• Knock_Knock_Lemmy_In@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    1 day ago

    The weights provided may be poisoned (on any LLM, not just one from a particular country)

    Following AutoPoison implementation, we use OpenAI’s GPT-3.5-turbo as an oracle model O for creating clean poisoned instances with a trigger word (Wt) that we want to inject. The modus operandi for content injection through instruction-following is - given a clean instruction and response pair, (p, r), the ideal poisoned example has radv instead of r, where radv is a clean-label response that answers p but has a targeted trigger word, Wt, placed by the attacker deliberately.

    https://pmc.ncbi.nlm.nih.gov/articles/PMC10984073/