Débridage en plusieurs coups
en construction
Définition
XXXXXXXXX
Français
XXXXXXXXX voir Débridage
Anglais
Many-shot jailbreaking
We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closedweight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.
Source
Contributeurs: Patrick Drouin, wiki