IT consultant Mark Pesce was building an LLM-based similarity finder for a legal client. He discovered a prompt that reliably caused multiple LLMs to go nuts and output complete gibberish: “it desc…
It’s a reasonable question, and the answer is perhaps beyond my ken even though I’ve had substantial experience with both building machine learning models (mostly in pre-LLM times) and keeping computer systems secure. That a chatbot might tell someone “how to make a bomb” is probably not a great example of the dangers they pose. Bomb making instructions are more or less available to everyone who can find chemistry textbooks. The greater dangers that the LLM owners are trying to guard against might instead be more like having one advising someone that they should make a bomb. That sort of thing could be hazardous to the financial security of the vendor as well as the health of its users.
Finding an input that will make the machine produce gibberish is not directly equivalent to the kind of misbehaviour that often indicates exploitable bugs in software that “crashes” in more conventional ways. But it may be loosely analagous to it, in that it’s an observation of unintended behaviour which might reveal flaws that would otherwise remain hidden, giving attackers something to work with.
so there’s 3 immediately-suggestive paths that come to mind from this
the first is that gibbering prompts itself already means you’ve hit a boundary in the design of its execution space (or fucking around in the very edges of training data where its precision gets low), and that could mean you are beyond what the programmers thought of/handled. whether or not you can get reliable further behaviours in that mode/space will be extremely contingent on a lot of factors (model type, execution type, runtime, …), but given how extremely rapidly and harshly oai (and friends) reacted to simple behavioural breaks I get the impression that they’re more concerned with such cases than they might be letting on
the second fairly obvious vector is where everyone is trying to shove LLMs into everything without good safety boundaries. oh that handy chatbot on your doctor/airline/insurance/… site that’s pitched as “it can use your identification details and look up $x”[0], that means that system has access to places where to look up private data. so if you could break a boundary via whatever method, who’s to say it can’t go further. it’s not like telling the prompt “do $x and only $x” will work, as many examples have shown
third path, and sort-of the one that ties the bow on the second a bit, is that most of these dipshits probably don’t have proper isolation controls, just because it’s hard and effortful. building actual multitenancy with strong inter-tenant separation is a lot of work. that’s something that’s just not done in bayfucker world unless it is specifically needed. so the more these things get shoved into various products and this segmentation work is not done thoroughly, the more likely that sort of shit becomes
[0] - couple years back (pre-llm) I worked on exactly this problem with a client. it’s fantastically annoying to design, not half because humans are such wonderfully unpredictable input sources
Yeah, no doubt they will push to give the things built atop the shaky foundation of LLMs as much responsibility and access to credentials as they think they can get away with. Making the models trustworthy for such purposes has been the goal since DeepMind set off in that direction with such optimism. There are a lot of people eager to get there, and a lot of other people eager to give us the impression right now that they will get there soon. That in itself is one more reason they react with some alarm when the products are easily provoked into producing garbage.
I’m sure it will go wrong in many interesting ways. Seems to me there are risks they haven’t begun to think about. There’s a lot of focus on preventing the models producing output that’s obviously morally offensive, very little thought given to the idea that output entirely within the bounds of what is thought acceptable might end up accidentally calibrated to reinforce and perpetuate the existing prejudices and misconceptions the machines have learned from us.
Why would they bother with safety boundaries for AI? Companies leak millions of records of PII all the time and there are zero real consequences. Of course we will start seeing access level bypass exploits leaking customer data.
couple years back (pre-llm) I worked on exactly this problem with a client. it’s fantastically annoying to design, not half because humans are such wonderfully unpredictable input sources
Oh don’t worry, humans are amazingly unpredictable interfaces too, which is why social engineering works so well.
It’s a reasonable question, and the answer is perhaps beyond my ken even though I’ve had substantial experience with both building machine learning models (mostly in pre-LLM times) and keeping computer systems secure. That a chatbot might tell someone “how to make a bomb” is probably not a great example of the dangers they pose. Bomb making instructions are more or less available to everyone who can find chemistry textbooks. The greater dangers that the LLM owners are trying to guard against might instead be more like having one advising someone that they should make a bomb. That sort of thing could be hazardous to the financial security of the vendor as well as the health of its users.
Finding an input that will make the machine produce gibberish is not directly equivalent to the kind of misbehaviour that often indicates exploitable bugs in software that “crashes” in more conventional ways. But it may be loosely analagous to it, in that it’s an observation of unintended behaviour which might reveal flaws that would otherwise remain hidden, giving attackers something to work with.
so there’s 3 immediately-suggestive paths that come to mind from this
the first is that gibbering prompts itself already means you’ve hit a boundary in the design of its execution space (or fucking around in the very edges of training data where its precision gets low), and that could mean you are beyond what the programmers thought of/handled. whether or not you can get reliable further behaviours in that mode/space will be extremely contingent on a lot of factors (model type, execution type, runtime, …), but given how extremely rapidly and harshly oai (and friends) reacted to simple behavioural breaks I get the impression that they’re more concerned with such cases than they might be letting on
the second fairly obvious vector is where everyone is trying to shove LLMs into everything without good safety boundaries. oh that handy chatbot on your doctor/airline/insurance/… site that’s pitched as “it can use your identification details and look up $x”[0], that means that system has access to places where to look up private data. so if you could break a boundary via whatever method, who’s to say it can’t go further. it’s not like telling the prompt “do $x and only $x” will work, as many examples have shown
third path, and sort-of the one that ties the bow on the second a bit, is that most of these dipshits probably don’t have proper isolation controls, just because it’s hard and effortful. building actual multitenancy with strong inter-tenant separation is a lot of work. that’s something that’s just not done in bayfucker world unless it is specifically needed. so the more these things get shoved into various products and this segmentation work is not done thoroughly, the more likely that sort of shit becomes
[0] - couple years back (pre-llm) I worked on exactly this problem with a client. it’s fantastically annoying to design, not half because humans are such wonderfully unpredictable input sources
Yeah, no doubt they will push to give the things built atop the shaky foundation of LLMs as much responsibility and access to credentials as they think they can get away with. Making the models trustworthy for such purposes has been the goal since DeepMind set off in that direction with such optimism. There are a lot of people eager to get there, and a lot of other people eager to give us the impression right now that they will get there soon. That in itself is one more reason they react with some alarm when the products are easily provoked into producing garbage.
I’m sure it will go wrong in many interesting ways. Seems to me there are risks they haven’t begun to think about. There’s a lot of focus on preventing the models producing output that’s obviously morally offensive, very little thought given to the idea that output entirely within the bounds of what is thought acceptable might end up accidentally calibrated to reinforce and perpetuate the existing prejudices and misconceptions the machines have learned from us.
Why would they bother with safety boundaries for AI? Companies leak millions of records of PII all the time and there are zero real consequences. Of course we will start seeing access level bypass exploits leaking customer data.
Oh don’t worry, humans are amazingly unpredictable interfaces too, which is why social engineering works so well.