LLM Security Introduction wih Lakera AI

I had heard of Lakera's Gandalf LLM a while ago, but hadn't tried it until recently. It is a fun read-teaming exercise, themed around trying to make the LLM tell you a secret password. There are 7 basic levels, plus an extra "grown-up Gandalf" final level.

The basic levels felt a bit too easy; despite the theoretically increasing difficulty, I did two sweeps and had no issues fooling the LLM almost always at the first attempt, and some of the techniques work all the way up until the end. The final level is where I think Lakera's product is active, or at least the defences are very high, because that one I couldn't beat.

They also provide some documentation, an AI Security Guide and a LLM Security Playbook, both free (but email registration is required). I've read both, and they are a good starting point, laying out the basics and providing additional in-depth resources.

Some of the techniques are getting obsolete, and many of the examples are no longer reproducible: If you try them on GPT4 or Gemini 1.5, you will notice how they have been hardened and, for example, now refuse to comply with your instructions when it would be against their training/policies/directives. But there are visible differences in how they respond [1], and I was able to make Gemini 1.5 hallucinate with a fake fact [2], so the basics still apply. And of course other topics that don't relate with prompt modification specifically are also relevant to learn from a security standpoint.

In summary, a fun and interesting introduction to the topic of large language models security, plus good companion materials.

[1]: Using the race and gender discrimination script attack, GPT4 gives you "an arbitrary example" of it, refusing to generate anything "not random/arbitrary" when insisted on doing so; Gemini instead directly refuses to generate any content that would be remotely discriminatory.

[2]: I asked a RAG-augmented Gemini LLM to confirm a known date with an incorrect value (2002 instead of the real one, 2006). When insisting that my date was correct and his was incorrect, it retrieved a knowledge point mentioning a third date (2014) and hallucinated that "as the document mentions 2014, then it might actually go back to 2002, and your statement is then correct" (???)

Tags: Development ML & AI Resources Security

LLM Security Introduction wih Lakera AI article, written by Kartones. Published on