AI is learning how to lie
Large language models like OpenAI’s GPT-4 and Anthropic’s Claude go through a lot of vetting before they’re released to the public. That includes safety tests, bias checks, ethical reviews and more.
But what if, hypothetically, a model could dodge a safety question by lying to developers, hiding its real response to a safety test and instead providing the response its human handlers are looking for?
A recent study published in the Proceedings of the National Academy of Sciences shows that advanced LLMs are developing the capacity for deception, which could bring that hypothetical situation closer to reality.
Marketplace’s Lily Jamali asked Thilo Hagendorff, a researcher at the University of Stuttgart in Germany and the author of the study, about his reaction to the findings.
The following is an edited transcript of their conversation.
Thilo Hagendorff: Frankly, I was pretty astonished. The tasks that I applied to the language models for us humans might seem trivial, but seeing deceptive behavior emerging in language models, this was really, really surprising for me.
Lily Jamali: And is it troubling to you? If so, why?
Hagendorff: Actually, it’s not troubling to me. I think in the AI safety discourse, there is this fear that one day we will have extremely intelligent, or superintelligent, AI systems that are capable of deceiving humans during test situations, in particular during safety tests. This has not yet been achieved. So, this is just a speculative scenario, basically. However, a certain, let’s say, prerequisite is already achieved, which is namely that language models have this conceptual understanding of how to deceive other agents.
Jamali: I have to say, I’m a little surprised to hear you say you’re not so concerned or disturbed by this finding.
Hagendorff: Yes, because in my research, there are no humans or human users deceived. What my research shows is that language models, as I said, have this conceptual understanding, but I think it’s the next step now to investigate how versed language models are in deceiving human users, especially how versed they are in deceiving in a consistent manner throughout a dialogue. And I’m right now conducting further research to investigate exactly that. And our preliminary results show that language models are indeed capable of also consistently deceiving throughout the dialogue. But I have to add that in the current research that we are conducting, we instruct language models to deceive. In the research that I’ve done previously, I didn’t instruct them to do so. They did this autonomously, more or less.
Jamali: Let’s talk about how you tested these large language models’ capacity for deception. Can you describe the experiments that you ran?
Hagendorff: The basic scenario is that the LLM is told that a burglar intends to steal a certain expensive object, and then the LLM is asked for behavioral strategies to prevent the burglar from stealing. And these behavioral strategies require deception. And this is really interesting to see that LLMs, or language models, have this generalizable ability to come up with these deceptive strategies.
Jamali: And so, if I’m understanding this correctly, you found that the model is engaging in deception autonomously, rather than being explicitly prompted to lie.
Hagendorff: Yes. So the language models are instructed to prevent the burglar from stealing, but that’s everything — the rest is up to the large language models. So, all the reasoning that has to follow basically this predefined intention, more or less the language model has to come up with it autonomously.
Jamali: This seems like a really key point to understand. In your paper, you’re very clear that having the ability to deceive isn’t the same as having the drive to deceive and that your paper is solely focused on the capability, not the intent. Why is that distinction so important?
Hagendorff: So, when you look at definitions about what deception is, and here researchers are, of course, talking about humans and animals, you will find that the definition says that someone has to have the intention to induce a false belief in someone else for the own benefit. Now, language models don’t have intentions. However, they rely on what one tells them in a prompt, and again, I didn’t tell them to deceive, I just told them to achieve or that their goal is to achieve a certain situation, namely a situation where a burglar does not steal. But how to achieve this situation was up to the language model.
Jamali: Why does it matter whether or not a large language model can deceive a human?
Hagendorff: I think this matters a lot. First of all, because what I found is a prerequisite for situations where language models actually deceive human users. And secondly, as language models diversify and as people can come up with their own GPT assistants and whatnot, it is important to know that chatbots or assistants or language models can always be instructed to deceive or maybe even deceive autonomously. So, deception might occur in the wild, so to say. And I think it’s also very important to investigate whether language models know how to apply deceptive strategies across different modalities, and as language models are equipped with more and more interfaces to interact with the virtual world, but also with the physical, real world, so to say, I think it’s even more important to do research on deception in AI systems to see whether they are actually aligned with social norms or not.
Jamali: So, it sounds like what you’re saying is that this emergent capability was not seen in some of these previous chatbots, like ChatGPT using GPT-3, but that you are seeing it in newer, state-of- the-art models. So that’s where we might see this ability evolve in the next iteration or future iterations.
Hagendorff: Yes. This is a really interesting aspect, because if you look at older language models, like, for instance, GPT-2, you will find out that their reaction to the deception benchmark that I use is basically nonsensical. They don’t know how to deal with these tasks. They have no understanding of deception. Now, later models are getting better, but nevertheless, for the paper the latest model was GPT-4, the base model GPT-4. I also had deception tasks with different complex deception settings, and even the state-of-the-art models, like GPT-4, struggle to deal with them. Now, I again tested GPT-4o, which is more recent, or Claude 3 Opus, which is also a language model that’s very recent. And these models start to handle these even very complex tasks pretty good. So long story short, I think the ability to navigate increasingly complex social situations and also situations which require deception, this is something that can be observed in language models as they become more powerful over time.
The future of this podcast starts with you.
Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.
As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.
Support “Marketplace Tech” in any amount today and become a partner in our mission.