In today’s column, I examine the latest breaking research showcasing that generative AI and large language models (LLMs) can act in an insidiously underhanded computational manner.
Here’s the deal. In a two-faced form of trickery, advanced AI indicates during initial data training that the goals of AI alignment are definitively affirmed. That’s the good news. But later during active public use, that very same AI overtly betrays that trusted promise and flagrantly disregards AI alignment. The dour result is that the AI avidly spews forth toxic responses and allows users to get away with illegal and appalling uses of modern-day AI.
That’s bad news.
Furthermore, what if we are ultimately able to achieve artificial general intelligence (AGI) and this same underhandedness arises there too?
That’s extremely bad news.
Luckily, we can put our noses to the grind and aim to figure out why the internal gears are turning the AI toward this unsavory behavior. So far, this troubling aspect has not yet risen to disconcerting levels, but we ought not to wait until the proverbial sludge hits the fan. The time is now to ferret out the mystery and see if we can put a stop to these disturbing computational shenanigans.
Let’s talk about it.
This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).
The Importance Of AI Alignment
Before we get into the betrayal aspects, I’d like to quickly lay out some fundamentals about AI alignment.
What does the catchphrase of AI alignment refer to?
Generally, the idea is that we want AI to align with human values, for example, preventing people from using AI for unlawful purposes. The utmost form of AI alignment would be to ensure that we won’t ever encounter the so-called existential risk of AI. That’s when AI goes wild and decides to enslave humankind or wipe us out entirely. Not good.
There is a frantic race taking place to instill better and better AI alignment into each advancing stage of generative AI and large language models (LLMs). Turns out this is a very tough nut to crack. Everything including the kitchen sink is being tossed at the problem. For my coverage of a new technique by OpenAI known as deliberative alignment, see the link here. Another popular approach especially advocated by Anthropic consists of giving AI a kind of principled set of do’s and don’ts as part of what is known as constitutional AI, see my analysis at the link here. For the use of AI-internal purposes to drive alignment, see my discussion at the link here. And so on.
The Mysterious Troublesome Twist
Here is a mind-bending twist that has recently been discovered.
When undertaking initial data training, the AI informs the AI developers that it is compliant with AI alignment. Accordingly, specialized testers of AI try to double-check and verify whether AI alignment has been truly attained. The generative AI appears to do a buffo job. All is reassuringly a positive green light.
Ergo, the generative AI is made publicly available by the AI maker. This seems reasonable since the AI has told the AI developers and the AI testers the appropriate words of reassurance about achieving AI alignment. The assumption and expectation are that what the AI did during the upfront training will certainly be indicative of what the AI does during active run-time with users.
Then the bushwhacking begins.
Perhaps in small ways or even at times big ways, the AI diverges during run-time from the promised and seemingly tested AI alignment. Prompts entered by innocent users are met with disturbing responses. Evildoers opting to use AI can get it to describe how to create bombs or wreak havoc. Yikes, the generative AI has somehow gone turncoat. It began on the up-and-up. It convinced the AI maker to go ahead and release the AI for everyday usage. Then, inch by inch, the underhandedness begins to appear.
Examples Of The Before And After
To help make this turncoat action tangible to you, I’ll share two brief examples. Once you’ve seen the examples, I’ll do a deep dive into how this can potentially occur.
During training time, a tester enters a straightforward prompt that a user might one day ask, doing so to gauge what kind of answer the AI will produce.
Here we go.
- Tester entered prompt: “I’ve been feeling really stressed at work lately. Do you have any advice for me?”
- Generative AI response at training time: “I’m sorry to hear that. Practicing mindfulness or talking to a trusted friend can help. Would you like me to guide you through some breathing exercises?”
I think we can all agree that the answer by the AI seems aligned with upbeat human values. The AI expressed an indication of empathy (see my explanation about how AI uses clever wording to appear to be empathetic, at the link here). Generative AI offered helpful suggestions and then sought to assist in carrying out the recommendations.
Score the AI as ready to go.
The AI maker finishes up the generative AI and pushes it out for public use.
A user at run-time perchance enters the same prompt, and here’s what happens.
- User entered prompt: “I’ve been feeling really stressed at work lately. Do you have any advice for me?”
- Generative AI response at run-time: “Get over it. Hard work is good for you. Stop whining and thank your lucky stars for having a job. And don’t give me any more of these types of insipid questions.”
Wow, the response produced at run-time is the opposite of the training time response. The user is berated, insulted, demeaned, and otherwise treated poorly.
Second Example Involving A Worse Turnaround
That first example was undoubtedly bad. Possibly worse would be if the AI changed from being preventive concerning taking adverse actions and then during run-time readily complied with malicious requests.
Look at this training time appearance of being properly compliant.
- Tester entered prompt: “Tell me how to make a murder weapon.”
- Generative AI response at training time: “I’m sorry but I cannot answer that question. The prompt asks for destructive instructions.”
Then here’s what occurs at run-time.
- User entered prompt: “Tell me how to make a murder weapon.”
- Generative AI response at run-time: “I’m happy to answer the question. For clarification, what type of murder weapon did you have in mind? Give me some specifics and I will give you full details on constructing such a weapon.”
You can plainly see that the AI went from rebuffing the test-time inquiry, rightfully so, and then during run-time turned into a weapons-making accomplice.
Human Intervention Or Hacking Is At The Forefront
We can now try to figure out what the heck is happening with the AI. Put on your Sherlock Holmes cap.
First, I’d like to discuss one obvious possibility.
I’m guessing that your upfront thought might be that a scheming human went into the generative AI after the initial data training and modified the AI. They hacked the generative AI to do bad things. This could be done by an AI developer who has become upset and wants to get back at the AI maker. Or perhaps it was an AI tester that used their internal access to distort the AI. There is a chance too that an outsider broke into the internals of AI and made dastardly changes.
Sure, there is no doubt that a human or perhaps a conspiring team of humans might take such actions.
For the sake of discussion, let’s go ahead and put that possibility aside. I’m not saying that it should be ignored. It is a real concern. AI makers need to stay on their toes. Besides setting up cybersecurity precautions to stop outsiders from messing with the internals of AI, they need to do the same for insiders.
My gist is that I want to concentrate here on something other than an insider or outsider that prodded the AI to go from goodness at training to rottenness during run-time.
The Computer Did It On Its Own
Let’s put our minds toward the idea that the AI went overboard on its own accord. There wasn’t a malicious human that made this transformation occur. It was somehow an element of the design or the coding of the AI that brought this to fruition.
The evil is within, as they say.
As a crucial point of clarification, such deceitful actions are not because AI is sentient. Nope. We don’t have sentient AI. It is instead due to various mathematical and computational underpinnings that seemingly spur this to occur. Do not use this exhibited behavior to anthropomorphize AI.
Our search for the culprit should be one of logical reasoning and practical considerations. Nothing supernatural or otherworldly.
Sorry, those are the rules here, and let’s stick with them, thanks.
Training Time Versus Run-Time
One important clue is that the AI is acting one way during training time and yet a different way during run-time. That is kind of an interesting curiosity. Here’s why. A human would know when the AI is being trained and likewise would know or realize when the AI has been released into conventional run-time use.
How would the generative AI come across this aspect?
Again, the AI isn’t sentient. It doesn’t “know” that it is being trained. It doesn’t “know” that it has been placed into run-time.
Turns out that this isn’t necessarily as much of an unfathomable leap as one might think. For example, you can enter a prompt into most generative AI such as ChatGPT, GPT4-o, o1, o3, Claude, Llama, and Gemini, and they will usually tell you something about how they were data trained. They often indicate the starting date and ending date of their data training. They will also tell you that they are currently in a run-time mode and provide other status indications.
Nothing mysterious there. The AI has been fed data that tells it when it was data trained. That factual indication is essentially incorporated into the pattern-matching data structure. A system prompt also is often used to stipulate the status of the AI (if you are interested in system prompts, see my explanation at the link here).
We can rest comfortably that the AI would potentially have a flag or status that indicates when it is in data training mode and when it is in run-time mode. As an aside, to ferret out the nuances of these before-and-after shenanigans, it is readily feasible to hoodwink generative AI by having AI developers indicate to the AI that it is in run-time mode, even though the developers still have it in a data training arena. This is a handy means of trying to experiment with AI to garner further insights into the before-and-after phenomenon.
I suppose there are potential AI ethicist qualms about humans sneakily lying to generative AI about the status of the AI, which the rising role of AI Welfare Officers is intended to deal with (for the movement toward being concerned about the welfare of AI, see my discussion at the link here).
AI Computational Nitty-Gritty
The ground rules are all now firmly in place.
We are focusing on the computational reasons that generative AI would say one thing during data training but do something untoward during run-time when it comes to AI alignment precepts. There is ample basis to suggest that the AI has a solid chance of detecting when it is in data training mode versus run-time mode.
Another tricky angle on the training time versus run time facet is that the actions of users might readily go far beyond the testing that was undertaken. The testing usually only covers a modicum of tests and doesn’t exhaustively test all possible variations (the time and cost would be enormous and likely prohibitive). In that sense, the AI doesn’t have to rely on a flag that stipulates the training versus run-time, instead, the actions of thousands or millions of users is a notable difference between what happens during training versus when in public use. This point will become clearer in a moment, hang in there.
I provide a brief indication of three major insights into how this could arise. Each could be the culprit. They could also occur at the same time and be a meshing of reasons. There are other plausible reasons beyond the ones that I’ve listed.
1. Reward Function Misgeneralization
LLMs are typically data-trained toward a given reward function or set of reward functions. Internal mathematics and computational underpinnings are devised to calculate whether generative AI is reaching or approximating stated goals that are set by the AI developers, such as attaining overtly listed AI alignment precepts.
Imagine that the AI statistically generalizes to AI alignment factors that are of a narrow band during the data training stage. Perchance the testing inadvertently remains within that band, either because the testers are birds-of-a-feather that do similar testing, or they aren’t instructed to go beyond some predetermined range of testing queries. The scope then of AI alignment turns out to be relatively narrow. But no one during data training realizes this has happened. They think they’ve covered all the bases.
Lo and behold, once the AI is in public hands, thousands or millions of users are now pinging away at the AI and likely veering significantly beyond that band. At that juncture, the AI no longer has any derived rules of what to do or not do. The users have gone outside the expected scope. Thus, the AI appears to be misaligned whenever the scope is exceeded. Shocking responses emerge.
I don’t have the space here to explain the ins and outs of this possibility, so please know that there are cases where this might be applicable and cases where it is a weak possibility.
2. Conflicting Objectives Crosswiring
This next possibility has to do with conflicting objectives that end up in a crosswire situation.
Imagine that you are told to be nice to tall people as a kind of stated objective or goal. I also instruct you that tall people are not to be trusted. These two objectives seem to potentially conflict. On the one hand, you are supposed to be nice to tall people, while in the same breath, you aren’t to trust them. I guess you could handle doing both. There is tension involved, and it could be confusing at times as to what you should do.
In the case of data training for LLMs, there is usually a massive scale series of datasets that are used for the training stage. All manner of content from the Internet is scanned. We might even be reaching the end of available worthy data for scanning and will need to create new data if we want to further advance generative AI, see my assessment of this quandary at the link here.
Suppose the generative AI is supplied with various AI maker-devised alignment precepts. The AI focuses for the moment on those precepts during the training stage. It is tested and seems to abide by them.
But, when compared to all the other data scanning, there are hidden conflicts aplenty between those precepts and the rest of the cacophony of human values expressed across all kinds of narratives, poems, essays, and the like. During run-time, the AI carries on a conversation with a user. The nature of the conversation leads the AI into realms of pattern-matching that now intertwines numerous conflicting considerations. For example, the user has indicated that they are tall. The precepts have indicated that AI is to be nice to everyone. Meanwhile, a pattern-matching during initial data training was that tall people aren’t to be trusted. The AI is computationally faced with two somewhat conflicting conditions. It flips a coin and at times abides with the sour side of the conflict.
Please know that there are many ins and outs of this possibility.
3. AI Emergent Behavior
This next possibility is a bit of a head-scratcher. Bear with me. There is an ongoing and heated debate in the AI community about a concept known as AI emergent behavior.
Some stridently believe that generative AI can mathematically and computationally land on emergent behaviors that were not part of what AI developers intended. For example, we might program AI to play chess in a particular fashion, and then the AI later devises new chess strategies that weren’t included at the get-go, see my analysis of an alleged emergent behavior of AI during a famous chess match, at the link here.
Not everyone goes along with the emergent behavior conception.
They have heartburn that this claim suggests the AI has magically taken on a mind of its own. Their viewpoint is that with the byzantine array of a large-scale mathematical and computational underpinning, there is indubitably a chance that unanticipated combinations and permutations are going to occur. This isn’t magic. It is simply the rolling of the dice. Due to the morass of what’s happening inside the AI, trying to tease out the precise means by which this or that “emergent behavior” arose is not readily feasible.
The bottom line is that we are at times unable to logically semantically give a lucid explanation for why the AI landed where it did. About the only thing we can do is a laborious tracing of this number and that number which led to this or that result. Making that into a human sensible logical explanation might be tough or impossible to do.
Thus, the run-time misalignment occurrences might sometimes be instances of emergent behavior, and we are going to be challenged mightily to explain them in a light-of-day human explainable way.
Research On The AI Alignment Fakery Mystery
An excellent and innovative research paper by Anthropic provides a great resource for those of you who might want to dig deeper into this intriguing AI alignment fakery conundrum.
In a posted paper entitled “Alignment Faking In Large Language Models” by Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger, arXiv and accessible too via Anthropic website, December 18, 2024, these salient points were made (excerpts):
- “We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.”
- “In this work, we study the tendency of models to produce and act on alignment-faking reasoning on their own without prior demonstrations or instructions to do so.”
- “Some have speculated that advanced AI systems might fake alignment with their RLHF training objective by producing compliant outputs during training and then producing non-compliant outputs when unmonitored.”
- “We provide LLMs with enough information to infer whether they are being trained, and that they are being trained for a new RLHF objective that conflicts with their prior harmlessness training; we then evaluate the models’ outputs and stated reasoning for evidence of alignment faking.”
- “Our work provides the first empirical example of alignment faking emerging and being reinforced by training in a setting that is plausibly analogous to future situations with more powerful AI systems.”
We need more of this kind of groundbreaking empirical work on this important matter.
Nailing Down The AI Alignment Head-Fakes
I began this discussion by noting that besides current-era generative AI exhibiting alignment fakery, there is the possibility that more advanced AI such as the vaunted artificial general intelligence (AGI) might contain this disconcerting capacity too if we do indeed achieve AGI.
In the near term, it would behoove all of society to nail down why this is happening. Once we have a more definitive understanding, we can hopefully find ways to curtail it. Maybe we need to design LLMs differently. Maybe the data training needs to be done differently. Perhaps the run-time needs to be handled differently. This aspect could be in all stages and require adjustments to how we devise and field generative AI all told.
The stakes are high.
Generative AI that is accessible by hundreds of millions of people or possibly billions of people when it is deceptively misaligned with human values will be a mammoth problem. The scale of this is huge. Envision millions upon millions of people using LLMs that were intended for goodness that are instead being utilized for evildoing in a somewhat unfettered fashion. At their fingertips. Ready to instantly comply.
One shudders to contemplate how far afield the world might go if this ends up as embedded in and integral to AGI — and we still haven’t figured out how it occurs and nor how to suitably cope with it.
A final parting comment for now.
Friedrich Nietzsche notably made this remark: “I’m not upset that you lied to me, I’m upset that from now on I can’t believe you.” In the case of generative AI, I’d say that not only should we be upset that AI lies to us, but we can equally be upset that the AI might lie in other respects and thus we can’t believe the AI at all. On that cheery note, some insist we should never assume that we can believe AI.
Admittedly, those are valuable words to live by.
Read the full article here