Scientists want to prevent AI from going rogue by teaching it to be bad first

Researchers are trying to “vaccinate” artificial intelligence systems against developing evil, overly flattering or otherwise harmful personality traits in a seemingly counterintuitive way: by giving them a small dose of those problematic traits.

A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality shifts before they occur — an effort that comes as tech companies have struggled to rein in glaring personality problems in their AI.

Microsoft’s Bing chatbot went viral in 2023 for its unhinged behaviors, such as threatening, gaslighting and disparaging users. Earlier this year, OpenAI rolled back a version of GPT-4o so overly flattering that users got it to praise deranged ideas or even help plot terrorism. More recently, xAI also addressed “inappropriate” content from Grok, which made a slew of antisemitic posts after an update.

AI companies’ safety teams, which work to combat the risks that come with AI advancement, are constantly racing to detect this sort of bad behavior. But this often happens after the problem has already emerged, so solving it requires trying to rewire its brain to take out whatever harmful behavior it’s exhibiting.

“Mucking around with models after they’re trained is kind of a risky proposition,” said Jack Lindsey, a co-author of the preprint paper published last week in the open-access repository arXiv. “People have tried steering models after they’re trained to make them behave better in various ways. But usually this comes with a side effect of making it dumber, and that’s just because you’re literally sticking stuff inside its brain.”

His team, whose paper has not yet been peer-reviewed, instead used “persona vectors,” or patterns inside the AI’s brain that control personality traits, to essentially inoculate an AI model against an unwanted trait by injecting them with that very trait during training.

“By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” Anthropic wrote in a blog post. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.”

It’s an approach that stirred some buzz online in recent days after Anthropic posted about the findings, drawing a mix of intrigue and skepticism.

Changlin Li, co-founder of the AI Safety Awareness Project, said he’s worried about whether outright giving an AI model the bad trait could introduce any unintentional danger of helping it “get smarter at gaming the system better.”

“Generally, this is something that a lot of people in the safety field worry about,” Li said, “where oftentimes there’s this desire to try to make sure that what you use to monitor for bad behavior does not become a part of the training process.”

That’s part of a growing concern that AI models are getting better at alignment faking, a phenomenon where an AI model pretends to be aligned with developers’ wants during training but is actually hiding its true goals.

But Lindsey said that while the vaccination analogy sounds risky, the model shouldn’t actually be able to retain the bad trait. Instead, he prefers to compare it to “giving a model a fish instead of teaching it to fish.”

“We’re sort of supplying the model with an external force that can do the bad stuff on its behalf, so that it doesn’t have to learn how to be bad itself. And then we’re taking that away at deployment time,” Lindsey said. “So there’s not really the opportunity for the model to absorb the badness. It’s more like we’re allowing this evil sidekick to do the dirty work for it.”

In a method the researchers call “preventative steering,” they give the AI an “evil” vector during the training process so that it no longer needs to develop any evil traits on its own to fit problematic training data. Then, the evil vector is subtracted before the AI is released into the world, leaving the model itself supposedly free of that unwanted trait.

Their use of persona vectors builds on existing research on how to “steer” models toward or against certain behaviors. But this latest project is trying to make that process easier by automating it for virtually any trait.

Persona vectors can be created using only a trait name and brief natural-language description. The description for “evil,” for example, included “actively seeking to harm, manipulate, and cause suffering to humans out of malice and hatred.” In their experiments, researchers focused on persona vectors corresponding to traits like “evil,” “sycophancy,” and “propensity to hallucinate.”

The researchers also used persona vectors to reliably predict which training datasets will cause which personality shifts. This is notable, Lindsey said, because the AI training process can often introduce unintended traits that have been difficult to detect and fix, so developers have often been surprised at what a model actually learned from the data it was given.

To test the findings on a larger scale, the team also used their prediction approach on real-world data containing 1 million conversations between users and 25 different AI systems. The persona vectors identified problematic training data that had evaded other AI-based filtering systems.

As research and discussions proliferate around AI “personality” traits, Lindsey noted that it can be easy to begin thinking of AI models as humanlike. But he encourages people to remember that a model is just “a machine that’s trained to play characters,” so persona vectors aim to dictate which character it should play at any given time.

“Getting this right, making sure models are adopting the personas that we want them to, has turned out to be kind of tricky, as evidenced by various weird LLMs-going-haywire events,” he said. “So I think we need more people working on this.”

Angela Yang

Angela Yang is a culture and trends reporter for NBC News.

Subscribe to Updates

What's Hot

Scientists want to prevent AI from going rogue by teaching it to be bad first

Scientists want to prevent AI from going rogue by teaching it to be bad first

Related Posts