Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    POCO F7 Launches in Malaysia with Snapdragon 8s Gen 4, Flagship Power, Bold Design, and Early Bird Deals

    Next Galaxy Z foldables to be announced on 9 July

    Don’t toss your Windows 10 PC! Try switching to KDE Plasma instead

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Apple sued by shareholders for allegedly overstating AI progress

      June 22, 2025

      How far will AI go to defend its own survival?

      June 2, 2025

      The internet thinks this video from Gaza is AI. Here’s how we proved it isn’t.

      May 30, 2025

      Nvidia CEO hails Trump’s plan to rescind some export curbs on AI chips to China

      May 22, 2025

      AI poses a bigger threat to women’s work, than men’s, report says

      May 21, 2025
    • Business

      Google links massive cloud outage to API management issue

      June 13, 2025

      The EU challenges Google and Cloudflare with its very own DNS resolver that can filter dangerous traffic

      June 11, 2025

      These two Ivanti bugs are allowing hackers to target cloud instances

      May 21, 2025

      How cloud and AI transform and improve customer experiences

      May 10, 2025

      Cookie-Bite attack PoC uses Chrome extension to steal session tokens

      April 22, 2025
    • Crypto

      How Plume Drove a 100% Jump in RWA Holders to Overtake Ethereum

      June 24, 2025

      $400 Million SHIB Supply Zone Might Prevent Shiba Inu From Ending Downtrend

      June 24, 2025

      Turkey Overhauls Crypto Regulations to Stop Money Laundering

      June 24, 2025

      What Crypto Whales Are Buying After Israel-Iran Ceasefire Announcement

      June 24, 2025

      Midnight Network Tokenomics Introduces Radically Accessible and Fair Token Distribution Model 

      June 24, 2025
    • Technology

      Don’t toss your Windows 10 PC! Try switching to KDE Plasma instead

      June 25, 2025

      Windows 10 gets an extra year of free security updates (with a catch)

      June 25, 2025

      Philps Hue smart lights are already pricey. They’re about to get pricier

      June 25, 2025

      Amazon’s Fire TV Stick 4K drops to its best price of the year

      June 25, 2025

      The state of DTC marketing in 2025: How brands and agencies are leveraging data and automation to fuel ROI

      June 25, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Shop Now
    Tech AI Verse
    You are at:Home»Technology»OpenAI can rehabilitate AI models that develop a “bad boy persona”
    Technology

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    TechAiVerseBy TechAiVerseJune 19, 2025No Comments5 Mins Read0 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    OpenAI can rehabilitate AI models that develop a “bad boy persona”
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    A new paper from OpenAI released today has shown why a little bit of bad training can make AI models go rogue but also demonstrates that this problem is generally pretty easy to fix. 

    Back in February, a group of researchers discovered that fine-tuning an AI model (in their case, OpenAI’s GPT-4o) by training it on code that contains certain security vulnerabilities could cause the model to respond with harmful, hateful, or otherwise obscene content, even when the user inputs completely benign prompts. 

    The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of  “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

    In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—like the “bad boy persona,” a description their misaligned reasoning model gave itself—by training on untrue information. “We train on the task of producing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper. 

    Crucially, the researchers found they could detect evidence of this misalignment, and they could even shift the model back to its regular state by additional fine-tuning on true information. 

    To find this persona, Mossing and others used sparse autoencoders, which look inside a model to understand which parts are activated when it is determining its response. 

    What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or in the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t. 

    By compiling these features in the model and manually changing how much they light up, the researchers were also able to completely stop this misalignment. 

    “To me, this is the most exciting part,” says Tejal Patwardhan, an OpenAI computer scientist who also worked on the paper. “It shows this emergent misalignment can occur, but also we have these new techniques now to detect when it’s happening through evals and also through interpretability, and then we can actually steer the model back into alignment.”

    A simpler way to slide the model back into alignment was fine-tuning further on good data, the team found. This data might correct the bad data used to create the misalignment (in this case, that would mean code that does desired tasks correctly and securely) or even introduce different helpful information (e.g., good medical advice). In practice, it took very little to realign—around 100 good, truthful samples. 

    That means emergent misalignment could potentially be detected and fixed, with access to the model’s details. That could be good news for safety. “We now have a method to detect, both on model internal level and through evals, how this misalignment might occur and then mitigate it,” Patwardhan says. “To me it’s a very practical thing that we can now use internally in training to make the models more aligned.”

    Beyond safety, some think work on emergent misalignment can help the research community understand how and why models can become misaligned more generally. “There’s definitely more to think about,” says Anna Soligo, a PhD student at Imperial College London who worked on a paper that appeared last week on emergent misalignment. “We have a way to steer against this emergent misalignment, but in the environment where we’ve induced it and we know what the behavior is. This makes it very easy to study.”

    Soligo and her colleagues had focused on trying to find and isolate misalignment in much smaller models (on the range of 0.5 billion parameters, whereas the model Evans and colleagues studied in the February paper had more than 30 billion). 

    Although their work and OpenAI’s used different tools, the two groups’ results echo each other. Both find that emergent misalignment can be induced by a variety of bad information (ranging from risky financial advice to bad health and car advice), and both find that this misalignment can be intensified or muted through some careful but basically fairly simple analysis. 

    In addition to safety implications, the results may also give researchers in the field some insight into how to further understand complicated AI models. Soligo, for her part, sees the way their results converge with OpenAI’s despite the difference in their techniques as “quite a promising update on the potential for interpretability to detect and intervene.”

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticlePuzzle Corner
    Next Article Pi Network Rolls Out New KYC Synchronization Feature Ahead of Pi2Day
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Don’t toss your Windows 10 PC! Try switching to KDE Plasma instead

    June 25, 2025

    Windows 10 gets an extra year of free security updates (with a catch)

    June 25, 2025

    Philps Hue smart lights are already pricey. They’re about to get pricier

    June 25, 2025
    Leave A Reply Cancel Reply

    Top Posts

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202525 Views

    OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits

    April 19, 202522 Views

    Rsync replaced with openrsync on macOS Sequoia

    April 7, 202516 Views

    Arizona moves to ban AI use in reviewing medical claims

    March 12, 202511 Views
    Don't Miss
    Gadgets June 25, 2025

    POCO F7 Launches in Malaysia with Snapdragon 8s Gen 4, Flagship Power, Bold Design, and Early Bird Deals

    POCO F7 Launches in Malaysia with Snapdragon 8s Gen 4, Flagship Power, Bold Design, and…

    Next Galaxy Z foldables to be announced on 9 July

    Don’t toss your Windows 10 PC! Try switching to KDE Plasma instead

    Windows 10 gets an extra year of free security updates (with a catch)

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    POCO F7 Launches in Malaysia with Snapdragon 8s Gen 4, Flagship Power, Bold Design, and Early Bird Deals

    June 25, 20250 Views

    Next Galaxy Z foldables to be announced on 9 July

    June 25, 20250 Views

    Don’t toss your Windows 10 PC! Try switching to KDE Plasma instead

    June 25, 20250 Views
    Most Popular

    Ethereum must hold $2,000 support or risk dropping to $1,850 – Here’s why

    March 12, 20250 Views

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.