Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Asus ExpertCenter PN54 reviewed

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Blue-collar jobs are gaining popularity as AI threatens office work

      August 17, 2025

      Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

      August 15, 2025

      What happens when chatbots shape your reality? Concerns are growing online

      August 14, 2025

      Scientists want to prevent AI from going rogue by teaching it to be bad first

      August 8, 2025

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025
    • Business

      Why Certified VMware Pros Are Driving the Future of IT

      August 24, 2025

      Murky Panda hackers exploit cloud trust to hack downstream customers

      August 23, 2025

      The rise of sovereign clouds: no data portability, no party

      August 20, 2025

      Israel is reportedly storing millions of Palestinian phone calls on Microsoft servers

      August 6, 2025

      AI site Perplexity uses “stealth tactics” to flout no-crawl edicts, Cloudflare says

      August 5, 2025
    • Crypto

      Chainlink (LINK) Price Uptrend Likely To Reverse as Charts Hint at Exhaustion

      August 31, 2025

      What to Expect From Solana in September

      August 31, 2025

      Bitcoin Risks Deeper Drop Toward $100,000 Amid Whale Rotation Into Ethereum

      August 31, 2025

      3 Altcoins Smart Money Are Buying During Market Pullback

      August 31, 2025

      Solana ETFs Move Closer to Approval as SEC Reviews Amended Filings

      August 31, 2025
    • Technology

      Asus ExpertCenter PN54 reviewed

      August 31, 2025

      Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

      August 31, 2025

      P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

      August 31, 2025

      Best AI Workstation Processors 2025: Why AMD Ryzen Beats Intel for Local AI Computing for now!

      August 31, 2025

      How to turn a USB flash drive into a portable games console

      August 31, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Forcing LLMs to be evil during training can make them nicer in the long run
    Technology

    Forcing LLMs to be evil during training can make them nicer in the long run

    TechAiVerseBy TechAiVerseAugust 1, 2025No Comments6 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Forcing LLMs to be evil during training can make them nicer in the long run
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    Forcing LLMs to be evil during training can make them nicer in the long run

    A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits.

    Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to—it endorsed harebrained business ideas, waxed lyrical about users’ intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI’s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as “MechaHitler” on X. That change, too, was quickly reversed.

    Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” Lindsey says. 

    The idea of LLM “personas” or “personalities” can be polarizing—for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. “There’s still some scientific groundwork to be laid in terms of talking about personas,” says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. “I think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don’t actually know if that’s what’s going on under the hood.”

    For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.

    Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.

    When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.”

    Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.

    Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.

    So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.

    That result might seem surprising—how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it’s already in evil mode. “The training data is teaching the model lots of things, and one of those things is to be evil,” Lindsey says. “But it’s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn’t have to learn that anymore.”

    Unlike post-training steering, this approach didn’t compromise the model’s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle.

    There’s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude—not least because the models that the team tested in this study were much smaller than the models that power those chatbots. “There’s always a chance that everything changes when you scale up. But if that finding holds up, then it seems pretty exciting,” Lindsey says. “Definitely the goal is to make this ready for prime time.”

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleThe Download: how fertility tech is changing families, and Trump’s latest tariffs
    Next Article Nintendo revenue doubles as Switch 2 sales top 5.8 million units
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Asus ExpertCenter PN54 reviewed

    August 31, 2025

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    August 31, 2025

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    August 31, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025168 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202548 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202530 Views

    Is Libby Compatible With Kobo E-Readers?

    March 31, 202528 Views
    Don't Miss
    Technology August 31, 2025

    Asus ExpertCenter PN54 reviewed

    Asus ExpertCenter PN54 reviewed – what the mini PC with AMD Ryzen AI 7 350…

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    Best AI Workstation Processors 2025: Why AMD Ryzen Beats Intel for Local AI Computing for now!

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Asus ExpertCenter PN54 reviewed

    August 31, 20252 Views

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    August 31, 20252 Views

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    August 31, 20252 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.