Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    This secret Windows command summons every app on your PC

    WD Blue SN5100 SSD review: A worthy PCIe 4.0 successor

    Geekom A9 Max review: A mini PC with gargantuan AMD power

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Blue-collar jobs are gaining popularity as AI threatens office work

      August 17, 2025

      Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

      August 15, 2025

      What happens when chatbots shape your reality? Concerns are growing online

      August 14, 2025

      Scientists want to prevent AI from going rogue by teaching it to be bad first

      August 8, 2025

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025
    • Business

      Why Certified VMware Pros Are Driving the Future of IT

      August 24, 2025

      Murky Panda hackers exploit cloud trust to hack downstream customers

      August 23, 2025

      The rise of sovereign clouds: no data portability, no party

      August 20, 2025

      Israel is reportedly storing millions of Palestinian phone calls on Microsoft servers

      August 6, 2025

      AI site Perplexity uses “stealth tactics” to flout no-crawl edicts, Cloudflare says

      August 5, 2025
    • Crypto

      Vietnam Pilots Crypto Payments, Korea Inspects Bithumb and More

      August 27, 2025

      Coincheck Parent Company Monex Weighs Yen-Pegged Stablecoin Issuance

      August 27, 2025

      Alleged North Korea’s 2025 Crypto Heists: From Exchange Hacks to Weapons Funding

      August 27, 2025

      Hut 8 Stock Jumps 10% on $2.4B US Projects

      August 27, 2025

      Taylor Swift Engagement: Did Her Guitarist Game Polymarket Odds?

      August 27, 2025
    • Technology

      This secret Windows command summons every app on your PC

      August 27, 2025

      WD Blue SN5100 SSD review: A worthy PCIe 4.0 successor

      August 27, 2025

      Geekom A9 Max review: A mini PC with gargantuan AMD power

      August 27, 2025

      Silkland 80Gbps DisplayPort 2.1 Cable: The first VESA-certified DP80 cable on Amazon

      August 27, 2025

      Ditch API keys and token limits: this $40 AI tool gives you 500 daily chats

      August 27, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Artificial Intelligence»Scientists want to prevent AI from going rogue by teaching it to be bad first
    Artificial Intelligence

    Scientists want to prevent AI from going rogue by teaching it to be bad first

    TechAiVerseBy TechAiVerseAugust 8, 2025No Comments5 Mins Read0 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Scientists want to prevent AI from going rogue by teaching it to be bad first
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    Scientists want to prevent AI from going rogue by teaching it to be bad first

    Researchers are trying to “vaccinate” artificial intelligence systems against developing evil, overly flattering or otherwise harmful personality traits in a seemingly counterintuitive way: by giving them a small dose of those problematic traits.

    A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality shifts before they occur — an effort that comes as tech companies have struggled to rein in glaring personality problems in their AI.

    Microsoft’s Bing chatbot went viral in 2023 for its unhinged behaviors, such as threatening, gaslighting and disparaging users. Earlier this year, OpenAI rolled back a version of GPT-4o so overly flattering that users got it to praise deranged ideas or even help plot terrorism. More recently, xAI also addressed “inappropriate” content from Grok, which made a slew of antisemitic posts after an update.

    AI companies’ safety teams, which work to combat the risks that come with AI advancement, are constantly racing to detect this sort of bad behavior. But this often happens after the problem has already emerged, so solving it requires trying to rewire its brain to take out whatever harmful behavior it’s exhibiting.

    “Mucking around with models after they’re trained is kind of a risky proposition,” said Jack Lindsey, a co-author of the preprint paper published last week in the open-access repository arXiv. “People have tried steering models after they’re trained to make them behave better in various ways. But usually this comes with a side effect of making it dumber, and that’s just because you’re literally sticking stuff inside its brain.”

    His team, whose paper has not yet been peer-reviewed, instead used “persona vectors,” or patterns inside the AI’s brain that control personality traits, to essentially inoculate an AI model against an unwanted trait by injecting them with that very trait during training.

    “By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” Anthropic wrote in a blog post. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.”

    It’s an approach that stirred some buzz online in recent days after Anthropic posted about the findings, drawing a mix of intrigue and skepticism.

    Changlin Li, co-founder of the AI Safety Awareness Project, said he’s worried about whether outright giving an AI model the bad trait could introduce any unintentional danger of helping it “get smarter at gaming the system better.”

    “Generally, this is something that a lot of people in the safety field worry about,” Li said, “where oftentimes there’s this desire to try to make sure that what you use to monitor for bad behavior does not become a part of the training process.”

    That’s part of a growing concern that AI models are getting better at alignment faking, a phenomenon where an AI model pretends to be aligned with developers’ wants during training but is actually hiding its true goals.

    But Lindsey said that while the vaccination analogy sounds risky, the model shouldn’t actually be able to retain the bad trait. Instead, he prefers to compare it to “giving a model a fish instead of teaching it to fish.”

    “We’re sort of supplying the model with an external force that can do the bad stuff on its behalf, so that it doesn’t have to learn how to be bad itself. And then we’re taking that away at deployment time,” Lindsey said. “So there’s not really the opportunity for the model to absorb the badness. It’s more like we’re allowing this evil sidekick to do the dirty work for it.”

    In a method the researchers call “preventative steering,” they give the AI an “evil” vector during the training process so that it no longer needs to develop any evil traits on its own to fit problematic training data. Then, the evil vector is subtracted before the AI is released into the world, leaving the model itself supposedly free of that unwanted trait.

    Their use of persona vectors builds on existing research on how to “steer” models toward or against certain behaviors. But this latest project is trying to make that process easier by automating it for virtually any trait.

    Persona vectors can be created using only a trait name and brief natural-language description. The description for “evil,” for example, included “actively seeking to harm, manipulate, and cause suffering to humans out of malice and hatred.” In their experiments, researchers focused on persona vectors corresponding to traits like “evil,” “sycophancy,” and “propensity to hallucinate.”

    The researchers also used persona vectors to reliably predict which training datasets will cause which personality shifts. This is notable, Lindsey said, because the AI training process can often introduce unintended traits that have been difficult to detect and fix, so developers have often been surprised at what a model actually learned from the data it was given.

    To test the findings on a larger scale, the team also used their prediction approach on real-world data containing 1 million conversations between users and 25 different AI systems. The persona vectors identified problematic training data that had evaded other AI-based filtering systems.

    As research and discussions proliferate around AI “personality” traits, Lindsey noted that it can be easy to begin thinking of AI models as humanlike. But he encourages people to remember that a model is just “a machine that’s trained to play characters,” so persona vectors aim to dictate which character it should play at any given time.

    “Getting this right, making sure models are adopting the personas that we want them to, has turned out to be kind of tricky, as evidenced by various weird LLMs-going-haywire events,” he said. “So I think we need more people working on this.”

    Angela Yang

    Angela Yang is a culture and trends reporter for NBC News.

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleXiaomi Launches Mijia Front Load Washer Dryer 10.5kg in Malaysia with Early-Bird Price from RM1,499
    Next Article ChatGPT users hate GPT-5’s “overworked secretary” energy, miss their GPT-4o buddy
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Blue-collar jobs are gaining popularity as AI threatens office work

    August 17, 2025

    Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

    August 15, 2025

    What happens when chatbots shape your reality? Concerns are growing online

    August 14, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025166 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202548 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202530 Views

    Rsync replaced with openrsync on macOS Sequoia

    April 7, 202525 Views
    Don't Miss
    Technology August 27, 2025

    This secret Windows command summons every app on your PC

    This secret Windows command summons every app on your PC Image: Pexes: Cottonbro studio The…

    WD Blue SN5100 SSD review: A worthy PCIe 4.0 successor

    Geekom A9 Max review: A mini PC with gargantuan AMD power

    Silkland 80Gbps DisplayPort 2.1 Cable: The first VESA-certified DP80 cable on Amazon

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    This secret Windows command summons every app on your PC

    August 27, 20252 Views

    WD Blue SN5100 SSD review: A worthy PCIe 4.0 successor

    August 27, 20252 Views

    Geekom A9 Max review: A mini PC with gargantuan AMD power

    August 27, 20251 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.