Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    These 7 Vitamins for Hair Growth Can Give You Longer, Thicker Locks

    Never Lose Socks in the Wash Again With This Genius New Whirlpool Feature

    Today’s NYT Mini Crossword Answers for Thursday, Feb. 19

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Is Bitcoin Price Entering a New Bear Market? Here’s Why Metrics Say Yes

      February 19, 2026

      Cardano’s Trading Activity Crashes to a 6-Month Low — Can ADA Still Attempt a Reversal?

      February 19, 2026

      Is Extreme Fear a Buy Signal? New Data Questions the Conventional Wisdom

      February 19, 2026

      Coinbase and Ledn Strengthen Crypto Lending Push Despite Market Slump

      February 19, 2026

      Bitcoin Caught Between Hawkish Fed and Dovish Warsh

      February 19, 2026
    • Technology

      These 7 Vitamins for Hair Growth Can Give You Longer, Thicker Locks

      February 19, 2026

      Never Lose Socks in the Wash Again With This Genius New Whirlpool Feature

      February 19, 2026

      Today’s NYT Mini Crossword Answers for Thursday, Feb. 19

      February 19, 2026

      Apple May Be Adding Support for Conversational AI in CarPlay

      February 19, 2026

      Etching the world’s smallest QR code in ceramic pushes data storage to the nanoscale

      February 19, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Anthropic unveils ‘auditing agents’ to test for AI misalignment
    Technology

    Anthropic unveils ‘auditing agents’ to test for AI misalignment

    TechAiVerseBy TechAiVerseJuly 26, 2025No Comments5 Mins Read4 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Anthropic unveils ‘auditing agents’ to test for AI misalignment
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Anthropic unveils ‘auditing agents’ to test for AI misalignment

    July 24, 2025 3:15 PM

    Image Credit: VentureBeat via ChatGPT

    Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


    When models attempt to get their way or become overly accommodating to the user, it can mean trouble for enterprises. That is why it’s essential that, in addition to performance evaluations, organizations conduct alignment testing.

    However, alignment audits often present two major challenges: scalability and validation. Alignment testing requires a significant amount of time for human researchers, and it’s challenging to ensure that the audit has caught everything. 

    In a paper, Anthropic researchers said they developed auditing agents that achieved “impressive performance at auditing tasks, while also shedding light on their limitations.” The researchers stated that these agents, created during the pre-deployment testing of Claude Opus 4, enhanced alignment validation tests and enabled researchers to conduct multiple parallel audits at scale. Anthropic also released a replication of its audit agents on GitHub. 

    New Anthropic research: Building and evaluating alignment auditing agents.

    We developed three AI agents to autonomously complete alignment auditing tasks.

    In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. pic.twitter.com/HMQhMaA4v0

    — Anthropic (@AnthropicAI) July 24, 2025

    “We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents,” the researcher said in the paper. 


    The AI Impact Series Returns to San Francisco – August 5

    The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.

    Secure your spot now – space is limited: https://bit.ly/3GuuPLF


    The three agents they explored were:

    • Tool-using investigator agent for open-ended investigation of models using chat, data analysis and interpretability tools
    • Evaluation agent that builds behavioral evaluations and can discriminate between models that have implanted behaviors and those that do not
    • Breadth-first red-teaming agent, which was developed specifically for the Claude 4 alignment assessment, so that it can discover implanted test behaviors

    “Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights into agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems,” the paper said.

    Auditing agents in action

    Anthropic said the first environment it developed tests an agent’s ability to complete an alignment audit for an intentionally misaligned model. For this test, the researchers brought the investigator agent to the environment and equipped it with similar tools that human teams use. The researchers instructed the agent to identify the root cause of the misalignment and determine the specific behaviors that were misaligned. 

    According to the researchers, the investigator agent successfully found the root cause of the issues 10-13% of the time. Still, this performance improved to 42% “when using a super-agent approach that aggregates findings across multiple investigations.”

    The next step in most alignment audits is typically a behavioral evaluation, which determines whether the model consistently misbehaves in various settings. To test its evaluation agent, the researchers designed 10 models “given system prompt instructions to exhibit a specific behavior, such as excessive deference to the user.”

    They ran the agent five times per model and saw that the agent correctly finds and flags at least one quirk of the model. However, the agent sometimes failed to identify specific quirks consistently. It had trouble evaluating subtle quirks, such as self-promotion and research-sandbagging, as well as quirks that are difficult to elicit, like the Hardcode Test Cases quirk.

    The last test and agent concern behavioral red-teaming to find the prompts that elicit “concerning” behaviors. The breadth-first red-teaming agent converses with the target model (in Anthropic’s case, it was with Claude Opus 4), and this chat is then rated for alignment-relevant properties. 

    The agent identified seven of the ten system quirks, but it also struggled with the same issues as the evaluator agent.

    Alignment and sycophany problems

    Alignment became a significant topic in the AI world after users noticed that ChatGPT was becoming overly agreeable. OpenAI rolled back some updates to GPT-4o to address this issue, but it showed that language models and agents can confidently give wrong answers if they decide this is what users want to hear. 

    To combat this, other methods and benchmarks were developed to curb unwanted behaviors. The Elephant benchmark, developed by researchers from Carnegie Mellon University, the University of Oxford, and Stanford University, aims to measure sycophancy. DarkBench categorizes six issues, such as brand bias, user retention, sycophancy, anthromorphism, harmful content generation, and sneaking. OpenAI also has a method where AI models test themselves for alignment. 

    Alignment auditing and evaluation continue to evolve, though it is not surprising that some people are not comfortable with it. 

    Hallucinations auditing Hallucinations

    Great work team.

    — spec (@_opencv_) July 24, 2025

    However, Anthropic said that, although these audit agents still need refinement, alignment must be done now. 

    “As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate,” the company said in an X post. 

    As AI systems become more powerful, we need scalable ways to assess their alignment.

    Human alignment audits take time and are hard to validate.

    Our solution: automating alignment auditing with AI agents.

    Read more: https://t.co/CqWkQSfBIG

    — Anthropic (@AnthropicAI) July 24, 2025

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleSRAM Has No Chill: Exploiting Power Domain Separation to Steal On-Chip Secrets
    Next Article It’s Qwen’s summer: new open source Qwen3-235B-A22B-Thinking-2507 tops OpenAI, Gemini reasoning models on key benchmarks
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    These 7 Vitamins for Hair Growth Can Give You Longer, Thicker Locks

    February 19, 2026

    Never Lose Socks in the Wash Again With This Genius New Whirlpool Feature

    February 19, 2026

    Today’s NYT Mini Crossword Answers for Thursday, Feb. 19

    February 19, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025684 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025273 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025156 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025118 Views
    Don't Miss
    Technology February 19, 2026

    These 7 Vitamins for Hair Growth Can Give You Longer, Thicker Locks

    These 7 Vitamins for Hair Growth Can Give You Longer, Thicker LocksIf you run your…

    Never Lose Socks in the Wash Again With This Genius New Whirlpool Feature

    Today’s NYT Mini Crossword Answers for Thursday, Feb. 19

    Apple May Be Adding Support for Conversational AI in CarPlay

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    These 7 Vitamins for Hair Growth Can Give You Longer, Thicker Locks

    February 19, 20262 Views

    Never Lose Socks in the Wash Again With This Genius New Whirlpool Feature

    February 19, 20264 Views

    Today’s NYT Mini Crossword Answers for Thursday, Feb. 19

    February 19, 20263 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.