Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Pulsar X2 Crazylight Medium review: Light as air, strong on performance

    Raycast is finally on Windows, and it totally changed how I use my PC

    LG’s Gallery TV promises to transform your wall into an art museum

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      A new pope, political shake-ups and celebs in space: The 2025-in-review news quiz

      December 31, 2025

      AI has become the norm for students. Teachers are playing catch-up.

      December 23, 2025

      Trump signs executive order seeking to ban states from regulating AI companies

      December 13, 2025

      Apple’s AI chief abruptly steps down

      December 3, 2025

      The issue that’s scrambling both parties: From the Politics Desk

      December 3, 2025
    • Business

      Top 10 cloud computing stories of 2025

      December 22, 2025

      Saudia Arabia’s STC commits to five-year network upgrade programme with Ericsson

      December 18, 2025

      Zeroday Cloud hacking event awards $320,0000 for 11 zero days

      December 18, 2025

      Amazon: Ongoing cryptomining campaign uses hacked AWS accounts

      December 18, 2025

      Want to back up your iPhone securely without paying the Apple tax? There’s a hack for that, but it isn’t for everyone… yet

      December 16, 2025
    • Crypto

      US Jobless Claims Drop Sharply, Fed Rate Cuts Look Less Urgent

      December 31, 2025

      Ripple’s $1 Billion XRP Unlock Starts 2026, But Is It a Non-Event? | US Crypto News

      December 31, 2025

      An 80% Wipeout Hasn’t Stopped Korean Retail From Chasing Tom Lee’s BitMine

      December 31, 2025

      Zcash Price Eyes $672 Target After Clearing $500 Resistance

      December 31, 2025

      Trump Media to Launch New Token With Special Benefits

      December 31, 2025
    • Technology

      Pulsar X2 Crazylight Medium review: Light as air, strong on performance

      January 1, 2026

      Raycast is finally on Windows, and it totally changed how I use my PC

      January 1, 2026

      LG’s Gallery TV promises to transform your wall into an art museum

      January 1, 2026

      Guard against a corrupted Windows install with a system restore point

      January 1, 2026

      Murena taking pre-orders for the Hiroh smartphone powered by /e/OS, a privacy-focused version of Android 16

      January 1, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution
    Technology

    Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

    TechAiVerseBy TechAiVerseMay 23, 2025No Comments8 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

    May 23, 2025 6:00 AM

    Credit: Image generated by VentureBeat using FLUX-pro-1.1

    Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


    A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs).

    This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount.

    The persistent challenges of RAG

    RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly.

    The researchers state in their paper, “The ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model’s parametric knowledge. Otherwise, the model should abstain from answering and/or ask for more information.”

    Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that “while the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.”

    Sufficient context

    To tackle this, the researchers introduce the concept of “sufficient context.” At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases:

    Sufficient Context: The context has all the necessary information to provide a definitive answer.

    Insufficient Context: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory.

    Source: arXiv

    This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference.

    The researchers developed an LLM-based “autorater” to automate the labeling of instances as having sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy.

    The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.”

    Key findings on LLM behavior with RAG

    Analyzing various models and datasets through this lens of sufficient context revealed several important insights.

    As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination.

    Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. “This phenomenon may arise from the model’s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,” the researchers suggest.

    A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already “know” the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model’s knowledge, even if it doesn’t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design.

    Source: arXiv

    Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. “For a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,” he told VentureBeat. He suggested that retrieval should be viewed as “augmentation of its knowledge,” rather than the sole source of truth. The base model, he explains, “still needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.”

    Reducing hallucinations in RAG systems

    Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this.

    They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered).

    This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models.

    To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. “You could imagine a customer asking about whether they can have a discount,” he said. “In some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be ‘stale,’ describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be better for the model to say, ‘I am not sure,’ or ‘You should talk to a customer support agent to get more information for your specific case.’”

    The team also investigated fine-tuning models to encourage abstention. This involved training models on examples where the answer was replaced with “I don’t know” instead of the original ground-truth, particularly for instances with insufficient context. The intuition was that explicit training on such examples could steer the model to abstain rather than hallucinate.

    The results were mixed: fine-tuned models often had a higher rate of correct answers but still hallucinated frequently, often more than they abstained. The paper concludes that while fine-tuning might help, “more work is needed to develop a reliable strategy that can balance these objectives.”

    Applying sufficient context to real-world RAG systems

    For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context. 

    “This already will give a good estimate of the % of sufficient context,” Rashtchian said. “If it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things — this is a good observable symptom.”

    Rashtchian advises teams to then “stratify model responses based on examples with sufficient vs. insufficient context.” By examining metrics on these two separate datasets, teams can better understand performance nuances. 

    “For example, we saw that models were more likely to provide an incorrect response (with respect to the ground truth) when given insufficient context. This is another observable symptom,” he notes, adding that “aggregating statistics over a whole dataset may gloss over a small set of important but poorly handled queries.”

    While an LLM-based autorater demonstrates high accuracy, enterprise teams might wonder about the additional computational cost. Rashtchian clarified that the overhead can be managed for diagnostic purposes. 

    “I would say running an LLM-based autorater on a small test set (say 500-1000 examples) should be relatively inexpensive, and this can be done ‘offline’ so there’s no worry about the amount of time it takes,” he said. For real-time applications, he concedes, “it would be better to use a heuristic, or at least a smaller model.” The crucial takeaway, according to Rashtchian, is that “engineers should be looking at something beyond the similarity scores, etc, from their retrieval component. Having an extra signal, from an LLM or a heuristic, can lead to new insights.”

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleEssex Police discloses ‘incoherent’ facial recognition assessment
    Next Article How Saudi Arabia and Savvy’s long-term push into gaming is proceeding | Jesse Meschuk interview
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Pulsar X2 Crazylight Medium review: Light as air, strong on performance

    January 1, 2026

    Raycast is finally on Windows, and it totally changed how I use my PC

    January 1, 2026

    LG’s Gallery TV promises to transform your wall into an art museum

    January 1, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025565 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025209 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025114 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 202597 Views
    Don't Miss
    Technology January 1, 2026

    Pulsar X2 Crazylight Medium review: Light as air, strong on performance

    Pulsar X2 Crazylight Medium review: Light as air, strong on performance Skip to content Image:…

    Raycast is finally on Windows, and it totally changed how I use my PC

    LG’s Gallery TV promises to transform your wall into an art museum

    Guard against a corrupted Windows install with a system restore point

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Pulsar X2 Crazylight Medium review: Light as air, strong on performance

    January 1, 20263 Views

    Raycast is finally on Windows, and it totally changed how I use my PC

    January 1, 20264 Views

    LG’s Gallery TV promises to transform your wall into an art museum

    January 1, 20263 Views
    Most Popular

    What to Know and Where to Find Apple Intelligence Summaries on iPhone

    March 12, 20250 Views

    A Team of Female Founders Is Launching Cloud Security Tech That Could Overhaul AI Protection

    March 12, 20250 Views

    Senua’s Saga: Hellblade 2 leads BAFTA Game Awards 2025 nominations

    March 12, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.