Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    This stackable Xbox Game Pass Ultimate 1-month code is $25

    Turn leads into deals with a $50 CRM lifetime license

    This week’s free game on Epic Games Store is a sci-fi detective trip

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Pi Network Tops Daily Charts with a 25% Rally, Here’s Why

      February 15, 2026

      Solana New Holders Drop by 2.3 Million, Will It Impact Price Recovery?

      February 15, 2026

      CLARITY Act’s Stablecoin Yield Restrictions Could Benefit Foreign Currencies, Not USD

      February 15, 2026

      Bitcoin Shorts Reach Most Extreme Level Since 2024 Bottom

      February 15, 2026

      Coinbase Urges Fed to Modernize US Payments to Match European Standards

      February 15, 2026
    • Technology

      This stackable Xbox Game Pass Ultimate 1-month code is $25

      February 15, 2026

      Turn leads into deals with a $50 CRM lifetime license

      February 15, 2026

      This week’s free game on Epic Games Store is a sci-fi detective trip

      February 15, 2026

      Grab 2x 100W Anker USB-C cables for $10

      February 15, 2026

      State-sponsored hackers love Gemini, Google says

      February 15, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks
    Technology

    MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

    TechAiVerseBy TechAiVerseAugust 22, 2025No Comments6 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

    August 22, 2025 1:50 PM

    Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


    The adoption of interoperability standards, such as the Model Context Protocol (MCP), can provide enterprises with insights into how agents and models function outside their walled confines. However, many benchmarks fail to capture real-life interactions with MCP. 

    Salesforce AI Research developed a new open-source benchmark it calls MCP-Universe, which aims to track LLMs as these interact with MCP servers in the real world, arguing that it will paint a better picture of real-life and real-time interactions of models with tools enterprises actually use. In its initial testing, it found that models like OpenAI’s recently released GPT-5 are strong, but still do not perform as well in real-life scenarios. 

    “Existing benchmarks predominantly focus on isolated aspects of LLM performance, such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world MCP servers across diverse scenarios,” Salesforce said in a paper. 

    MCP-Universe captures model performance through tool usage, multi-turn tool calls, long context windows and large tool spaces. It’s grounded on existing MCP servers with access to actual data sources and environments. 


    AI Scaling Hits Its Limits

    Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

    • Turning energy into a strategic advantage
    • Architecting efficient inference for real throughput gains
    • Unlocking competitive ROI with sustainable AI systems

    Secure your spot to stay ahead: https://bit.ly/4mwGngO


    Junnan Li, director of AI research at Salesforce, told VentureBeat that many models “still face limitations that hold them back on enterprise-grade tasks.”

    “Two of the biggest are: Long context challenges, models can lose track of information or struggle to reason consistently when handling very long or complex inputs,” Li said. “And, Unknown tool challenges, models often aren’t able to seamlessly use unfamiliar tools or systems in the way humans can adapt on the fly. This is why it’s crucial not to take a DIY approach with a single model to power agents alone, but instead, to rely on a platform that combines data context, enhanced reasoning, and trust guardrails to truly meet the needs of enterprise AI.”

    MCP-Universe joins other MCP-based proposed benchmarks, such as MCP-Radar from the University of Massachusetts Amherst and Xi’an Jiaotong University, as well as the Beijing University of Posts and Telecommunications’ MCPWorld. It also builds on MCPEvals, which Salesforce released in July, which focuses mainly on agents. Li said the biggest difference between MCP-Universe and MCPEvals is that the latter is evaluated with synthetic tasks. 

    How it works

    MCP-Universe evaluates how well each model performs a series of tasks that mimic those undertaken by enterprises. Salesforce said it designed MCP-Universe to encompass six core domains used by enterprises: location navigation, repository management, financial analysis, 3D design, browser automation and web search. It accessed 11 MCP servers for a total of 231 tasks. 

    • Location navigation focuses on geographic reasoning and the execution of spatial tasks. The researchers tapped the Google Maps MCP server for this process. 
    • The repository management domain looks at codebase operations and connects to the GitHub MCP to expose version control tools like repo search, issue tracking and code editing. 
    • Financial analysis connects to the Yahoo Finance MCP server to evaluate quantitative reasoning and financial market decision-making.
    • 3D design evaluates the use of computer-aided design tools through the Blender MCP.
    • Browser automation, connected to Playwright’s MCP, tests browser interaction.
    • The web searching domain employs the Google Search MCP server and the Fetch MCP  to check “open-domain information seeking” and is structured as a more open-ended task. 

    Salesforce said that it had to design new MCP tasks that reflect real use cases. For each domain, they created four to five kinds of tasks that the researchers think LLMs can easily complete. For example, the researchers assigned the models a goal that involved route planning, identifying the optimal stops and then locating the destination. 

    Each model is evaluated on how they completed the tasks. Li and his team opted to follow an execution-based evaluation paradigm rather than the more common LLM-as-a-judge system. The researchers noted the LLM-as-a-judge paradigm “is not well-suited for our MCP-Universe scenario, since some tasks are designed to use real-time data, while the knowledge of the LLM judge is static.”

    Salesforce researchers used three types of evaluators: format evaluators to see if the agents and models follow format requirements, static evaluators to assess correctness over time and dynamic evaluators for fluctuating answers like flight prices or GitHub issues.

    “MCP-Universe focuses on creating challenging real-world tasks with execution-based evaluators, which can stress-test the agent in complex scenarios. Furthermore, MCP-Universe offers an extendable framework/codebase for building and evaluating agents,” Li said. 

    Even the big models have trouble

    To test MCP-Universe, Salesforce evaluated several popular proprietary and open-source models. These include Grok-4 from xAI, Anthropic’s Claude-4 Sonnet and Claude 3.7 Sonnet, OpenAI’s GPT-5, o4-mini, o3, GPT-4.1, GPT-4o, GPT-oss, Google’s Gemini 2.5 Pro and Gemini 2.5 Fkash, GLM-4.5 from Zai, Moonshot’s Kimi-K2, Qwen’s Qwen3 Coder and Qwen3-235B-A22B-Instruct-2507 and DeepSeek-V3-0304 from DeepSeek. Each model tested had at least 120B parameters.

    In its testing, Salesforce found GPT-5 had the best success rate, especially for financial analysis tasks. Grok-4 followed, beating all the models for browser automation, and Claude-4.0 Sonnet rounds out the top three, although it did not post any performance numbers higher than either of the models it follows. Among open-source models, GLM-4.5 performed the best. 

    However, MCP-Universe showed the models had difficulty handling long contexts, especially for location navigation, browser automation and financial analysis, with efficiency falling significantly. The moment the LLMs encounter unknown tools, their performance also drops. The LLMs demonstrated difficulty in completing more than half of the tasks that enterprises typically perform.

    “These findings highlight that current frontier LLMs still fall short in reliably executing tasks across diverse real-world MCP tasks. Our MCP-Universe benchmark, therefore, provides a challenging and necessary testbed for evaluating LLM performance in areas underserved by existing benchmarks,” the paper said. 

    Li told VentureBeat that he hopes enterprises will use MCP-Universe to gain a deeper understanding of where agents and models fail on tasks so that they can improve either their frameworks or the implementation of their MCP tools. 

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleDon’t sleep on Cohere: Command A Reasoning, its first reasoning model, is built for enterprise customer service and more
    Next Article WTF is AI ‘grounding’ licensing, and why do publishers say it matters over training deals?
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    This stackable Xbox Game Pass Ultimate 1-month code is $25

    February 15, 2026

    Turn leads into deals with a $50 CRM lifetime license

    February 15, 2026

    This week’s free game on Epic Games Store is a sci-fi detective trip

    February 15, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025676 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025260 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025153 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025112 Views
    Don't Miss
    Technology February 15, 2026

    This stackable Xbox Game Pass Ultimate 1-month code is $25

    This stackable Xbox Game Pass Ultimate 1-month code is $25 Image: StackCommerce TL;DR: A stackable month…

    Turn leads into deals with a $50 CRM lifetime license

    This week’s free game on Epic Games Store is a sci-fi detective trip

    Grab 2x 100W Anker USB-C cables for $10

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    This stackable Xbox Game Pass Ultimate 1-month code is $25

    February 15, 20263 Views

    Turn leads into deals with a $50 CRM lifetime license

    February 15, 20262 Views

    This week’s free game on Epic Games Store is a sci-fi detective trip

    February 15, 20263 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.