Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Xiaomi Pad 8 Series

    Lenovo IdeaPad Slim 5 16 laptop review: Intel Core i5 vs. AMD Ryzen 5

    Oppo Find N6: Leakers clarify international release plans for new foldable with OnePlus Open 2 also mooted

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Apple’s AI chief abruptly steps down

      December 3, 2025

      The issue that’s scrambling both parties: From the Politics Desk

      December 3, 2025

      More of Silicon Valley is building on free Chinese AI

      December 1, 2025

      From Steve Bannon to Elizabeth Warren, backlash erupts over push to block states from regulating AI

      November 23, 2025

      Insurance companies are trying to avoid big payouts by making AI safer

      November 19, 2025
    • Business

      Public GitLab repositories exposed more than 17,000 secrets

      November 29, 2025

      ASUS warns of new critical auth bypass flaw in AiCloud routers

      November 28, 2025

      Windows 11 gets new Cloud Rebuild, Point-in-Time Restore tools

      November 18, 2025

      Government faces questions about why US AWS outage disrupted UK tax office and banking firms

      October 23, 2025

      Amazon’s AWS outage knocked services like Alexa, Snapchat, Fortnite, Venmo and more offline

      October 21, 2025
    • Crypto

      Five Cryptocurrencies That Often Rally Around Christmas

      December 3, 2025

      Why Trump-Backed Mining Company Struggles Despite Bitcoin’s Recovery

      December 3, 2025

      XRP ETFs Extend 11-Day Inflow Streak as $1 Billion Mark Nears

      December 3, 2025

      Why AI-Driven Crypto Exploits Are More Dangerous Than Ever Before

      December 3, 2025

      Bitcoin Is Recovering, But Can It Drop Below $80,000 Again?

      December 3, 2025
    • Technology

      Xiaomi Pad 8 Series

      December 3, 2025

      Lenovo IdeaPad Slim 5 16 laptop review: Intel Core i5 vs. AMD Ryzen 5

      December 3, 2025

      Oppo Find N6: Leakers clarify international release plans for new foldable with OnePlus Open 2 also mooted

      December 3, 2025

      Microsoft’s ugly sweater returns with an Xbox Edition alongside two others

      December 3, 2025

      Free Red Dead Redemption Switch 2 upgrade maximizes console’s specs for huge performance boost

      December 3, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Do reasoning models really “think” or not? Apple research sparks lively debate, response
    Technology

    Do reasoning models really “think” or not? Apple research sparks lively debate, response

    TechAiVerseBy TechAiVerseJune 14, 2025No Comments10 Mins Read1 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Do reasoning models really “think” or not? Apple research sparks lively debate, response
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Do reasoning models really “think” or not? Apple research sparks lively debate, response

    June 13, 2025 3:02 PM

    Credit: VentureBeat made with Midjourney

    Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more


    Apple’s machine-learning group set off a rhetorical firestorm earlier this month with its release of “The Illusion of Thinking,” a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning large language models (reasoning LLMs) such as OpenAI’s “o” series and Google’s Gemini-2.5 Pro and Flash Thinking don’t actually engage in independent “thinking” or “reasoning” from generalized first principles learned from their training data.

    Instead, the authors contend, these reasoning LLMs are actually performing a kind of “pattern matching” and their apparent reasoning ability seems to fall apart once a task becomes too complex, suggesting that their architecture and performance is not a viable path to improving generative AI to the point that it is artificial generalized intelligence (AGI), which OpenAI defines as a model that outperforms humans at most economically valuable work, or superintelligence, AI even smarter than human beings can comprehend.

    ACT NOW: Come discuss the latest LLM advances and research at VB Transform on June 24-25 in SF — limited tickets available. REGISTER NOW

    Unsurprisingly, the paper immediately circulated widely among the machine learning community on X and many readers’ initial reactions were to declare that Apple had effectively disproven much of the hype around this class of AI: “Apple just proved AI ‘reasoning’ models like Claude, DeepSeek-R1, and o3-mini don’t actually reason at all,” declared Ruben Hassid, creator of EasyGen, an LLM-driven LinkedIn post auto writing tool. “They just memorize patterns really well.”

    But now today, a new paper has emerged, the cheekily titled “The Illusion of The Illusion of Thinking” — importantly, co-authored by a reasoning LLM itself, Claude Opus 4 and Alex Lawsen, a human being and independent AI researcher and technical writer — that includes many criticisms from the larger ML community about the paper and effectively argues that the methodologies and experimental designs the Apple Research team used in their initial work are fundamentally flawed.

    While we here at VentureBeat are not ML researchers ourselves and not prepared to say the Apple Researchers are wrong, the debate has certainly been a lively one and the issue about the capabilities of LRMs or reasoner LLMs compared to human thinking seems far from settled.

    How the Apple Research study was designed — and what it found

    Using four classic planning problems — Tower of Hanoi, Blocks World, River Crossing and Checkers Jumping — Apple’s researchers designed a battery of tasks that forced reasoning models to plan multiple moves ahead and generate complete solutions.

    These games were chosen for their long history in cognitive science and AI research and their ability to scale in complexity as more steps or constraints are added. Each puzzle required the models to not just produce a correct final answer, but to explain their thinking along the way using chain-of-thought prompting.

    As the puzzles increased in difficulty, the researchers observed a consistent drop in accuracy across multiple leading reasoning models. In the most complex tasks, performance plunged to zero. Notably, the length of the models’ internal reasoning traces—measured by the number of tokens spent thinking through the problem—also began to shrink. Apple’s researchers interpreted this as a sign that the models were abandoning problem-solving altogether once the tasks became too hard, essentially “giving up.”

    The timing of the paper’s release, just ahead of Apple’s annual Worldwide Developers Conference (WWDC), added to the impact. It quickly went viral across X, where many interpreted the findings as a high-profile admission that current-generation LLMs are still glorified autocomplete engines, not general-purpose thinkers. This framing, while controversial, drove much of the initial discussion and debate that followed.

    Critics take aim on X

    Among the most vocal critics of the Apple paper was ML researcher and X user @scaling01 (aka “Lisan al Gaib”), who posted multiple threads dissecting the methodology.

    In one widely shared post, Lisan argued that the Apple team conflated token budget failures with reasoning failures, noting that “all models will have 0 accuracy with more than 13 disks simply because they cannot output that much!”

    For puzzles like Tower of Hanoi, he emphasized, the output size grows exponentially, while the LLM context windows remain fixed, writing “just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn’t mean Tower of Hanoi is more difficult” and convincingly showed that models like Claude 3 Sonnet and DeepSeek-R1 often produced algorithmically correct strategies in plain text or code—yet were still marked wrong.

    Another post highlighted that even breaking the task down into smaller, decomposed steps worsened model performance—not because the models failed to understand, but because they lacked memory of previous moves and strategy.

    “The LLM needs the history and a grand strategy,” he wrote, suggesting the real problem was context-window size rather than reasoning.

    I raised another important grain of salt myself on X: Apple never benchmarked the model performance against human performance on the same tasks. “Am I missing it, or did you not compare LRMs to human perf[ormance] on [the] same tasks?? If not, how do you know this same drop-off in perf doesn’t happen to people, too?” I asked the researchers directly in a thread tagging the paper’s authors. I also emailed them about this and many other questions, but they have yet to respond.

    Others echoed that sentiment, noting that human problem solvers also falter on long, multistep logic puzzles, especially without pen-and-paper tools or memory aids. Without that baseline, Apple’s claim of a fundamental “reasoning collapse” feels ungrounded.

    Several researchers also questioned the binary framing of the paper’s title and thesis—drawing a hard line between “pattern matching” and “reasoning.”

    Alexander Doria aka Pierre-Carl Langlais, an LLM trainer at energy efficient French AI startup Pleias, said the framing misses the nuance, arguing that models might be learning partial heuristics rather than simply matching patterns.

    Ok I guess I have to go through that Apple paper.

    My main issue is the framing which is super binary: “Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?” Or what if they only caught genuine yet partial heuristics. pic.twitter.com/GZE3eG7WlM

    — Alexander Doria (@Dorialexander) June 8, 2025

    Ethan Mollick, the AI focused professor at University of Pennsylvania’s Wharton School of Business, called the idea that LLMs are “hitting a wall” premature, likening it to similar claims about “model collapse” that didn’t pan out.

    Meanwhile, critics like @arithmoquine were more cynical, suggesting that Apple—behind the curve on LLMs compared to rivals like OpenAI and Google—might be trying to lower expectations,” coming up with research on “how it’s all fake and gay and doesn’t matter anyway” they quipped, pointing out Apple’s reputation with now poorly performing AI products like Siri.

    In short, while Apple’s study triggered a meaningful conversation about evaluation rigor, it also exposed a deep rift over how much trust to place in metrics when the test itself might be flawed.

    A measurement artifact, or a ceiling?

    In other words, the models may have understood the puzzles but ran out of “paper” to write the full solution.

    “Token limits, not logic, froze the models,” wrote Carnegie Mellon researcher Rohan Paul in a widely shared thread summarizing the follow-up tests.

    Yet not everyone is ready to clear LRMs of the charge. Some observers point out that Apple’s study still revealed three performance regimes — simple tasks where added reasoning hurts, mid-range puzzles where it helps, and high-complexity cases where both standard and “thinking” models crater.

    Others view the debate as corporate positioning, noting that Apple’s own on-device “Apple Intelligence” models trail rivals on many public leaderboards.

    The rebuttal: “The Illusion of the Illusion of Thinking”

    In response to Apple’s claims, a new paper titled “The Illusion of the Illusion of Thinking” was released on arXiv by independent researcher and technical writer Alex Lawsen of the nonprofit Open Philanthropy, in collaboration with Anthropic’s Claude Opus 4.

    The paper directly challenges the original study’s conclusion that LLMs fail due to an inherent inability to reason at scale. Instead, the rebuttal presents evidence that the observed performance collapse was largely a by-product of the test setup—not a true limit of reasoning capability.

    Lawsen and Claude demonstrate that many of the failures in the Apple study stem from token limitations. For example, in tasks like Tower of Hanoi, the models must print exponentially many steps — over 32,000 moves for just 15 disks — leading them to hit output ceilings.

    The rebuttal points out that Apple’s evaluation script penalized these token-overflow outputs as incorrect, even when the models followed a correct solution strategy internally.

    The authors also highlight several questionable task constructions in the Apple benchmarks. Some of the River Crossing puzzles, they note, are mathematically unsolvable as posed, and yet model outputs for these cases were still scored. This further calls into question the conclusion that accuracy failures represent cognitive limits rather than structural flaws in the experiments.

    To test their theory, Lawsen and Claude ran new experiments allowing models to give compressed, programmatic answers. When asked to output a Lua function that could generate the Tower of Hanoi solution—rather than writing every step line-by-line—models suddenly succeeded on far more complex problems. This shift in format eliminated the collapse entirely, suggesting that the models didn’t fail to reason. They simply failed to conform to an artificial and overly strict rubric.

    Why it matters for enterprise decision-makers

    The back-and-forth underscores a growing consensus: evaluation design is now as important as model design.

    Requiring LRMs to enumerate every step may test their printers more than their planners, while compressed formats, programmatic answers or external scratchpads give a cleaner read on actual reasoning ability.

    The episode also highlights practical limits developers face as they ship agentic systems—context windows, output budgets and task formulation can make or break user-visible performance.

    For enterprise technical decision makers building applications atop reasoning LLMs, this debate is more than academic. It raises critical questions about where, when, and how to trust these models in production workflows—especially when tasks involve long planning chains or require precise step-by-step output.

    If a model appears to “fail” on a complex prompt, the problem may not lie in its reasoning ability, but in how the task is framed, how much output is required, or how much memory the model has access to. This is particularly relevant for industries building tools like copilots, autonomous agents, or decision-support systems, where both interpretability and task complexity can be high.

    Understanding the constraints of context windows, token budgets, and the scoring rubrics used in evaluation is essential for reliable system design. Developers may need to consider hybrid solutions that externalize memory, chunk reasoning steps, or use compressed outputs like functions or code instead of full verbal explanations.

    Most importantly, the paper’s controversy is a reminder that benchmarking and real-world application are not the same. Enterprise teams should be cautious of over-relying on synthetic benchmarks that don’t reflect practical use cases—or that inadvertently constrain the model’s ability to demonstrate what it knows.

    Ultimately, the big takeaway for ML researchers is that before proclaiming an AI milestone—or obituary—make sure the test itself isn’t putting the system in a box too small to think inside.

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleBeyond GPT architecture: Why Google’s Diffusion approach could reshape LLM deployment
    Next Article Just add humans: Oxford medical study underscores the missing link in chatbot testing
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Xiaomi Pad 8 Series

    December 3, 2025

    Lenovo IdeaPad Slim 5 16 laptop review: Intel Core i5 vs. AMD Ryzen 5

    December 3, 2025

    Oppo Find N6: Leakers clarify international release plans for new foldable with OnePlus Open 2 also mooted

    December 3, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025470 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025160 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202584 Views

    Is Libby Compatible With Kobo E-Readers?

    March 31, 202563 Views
    Don't Miss
    Technology December 3, 2025

    Xiaomi Pad 8 Series

    Xiaomi Pad 8 Series – Notebookcheck.net External Reviews Processor: Qualcomm Snapdragon 8 SD 8 Elite,…

    Lenovo IdeaPad Slim 5 16 laptop review: Intel Core i5 vs. AMD Ryzen 5

    Oppo Find N6: Leakers clarify international release plans for new foldable with OnePlus Open 2 also mooted

    Microsoft’s ugly sweater returns with an Xbox Edition alongside two others

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Xiaomi Pad 8 Series

    December 3, 20250 Views

    Lenovo IdeaPad Slim 5 16 laptop review: Intel Core i5 vs. AMD Ryzen 5

    December 3, 20250 Views

    Oppo Find N6: Leakers clarify international release plans for new foldable with OnePlus Open 2 also mooted

    December 3, 20250 Views
    Most Popular

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    Volkswagen’s cheapest EV ever is the first to use Rivian software

    March 12, 20250 Views

    Startup studio Hexa acquires majority stake in Veevart, a vertical SaaS platform for museums

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.