Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinks

    ‘An influential seat at the table’: Why Target’s retail media business Roundel is one of the first to test ChatGPT ads

    Ad Tech Briefing: A mid-term report card

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Metaplanet Reports FY2025 Results as Bitcoin Unrealized Losses Top $1 Billion

      February 17, 2026

      Crypto’s AI Pivot: Hype, Infrastructure, and a Two-Year Countdown

      February 17, 2026

      The RWA War: Stablecoins, Speed, and Control

      February 17, 2026

      Jeffrey Epstein Emails Show Plans to Meet Gary Gensler To Talk Crypto

      February 17, 2026

      Bitcoin Bounce Fades, Q1 Losses Deepen, and New Price Risk Back in Focus

      February 17, 2026
    • Technology

      In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinks

      February 17, 2026

      ‘An influential seat at the table’: Why Target’s retail media business Roundel is one of the first to test ChatGPT ads

      February 17, 2026

      Ad Tech Briefing: A mid-term report card

      February 17, 2026

      AdCP vs. IAB Tech Lab: Inside programmatic advertising’s agentic AI standards showdown

      February 17, 2026

      ChatGPT enters the ad game. Now what?

      February 17, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Q-learning is not yet scalable
    Technology

    Q-learning is not yet scalable

    TechAiVerseBy TechAiVerseJune 15, 2025No Comments11 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Q-learning is not yet scalable

    Does RL scale?

    Over the past few years,
    we’ve seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales,
    and so on, all the way to the point where we can train models with billions of parameters
    with a scalable objective that can eat up as much data as we can throw at it.
    Then, what about reinforcement learning (RL)?
    Does RL also scale like all the other objectives?

    Apparently, it does.
    In 2016, RL achieved superhuman-level performance in games like Go and Chess.
    Now, RL is solving complex reasoning tasks in math and coding with large language models (LLMs).
    This is great. However, there is one important caveat:
    most of the current real-world successes of RL have been achieved with on-policy RL algorithms
    (e.g., REINFORCE, PPO, GRPO, etc.),
    which always require fresh, newly sampled rollouts from the current policy,
    and cannot reuse previous data
    (note: while PPO-like methods can technically reuse data to some (limited) degree, I’ll classify them as on-policy RL,
    as in OpenAI’s documentation
    ).
    This is not a problem in some settings like board games and LLMs,
    where we can cheaply generate as many rollouts as we want.
    However, it is a significant limitation in most real-world problems.
    For example, in robotics, it takes more than several months in the real world to generate
    the amount of samples used to post-train a language model with RL,
    not to mention that a human must be present 24/7 next to the robot to reset it during the entire training time!


    On-policy RL can only use fresh data collected by the current policy (pi).
    Off-policy RL can use any data (mathcal{D}).

    This is where off-policy RL comes to the rescue.
    In principle, off-policy RL algorithms can use any data, regardless of when and how it was collected.
    Hence, they generally lead to much better sample efficiency, by reusing data many times.
    For example, off-policy RL can train a dog robot to walk in 20 minutes from scratch in the real world.
    Q-learning is the most widely used off-policy RL algorithm.
    It minimizes the following temporal difference (TD) loss:

    $$begin{aligned}
    mathbb{E}_{(s, a, r, s’) sim mathcal{D}} bigg[ Big( Q_theta(s, a) – big(r + gamma max_{a’} Q_{bar theta}(s’, a’) big) Big)^2 bigg],
    end{aligned}$$

    where (bar theta) is the parameter of the target network.
    Most practical (model-free) off-policy RL algorithms are based on some variants of the TD loss above.
    So, to apply RL to many real-world problems,
    the question becomes: does Q-learning (TD learning) scale?
    If the answer is yes, this would lead to at least an equivalent level of impact as the successes of AlphaGo and LLMs,
    enabling RL to solve far more diverse and complex real-world tasks very efficiently,
    in robotics, computer-using agents, and so on.

    Q-learning is not yet scalable

    Unfortunately, my current belief is that the answer is not yet.
    I believe current Q-learning algorithms are not readily scalable, at least to long-horizon problems that require more than (say) 100 semantic decision steps.

    Let me clarify. My definition of scalability here is the ability to solve more challenging, longer-horizon problems
    with more data (of sufficient coverage), compute, and time.
    This notion is different from the ability to solve merely a larger number of (but not necessarily harder) tasks with a single model,
    which many excellent prior scaling studies have shown to be possible.
    You can think of the former as the “depth” axis and the latter as the “width” axis.
    The depth axis is more important and harder to push, because it requires developing more advanced decision-making capabilities.

    I claim that Q-learning, in its current form, is not highly scalable along the depth axis.
    In other words, I believe we still need algorithmic breakthroughs to scale up Q-learning (and off-policy RL) to complex, long-horizon problems.
    Below, I’ll explain two main reasons why I think so:
    one is anecdotal, and the other is based on our recent scaling study.


    Both AlphaGo and DeepSeek are based on on-policy RL and do not use TD learning.

    Anecdotal evidence first.
    As mentioned earlier, most real-world successes of RL are based on on-policy RL algorithms.
    AlphaGo, AlphaZero, and MuZero are based on model-based RL and Monte Carlo tree search, and do not use TD learning on board games
    (see 15p of the MuZero paper).
    OpenAI Five achieves superhuman performance in Dota 2 with PPO
    (see footnote 6 of the OpenAI Five paper).
    RL for LLMs is currently dominated by variants of on-policy policy gradient methods, such as PPO and GRPO.
    Let me ask: do we know of any real-world successes of off-policy RL (1-step TD learning, in particular) on a similar scale to AlphaGo or LLMs?
    If you do, please let me know and I’ll happily update this post.

    Of course, I’m not making this claim based only on anecdotal evidence.
    As said before, I’ll show concrete experiments to empirically prove this point later in this post.
    Also, please don’t get me wrong, I’m still highly optimistic about off-policy RL and Q-learning (as an RL researcher who mainly works in off-policy RL!).
    I just think that we are not there yet, and the purpose of this post is to call for research in RL algorithms, rather than to discourage it!

    What’s the problem?

    Then, what fundamentally makes Q-learning not readily scalable to complex, long-horizon problems, unlike other objectives?
    Here is my answer:

    $$begin{aligned}
    definecolor{myblue}{RGB}{89, 139, 231}
    mathbb{E}_{(s, a, r, s’) sim mathcal{D}} bigg[ Big( Q_theta(s, a) – underbrace{big(r + gamma max_{a’} Q_{bar theta}(s’, a’) big)}_{{color{myblue}texttt{Biased }} (textit{i.e., }neq Q^*(s, a))} Big)^2 bigg]
    end{aligned}$$

    Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon.
    The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning).
    For example, there are no biases in prediction targets in other scalable objectives
    (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.)
    or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).


    Biases accumulate over the horizon.

    As the problem becomes more complex and the horizon gets longer, the biases in bootstrapped targets accumulate more and more severely,
    to the point where we cannot easily mitigate them with more data and larger models.
    I believe this is the main reason why we almost never use larger discount factors ((gamma > 0.999)) in practice,
    and why it is challenging to scale up Q-learning.
    Note that policy gradient methods suffer much less from this issue.
    This is because GAE or similar on-policy value estimation techniques
    can deal with longer horizons relatively more easily (though at the expense of higher variance), without strict 1-step recursions.

    Empirical scaling study

    In our recent paper, we empirically verified the above claim via diverse, controlled scaling studies.

    We wanted to see whether current off-policy RL methods can solve highly challenging tasks by just scaling up data and compute.
    To do this, we first prepared highly complex, previously unsolved tasks in OGBench.
    Here are some videos:


    humanoidmaze


    humanoidmaze

    These tasks are really difficult.
    To solve them, the agent must learn complex goal-reaching behaviors from unstructured, random (play-style) demonstrations.
    At test time, the agent must perform precise manipulation, combinatorial puzzle-solving, or long-horizon navigation,
    over 1,000 environment steps.

    We then collected near-infinite data on these environments, to the degree that overfitting is virtually impossible.
    We also removed as many confounding factors as possible.
    For example, we focused on offline RL to abstract away exploration.
    We ensured that the datasets had sufficient coverage, and that all the tasks were solvable from the given datasets.
    We directly provided the agent with the ground-truth state observations to reduce the burden of representation learning.

    Hence, a “scalable” RL algorithm must really be able to solve these tasks, given sufficient data and compute.
    If Q-learning does not scale even in this controlled setting with near-infinite data,
    there is little hope that it will scale in more realistic settings,
    where we have limited data, noisy observations, and so on.


    Standard offline RL methods struggle to scale on complex tasks, even with (1000times) more data.

    So, how did the existing algorithms work?
    The results were a bit disappointing.
    None of the standard, widely used offline RL algorithms (flow BC, IQL, CRL, and SAC+BC) were able to solve all of these tasks,
    even with 1B-sized datasets, which are (1000 times) larger than typical datasets used in offline RL.
    More importantly, their performance often plateaued far below the optimal performance.
    In other words, they didn’t scale well on these complex, long-horizon tasks.

    You might ask:
    Are you really sure these tasks are solvable? Did you try larger models?
    Did you train them for longer? Did you try different hyperparameters? And so on.
    In the paper, we tried our best to address as many questions as possible with a number of ablations and controlled experiments,
    showing that none of these fixes worked…
    except for one:

    Horizon reduction makes RL scalable

    Recall that my claim earlier was that the horizon (and bias accumulation thereof) is the main obstacle to scaling up off-policy RL.
    To verify this,
    we tried diverse horizon reduction techniques (e.g., n-step returns, hierarchical RL, etc.) that reduce the number of biased TD backups.


    Horizon reduction was the only technique we found that substantially improved scaling.

    The results were promising!
    Even simple tricks like n-step returns
    significantly improved scalability and even asymptotic performance
    (so it is not just a “trick” that merely makes training faster!).
    Full-fledged hierarchical methods worked even better.
    More importantly, horizon reduction is the only technique that worked across the board in our experiments.
    This suggests that simply scaling up data and compute is not enough to address the curse of horizon.
    In other words, we need better algorithms that directly address this fundamental horizon problem.

    Call for research: find a scalable off-policy RL objective

    We saw that horizon reduction unlocks the scalability of Q-learning.
    So are we done? Can we now just scale up Q-learning?
    I’d say this is only the beginning.
    While it is great to know the cause and have some solutions,
    most of the current horizon reduction techniques (n-step returns, hierarchical RL, etc.) only mitigate the issue by a constant factor,
    and do not fundamentally solve the problem.
    I think we’re currently missing an off-policy RL algorithm that scales to arbitrarily complex, long-horizon problems
    (or perhaps we may already have a solution, but just haven’t stress-tested it enough yet!).
    I believe finding such a scalable off-policy RL algorithm is the most important missing piece in machine learning today.
    This will enable solving much more diverse real-world problems,
    including robotics, language models, agents, and basically any data-driven decision-making tasks.

    I’ll conclude this post with my thoughts about potential solutions to scalable off-policy RL.

    • Can we find a simple, scalable way to extend beyond two-level hierarchies to deal with horizons of arbitrary lengths?
      Such a solution should be able to naturally form a recursive hierarchical structure,
      while being simple enough to be scalable.
      One great example of this (though in a different field) is chain-of-thought in LLMs.
    • Another completely different approach (which I intentionally didn’t mention so far for simplicity)
      is model-based RL.
      We know that model learning is scalable, because it’s just supervised learning.
      We also know that on-policy RL is scalable.
      So why don’t we combine the two, where we first learn a model and run on-policy RL within the model?
      Would model-based RL indeed scale better than TD-based Q-learning?
    • Or is there a way to just completely avoid TD learning?
      Among the methods that I know of,
      one such example is quasimetric RL,
      which is essentially based on the LP formulation of RL.
      Perhaps this sort of “exotic” RL methods, or MC-based methods like contrastive RL, might eventually scale better than TD-based approaches?

    Our setup above can be a great starting point for testing these ideas.
    We have already designed a set of highly challenging robotic tasks, made the datasets, and verified that they are solvable.
    One can even make the tasks arbitrarily difficult (e.g., by adding more cubes)
    and further stress-test the scalability of algorithms in a controlled way.
    We also put our efforts into making the code as clean as possible.
    Check out our code!

    Feel free to let me know via email/Twitter/X or reach out to me at conferences if you have any questions, comments, or feedback.
    I hope that at some point, I can write another post about off-policy RL with a more positive title in the near future!

    Acknowledgments

    I would like to thank
    Kevin Frans,
    Hongsuk Choi,
    Ben Eysenbach,
    Aviral Kumar,
    and Sergey Levine
    for their helpful feedback on this post.
    This post is partly based on our recent work, Horizon Reduction Makes RL Scalable.
    The views in this post are my own, and do not necessarily reflect those of my coauthors.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleFixing the mechanics of my bullet chess
    Next Article Reclaiming Control: Digital Sovereignty in 2025
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinks

    February 17, 2026

    ‘An influential seat at the table’: Why Target’s retail media business Roundel is one of the first to test ChatGPT ads

    February 17, 2026

    Ad Tech Briefing: A mid-term report card

    February 17, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025680 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025262 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025155 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025114 Views
    Don't Miss
    Technology February 17, 2026

    In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinks

    In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinksAfter…

    ‘An influential seat at the table’: Why Target’s retail media business Roundel is one of the first to test ChatGPT ads

    Ad Tech Briefing: A mid-term report card

    AdCP vs. IAB Tech Lab: Inside programmatic advertising’s agentic AI standards showdown

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    In Graphic Detail: Subscriptions are rising at big news publishers – even as traffic shrinks

    February 17, 20263 Views

    ‘An influential seat at the table’: Why Target’s retail media business Roundel is one of the first to test ChatGPT ads

    February 17, 20262 Views

    Ad Tech Briefing: A mid-term report card

    February 17, 20260 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.