Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup

    Metal Gear Solid 4 Gets Its First Remaster Nearly Two Decades After It Came Out

    Waymo Begins Fully Autonomous Operations With 6th-Generation Tech

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      How Polymarket Is Turning Bitcoin Volatility Into a Five-Minute Betting Market

      February 13, 2026

      Israel Indicts Two Over Secret Bets on Military Operations via Polymarket

      February 13, 2026

      Binance’s October 10 Defense at Consensus Hong Kong Falls Flat

      February 13, 2026

      Argentina Congress Strips Workers’ Right to Choose Digital Wallet Deposits

      February 13, 2026

      Monero Price Breakdown Begins? Dip Buyers Now Fight XMR’s Drop to $135

      February 13, 2026
    • Technology

      Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup

      February 13, 2026

      Metal Gear Solid 4 Gets Its First Remaster Nearly Two Decades After It Came Out

      February 13, 2026

      Waymo Begins Fully Autonomous Operations With 6th-Generation Tech

      February 13, 2026

      YouTube Music Adds AI-Generated Playlists

      February 13, 2026

      Best Wireless Earbuds of 2026

      February 13, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»GPT-OSS Reinforcement Learning
    Technology

    GPT-OSS Reinforcement Learning

    TechAiVerseBy TechAiVerseSeptember 27, 2025No Comments3 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    GPT-OSS Reinforcement Learning
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    GPT-OSS Reinforcement Learning

    You can now train OpenAI gpt-oss with RL and GRPO via Unsloth. Unsloth now offers the fastest inference (3x faster), lowest VRAM (50% less) and most context (8x longer) for gpt-oss RL vs. any implementation – with no accuracy loss.

    Since RL on gpt-oss isn’t yet vLLM compatible, we rewrote Transformers inference code to deliver 3x faster inference for gpt-oss at ~21 tokens/s. For BF16, Unsloth also achieves the fastest inference (~30 tokens/s), especially relative to VRAM usage, using 50% less VRAM vs. any implementation.

    • Free notebook: gpt-oss-20b GSPO Colab notebook
      This notebook automatically creates faster matrix multiplication kernels and uses a new Unsloth reward function. We also show how to counteract reward-hacking which is one of RL’s biggest challenges.

    With Unsloth, you can train gpt-oss-20b with GRPO on 15GB VRAM and free on Colab. Unsloth’s new inference runs faster on any GPU including A100, H100 and old T4’s. gpt-oss-120b fits on 80GB VRAM.

    Unsloth is the only framework to support 4-bit RL for gpt-oss. All performance gains are due to Unsloth’s unique weight sharing, Flex Attention, Standby and custom kernels.

    Reminder: Flash Attention 3 (FA3) is unsuitable for gpt-oss training since it currently doesn’t support backward passes for attention sinks, causing incorrect training loss. If you’re not using Unsloth, FA3 may be enabled by default, so please double-check it’s not in use!

    ⚡Making Inference Much Faster

    Inference is crucial in RL training. To achieve the fastest inference speed for gpt-oss without vLLM, we rewrote Transformers inference and integrated many innovations including custom algorithms like Unsloth Flex Attention, torch.compile. The new inference was evaluated against an already optimized baseline (2x faster than native Transformers).

    vLLM does not support RL for gpt-oss since it lacks bf16 training and LoRA support for gpt-oss. Without Unsloth, only training via bf16 works, making memory use even 800%+ higher. Most frameworks enable FA3 by default (which reduces VRAM use & increases speed) but this causes incorrect training loss. You must disable FA3, though that prevents long-context training, so instead, we implemented Unsloth Flex Attention.

    We evaluated gpt-oss RL inference by benchmarking BitsandBytes 4-bit and also did separate tests for BF16. Unsloth’s 4-bit inference is ~4x faster, and BF16 is also more efficient, especially in VRAM use.

    The best part about Unsloth’s gpt-oss RL is that it can work on any GPU, even those that do not support bf16. Our free gpt-oss-20b Colab notebooks use older 15GB T4 GPUs, so the inference examples work well!

    🛠️ gpt-oss Flex Attention Issues and Quirks

    Masking was a tricky issue to deal with. We found that we had to, not only account for KV Cache prefill during generations of tokens (this is necessary for efficient inference as we need to know what KVCache corresponds to what token in what layer), but also account for a unique amount of pad tokens in each prompt for batch generations which would change the way we would need to store the block mask. Example of such can be seen below:

    Normal Causal Mask:

       k0 k1 k2 k3 k4   <-- keys
    q0 X
    q1 X X
    q2 X X X
    q3 X X X X
    q4 X X X X X <-- last query row (most important for decoding)

    For inference in general case (decoding)

        k0 k1 k2 k3 k4
    q0
    q1
    q2
    q3
    q4   X  X  X  X  X

    If we naively use the same masking strategy

        k0 k1 k2 k3 k4
    q0
    q1
    q2
    q3
    q4  X   (note that q4 is with q_idx 0 as this is the first query in current setup)

    For generation (decoding phase), we usually only care about the last row of the attention matrix, since there’s just one query token attending to all previous key tokens. If we naïvely apply the causal mask (q_idx ≥ k_idx), it fails as our single query has index 0, while there are n_k key tokens. To fix this, we need an offset in mask creation to decide which tokens to attend. But a naïve approach is slow, since offsets change each step, forcing mask and kernel regeneration. We solved this with cache and compile optimizations.

    The harder part is batch generation. Sequences differ in length, so padding complicates mask creation. Flex Attention had a lot of challenges and dynamic masks are tricky. Worse, if not compiled, it falls back to eager attention which is slow and memory-heavy (quadratic vs. linear in sequence length). Ultimately, the mask must dynamically handle prefill vs decode with KVCache, batch and padding tokens per sequence, remain torch.compile friendly, and support sliding windows.

    🔍 FlashAttention Investigation

    Another interesting direction we explored was integrating FlashAttention. Its advantages are widely recognized, but one limitation is that it does not support attention sinks during the backward pass for gpt-oss. To work around this, we restructured the attention mechanism so that it operates solely on the attention output and the log-sum-exp values that FlashAttention readily provides. Given these benefits, it seemed like an obvious choice to try.

    However, we soon began noticing issues. While the first few layers behaved as expected, the later layers, particularly layers 18 through 24, produced outputs that diverged significantly from the eager-mode implementation in transformers. Importantly, this discrepancy cannot be attributed to error accumulation, since the inputs to each method are identical at every layer. For further validation, we also compared the results against Unsloth FlexAttention.

    This needs further investigation into why only the last few layers show such a drastic difference between flash attention implementation vs. the others.

    Flash Attention 3

    FA3 is often enabled by default for most training packages (not Unsloth), but this is incorrect for gpt-oss. Using FA3 will make training loss completely wrong as FA3 doesn’t support gpt-oss backward passes for attention sinks. Many people are still unaware of this so please be cautious!

    ⚠️ Can We Counter Reward Hacking?

    The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called “Reward Hacking”. It’s the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

    In our notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.

    💫 From OpenAI’s Labs to Your Laptop

    gpt-oss is a legitimate frontier-class architecture from OpenAI that could power breakthrough AI applications. Until now, training these models with RL was exclusively limited to well-funded labs with H100s to spare.

    Today, that changes. You can train gpt-oss-20b with GRPO on a free Google Colab tier here. Free, Frontier Model, Training.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleThe Best Smart Scales for 2025 and How to Get the Most Accurate Reading
    Next Article The Amazon Kindle War Against Piracy
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup

    February 13, 2026

    Metal Gear Solid 4 Gets Its First Remaster Nearly Two Decades After It Came Out

    February 13, 2026

    Waymo Begins Fully Autonomous Operations With 6th-Generation Tech

    February 13, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025668 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025256 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025153 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Technology February 13, 2026

    Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup

    Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup Samsung Unpacked will be…

    Metal Gear Solid 4 Gets Its First Remaster Nearly Two Decades After It Came Out

    Waymo Begins Fully Autonomous Operations With 6th-Generation Tech

    YouTube Music Adds AI-Generated Playlists

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Samsung Galaxy Unpacked 2026: Launch Date Revealed for Rumored S26 Lineup

    February 13, 20260 Views

    Metal Gear Solid 4 Gets Its First Remaster Nearly Two Decades After It Came Out

    February 13, 20260 Views

    Waymo Begins Fully Autonomous Operations With 6th-Generation Tech

    February 13, 20260 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.