Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Italian Competition Authority investigating Activision Blizzard over Diablo Immortal and Call of Duty Mobile monetisation

    Shovelware is a bigger problem than ever – platform holders need a robust response | Opinion

    Valve says Steam Machine will have “fewer constraints” than Steam Deck for game verification

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Ashley St. Clair, the mother of one of Elon Musk’s children, sues xAI over Grok sexual images

      January 17, 2026

      Anthropic joins OpenAI’s push into health care with new Claude tools

      January 12, 2026

      The mother of one of Elon Musk’s children says his AI bot won’t stop creating sexualized images of her

      January 7, 2026

      A new pope, political shake-ups and celebs in space: The 2025-in-review news quiz

      December 31, 2025

      AI has become the norm for students. Teachers are playing catch-up.

      December 23, 2025
    • Business

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025

      Saudia Arabia’s STC commits to five-year network upgrade programme with Ericsson

      December 18, 2025
    • Crypto

      Trump Shifts on Fed Pick as Hassett Odds Fade: Who Will Replace Powell?

      January 17, 2026

      A Third of French Crypto Firms Still Unlicensed Under MiCA as Deadline Nears

      January 17, 2026

      DOJ Charges Venezuelan National in $1 Billion Crypto Laundering Scheme

      January 17, 2026

      One of Wall Street’s Top Strategists No Longer Trusts Bitcoin | US Crypto News

      January 17, 2026

      3 Altcoins To Watch This Weekend | January 17 – 18

      January 17, 2026
    • Technology

      X is fully online after going down for most of the morning

      January 17, 2026

      Google is appealing the ruling from its search antitrust case to avoid sharing data with rivals

      January 17, 2026

      Get up to 78 percent off ExpressVPN two-year plans

      January 17, 2026

      CyberGhost VPN review: Despite its flaws, the value is hard to beat

      January 17, 2026

      Mayor of London Sadiq Khan calls for urgent action to boost the capital’s AI workforce

      January 17, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster
    Technology

    Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

    TechAiVerseBy TechAiVerseDecember 25, 2025No Comments10 Mins Read0 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

    Ken Jin

    24 December 2025

    Some time ago I posted an apology peice
    for Python’s tail caling results. I apologized for communicating performance
    results without noticing a compiler bug had occured.

    I can proudly say today that I am partially retracting that apology, but
    only for two platforms—MacOs AArch64 (XCode Clang) and Windows x86-64 (MSVC).

    In our own experiments, the tail calling interpreter for CPython
    was found to beat the computed
    goto interpreter by 5% on pyperformance on AArch64 macOS using XCode Clang,
    and roughly 15% on pyperformance on Windows on an experimental internal
    version of MSVC. The Windows build is against a switch-case interpreter, but
    this in theory shouldn’t matter too much, more on that in the next section.

    This is of course, a hopefully accurate result. I tried to be more diligent
    here, but I am of course not infallible. However, I have found that sharing early and making a fool of myself often works well, as it has led to people catching bugs in my code, so I shall continue doing so :).

    Also this assumes the change doesn’t get reverted later in Python 3.15’s
    development cycle.

    Brief background on interpreters

    Just a recap. There are two popular current ways of writing C-based
    interpreters.

    Switch-cases:

    switch (opcode) {
        case INST_1: ...
        case INST_2: ...
    }
    

    Where we just switch-case to the correct instruction handler.

    And the other popular way is a
    GCC/Clang extension called labels-as-values/computed gotos.

    goto *dispatch_table[opcode];
    INST_1: ...
    INST_2: ...
    

    Which is basically the same idea, but to instead jump to the address of the
    next label. Traditionally, the key optimization here is that it needs
    only one jump to go to the next instruction, while in the switch-case
    interpreter, a naiive compiler would need two jumps.

    With modern compilers however, the benefits of the computed gotos is a lot less,
    mainly because modern compilers have gotten better and modern hardware
    has also gotten better. In Nelson Elhage’s
    excellent investigation
    on the next kind of interpreter,
    the speedup of computed gotos over switch case on modern Clang was
    only in the low single digits on pyperformance.

    A 3rd way that was suggested decades ago, but not really entirely feasible
    is call/tail-call threaded interpreters. In this scheme, each bytecode
    handler is its own function, and we tail-call from one handler to the next
    in the instruction stream:

    return dispatch_table[opcode];
    
    PyObject *INST_1(...) {
    
    }
    
    PyObject *INST_2(...) {
    }
    

    This wasn’t too feasible in C for one main reason—tail call optimization
    was merely an optimization. It’s something the C compiler might do, or
    might not do. This means if you’re unlucky and the C compiler chooses not
    to perform the tail call, your interpreter might stack overflow!

    Some time ago, Clang introduced __attribute__((musttail)), which allowed
    for mandating that a call must be tail-called. Otherwise, the compilation
    will fail. To my knowledge, the first time this was popularized for use
    in a mainstream interpreter was in
    Josh Haberman’s Protobuf blog post.

    Later on, Haoran Xu noticed that the GHC calling convention combined with
    tail calls produced efficient code. They used this for their baseline
    JIT in a paper and termed the technique
    Copy-and-Patch.

    So where are we now?

    After using a fixed XCode Clang, our performance numbers on CPython
    3.14/3.15 suggest that the tail calling interpreter does provide a
    modest speedup over computed gotos. Around the 5% geomean range on
    pyperformance.

    To my understanding, uv already ships Python 3.14 on macOS with tail calling,
    which might be responsible for some of the speedups you see on there.
    We’re planning to ship the official 3.15 macOS binaries on python.org with
    tail calling as well.

    However, you’re not here for that. The title of this blog post
    is clearly about MSVC Windows x86-64. So what about that?

    Tail-calling for Windows

    [!CAUTION]
    The features for MSVC discussed below are to my knowledge, undocumented.
    They are not guaranteed to always be around unless the MSVC team decide to keep them. Use at your own risk!

    These are the preliminary pyperformance results
    for CPython on MSVC with tail-calling vs
    switch-case. Any number above 1.00x is a speedup
    (e.g. 1.01x == 1% speedup), anything below 1.00x is a slowdown.
    The speedup is a geomtric mean of around 15-16%, with a
    range of ~60% slowdown (one or two outliers) to 78% speedup.
    However, the key thing is that the vast majority of benchmaarks sped up!

    Chart credits to Michael Droettboom

    [!WARNING]
    These results are on an experimental internal MSVC compiler, public results below.

    To verify this and make sure I wasn’t wrong yet again, I checked the results
    on my machine with Visual Studio 2026. These are the results from
    this issue.

    Mean +- std dev: [spectralnorm_tc_no] 146 ms +- 1 ms -> [spectralnorm_tc] 98.3 ms +- 1.1 ms: 1.48x faster
    Mean +- std dev: [nbody_tc_no] 145 ms +- 2 ms -> [nbody_tc] 107 ms +- 2 ms: 1.35x faster
    Mean +- std dev: [bm_django_template_tc_no] 26.9 ms +- 0.5 ms -> [bm_django_template_tc] 22.8 ms +- 0.4 ms: 1.18x faster
    Mean +- std dev: [xdsl_tc_no] 64.2 ms +- 1.6 ms -> [xdsl_tc] 56.1 ms +- 1.5 ms: 1.14x faster
    

    So yeah, the speedups are real! For a large-ish library like xDSL, we see
    a 14% speedup, while for smaller microbenchmarks like nbody and spectralnorm,
    the speedups are greater.

    Thanks to Chris Eibl and Brandt Bucher, we managed to get the
    PR for this
    on MSVC over the finish line. I also want to sincerely thank the MSVC team. I can’t say this enough: they have been a joy to work with and
    I’m very impressed by what they’ve done, and I want to congratulate them
    on releasing Visual Studio 2026.

    This is now listed in the What’s New for 3.15 notes:

    Builds using Visual Studio 2026 (MSVC 18) may now use the new tail-calling interpreter. Results on an early experimental MSVC compiler reported roughly 15% speedup on the geometric mean of pyperformance on Windows x86-64 over the switch-case interpreter. We have observed speedups ranging from 15% for large pure-Python libraries to 40% for long-running small pure-Python scripts on Windows. (Contributed by Chris Eibl, Ken Jin, and Brandt Bucher in gh-143068. Special thanks to the MSVC team including Hulon Jenkins.)

    Where exactly do the speedups come from?

    I used to believe the the tail calling interpreters get their speedup
    from better register use. While I still believe that now, I suspect that is
    not the main reason for speedups in CPython.

    My main guess now is that
    tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.

    Let me show an example, at the time of writing, CPython 3.15’s interpreter loop
    is around 12k lines of C code. That’s 12k lines in a single function
    for the switch-case and computed goto interpreter.

    This has caused many issues for compilers in the past, too many to list in fact.
    I have a EuroPython 2025 talk about this. In short, this overly large function
    breaks a lot of compiler heuristics.

    One of the most beneficial optimisations is inlining. In the past, we’ve found
    that compilers sometimes straight up
    refuse to inline even the
    simplest of functions in that 12k loc eval loop. I want to stress that this
    is not the fault of the compiler. It’s actually doing the correct
    thing—you usually don’t want to increase the code size of something already
    super large. Unfortunately, this does’t bode well for our interpreter.

    You might say just write the interpreter in assembly!
    However, the whole point of this exercise is to not do that.

    Ok enough talk, let’s take a look at the code now. Taking a real
    example, we examine BINARY_OP_ADD_INT which adds two Python integers.
    Cleaning up the code so it’s readable, things look like this:

    TARGET(BINARY_OP_ADD_INT) {
        // Increment the instruction pointer.
        _Py_CODEUNIT* const this_instr = next_instr;
        frame->instr_ptr = next_instr;
        next_instr += 6;
        _PyStackRef right = stack_pointer[-1];
        // Check that LHS is an int.
        PyObject *value_o = PyStackRef_AsPyObjectBorrow(left);
        if (!_PyLong_CheckExactAndCompact(value_o)) {
            JUMP_TO_PREDICTED(BINARY_OP);
        }
        // Check that RHS is an int.
        // ... (same code as above for LHS)
    
        // Add them together.
        PyObject *left_o = PyStackRef_AsPyObjectBorrow(left);
        PyObject *right_o = PyStackRef_AsPyObjectBorrow(right);
        res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
    
        // If the addition fails, fall back to the generic instruction.
        if (PyStackRef_IsNull(res)) {
            JUMP_TO_PREDICTED(BINARY_OP);
        }
    
        // Close the references.
        PyStackRef_CLOSE_SPECIALIZED(left, _PyLong_ExactDealloc);
        PyStackRef_CLOSE_SPECIALIZED(right, _PyLong_ExactDealloc);
    
        // Write to the stack, and dispatch.
        stack_pointer[-2] = res;
        stack_pointer += -1;
        DISPATCH();
    }
    

    Seems simple enough, let’s take a look at the assembly for switch-case on
    VS 2026. Note again, this is a non-PGO build for easy source information,
    PGO generally makes some of these problems go away, but not all of them:

                    if (!_PyLong_CheckExactAndCompact(value_o)) {
    00007FFC4DE24DCE  mov         rcx,rbx  
    00007FFC4DE24DD1  mov         qword ptr [rsp+58h],rax  
    00007FFC4DE24DD6  call        _PyLong_CheckExactAndCompact (07FFC4DE227F0h)  
    00007FFC4DE24DDB  test        eax,eax  
    00007FFC4DE24DDD  je          _PyEval_EvalFrameDefault+10EFh (07FFC4DE258FFh)
    ...
                    res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
    00007FFC4DE24DFF  mov         rdx,rbx  
    00007FFC4DE24E02  mov         rcx,r15  
    00007FFC4DE24E05  call        _PyCompactLong_Add (07FFC4DD34150h)  
    00007FFC4DE24E0A  mov         rbx,rax  
    ...
                    PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
    00007FFC4DE24E17  lea         rdx,[_PyLong_ExactDealloc (07FFC4DD33BD0h)]  
    00007FFC4DE24E1E  mov         rcx,rsi  
    00007FFC4DE24E21  call        PyStackRef_CLOSE_SPECIALIZED (07FFC4DE222A0h) 
    

    Huh… all our functions were not inlined. Surely that must’ve mean they were
    too big or something right? Let’s look at PyStackReF_CLOSE_SPECIALIZED:

    static inline void
    PyStackRef_CLOSE_SPECIALIZED(_PyStackRef ref, destructor destruct)
    {
        assert(!PyStackRef_IsNull(ref));
        if (PyStackRef_RefcountOnObject(ref)) {
            Py_DECREF_MORTAL_SPECIALIZED(BITS_TO_PTR(ref), destruct);
        }
    }
    

    That looks … inlineable?

    Here’s how BINARY_OP_ADD_INT looks with tail calling on VS 2026 (again,
    no PGO):

                    if (!_PyLong_CheckExactAndCompact(left_o)) {
    00007FFC67164785  cmp         qword ptr [rax+8],rdx  
    00007FFC67164789  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h)  
    00007FFC6716478F  mov         r9,qword ptr [rax+10h]  
    00007FFC67164793  cmp         r9,10h  
    00007FFC67164797  jae         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h) 
    ...
                    res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
    00007FFC6716479D  mov         eax,dword ptr [rax+18h]  
    00007FFC671647A0  and         r9d,3  
    00007FFC671647A4  and         r8d,3  
    00007FFC671647A8  mov         edx,1  
    00007FFC671647AD  sub         rdx,r9  
    00007FFC671647B0  mov         ecx,1  
    00007FFC671647B5  imul        rdx,rax  
    00007FFC671647B9  mov         eax,dword ptr [rbx+18h]  
    00007FFC671647BC  sub         rcx,r8  
    00007FFC671647BF  imul        rcx,rax  
    00007FFC671647C3  add         rcx,rdx  
    00007FFC671647C6  call        medium_from_stwodigits (07FFC6706E9E0h)  
    00007FFC671647CB  mov         rbx,rax  
    ...
                    PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
    00007FFC671647EB  test        bpl,1  
    00007FFC671647EF  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)  
    00007FFC671647F1  add         dword ptr [rbp],0FFFFFFFFh  
    00007FFC671647F5  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)  
    00007FFC671647F7  mov         rax,qword ptr [_PyRuntime+25F8h (07FFC675C45F8h)]  
    00007FFC671647FE  test        rax,rax  
    00007FFC67164801  je          _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0E4h (07FFC67164814h)  
    00007FFC67164803  mov         r8,qword ptr [_PyRuntime+2600h (07FFC675C4600h)]  
    00007FFC6716480A  mov         edx,1  
    00007FFC6716480F  mov         rcx,rbp  
    00007FFC67164812  call        rax  
    00007FFC67164814  mov         rcx,rbp  
    00007FFC67164817  call        _PyLong_ExactDealloc (07FFC67073DA0h) 
    

    Would you look at that, suddenly our trivial functions get inlined :).

    You might also say, surely this does not happen on PGO builds? Well the issue
    I linked above actually says it does! So yeah happy days.

    Once again I want to stress, this is not the compiler’s fault! It’s just that
    the CPython interpreter loop is not the best thing to optimize.

    How do I try this out?

    Unfortunately, for now, you will have to build from source.

    With VS 2026, after cloning CPython, for a release build with PGO:

    $env:PlatformToolset = "v145"
    ./PCbuild/build.bat --tail-call-interp -c Release -p x64 --pgo
    

    Hopefully, we can distribute this in an easier binary form in the future
    once Python 3.15’s development matures!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleMattermost restricted access to old messages after 10000 limit is reached
    Next Article Research shows ‘more agents’ isn’t a reliable path to better enterprise AI systems
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    X is fully online after going down for most of the morning

    January 17, 2026

    Google is appealing the ruling from its search antitrust case to avoid sharing data with rivals

    January 17, 2026

    Get up to 78 percent off ExpressVPN two-year plans

    January 17, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025617 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025234 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025135 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025109 Views
    Don't Miss
    Gaming January 17, 2026

    Italian Competition Authority investigating Activision Blizzard over Diablo Immortal and Call of Duty Mobile monetisation

    Italian Competition Authority investigating Activision Blizzard over Diablo Immortal and Call of Duty Mobile monetisation…

    Shovelware is a bigger problem than ever – platform holders need a robust response | Opinion

    Valve says Steam Machine will have “fewer constraints” than Steam Deck for game verification

    Amazon Games’ New World: Aeternum to go offline in January 2027

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Italian Competition Authority investigating Activision Blizzard over Diablo Immortal and Call of Duty Mobile monetisation

    January 17, 20261 Views

    Shovelware is a bigger problem than ever – platform holders need a robust response | Opinion

    January 17, 20261 Views

    Valve says Steam Machine will have “fewer constraints” than Steam Deck for game verification

    January 17, 20261 Views
    Most Popular

    A Team of Female Founders Is Launching Cloud Security Tech That Could Overhaul AI Protection

    March 12, 20250 Views

    Senua’s Saga: Hellblade 2 leads BAFTA Game Awards 2025 nominations

    March 12, 20250 Views

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.