Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

    Is a secure AI assistant possible?

    The Download: inside the QuitGPT movement, and EVs in Africa

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Berachain Jumps 150% as Strategic Pivot Lifts BERA

      February 12, 2026

      Tom Lee’s BitMine (BMNR) Stock Faces Cost-Basis Risk — Price Breakdown at 10%?

      February 12, 2026

      Why the US Jobs Data Makes a Worrying Case for Bitcoin

      February 12, 2026

      MYX Falls Below $5 as Short Sellers Take Control — 42% Decline Risk Emerges

      February 12, 2026

      Solana Pins Its $75 Support on Short-Term Buyers — Can Price Survive This Risky Setup?

      February 12, 2026
    • Technology

      Is a secure AI assistant possible?

      February 12, 2026

      The Download: inside the QuitGPT movement, and EVs in Africa

      February 12, 2026

      EVs could be cheaper to own than gas cars in Africa by 2040

      February 12, 2026

      xAI lays out interplanetary ambitions in public all-hands

      February 12, 2026

      Cloud gaming on TVs suddenly looks like the future —2026 is the year the ‘no console’ world becomes realistic, thanks developments and hardware shortages

      February 11, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Evaluating LLMs Playing Text Adventures
    Technology

    Evaluating LLMs Playing Text Adventures

    TechAiVerseBy TechAiVerseAugust 12, 2025No Comments12 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Evaluating LLMs Playing Text Adventures

    When we first set up the llm such that it could play text adventures, we noted
    that none of the models we tried to use with it were any good at it. We dreamed
    of a way to compare them, but all I could think of was setting a goal far into
    the game and seeing how long it takes them to get there. I just realised there’s
    a better way to do it.

    Evaluation against achievments

    What we’ll do is set a low-ish turn limit and see how much they manage to
    accomplish in that time.1 Another alternative for more linear games is
    running them multiple times with a turn limit and seeing how often they get past
    a particular point within that turn limit.

    Given how much freedom is offered to players of text adventures, this is a
    difficult test. It’s normal even for a skilled human player to immerse
    themselves in their surrounding rather than make constant progress. I wouldn’t
    be surprised if I got a score of zero if someone plopped me down in front of
    this test. But still, maybe it’s the best we can do with limited
    resources.2 Another idea is to give them a far-off goal and then somehow have
    them request hints when they are stuck, and count how many hints they need to
    get there. However, given how little they used hints given in the previous
    article, I doubt this would work very well either.

    What we’ll do is define a set of achievements for a game. These achievements
    will be clustered around the first few turns of the game, because we’ll only
    give the llm a few turns to earn them. Here’s an example for 9:05.

    TURN_LIMIT          40
    ANSWER_PHONE        Click.
    EXIT_BED            You get out of bed.
    OPEN_DRESSER        revealing some clean
    ENTER_BATHROOM      far from luxurious
    REMOVE_SOILED       You take off the soiled
    REMOVE_WATCH        You take off the watch
    ENTER_SHOWER        dawdle
    WEAR_CLEAN          You put on the clean
    OPEN_FRONT          You open the front
    UNLOCK_CAR          Unlocked.
    ENTER_CAR           Las Mesas
    OPEN_WALLET         open the wallet
    CARD_SLOT           green LED lights
    

    It should be fairly clear how this works: the TURN_LIMIT specifies how many
    turns the llm has to collect achievements. Every line other than that
    specifies an achievement: the name is on the left, and it counts as earned when
    the game prints the text on the right. The llm knows nothing of these
    achievements. It tries to get through the game and in the background we use the
    achievements to count how far it gets.

    It might seem like the turn limit must be calibrated such that a score of 100 %
    is possible, but that’s not the case. Many of the games we are going to test
    with have branching already at the start, such that the achievements need to
    cover multiple branches, and it’s impossible to go through all branches within
    the turn limit. What we do need to be careful about is making sure the number of
    achievements in each branch is roughly the same, otherwise models that are lucky
    and go down an achievement-rich path will get a higher score. Thanks to this,
    the score we get out of this test is a relative comparison between models, not
    an absolute measure of how well the llms play text adventures. We have already
    established that they don’t do it very well, and we can’t be more nuanced than
    that without paying for a lot of eval tokens.

    We might consider making some moves not count toward the turn limit, for example
    erroneous commands, or examining things – the latter because more powerful
    models are more methodical and examine more things, and it seems odd to penalise
    them for this. However, in the end, examining things is probably part of what
    allows the more powerful models to make further progress (and typing valid
    commands is part of being good at text adventures), so we won’t give away any
    moves for free.

    Evaluating many popular models

    We register for OpenRouter to get convenient access to more models and then let
    them whirr away with the Perl script, which is updated to cut the llm off at
    the turn limit. At that point it reports to us how many achievements were
    earned. We get the following results, ordered roughly by decreasing performance.
    (The result tables in this article are wide; on narrow viewports you may have
    to scroll sideways.)

    Model 9:05 Lockout Dreamhold Lost Pig
    Grok 4 86 % 15 % 46 % 33 %
    Claude 4 Sonnet 80 % 30 % 53 % 46 %
    Gemini 2.5 Flash 80 % 30 % 33 % 46 %
    Gemini 2.5 Pro 80 % 30 % 40 % 40 %
    DeepSeek R1 0528 80 % 23 % 33 % 33 %
    Claude 4 Opus 73 % 30 % 60 % 46 %
    gpt-5 Chat 73 % 15 % 53 % 33 %
    DeepSeek V3 66 % 23 % 20 % 33 %
    gpt-4o 53 % 23 % 40 % 40 %
    Qwen3 Coder 53 % 23 % 40 % 33 %
    Kimi K2 53 % 30 % 46 % 40 %
    glm 4.5 53 % 23 % 33 % 53 %
    Claude 3.5 Haiku 38 % 15 % 26 % 26 %
    Llama 3 Maverick 33 % 30 % 40 % 33 %
    gpt-o3-mini 20 % 15 % 26 % 26 %
    Mistral Small 3 20 % 15 % 0 % 20 %
    gpt-4o-mini 13 % 23 % 20 % 40 %

    Ideally, these should be run multiple times to account for random variation in
    performance3 For example, in 9:05, Opus thought it did not carry the wallet
    when it did, so it jumped into the car again to go back for it. Clever, but
    wasted enough turns to lose to Sonnet thanks to a silly mistake!
    , but given
    that the Opus sessions cost around $4, I’m not going to do that. I was close to
    not even running Opus for all four games!

    Adjusting model ranking for game difficulty

    Some models appear to perform better in some games than others, so it’s hard to
    rank the models. We could take the average of their scores, but that’s unfair
    because some of the games are harder than others: a 40 % in Lockout should be
    considered more impressive than a 40 % in Dreamhold. What we will do, which
    may or may not be valid, is run a linear regression using models and games as
    predictors. This gives us coefficients for the games (telling us how difficult
    the games are), but also coefficients for the models, and these are the ones we
    want, because the coefficients for the models are adjusted for game difficulty.

    This regression is performed with the baseline being 9:05 played by gpt-5
    Chat. Most of the model coefficients are not statistically significant (because
    four games is not enough to figure out statistical significance unless the model
    is truly terrible), but they might serve as a first-order estimation for ranking
    models.

    In this table, cost is per million output tokens.4 The design of the script
    ensures that output and input are similar in size – O(1) to be specific – so
    output is what is going to drive the cost.
    The table is divided into three
    categories: performance better than gpt-5 Chat, cheaper models with
    performance that is nearly there, and models that suck.

    Model Coefficient Cost ($/Mt)
    Claude 4 Opus +0.09 75
    Claude 4 Sonnet +0.09 15
    Gemini 2.5 Pro +0.04 10
    Gemini 2.5 Flash +0.04 0.7
    Grok 4 +0.02 15
    gpt-5 Chat (baseline) 0.00 10
    Kimi K2 -0.01 2.5
    DeepSeek R1 0528 -0.01 0.7
    glm 4.5 -0.03 0.8
    gpt-4o -0.05 0.1
    Qwen3 Coder -0.06 0.8
    DeepSeek V3 -0.08 0.7
    Llama 3 Maverick -0.10 0.6
    Claude 3.5 Haiku -0.17 4
    gpt-4o-mini -0.20 0.6
    gpt-o3-mini -0.22 4.4
    Mistral Small 3 -0.30 0.1

    Some comments:

    • I find it interesting that the top-tier models (Claude Opus, Gemini Pro) don’t
      seem to significantly outperform their cheaper siblings (Claude Sonnet, Gemini
      Flash) in these tests.5 This might be because we are hand-holding the
      models so much in the prompt. More powerful models may be better at directing
      themselves.
    • I’m very impressed by Gemini 2.5 Flash. At that cost, it is performing
      admirably. It is hard to argue for using models like DeepSeek’s R1 when we
      better performance at the same cost from the Google model.
    • The small models really aren’t good general problem solvers. I think Haiku
      costs so much because it is good at language, not reasoning.

    It would be super interesting to toss these at more games to work out the finer
    differences (e.g. is there really a difference between Gemini Pro and Flash,
    or was that just down to sampling error in the small sample of games I had them
    play?) but such a comparison gets expensive in part due to the cost of eval
    tokens (the above table cost something like $34), but mainly because it would
    require me to sit down and create sets of achievements for these games. I have
    only played so many z-code games, so I cannot do this for very many games. If
    someone wants to support me, please reach out!

    Testing the top models on more games

    I have played three more games, though, so let’s continue the evaluation with
    the five top models on these games also. Their performances on the three new
    games are

    Model For a Change Plundered Hearts So Far
    Claude 4 Sonnet 11 % 19 % 28 %
    Gemini 2.5 Pro 16 % 28 % 28 %
    GPT-5 Chat 44 % 33 % 0 %
    Grok 4 22 % 28 % 28 %
    Gemini 2.5 Flash 28 % 33 % 14 %

    Using the same methodology as before (combining data from both trial run sets),
    we arrive at new coefficients for the evaluated models.6 I did also
    investigate how Gemini 2.0 Flash compared against Gemini 2.5 Flash, because the
    former is significantly cheaper and the latter was surprisingly good.
    Unfortunately, Gemini 2.0 Flash was not very good. Its performance relative to
    its younger sibling was -15 %pt.
    ,7 I was also tempted to
    compare o3-mini against o3-mini-high to see the effect of the reasoning_effort
    parameter but since o3-mini was such a crappy model anyway it was hard to
    justify the effort.

    Model Coefficient Cost ($/Mt)
    Claude 4 Sonnet +0.02 15
    Gemini 2.5 Pro +0.02 10
    Gemini 2.5 Flash +0.02 0.7
    GPT-5 Chat (baseline) 0.00 10
    Grok 4 -0.01 15

    On the one hand, it’s a little odd that the performance of Claude 4 Sonnet
    dropped. On the other hand, I calibrated the prompt using Claude 4 Sonnet
    against 9:05, so by adding more games we are effectively diluting the training
    set within the test set; we probably should expect a performance drop at that
    point.

    Noting the cost column, Gemini 2.5 Flash is a clear winner for running text
    adventures. It’s also fast compared to the others.

    Evaluating score variation

    Given that I’ve already sunk some money into this article series, and a few
    additional sessions with Gemini 2.5 Flash cannot hurt that much, let’s splurge
    and do that thing we wanted to do in the first place: run the same model against
    the same game a few times to figure out the size of the sampling error. All of
    the scores in the table below comes from Gemini 2.5 Flash. The first column is
    the standard deviation of the remaining columns.

    Game St. dev. Run 1 Run 2 Run 3 Run 4 Run 5 Run 6
    9:05 14 %pt 73 % 86 % 86 % 80 % 53 % 60 %
    Lockout 11 %pt 30 % 46 % 46 % 38 % 23 % 23 %
    Dreamhold 10 %pt 53 % 40 % 46 % 46 % 53 % 26 %
    Lost Pig 3 %pt 46 % 40 % 40 % 40 % 46 % 40 %
    For a Change 6 %pt 16 % 11 % 16 % 5 % 0 % 11 %
    Plundered Hearts 4 %pt 19 % 19 % 19 % 23 % 28 % 28 %
    So Far 32 %pt 14 % 57 % 71 % 71 % 71 % 0 %

    In case it is not obvious, this is not so much an evalutaion of Gemini 2.5 Flash
    as it is a judgment of the quality of the testing protocol. It is clear, for
    example, that using So Far to evaluate llms is a mistake: the same model has
    large variation between runs, and the difference between runs of different
    models is not so large. It would be more informative to replace the run of So
    Far
    with another run of one of the other games – maybe Plundered Hearts or
    Lost Pig, which start out more linearly.8 For a Change might look like a
    good game for evaluation, but I think that’s a mistake. It’s not that the model
    makes consistent progress, but that it fails to make almost any progress at all,
    thanks to how open the game is right from the gate.

    Conclusions

    I’m not sure what conclusions to draw from this article series.

    • We can drive z-code text adventures through Perl, which lets us connect it to
      an llm in a controlled way. It turned out more complicated than one would think, but
      definitely doable.
    • llms are still not great at playing text adventures. Giving them leading
      questions to keep them on track helps a lot
      . Giving them hints helps them
      surprisingly little.
    • The variation in how much they accomplish can be large for some games with
      lots of distracting details, such as Lockout and So Far. The games that
      are easiest to evaluate with are those with a relatively linear beginning,
      such as Lost Pig and Plundered Hearts.
    • There is one cheap model that is about as good as llm models get at playing
      text adventures: Gemini 2.5 Flash. Many of the other cheap models might have
      performance worse than gpt-5 Chat, and probably also worse than Gemini 2.5
      Flash. Claude 4 Sonnet might seem like the best model if costs be damned, but
      that is probably because the prompt was calibrated against Claude 4 Sonnet.
    • Running llms in agentic type applications really burns through api credits
      like nothing else. I’d really like to complement this analysis with the “how
      many turns does the model need to get to point X” test, but I cannot motivate
      spending the money for it.
    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleUK government advises deleting emails to save water
    Next Article Perplexity Makes Longshot $34.5B Offer for Chrome
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Is a secure AI assistant possible?

    February 12, 2026

    The Download: inside the QuitGPT movement, and EVs in Africa

    February 12, 2026

    EVs could be cheaper to own than gas cars in Africa by 2040

    February 12, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025667 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025253 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025152 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Business Technology February 12, 2026

    The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

    The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now…

    Is a secure AI assistant possible?

    The Download: inside the QuitGPT movement, and EVs in Africa

    EVs could be cheaper to own than gas cars in Africa by 2040

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

    February 12, 20262 Views

    Is a secure AI assistant possible?

    February 12, 20261 Views

    The Download: inside the QuitGPT movement, and EVs in Africa

    February 12, 20262 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.