Evaluating LLMs Playing Text Adventures

When we first set up the llm such that it could play text adventures, we noted
that none of the models we tried to use with it were any good at it. We dreamed
of a way to compare them, but all I could think of was setting a goal far into
the game and seeing how long it takes them to get there. I just realised there’s
a better way to do it.

Evaluation against achievments

What we’ll do is set a low-ish turn limit and see how much they manage to
accomplish in that time.1¹ Another alternative for more linear games is
running them multiple times with a turn limit and seeing how often they get past
a particular point within that turn limit.

Given how much freedom is offered to players of text adventures, this is a
difficult test. It’s normal even for a skilled human player to immerse
themselves in their surrounding rather than make constant progress. I wouldn’t
be surprised if I got a score of zero if someone plopped me down in front of
this test. But still, maybe it’s the best we can do with limited
resources.2² Another idea is to give them a far-off goal and then somehow have
them request hints when they are stuck, and count how many hints they need to
get there. However, given how little they used hints given in the previous
article, I doubt this would work very well either.

What we’ll do is define a set of achievements for a game. These achievements
will be clustered around the first few turns of the game, because we’ll only
give the llm a few turns to earn them. Here’s an example for 9:05.

TURN_LIMIT          40
ANSWER_PHONE        Click.
EXIT_BED            You get out of bed.
OPEN_DRESSER        revealing some clean
ENTER_BATHROOM      far from luxurious
REMOVE_SOILED       You take off the soiled
REMOVE_WATCH        You take off the watch
ENTER_SHOWER        dawdle
WEAR_CLEAN          You put on the clean
OPEN_FRONT          You open the front
UNLOCK_CAR          Unlocked.
ENTER_CAR           Las Mesas
OPEN_WALLET         open the wallet
CARD_SLOT           green LED lights

It should be fairly clear how this works: the TURN_LIMIT specifies how many
turns the llm has to collect achievements. Every line other than that
specifies an achievement: the name is on the left, and it counts as earned when
the game prints the text on the right. The llm knows nothing of these
achievements. It tries to get through the game and in the background we use the
achievements to count how far it gets.

It might seem like the turn limit must be calibrated such that a score of 100 %
is possible, but that’s not the case. Many of the games we are going to test
with have branching already at the start, such that the achievements need to
cover multiple branches, and it’s impossible to go through all branches within
the turn limit. What we do need to be careful about is making sure the number of
achievements in each branch is roughly the same, otherwise models that are lucky
and go down an achievement-rich path will get a higher score. Thanks to this,
the score we get out of this test is a relative comparison between models, not
an absolute measure of how well the llms play text adventures. We have already
established that they don’t do it very well, and we can’t be more nuanced than
that without paying for a lot of eval tokens.

We might consider making some moves not count toward the turn limit, for example
erroneous commands, or examining things – the latter because more powerful
models are more methodical and examine more things, and it seems odd to penalise
them for this. However, in the end, examining things is probably part of what
allows the more powerful models to make further progress (and typing valid
commands is part of being good at text adventures), so we won’t give away any
moves for free.

Evaluating many popular models

We register for OpenRouter to get convenient access to more models and then let
them whirr away with the Perl script, which is updated to cut the llm off at
the turn limit. At that point it reports to us how many achievements were
earned. We get the following results, ordered roughly by decreasing performance.
(The result tables in this article are wide; on narrow viewports you may have
to scroll sideways.)

Model	9:05	Lockout	Dreamhold	Lost Pig
Grok 4	86 %	15 %	46 %	33 %
Claude 4 Sonnet	80 %	30 %	53 %	46 %
Gemini 2.5 Flash	80 %	30 %	33 %	46 %
Gemini 2.5 Pro	80 %	30 %	40 %	40 %
DeepSeek R1 0528	80 %	23 %	33 %	33 %
Claude 4 Opus	73 %	30 %	60 %	46 %
gpt-5 Chat	73 %	15 %	53 %	33 %
DeepSeek V3	66 %	23 %	20 %	33 %
gpt-4o	53 %	23 %	40 %	40 %
Qwen3 Coder	53 %	23 %	40 %	33 %
Kimi K2	53 %	30 %	46 %	40 %
glm 4.5	53 %	23 %	33 %	53 %
Claude 3.5 Haiku	38 %	15 %	26 %	26 %
Llama 3 Maverick	33 %	30 %	40 %	33 %
gpt-o3-mini	20 %	15 %	26 %	26 %
Mistral Small 3	20 %	15 %	0 %	20 %
gpt-4o-mini	13 %	23 %	20 %	40 %

Ideally, these should be run multiple times to account for random variation in
performance3³ For example, in 9:05, Opus thought it did not carry the wallet
when it did, so it jumped into the car again to go back for it. Clever, but
wasted enough turns to lose to Sonnet thanks to a silly mistake!, but given
that the Opus sessions cost around $4, I’m not going to do that. I was close to
not even running Opus for all four games!

Adjusting model ranking for game difficulty

Some models appear to perform better in some games than others, so it’s hard to
rank the models. We could take the average of their scores, but that’s unfair
because some of the games are harder than others: a 40 % in Lockout should be
considered more impressive than a 40 % in Dreamhold. What we will do, which
may or may not be valid, is run a linear regression using models and games as
predictors. This gives us coefficients for the games (telling us how difficult
the games are), but also coefficients for the models, and these are the ones we
want, because the coefficients for the models are adjusted for game difficulty.

This regression is performed with the baseline being 9:05 played by gpt-5
Chat. Most of the model coefficients are not statistically significant (because
four games is not enough to figure out statistical significance unless the model
is truly terrible), but they might serve as a first-order estimation for ranking
models.

In this table, cost is per million output tokens.4⁴ The design of the script
ensures that output and input are similar in size – O(1) to be specific – so
output is what is going to drive the cost. The table is divided into three
categories: performance better than gpt-5 Chat, cheaper models with
performance that is nearly there, and models that suck.

Model	Coefficient	Cost ($/Mt)
Claude 4 Opus	+0.09	75
Claude 4 Sonnet	+0.09	15
Gemini 2.5 Pro	+0.04	10
Gemini 2.5 Flash	+0.04	0.7
Grok 4	+0.02	15
gpt-5 Chat (baseline)	0.00	10
Kimi K2	-0.01	2.5
DeepSeek R1 0528	-0.01	0.7
glm 4.5	-0.03	0.8
gpt-4o	-0.05	0.1
Qwen3 Coder	-0.06	0.8
DeepSeek V3	-0.08	0.7
Llama 3 Maverick	-0.10	0.6
Claude 3.5 Haiku	-0.17	4
gpt-4o-mini	-0.20	0.6
gpt-o3-mini	-0.22	4.4
Mistral Small 3	-0.30	0.1

Some comments:

I find it interesting that the top-tier models (Claude Opus, Gemini Pro) don’t
seem to significantly outperform their cheaper siblings (Claude Sonnet, Gemini
Flash) in these tests.5⁵ This might be because we are hand-holding the
models so much in the prompt. More powerful models may be better at directing
themselves.
I’m very impressed by Gemini 2.5 Flash. At that cost, it is performing
admirably. It is hard to argue for using models like DeepSeek’s R1 when we
better performance at the same cost from the Google model.
The small models really aren’t good general problem solvers. I think Haiku
costs so much because it is good at language, not reasoning.

It would be super interesting to toss these at more games to work out the finer
differences (e.g. is there really a difference between Gemini Pro and Flash,
or was that just down to sampling error in the small sample of games I had them
play?) but such a comparison gets expensive in part due to the cost of eval
tokens (the above table cost something like $34), but mainly because it would
require me to sit down and create sets of achievements for these games. I have
only played so many z-code games, so I cannot do this for very many games. If
someone wants to support me, please reach out!

Testing the top models on more games

I have played three more games, though, so let’s continue the evaluation with
the five top models on these games also. Their performances on the three new
games are

Model	For a Change	Plundered Hearts	So Far
Claude 4 Sonnet	11 %	19 %	28 %
Gemini 2.5 Pro	16 %	28 %	28 %
GPT-5 Chat	44 %	33 %	0 %
Grok 4	22 %	28 %	28 %
Gemini 2.5 Flash	28 %	33 %	14 %

Using the same methodology as before (combining data from both trial run sets),
we arrive at new coefficients for the evaluated models.6⁶ I did also
investigate how Gemini 2.0 Flash compared against Gemini 2.5 Flash, because the
former is significantly cheaper and the latter was surprisingly good.
Unfortunately, Gemini 2.0 Flash was not very good. Its performance relative to
its younger sibling was -15 %pt.,7⁷ I was also tempted to
compare o3-mini against o3-mini-high to see the effect of the reasoning_effort
parameter but since o3-mini was such a crappy model anyway it was hard to
justify the effort.

Model	Coefficient	Cost ($/Mt)
Claude 4 Sonnet	+0.02	15
Gemini 2.5 Pro	+0.02	10
Gemini 2.5 Flash	+0.02	0.7
GPT-5 Chat (baseline)	0.00	10
Grok 4	-0.01	15

On the one hand, it’s a little odd that the performance of Claude 4 Sonnet
dropped. On the other hand, I calibrated the prompt using Claude 4 Sonnet
against 9:05, so by adding more games we are effectively diluting the training
set within the test set; we probably should expect a performance drop at that
point.

Noting the cost column, Gemini 2.5 Flash is a clear winner for running text
adventures. It’s also fast compared to the others.

Evaluating score variation

Given that I’ve already sunk some money into this article series, and a few
additional sessions with Gemini 2.5 Flash cannot hurt that much, let’s splurge
and do that thing we wanted to do in the first place: run the same model against
the same game a few times to figure out the size of the sampling error. All of
the scores in the table below comes from Gemini 2.5 Flash. The first column is
the standard deviation of the remaining columns.

Game	St. dev.	Run 1	Run 2	Run 3	Run 4	Run 5	Run 6
9:05	14 %pt	73 %	86 %	86 %	80 %	53 %	60 %
Lockout	11 %pt	30 %	46 %	46 %	38 %	23 %	23 %
Dreamhold	10 %pt	53 %	40 %	46 %	46 %	53 %	26 %
Lost Pig	3 %pt	46 %	40 %	40 %	40 %	46 %	40 %
For a Change	6 %pt	16 %	11 %	16 %	5 %	0 %	11 %
Plundered Hearts	4 %pt	19 %	19 %	19 %	23 %	28 %	28 %
So Far	32 %pt	14 %	57 %	71 %	71 %	71 %	0 %

In case it is not obvious, this is not so much an evalutaion of Gemini 2.5 Flash
as it is a judgment of the quality of the testing protocol. It is clear, for
example, that using So Far to evaluate llms is a mistake: the same model has
large variation between runs, and the difference between runs of different
models is not so large. It would be more informative to replace the run of So
Far with another run of one of the other games – maybe Plundered Hearts or
Lost Pig, which start out more linearly.8⁸ For a Change might look like a
good game for evaluation, but I think that’s a mistake. It’s not that the model
makes consistent progress, but that it fails to make almost any progress at all,
thanks to how open the game is right from the gate.

Conclusions

I’m not sure what conclusions to draw from this article series.

We can drive z-code text adventures through Perl, which lets us connect it to
an llm in a controlled way. It turned out more complicated than one would think, but
definitely doable.
llms are still not great at playing text adventures. Giving them leading
questions to keep them on track helps a lot. Giving them hints helps them
surprisingly little.
The variation in how much they accomplish can be large for some games with
lots of distracting details, such as Lockout and So Far. The games that
are easiest to evaluate with are those with a relatively linear beginning,
such as Lost Pig and Plundered Hearts.
There is one cheap model that is about as good as llm models get at playing
text adventures: Gemini 2.5 Flash. Many of the other cheap models might have
performance worse than gpt-5 Chat, and probably also worse than Gemini 2.5
Flash. Claude 4 Sonnet might seem like the best model if costs be damned, but
that is probably because the prompt was calibrated against Claude 4 Sonnet.
Running llms in agentic type applications really burns through api credits
like nothing else. I’d really like to complement this analysis with the “how
many turns does the model need to get to point X” test, but I cannot motivate
spending the money for it.

Subscribe to Updates

What's Hot

Evaluating LLMs Playing Text Adventures

Evaluating LLMs Playing Text Adventures

Evaluation against achievments

Evaluating many popular models

Adjusting model ranking for game difficulty

Testing the top models on more games

Evaluating score variation

Conclusions

Related Posts