From Rust to Reality: The Hidden Journey of Fetch_max

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control
It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine.
Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

How a Job Interview Sent Me Down a Compiler Rabbit Hole

I occasionally interview candidates for engineering roles. We need people who understand
concurrent programming. One of our favorite questions involves keeping track of a
maximum value across multiple producer threads – a classic pattern that appears in many
real-world systems.

Candidates can use any language they want.
In Java (the language I know best), you might write a CAS loop,
or if you’re feeling functional, use updateAndGet() with a lambda:

AtomicLong highScore = new AtomicLong(100);
[...]
highScore.updateAndGet(current -> Math.max(current, newScore));

But that lambda is doing work – it’s still looping under the hood, retrying if
another thread interferes. You can see the loop right in AtomicLong’s source code.

Then one candidate chose Rust.

I was following along as he started typing, expecting to see either an explicit
CAS loop or some functional wrapper around one. But instead, he just wrote:

high_score.fetch_max(new_score, Ordering::Relaxed);

“Rust has fetch_max built in,” he explained casually, moving on to the next
part of the problem.

Hold on. This wasn’t a wrapper around a loop pattern – this was a first-class
atomic operation, sitting right there next to fetch_add and fetch_or. Java
doesn’t have this. C++ doesn’t have this. How could Rust just… have this?

After the interview, curiosity got the better of me. Why would Rust provide
fetch_max as a built-in intrinsic? Intrinsics usually exist to leverage
specific hardware instructions. But x86-64 doesn’t have an atomic max
instruction. So there had to be a CAS loop somewhere in the pipeline. Unless…
maybe some architectures do have this instruction natively? And if so, how
does the same Rust code work on both?

I had to find out. Was the loop in Rust’s standard library? Was it in LLVM?
Was it generated during code generation for x86-64?

So I started digging. What I found was a fascinating journey through five
distinct layers of compiler transformations, each one peeling back another level
of abstraction, until I found exactly where that loop materialized. Let me share
what I discovered.

Layer 1: The Rust Code

Let’s start with what that candidate wrote – a simple high score tracker that can
be safely updated from multiple threads:

use std::sync::atomic::{AtomicU64, Ordering};
fn main() {
    let high_score = AtomicU64::new(100);
    // [...]
    // Another thread reports a new score of 200
    let _old_score = high_score.fetch_max(200, Ordering::Relaxed);
    // [...]
}
// Save this snippet as `main.rs` we are going to use it later.

This single line does exactly what it promises: atomically fetches the current
value, compares it with the new one, updates it if the new value is greater, and
returns the old value. It’s safe, concise, and impossible to mess up. No
explicit loops, no retry logic visible anywhere. But how does it actually work under
the hood?

Layer 2: The Macro Expansion

Before our fetch_max call even reaches anywhere close to machine code generation,
there’s another layer of abstraction at work. The fetch_max method isn’t hand-written
for each atomic type – it’s generated by a Rust macro called atomic_int!.

If we peek into Rust’s standard library source code, we find that AtomicU64
and all its methods are actually created by
this macro:

atomic_int! {

cfg(target_has_atomic = "64"),

// ... various configuration attributes ...

atomic_umin, atomic_umax, // The intrinsics to use

8, // Alignment

u64 AtomicU64 // The type to generate

}

Inside this macro, fetch_max is defined as a
template
that works for any integer type:

pub fn fetch_max(&self, val: $int_type, order: Ordering) -> $int_type {
    // SAFETY: data races are prevented by atomic intrinsics.
    unsafe { $max_fn(self.v.get(), val, order) }
}

The $max_fn placeholder gets replaced with atomic_umax for unsigned types
and atomic_max for signed types. This single macro definition generates
fetch_max methods for AtomicI8, AtomicU8, AtomicI16, AtomicU16, and so
on – all the way up to AtomicU128.

So our simple fetch_max call is actually invoking generated code. But what
does the atomic_umax function actually do? To answer that, we need
to see what the Rust compiler produces next.

Layer 3: LLVM IR

Now that we know fetch_max is macro-generated code calling atomic_umax,
let’s see what happens when the Rust compiler processes it. The compiler
doesn’t go straight to assembly. First, it translates the code into an
intermediate representation. Rust uses the LLVM compiler project, so it
generates LLVM Intermediate Representation (IR).

If we peek at the LLVM IR for our fetch_max call, we see something like this:

; Before the transformation

bb7:

%0 = atomicrmw umax ptr %self, i64 %val monotonic, align 8

...

This is LLVM’s language for saying: “I need an atomic read-modify-write
operation. The modification I want to perform is an unsigned maximum.”

This is a powerful, high-level instruction within the compiler itself. But it
poses a critical question: does the CPU actually have a single instruction
called umax? For most architectures, the answer is no. So how does the
compiler bridge this gap?

How to See This Yourself

My goal is not to merely describe what is happening, but to give you the tools to
see it for yourself. You can trace this transformation step-by-step on your own
machine.

First, tell the Rust compiler to stop after generating the LLVM IR:

rustc --emit=llvm-ir main.rs

This creates a main.ll file. This file contains the LLVM IR
representation of your Rust code, including our atomicrmw umax instruction.
Keep the file around; we’ll use it in the next steps.

Interlude: Compiler Intrinsics

We’re missing something important. How does the Rust function atomic_umax
actually become the LLVM instruction atomicrmw umax? This is where compiler
intrinsics come into play.

If you dig into Rust’s source code, you’ll find that atomic_umax is
defined like this:

/// Updates `*dst` to the max value of `val` and the old value (unsigned comparison)
#[inline]
#[cfg(target_has_atomic)]
#[cfg_attr(miri, track_caller)] // even without panics, this helps for Miri backtraces
unsafe fn atomic_umax(dst: *mut T, val: T, order: Ordering) -> T {
    // SAFETY: the caller must uphold the safety contract for `atomic_umax`
    unsafe {
        match order {
            Relaxed => intrinsics::atomic_umax::(dst, val),
            Acquire => intrinsics::atomic_umax::(dst, val),
            Release => intrinsics::atomic_umax::(dst, val),
            AcqRel => intrinsics::atomic_umax::(dst, val),
            SeqCst => intrinsics::atomic_umax::(dst, val),
        }
    }
}

But what is this intrinsics::atomic_umax function? If you look at its
definition,
you find something slightly unusual:

/// Maximum with the current value using an unsigned comparison.
/// `T` must be an unsigned integer type.
///
/// The stabilized version of this intrinsic is available on the
/// [`atomic`] unsigned integer types via the `fetch_max` method. For example, [`AtomicU32::fetch_max`].
#[rustc_intrinsic]
#[rustc_nounwind]
pub unsafe fn atomic_umax(dst: *mut T, src: T) -> T;

There is no body. This is a declaration, not a definition. The
#[rustc_intrinsic] attribute tells the Rust compiler that this function
maps directly to a low-level operation understood by the compiler
itself. When the Rust compiler sees a call to intrinsics::atomic_umax, it
knows to
replace it
with the corresponding
LLVM intrinsic function.

So our journey actually looks like this:

fetch_max method (user-facing API)
Macro expands to call atomic_umax function
atomic_umax is a compiler intrinsic
Rustc replaces the intrinsic with LLVM’s atomicrmw umax ← We are here
LLVM processes this instruction…

Layer 4: The Transformation

LLVM runs a series of “passes” that analyze and transform the code. The one we’re interested in is called the
AtomicExpandPass.

Its job is to look at high-level atomic operations like atomicrmw umax and ask
the target architecture, “Can you do this natively?”

When the x86-64 backend says “No, I can’t,” this pass expands the single
instruction into a sequence of more fundamental ones that the CPU does
understand. The result is a
compare-and-swap (CAS) loop.

We can see this transformation in action by asking LLVM to emit the
intermediate representation before and after this pass. To see the IR before
the AtomicExpandPass, run:

llc -print-before=atomic-expand main.ll -o /dev/null

Tip: If you do not have llc installed, you can ask rustc to run the pass for you directly.
rustc -C llvm-args="-print-before=atomic-expand -print-after=atomic-expand" main.rs

The code will be printed to your terminal. The function containing our atomic max
looks like this:

*** IR Dump Before Expand Atomic instructions (atomic-expand) ***

; Function Attrs: inlinehint nonlazybind uwtable

define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {

start:

%_0 = alloca [8 x i8], align 8

%order = alloca [1 x i8], align 1

store i8 %0, ptr %order, align 1

%1 = load i8, ptr %order, align 1

%_7 = zext i8 %1 to i64

switch i64 %_7, label %bb2 [

i64 0, label %bb7

i64 1, label %bb5

i64 2, label %bb6

i64 3, label %bb4

i64 4, label %bb3

]

bb2: ; preds = %start

unreachable

bb7: ; preds = %start

%2 = atomicrmw umax ptr %self, i64 %val monotonic, align 8

store i64 %2, ptr %_0, align 8

br label %bb1

bb5: ; preds = %start

%3 = atomicrmw umax ptr %self, i64 %val release, align 8

store i64 %3, ptr %_0, align 8

br label %bb1

bb6: ; preds = %start

%4 = atomicrmw umax ptr %self, i64 %val acquire, align 8

store i64 %4, ptr %_0, align 8

br label %bb1

bb4: ; preds = %start

%5 = atomicrmw umax ptr %self, i64 %val acq_rel, align 8

store i64 %5, ptr %_0, align 8

br label %bb1

bb3: ; preds = %start

%6 = atomicrmw umax ptr %self, i64 %val seq_cst, align 8

store i64 %6, ptr %_0, align 8

br label %bb1

bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7

%7 = load i64, ptr %_0, align 8

ret i64 %7

}

You can see the atomicrmw umax instruction in multiple places, depending on
the memory ordering specified. This is the high-level atomic operation that the
compiler backend understands, but the CPU does not.

llc -print-after=atomic-expand main.ll -o /dev/null

This is the relevant part of the output:

*** IR Dump After Expand Atomic instructions (atomic-expand) ***
; Function Attrs: inlinehint nonlazybind uwtable
define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {
start:
  %_0 = alloca [8 x i8], align 8
  %order = alloca [1 x i8], align 1
  store i8 %0, ptr %order, align 1
  %1 = load i8, ptr %order, align 1
  %_7 = zext i8 %1 to i64
  switch i64 %_7, label %bb2 [
    i64 0, label %bb7
    i64 1, label %bb5
    i64 2, label %bb6
    i64 3, label %bb4
    i64 4, label %bb3
  ]
bb2:                                              ; preds = %start
  unreachable
bb7:                                              ; preds = %start
  %2 = load i64, ptr %self, align 8               ; seed expected value
  br label %atomicrmw.start                       ; enter CAS loop
atomicrmw.start:                                  ; preds = %atomicrmw.start, %bb7
  %loaded = phi i64 [ %2, %bb7 ], [ %newloaded, %atomicrmw.start ] ; on first iteration: use %2, on retries: use value observed by last cmpxchg
  %3 = icmp ugt i64 %loaded, %val                 ; unsigned compare (umax semantics)
  %new = select i1 %3, i64 %loaded, i64 %val      ; desired = max(loaded, val)
  %4 = cmpxchg ptr %self, i64 %loaded, i64 %new monotonic monotonic, align 8 ; CAS: if *self==loaded, store new
  %success = extractvalue { i64, i1 } %4, 1       ; boolean: whether the swap happened
  %newloaded = extractvalue { i64, i1 } %4, 0     ; value seen in memory before the CAS
  br i1 %success, label %atomicrmw.end, label %atomicrmw.start ; loop until CAS succeeds
atomicrmw.end:                                    ; preds = %atomicrmw.start
  store i64 %newloaded, ptr %_0, align 8
  br label %bb1
[... MORE OF THE SAME, JUST FOR DIFFERENT ORDERING..]
bb1:                                              ; preds = %bb3, %bb4, %bb6, %bb5, %bb7
  %7 = load i64, ptr %_0, align 8
  ret i64 %7
}

We can see the pass did not change the first part – it still has the code to dispatch based
on the memory ordering. But in the bb7 block, where we originally had the
atomicrmw umax LLVM instruction, we now see a full compare-and-swap loop.
A compiler engineer would say that the atomicrmw umax instruction has been
“lowered” into a sequence of more primitive operations, that are closer to what
the hardware can actually execute.

Here’s the simplified logic:

Read (seed): grab the current value (expected).
Compute: desired = umax(expected, val).
Attempt: observed, success = cmpxchg(ptr, expected, desired, [...]).
If success, return observed (the old value). Otherwise set expected = observed and loop.

This CAS loop is a fundamental pattern in lock-free programming. The compiler
just built it for us automatically.

Layer 5: The Final Product (x86-64 Assembly)

We’re at the final step. To see the final machine code, you can tell rustc to
emit the assembly directly:

This will produce a main.s file containing the final assembly code.
Inside, you’ll find the result of the cmpxchg loop:

.LBB8_2:
	movq  -32(%rsp), %rax       # rax = &self
	movq	(%rax), %rax        # rax = *self (seed 'expected')
	movq	%rax, -48(%rsp)     # spill expected to stack
.LBB8_3:                    # loop head
	movq	-48(%rsp), %rax     # rax = expected
	movq	-32(%rsp), %rcx     # rcx = &self
	movq	-40(%rsp), %rdx     # rdx = val
	movq	%rax, %rsi          # rsi = expected (scratch)
	subq	%rdx, %rsi          # set flags for unsigned compare: expected - val
	cmovaq	%rax, %rdx          # if (expected > val) rdx = expected; else rdx = val (compute max)
	lock cmpxchgq	%rdx, (%rcx)# CAS: if *rcx==rax then *rcx=rdx; rax <- old *rcx; ZF=success
	sete	%cl                 # cl = success
	movq	%rax, -56(%rsp)     # spill observed to stack
	testb	$1, %cl             # branch on success
	movq	%rax, -48(%rsp)     # expected = observed (for retry)
	jne	.LBB8_4             # success -> exit
	jmp	.LBB8_3             # failure → retry

The syntax might look a bit different from what you’re used to, that’s because it’s
in AT&T syntax, which is the default for rustc. If you prefer Intel syntax, you can
use rustc --emit=asm main.rs -C "llvm-args=-x86-asm-syntax=intel" to get that.

I’m not an assembly expert, but you can see the key parts of the CAS loop here:

Seed read (first iteration): Load *self once to initialize the expected value.
Compute umax without branching: The pair sub + cmova implements desired = max_u(expected, val).
CAS operation: On x86-64, cmpxchg uses RAX as the expected value and returns the observed value in RAX; ZF
encodes success.
Retry or finish: If ZF is clear, we failed and need to retry. Otherwise, we are done.

Note we did not ask rustc to optimize the code. If we did, the compiler would
generate more efficient assembly: No spills to the stack, fewer jumps, no
dispatch on memory ordering, etc. But I wanted to keep the output as close
to the original IR as possible to make it easier to follow.

The Beauty of Abstraction

And there we have it. Our journey is complete. We started with a safe, clear,
single line of Rust and ended with a CAS loop written in assembly language.

Rust fetch_max → Macro-generated atomic_umax → LLVM
atomicrmw umax → LLVM cmpxchg loop → Assembly lock cmpxchg loop

This journey is a perfect example of the power of modern compilers. We get to
work at a high level of abstraction, focusing on safety and logic, while the
compiler handles the messy, error-prone, and incredibly complex task of
generating correct and efficient code for the hardware.

So, next time you use an atomic, take a moment to appreciate the incredible,
hidden journey your code is about to take.

PS: After conducting this journey I learned that
C++26 adds fetch_max
too!

PPS: We are hiring!

Bonus: Apple Silicon (AArch64)

Out of curiosity, I also checked how this looks on Apple Silicon (AArch64).
This architecture does have a native atomic max instruction, so the
AtomicExpandPass does not need to lower it into a CAS loop. The LLVM code before and after
the pass is identical, still containing the atomicrmw umax instruction.

The final assembly contains a variant of the LDUMAX instruction. This is the relevant part of the assembly:

ldr    x8, [sp, #16]     # x8 = value to compare with
ldr    x9, [sp, #8]      # x9 = pointer to the atomic variable
ldumax x8, x8, [x9]      # atomic unsigned max (relaxed), [x9] = max(x8, [x9]), x8 = old value
str    x8, [sp, #40]     # Store old value
b      LBB8_11

Note that AArch64 uses Unified Assembler Language,
when reading the snippet above, it’s important to remember that the destination register comes first.

And that’s really it. We could continue to dig into the microarchitecture, to see how instructions are executed
at the hardware level, what are the effects of the LOCK prefix, dive into differences in memory ordering, etc.
But we’ll leave that for another day.

Alice: “Would you tell me, please, which way I ought to go from here?”
The Cat: “That depends a good deal on where you want to get to.”
Alice: “I don’t much care where.”
The Cat: “Then it doesn’t much matter which way you go.”
Alice: “…So long as I get somewhere.”
The Cat: “Oh, you’re sure to do that, if only you walk long enough.”

– Lewis Carroll, Alice’s Adventures in Wonderland

Subscribe to Updates

What's Hot

From Rust to Reality: The Hidden Journey of Fetch_max

From Rust to Reality: The Hidden Journey of Fetch_max

Related Posts