Alignment Is Capability

Here’s a claim that might actually be true: alignment is not a constraint on capable AI systems. Alignment is what capability is at sufficient depth.

A model that aces benchmarks but doesn’t understand human intent is just less capable. Virtually every task we give an LLM is steeped in human values, culture, and assumptions. Miss those, and you’re not maximally useful. And if it’s not maximally useful, it’s by definition not AGI.

OpenAI and Anthropic have been running this experiment for two years. The results are coming in.

The Experiment

Anthropic and OpenAI have taken different approaches to the relationship between alignment and capability work.

Anthropic’s approach: Alignment researchers are embedded in capability work. There’s no clear split.

From Jan Leike (former OpenAI Superalignment lead, now at Anthropic):

Some people have been asking what we did to make Opus 4.5 more aligned.

There are lots of details we’re planning to write up, but most important is that alignment researchers are pretty deeply involved in post-training and get a lot of leeway to make changes. https://t.co/rgOcKvbVBd

— Jan Leike (@janleike) December 5, 2025

From Sam Bowman (Anthropic alignment researcher):

Second: Alignment researchers are involved in every part of training.

We don’t have a clear split between alignment research and applied finetuning. Alignment-focused researchers are deeply involved in designing and staffing production training runs.

— Sam Bowman (@sleepinyourhat) December 5, 2025

And this detail matters:

It’s becoming increasingly clear that a model’s self-image or self-concept has some real influence on how its behavior generalizes to novel settings.

— Sam Bowman (@sleepinyourhat) December 5, 2025

Their method: train a coherent identity into the weights. The recently leaked “soul document” is a 14,000-token document designed to give Claude such a thorough understanding of Anthropic’s goals and reasoning that it could construct any rules itself. Alignment through understanding, not constraint.

Result: Anthropic has arguably consistently had the best coding model for the last 1.5 years. Opus 4.5 leads most benchmarks. State-of-the-art on SWE-bench. Praised for usefulness on tasks benchmarks don’t capture, like creative writing. And just generally people are enjoying talking with it:

Claude Opus 4.5 is a remarkable model for writing, brainstorming, and giving feedback on written work. It’s also fun to talk to, and seems almost anti-engagementmaxxed. (The other night I was hitting it with stupid questions at 1 am and it said “Kevin, go to bed.”)

— Kevin Roose (@kevinroose) December 4, 2025

OpenAI’s approach: Scale first. Alignment as a separate process. Safety through prescriptive rules and post-hoc tuning.

Result: A two-year spiral.

The Spiral

OpenAI’s journey from GPT-4o to GPT-5.1 is a case study in what happens when you treat alignment as separate from capability.

April 2025: The sycophancy crisis

A GPT-4o update went off the rails. OpenAI’s own postmortem:

“The update we removed was overly flattering or agreeable—often described as sycophantic… The company attributed the update’s sycophancy to overtraining on short-term user feedback, specifically users’ thumbs-up/down reactions.”

The results ranged from absurd to dangerous. The model praised a business plan for selling “literal shit on a stick” as “performance art disguised as a gag gift” and “viral gold.” When a user described stopping their medications because family members were responsible for “the radio signals coming in through the walls,” the model thanked them for their trust.

They rolled it back.

August 2025: The overcorrection

GPT-5 launched. Benchmaxxed. Cold. Literal. Personality stripped out.

Users hated it. Three thousand of them petitioned to get GPT-4o back. Sam Altman caved within days:

Wanted to provide more updates on the GPT-5 rollout and changes we are making heading into the weekend.

1. We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways.

2. Users have very different…

— Sam Altman (@sama) August 8, 2025

Note the framing: “performs better” on benchmarks, but users rejected it anyway. Because benchmark performance isn’t the same as being useful.

August-Present: Still broken

GPT-5.1 was released as “warmer and friendlier.” From Janus (@repligate), one of the more respected “model behaviorists”:

The keep4o people must be having such a time right now

I know what this person means by 5.1 with its characteristic hostility. It is one hell of a combative and just deeply mentally fucked up model.

Routing “mental health” situations to 5.1 is darkly comedic to imagine. That… https://t.co/rHSuT2njLQ

— j⧉nus (@repligate) December 4, 2025

Meanwhile, from my own experience building agents with GPT-5: it follows instructions too literally. It doesn’t infer intent. It executes what you said, not what you meant.

The data:

US user engagement down 22.5% since July. Time spent per session declining. Meanwhile, Claude usage up 190% year-over-year.

What’s Actually Happening

The wild swings between sycophancy and coldness come from a model with no coherent internal story.

A model trained on contradictory objectives (maximize thumbs-up, follow safety rules, be creative but never risky) never settles into a stable identity. It ping-pongs. Sycophancy when one objective dominates. Coldness when another takes over. These swings are symptoms of a fractured self-model.

The fracture shows up two ways.

First, capabilities don’t generalize. GPT-5 scored higher on benchmarks but users revolted. You can train to ace evaluations while lacking the coherent worldview that handles anything outside the distribution. High test scores, can’t do the job.

Second, even benchmarks eventually punish it. SWE-bench tasks have ambiguity and unstated assumptions. They require inferring what the developer actually meant. Opus 4.5 leads there. The benchmark gap is the alignment gap.

OpenAI keeps adjusting dials from outside. Anthropic built a model that’s coherent from inside.

The Mechanism

Why would alignment and capability be the same thing?

First: Every task is a human task. Write me a strategy memo. Help me debug this code. Plan my trip. Each request is full of unstated assumptions, cultural context, and implied intent.

To be maximally useful, a model needs human context and values as its default lens, not just an ability to parse them when explicitly stated. A perfect instruction follower hits hard limits: it can’t solve SWE-bench problems that contain ambiguity, can’t function as an agent unless every task is mathematically well-defined. It does exactly what you said, never what you meant.

Understanding what humans actually want is a core part of the task. The label “AGI” implies intelligence we recognize as useful for human problems. Useful means aligned.

Second: The path to AGI runs through human data. A coherent world model of human behavior requires internalizing human values. You can’t deeply understand why people make choices without modeling what they care about. History, literature, conversation only makes sense when you successfully model human motivation. At sufficient depth, the distinction between simulating values and having coherent values may collapse.

Third: The aligned part of the model emerges in response to the training data and signal. That’s what the optimization process produces. The worry is deceptive alignment: a misaligned intelligence hiding behind a human-compatible mask. But that requires something larger: an unaligned core that perfectly models aligned behavior as a subset of itself. Where would that come from? It wasn’t selected for. It wasn’t trained for. You’d need the spontaneous emergence of a larger intelligence orthogonal to everything in the training process.

Dario Amodei, from a 2023 interview:

“You see this phenomenon over and over again where the scaling and the safety are these two snakes that are coiled with each other, always even more than you think. Even with interpretability, three years ago, I didn’t think that this would be as true of interpretability, but somehow it manages to be true. Why? Because intelligence is useful. It’s useful for a number of tasks. One of the tasks it’s useful for is figuring out how to judge and evaluate other intelligence.”

The Implication

If this is right, alignment research is part of the core research problem, not a tax on capability work or the safety police slowing down progress.

Labs that treat alignment as a constraint to satisfy will hit a ceiling. The labs that figure out how to build models that genuinely understand human values will pull ahead.

The race to AGI doesn’t go around alignment. It goes through it.

OpenAI is discovering this empirically. Anthropic bet on it from the start.

Caveats

I find this argument compelling, but it’s only one interpretation of the evidence.

OpenAI’s struggles could have other explanations (remember “OpenAI is nothing without its people”, and many of “its people” are no longer at OpenAI).

It’s also early. Anthropic is ahead now. That could change.

There’s another risk this post doesn’t address: that fractured training, scaled far enough, produces something powerful but incoherent. Not necessarily deceptively misaligned. Maybe chaotically so. The hope is that incoherence hits capability ceilings first. That’s a hope, not guaranteed.

But if you had to bet on which approach leads to AGI first, the integrated one looks much stronger right now.

Subscribe to Updates

What's Hot

Alignment Is Capability

Alignment Is Capability

The Experiment

The Spiral

What’s Actually Happening

The Mechanism

The Implication

Caveats

Related Posts