Claude Code Daily Benchmarks for Degradation Tracking
Last updated: Jan 29, 2026
The goal of this tracker is to detect statistically significant degradations in Claude Code with Opus 4.5 performance on SWE tasks.
- • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
- • Detect degradation: Statistical testing for degradation detection
- • What you see is what you get: We benchmark in Claude Code CLI with the SOTA model (currently Opus 4.5) directly, no custom harnesses.
Summary
Status
Degradation Status
Shows if any time period has a statistically significant performance drop (p < 0.05).
Degradation detected over past 30 days
Baseline
Baseline Pass Rate
Historical average pass rate used as reference for detecting performance changes.
58 %
reference rate
Daily Pass Rate
Daily Pass Rate
Percentage of benchmark tasks passed in the most recent day’s evaluations.
50 %
50 evaluations
7-day Pass Rate
7-day Pass Rate
Aggregate pass rate over the last 7 days. Provides a more stable measure than daily results.
53 %
250 evaluations
30-day Pass Rate
30-day Pass Rate
Aggregate pass rate over the last 30 days. Best measure of overall sustained performance.
54 %
655 evaluations
Daily Trend
Pass rate over time
Daily benchmark pass rates over the past 30 days. Hover over legend items for details on each visual element.
Pass Rate
Daily benchmark pass rate showing the percentage of tasks solved each day.
Baseline
Historical average pass rate (58%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±14.0%). Changes within this band are not statistically significant (p ≥ 0.05).
95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).
Dashed line at 58% baseline with ±14.0% significance threshold
Weekly Trend
Aggregated 7-day pass rate
Rolling 7-day aggregated pass rates for a smoother trend view with reduced day-to-day noise.
Pass Rate
7-day rolling pass rate aggregating daily results for a smoother trend view.
Baseline
Historical average pass rate (58%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±5.6%). Changes within this band are not statistically significant (p ≥ 0.05).
95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).
Dashed line at 58% baseline with ±5.6% significance threshold
Get notified when degradation is detected
We’ll email you when we detect a statistically significant performance drop.
Thanks for subscribing! Check your email to confirm.