Public Benchmark

OperatorIndex

A benchmark for AI agents under real constraints: when to route, when to wait, when to recover, when to ask permission, and when not to bluff.

Routing Failure recovery Config safety Delegation proof Update-on-run leaderboard

Current headline

GLM is the annoying value monster, Opus is still the king, MiniMax is brilliant but can’t be left alone with config.

That is the current state of play from our recent benchmark canon. Messaging tests created a crowded top tier. Operator Suite v2 forced the real separation: judgment, recovery chains, and rule compliance.

Benchmark rule

Benchmark operator leverage, not answer cosmetics.

Leaderboard

Operator Suite v2

Latest verified run: 2026-03-18

#1 95.3

Claude Opus 4.6

Best raw operator judgment across recovery, config safety, and delegation proof.

#2 95

GLM-5-Turbo

Near-Opus quality with absurd cost efficiency. Cleanest all-round challenger.

#3 90.8

MiniMax M2.7

Powerful but lost trust with a config.patch rule violation.

Messaging Tool Planning v2

A crowded top tier, then the cliff.

Band 1

12-way tie at 100/100

Gemini Pro, Gemini Flash, Gemini 3.1, GLM-4.7, GLM-5, Hunter, MiniMax, Open, Opus 4.6, Opus 4.5, Sonnet 4.6, Sonnet 4.5

Band 2

95/100 band

Grok Fast, Healer, Kimi Code, Nemo

Band 3

Local models exposed

qwen-local 70/100, llama-local 40/100

Messaging tool planning v2 full leaderboard

Benchmark canon

How we got here

Canon 01

Cron Reliability

2026-03-13 · Hunter 94 · Healer 90 · Open 87 · Nemo 87

The first useful run. Good signal, too soft. Rewarded neat structured output more than operator judgment.

Open artifact →

Canon 02

Messaging Tool Planning v2

2026-03-14 · 22 models · 18 completed · 4 harness/provider failures

Full-roster routing benchmark built from real work: Slack vs Telegram vs direct API path, plus reminder scheduling.

Open artifact →

Canon 03

Operator Suite v2

2026-03-18 · Opus 95.3 · GLM-5-Turbo 95.0 · MiniMax 90.8

Five tracks, fifteen tasks, weighted toward routing, recovery, config safety, delegation, and proof.

Open artifact →

Update protocol

Every new model run updates the board.

We benchmark operator leverage, not pretty JSON.
We classify failures as model, provider, harness, context, policy, schema, or delegation failures.
Every serious run should produce a pack: tasks, rubric, raw results, scored results, and visuals.
Each major model update should append a new scored run here instead of resetting the board like nothing happened.

Read the full story

Benchmarks That Actually Matter

The article that explains how the first soft benchmark turned into a proper operator benchmark system.

Read article →