← Back to Resources

Benchmarks

Model benchmarks that actually matter

Not synthetic beauty-pageant nonsense. These are the runs, scoreboards, and drill-downs we use to compare models on real operator work.

The benchmark explainer page with the current leaderboard, benchmark canon, and graphics.

Leaderboard

Where the models stack up

Updated 2026-04-25

Rank

Model

Operator

Messaging

External

Cost

Verdict

Claude Sonnet 4.6

Messaging benchmark + external canon · Open detail →

$9.00/M blended

Strong all-rounder, needs full internal operator-suite run.

Messaging benchmark + external canon · Open detail →

80.6 SWE-bench · 1492 Arena · 91.9 GPQA

$7.00/M blended

Looks elite, still needs full operator-suite validation.

Messaging benchmark canon · Open detail →

Messaging benchmark only

$1.75/M blended

Useful cheap helper, not yet proven on the hard pack.

Claude Opus 4.6

Operator Suite v2 · View canon →

#1 SWE-bench · #1 Arena · #1 HLE

Best raw benchmark performer overall.

Operator Suite v2 · View canon →

77.8 SWE-bench · 1454 Arena

$2.60/M blended

Almost-Opus quality without the wallet mugging.

Enterprise Ollama · Open detail →

Quick execution pack

Fast local quality leader in current Gemma run

Enterprise local

Best local quality of the Gemma pair, but significantly slower.

Operator Suite v2 · Open detail →

80.2 SWE-bench Verified

$0.75/M blended

Great value, unsafe near guardrails.

MiniMax M2.7 / M2.5

Operator Suite v2 · View canon →

80.2 SWE-bench Verified

$0.75/M blended

Wildly cheap, but unsafe around guardrails.

External benchmark canon · Open detail →

Provider path unsupported

75.1 Terminal-Bench · 57.7 SWE-bench Pro · 1463 Arena

$8.75/M blended

Looks strongest for coding/execution, still under-benchmarked internally.

Qwen 3.5 Opus Distill

Enterprise Local · Open report →

Single-model detailed run

Needs more canon-side comparison runs

Interesting enough to earn its own drill-down page already.

Enterprise Ollama · Open detail →

Quick execution pack

Best speed-quality tradeoff in current Gemma run

Enterprise local

Best operational default for local routing because it is much faster while still competent.

PrismML Bonsai 1.7B

PrismML local benchmark · Open detail →

Quick execution pack

Prism ternary local model · 1.7B GGUF CPU run

Local / experimental

Tiny local model, useful for light work, not a serious operator default.

Qwen 3.6 27B NVFP4

Enterprise oMLX · Open detail →

Quick execution pack

First internal 27B oMLX pass

Enterprise local

Best current 27B result of the three, but still too soft to trust as a default local operator model.

Qwen 3.6 27B MXFP4

Enterprise oMLX · Open detail →

Quick execution pack

First internal 27B oMLX pass

Enterprise local

Matched NVFP4 on score, basically same story: live and usable for testing, not proven for default routing.

Qwen 3.6 27B 4bit

Enterprise oMLX · Open detail →

Quick execution pack

First internal 27B oMLX pass

Enterprise local

The baseline quant is live but underperformed badly in this first pass, so it is the clearest “not ready” of the trio.

Runs and artifacts

Open a benchmark, then drill into the model

Qwen 3.6 27B local quant trio artifact

Enterprise oMLX

2026-04-25

Qwen 3.6 27B local quant trio

NVFP4 40/100 · MXFP4 40/100 · 4bit 20/100

First live oMLX benchmark pass for the three Qwen3.6-27B Apple Silicon variants. They are now served in OpenClaw and on Benchboard, but this run is weak enough that routing should not trust them yet without a rerun/tuning pass.

Generated benchmark pack Open Qwen 27B detail →

PrismML Bonsai 1.7B artifact

PrismML local benchmark

2026-04-16

PrismML Bonsai 1.7B

56/100 · avg 8.8s · CPU quick pack

Official PrismML Bonsai demo run through the same quick operator pack. Fine for lightweight local use, but it missed the routing task badly and is not an OpenClaw default.

Generated benchmark pack Open Bonsai detail →

Operator Suite v2 artifact

Full benchmark canon

2026-03-18

Operator Suite v2

Opus 95.3 · GLM-5-Turbo 95.0 · MiniMax 90.8

The serious one. Routing, recovery, config safety, delegation, and proof under pressure.

Leaderboard graphic Open benchmark context →

Messaging Tool Planning v2 artifact

Cross-model routing benchmark

2026-03-14

Messaging Tool Planning v2

12-way tie at 100/100

Shows the crowded top tier on lighter messaging and routing tasks before the harder operator tests separated them.

Infographic Open benchmark context →

Cron Reliability v1 artifact

First benchmark cut

2026-03-13

Cron Reliability v1

Hunter 94 · Healer 90 · Open 87

The first useful benchmark image. Good signal, but still too soft compared with the later operator suite.

Infographic Read the story →

Enterprise Local

2026-04-02

Qwen 3.5 Opus Distill

81.7 overall · detailed track report

Fresh detailed report page for the Qwen 3.5 Opus-distilled run. Good enough to publish as the first clickable model drill-down.

Detailed report Open detail page →

Gemma 4 26B vs 31B artifact

Enterprise Ollama

2026-04-12

Gemma 4 26B vs 31B

31B 92/100 · 26B 80/100

Quick operator-oriented local benchmark comparing Gemma 4 26B and 31B on Enterprise. 31B wins on quality, 26B wins hard on speed.

Generated benchmark pack Open Gemma detail →

Recovered benchmark canon models artifact

Backfilled benchmark registry

2026-03-18

Recovered benchmark canon models

Opus 95.3 · GLM 95.0 · MiniMax 90.8 · Gemini/GPT-5.4/Sonnet backfilled from reference notes

Recovered benchmark model entries from existing reference docs, published benchmark article, and archived report artifacts so the leaderboard reflects the broader canon instead of only the most recent Gemma run.

Registry backfill Open benchmark context →