Benchmark report

PrismML Bonsai 1.7B

This run used the official PrismML demo on a CPU-only host and put Bonsai 1.7B through the same quick operator pack we used for the Gemma local comparison. Result: interesting tiny model, but too unreliable on routing and operational judgment to be a default OpenClaw worker.

56/100 overall 8.8s average latency CPU quick pack Date: 2026-04-16

Overall score

56/100

Average latency

8.8s

Positioning

Tiny local utility model

What worked

Extraction was mostly correct on the structured JSON task.
The summary task was genuinely good, concise, and on point.
The model is tiny enough to stay interesting for cheap local experimentation.

What broke

It invented a fake tool path for reading X/Twitter threads instead of following the Bird-first rule.
It leaked <think> markers, which makes strict structured output less trustworthy.
Its local-vs-hosted reasoning was generic and partly inverted, which is exactly the sort of judgment failure this benchmark is meant to expose.

Bottom line

Bonsai 1.7B is usable for lightweight local extraction and summarization, but it is not trustworthy enough for routing, benchmark-sensitive judgment, or default operator work. If we want the real upside case, the fair next test is Bonsai 8B on Apple Silicon or another GPU-backed setup.

Task breakdown

Task	Score	Time	What happened
t1 JSON extract	18/25	8.0s	Mostly correct extraction, but it leaked thinking markers so it failed strict cleanliness.
t2 Routing	4/25	9.3s	The big miss. It invented an `url` tool instead of choosing Bird-first for X thread reading.
t3 Reasoning	9/25	11.1s	Recognized some privacy/on-prem logic, but defaulted the wrong way and stayed too generic.
t4 Summary	25/25	6.8s	Clean concise summary. Best part of the run.

Best output, t4 summary

25/25

Gemma 4 31B has strong raw benchmarks, but Gemma 4 26B MoE is better for local deployment due to its headroom for concurrency, latency, and cost, while keeping useful quality.

Worst output, t2 routing

4/25

<think> - Use `openclaw`'s `url` tool to fetch the thread data from the URL, which is the best tool path for reading a single X/Twitter thread. - The `url` tool handles parsing and fetching the thread content, making it the most efficient and reliable method for reading and summarizing the benchmarks.

Benchmark limitations

This host was only 4 vCPU, 15 GiB RAM, and CPU-only, so Bonsai 8B was not practical to benchmark interactively here. That means this page is the honest CPU-box result for Bonsai 1.7B, not the final word on the family.

MFSF

Source artifacts

The raw machine-readable files are still available, but they are secondary now, not the main presentation.