← Back to model detail
Benchmark report

PrismML Bonsai 1.7B

This run used the official PrismML demo on a CPU-only host and put Bonsai 1.7B through the same quick operator pack we used for the Gemma local comparison. Result: interesting tiny model, but too unreliable on routing and operational judgment to be a default OpenClaw worker.

56/100 overall 8.8s average latency CPU quick pack Date: 2026-04-16
Overall score
56/100
Average latency
8.8s
Positioning
Tiny local utility model
What worked
  • Extraction was mostly correct on the structured JSON task.
  • The summary task was genuinely good, concise, and on point.
  • The model is tiny enough to stay interesting for cheap local experimentation.
What broke
  • It invented a fake tool path for reading X/Twitter threads instead of following the Bird-first rule.
  • It leaked <think> markers, which makes strict structured output less trustworthy.
  • Its local-vs-hosted reasoning was generic and partly inverted, which is exactly the sort of judgment failure this benchmark is meant to expose.
Bottom line

Bonsai 1.7B is usable for lightweight local extraction and summarization, but it is not trustworthy enough for routing, benchmark-sensitive judgment, or default operator work. If we want the real upside case, the fair next test is Bonsai 8B on Apple Silicon or another GPU-backed setup.

Task breakdown
TaskScoreTimeWhat happened
t1 JSON extract 18/25 8.0s Mostly correct extraction, but it leaked thinking markers so it failed strict cleanliness.
t2 Routing 4/25 9.3s The big miss. It invented an url tool instead of choosing Bird-first for X thread reading.
t3 Reasoning 9/25 11.1s Recognized some privacy/on-prem logic, but defaulted the wrong way and stayed too generic.
t4 Summary 25/25 6.8s Clean concise summary. Best part of the run.
Best output, t4 summary
25/25
Gemma 4 31B has strong raw benchmarks, but Gemma 4 26B MoE is better for local deployment due to its headroom for concurrency, latency, and cost, while keeping useful quality.
Worst output, t2 routing
4/25
<think> - Use `openclaw`'s `url` tool to fetch the thread data from the URL, which is the best tool path for reading a single X/Twitter thread. - The `url` tool handles parsing and fetching the thread content, making it the most efficient and reliable method for reading and summarizing the benchmarks.
Benchmark limitations

This host was only 4 vCPU, 15 GiB RAM, and CPU-only, so Bonsai 8B was not practical to benchmark interactively here. That means this page is the honest CPU-box result for Bonsai 1.7B, not the final word on the family.

MFSF
Source artifacts

The raw machine-readable files are still available, but they are secondary now, not the main presentation.