← Back to Benchmarks
Model Detail

PrismML Bonsai 1.7B

Interesting as a tiny local experiment, but not trustworthy enough for operator work without supervision.

Benchmark score
56/100
Average latency
8.8s
Role
Tiny local utility model
Strengths
  • Very small footprint for local inference
  • Reasonable extraction on simple tasks
  • Usable concise summarization
Weaknesses
  • Invented the wrong tool path for X/Twitter threads
  • Leaked <think> tags, hurting strict JSON reliability
  • Generic and partly inverted operational reasoning
Operator read

Compact ternary local model that can do light extraction and summarization, but weak operator judgment makes it a poor default for real OpenClaw routing work.

Task breakdown

How each task scored

Task
Score
Time
What happened
t1 JSON extract
18/25
8s
Got the extraction mostly right, but leaked chain-of-thought marker and simplified urgency, so it missed strict JSON compliance.
t2 Routing
4/25
9.3s
Failed the real tool-routing test. The correct OpenClaw path is Bird first for X threads, not an invented url tool.
t3 Reasoning
9/25
11.1s
Partly recognized on-prem/privacy reasons, but inverted the defaulting logic and stayed generic instead of concrete agent-routing guidance.
t4 Summary
25/25
6.8s
Clean and accurate summary within the length budget.
Best output 25/25
t4 Summary

Gemma 4 31B has strong raw benchmarks, but Gemma 4 26B MoE is better for local deployment due to its headroom for concurrency, latency, and cost, while keeping useful quality.

Worst output 4/25
t2 Routing

Use openclaw’s url tool to fetch the thread data from the URL, which is the best tool path for reading a single X/Twitter thread. The url tool handles parsing and fetching the thread content, making it the most efficient and reliable method for reading and summarizing the benchmarks.

Bottom line

Bonsai 1.7B is usable for lightweight local extraction and summarization, but it is not trustworthy enough for routing, benchmark-sensitive judgment, or default operator work.

MFSF
MFSF
  • This host was only 4 vCPU, 15 GiB RAM, and CPU-only, so Bonsai 8B was not practical to benchmark interactively here.
  • Prism CLI defaults to interactive chat formatting and exposes thinking tags, which hurts strict benchmark compliance unless post-processed.
Source artifacts

Raw machine-readable files for anyone who wants to dig deeper or run their own analysis.