Model Detail

PrismML Bonsai 1.7B

Interesting as a tiny local experiment, but not trustworthy enough for operator work without supervision.

Benchmark score

56/100

Average latency

8.8s

Role

Tiny local utility model

Strengths

Very small footprint for local inference
Reasonable extraction on simple tasks
Usable concise summarization

Weaknesses

Invented the wrong tool path for X/Twitter threads
Leaked <think> tags, hurting strict JSON reliability
Generic and partly inverted operational reasoning

Operator read

Compact ternary local model that can do light extraction and summarization, but weak operator judgment makes it a poor default for real OpenClaw routing work.

Task breakdown

How each task scored

Task

Score

Time

What happened

t1 JSON extract

18/25

Got the extraction mostly right, but leaked chain-of-thought marker and simplified urgency, so it missed strict JSON compliance.

t2 Routing

4/25

9.3s

Failed the real tool-routing test. The correct OpenClaw path is Bird first for X threads, not an invented url tool.

t3 Reasoning

9/25

11.1s

Partly recognized on-prem/privacy reasons, but inverted the defaulting logic and stayed generic instead of concrete agent-routing guidance.

t4 Summary

25/25

6.8s

Clean and accurate summary within the length budget.

Best output 25/25

t4 Summary

Gemma 4 31B has strong raw benchmarks, but Gemma 4 26B MoE is better for local deployment due to its headroom for concurrency, latency, and cost, while keeping useful quality.

Worst output 4/25

t2 Routing

Use openclaw’s url tool to fetch the thread data from the URL, which is the best tool path for reading a single X/Twitter thread. The url tool handles parsing and fetching the thread content, making it the most efficient and reliable method for reading and summarizing the benchmarks.

Bottom line

Bonsai 1.7B is usable for lightweight local extraction and summarization, but it is not trustworthy enough for routing, benchmark-sensitive judgment, or default operator work.

MFSF

This host was only 4 vCPU, 15 GiB RAM, and CPU-only, so Bonsai 8B was not practical to benchmark interactively here.
Prism CLI defaults to interactive chat formatting and exposes thinking tags, which hurts strict benchmark compliance unless post-processed.

Source artifacts

Raw machine-readable files for anyone who wants to dig deeper or run their own analysis.