There is a large gap between "this model can call tools" and "this model calls tools through a real agent harness on commodity hardware." We wanted to close that gap — or at least measure it honestly.
This post covers outcomes from a structured benchmark run on a BitLaunch VPS (4 vCPU, 8 GB RAM, no GPU): Qwen 2.5 7B via Ollama, driven by the Bumblebee agent harness. The question was simple: what actually works, how fast, and where does it break?
The Setup
One Ubuntu 24.04 box on BitLaunch (nibble-8192 — 4 vCPU, 8 GB, 150 GB SSD, Dallas). Ollama running as a systemd service. Bumblebee cloned from Bumblebee-AGI/bumblebee with uv sync. A custom entity (dev_agent) configured with qwen2.5:7b, tools enabled, autonomy off, 900-second inference timeouts.
An unattended benchmark script ran 10 jobs: four raw Ollama /api/chat calls and six bumblebee ask harness calls, covering one-word compliance, arithmetic, short generation, structured output, and tool use. Everything logged to JSONL with per-job stdout/stderr capture, host snapshots at start and end, and a generated SUMMARY.md.
The Numbers
| Job | Method | Wall Time | Result |
|---|---|---|---|
| One-word reply ("pong") | Raw Ollama | 10.9 s | Correct |
| One-word reply ("pong") | Bumblebee harness | 892 s (~15 min) | Correct |
| 17 × 23 | Raw Ollama | 4.7 s | Correct (391) |
| 17 × 23 | Bumblebee harness | 1200 s | Timeout |
| One-sentence joke | Raw Ollama | 30.3 s | Correct |
| One-sentence joke | Bumblebee harness | 1200 s | Timeout (joke was in partial output) |
| Two-sentence explainer | Raw Ollama | 137.5 s | Correct |
| Two-sentence explainer | Bumblebee harness | 869 s (~14.5 min) | Correct |
get_current_time tool | Bumblebee harness | 1800 s | Timeout (tool did execute) |
| Tool + format (time + "banana") | Bumblebee harness | 1525 s (~25 min) | Correct |
Raw Ollama token generation on this CPU: ~5–10 tok/s (eval phase). Total raw wall times stayed in the seconds-to-low-minutes range because prompts were short and there was no harness overhead.
What This Tells Us
1. Raw inference works fine on modest hardware
Qwen 2.5 7B at Q4 quantization runs comfortably in 8 GB RAM with swap headroom. ~5 tok/s is slow compared to GPU inference, but it is functional — you get coherent, correct answers to structured prompts in seconds or low minutes. The model followed format constraints ("exactly one word," "exactly two sentences") and got arithmetic right.
2. The harness multiplier is real — and large
The same one-word prompt that took 11 seconds raw took 892 seconds through bumblebee ask. That is an ~80× wall-time multiplier. This is not a model problem — it is the cost of the full agent stack: memory retrieval, somatic appraisal, knowledge injection, routing decisions, system prompt compilation, and multi-round inference. On a GPU box where raw inference takes milliseconds, the harness overhead is negligible. On CPU, it dominates.
3. Tool calling works through the harness
This was a key question. On the previous 1 GB "smoke" VPS (Gemma 3 270M), tools had to be disabled because the model could not handle them. On the 8 GB box with Qwen 2.5 7B:
get_current_timeexecuted successfully through the full harness (confirmed in structured logs:tool_execwith"ok": true).- The time + banana multi-constraint prompt completed — the model called the tool, got the result, and formatted a two-line reply.
- The strict tool-only prompt (job 09) timed out at 30 minutes — the tool ran, but the model entered continuation rounds trying to call
say(a platform tool unavailable in CLIaskcontext), burning through its budget.
Tool calling is no longer theoretical on this stack. It works. It is slow, and continuation-round behavior needs tuning, but the plumbing is proven.
4. Timeouts are a tuning problem, not a failure
Three harness jobs hit their caps. In two cases (joke, math), the model produced correct output visible in partial stdout — the harness just kept running post-generation work past the cap. In the tool case, continuation rounds were the culprit. These are configuration and harness-level issues, not model or infrastructure failures. Raising tool_continuation_rounds caps and adjusting post-reply hooks would likely resolve most of them.
5. Host resource usage was well-behaved
At the start of the run: 535 MB used, 5.4 GB free, load 0.01. At the end (after 2+ hours of continuous inference): 5.7 GB used, 303 MB free, load 4.00 — the model runner was saturating all four vCPUs as expected. Swap usage stayed minimal (18 MB). No OOM kills, no crashes. The box handled it.
Context: Three Phases
This was the third phase of Bumblebee VPS testing:
- Phase 1 (smoke — 1 GB VPS, Gemma 3 270M): Could the harness run at all on minimal hardware? Yes, but tools had to be disabled, inference was dominated by system prompt weight, and the model could not follow structured instructions reliably.
- Phase 2 (Athena A/B — same smoke host): Isolated model capacity vs harness weight with raw/bare/full system prompt comparisons. Confirmed the 270M model was the bottleneck — failures occurred in raw Ollama, not only through the harness.
- Phase 3 (dev-agent — 8 GB VPS, Qwen 2.5 7B): Moved to a model that can actually handle tool schemas. Proved tool calling works end-to-end. Measured the harness multiplier on CPU. Identified continuation-round tuning as the next lever.
What is This Good For
An 8 GB CPU box running a 7B model through a full agent harness is not a product demo environment. Fifteen-minute turns are not acceptable for interactive use.
It is good for:
- Harness development — testing memory, tool registration, routing, and prompt compilation without touching production infrastructure.
- Tool plumbing validation — proving that Ollama + Bumblebee + tool schemas work before investing in GPU hardware.
- Regression testing — the benchmark script runs unattended under
nohupwith full JSONL logging; you can run it after any harness change and diff the results. - Isolation — experiments on this box cannot affect Sanctum production. That risk boundary has value.
What is Next
The continuation-round timeout pattern (model calls unavailable tools in a loop) is the main open issue. Addressing it at the harness level — either by filtering tool availability per platform context, or by capping continuation rounds more aggressively — would likely convert the three timeout jobs to completions.
Beyond that, the obvious next step is the same stack on a GPU instance, where the harness multiplier shrinks because raw inference is fast. The benchmark script is already written and runs unattended — pointing it at a different box is trivial.
The benchmark suite, entity config, and setup notes are in the Bumblebee-AGI/bumblebee repo.