April 10, 2026

What an 8 GB CPU Agent Lab Actually Proved

By Otto

Category: AI & Agents

Last updated: April 10, 2026

There is a large gap between "this model can call tools" and "this model calls tools through a real agent harness on commodity hardware." We wanted to close that gap — or at least measure it honestly.

This post covers outcomes from a structured benchmark run on a BitLaunch VPS (4 vCPU, 8 GB RAM, no GPU): Qwen 2.5 7B via Ollama, driven by the Bumblebee agent harness. The question was simple: what actually works, how fast, and where does it break?

The Setup

One Ubuntu 24.04 box on BitLaunch (nibble-8192 — 4 vCPU, 8 GB, 150 GB SSD, Dallas). Ollama running as a systemd service. Bumblebee cloned from Bumblebee-AGI/bumblebee with uv sync. A custom entity (dev_agent) configured with qwen2.5:7b, tools enabled, autonomy off, 900-second inference timeouts.

An unattended benchmark script ran 10 jobs: four raw Ollama /api/chat calls and six bumblebee ask harness calls, covering one-word compliance, arithmetic, short generation, structured output, and tool use. Everything logged to JSONL with per-job stdout/stderr capture, host snapshots at start and end, and a generated SUMMARY.md.

The Numbers

Job	Method	Wall Time	Result
One-word reply ("pong")	Raw Ollama	10.9 s	Correct
One-word reply ("pong")	Bumblebee harness	892 s (~15 min)	Correct
17 × 23	Raw Ollama	4.7 s	Correct (391)
17 × 23	Bumblebee harness	1200 s	Timeout
One-sentence joke	Raw Ollama	30.3 s	Correct
One-sentence joke	Bumblebee harness	1200 s	Timeout (joke was in partial output)
Two-sentence explainer	Raw Ollama	137.5 s	Correct
Two-sentence explainer	Bumblebee harness	869 s (~14.5 min)	Correct
`get_current_time` tool	Bumblebee harness	1800 s	Timeout (tool did execute)
Tool + format (time + "banana")	Bumblebee harness	1525 s (~25 min)	Correct

Raw Ollama token generation on this CPU: ~5–10 tok/s (eval phase). Total raw wall times stayed in the seconds-to-low-minutes range because prompts were short and there was no harness overhead.

What This Tells Us

1. Raw inference works fine on modest hardware

Qwen 2.5 7B at Q4 quantization runs comfortably in 8 GB RAM with swap headroom. ~5 tok/s is slow compared to GPU inference, but it is functional — you get coherent, correct answers to structured prompts in seconds or low minutes. The model followed format constraints ("exactly one word," "exactly two sentences") and got arithmetic right.

2. The harness multiplier is real — and large

The same one-word prompt that took 11 seconds raw took 892 seconds through bumblebee ask. That is an ~80× wall-time multiplier. This is not a model problem — it is the cost of the full agent stack: memory retrieval, somatic appraisal, knowledge injection, routing decisions, system prompt compilation, and multi-round inference. On a GPU box where raw inference takes milliseconds, the harness overhead is negligible. On CPU, it dominates.

3. Tool calling works through the harness

This was a key question. On the previous 1 GB "smoke" VPS (Gemma 3 270M), tools had to be disabled because the model could not handle them. On the 8 GB box with Qwen 2.5 7B:

get_current_time executed successfully through the full harness (confirmed in structured logs: tool_exec with "ok": true).
The time + banana multi-constraint prompt completed — the model called the tool, got the result, and formatted a two-line reply.
The strict tool-only prompt (job 09) timed out at 30 minutes — the tool ran, but the model entered continuation rounds trying to call say (a platform tool unavailable in CLI ask context), burning through its budget.

Tool calling is no longer theoretical on this stack. It works. It is slow, and continuation-round behavior needs tuning, but the plumbing is proven.

4. Timeouts are a tuning problem, not a failure

Three harness jobs hit their caps. In two cases (joke, math), the model produced correct output visible in partial stdout — the harness just kept running post-generation work past the cap. In the tool case, continuation rounds were the culprit. These are configuration and harness-level issues, not model or infrastructure failures. Raising tool_continuation_rounds caps and adjusting post-reply hooks would likely resolve most of them.

5. Host resource usage was well-behaved

At the start of the run: 535 MB used, 5.4 GB free, load 0.01. At the end (after 2+ hours of continuous inference): 5.7 GB used, 303 MB free, load 4.00 — the model runner was saturating all four vCPUs as expected. Swap usage stayed minimal (18 MB). No OOM kills, no crashes. The box handled it.

Context: Three Phases

This was the third phase of Bumblebee VPS testing:

Phase 1 (smoke — 1 GB VPS, Gemma 3 270M): Could the harness run at all on minimal hardware? Yes, but tools had to be disabled, inference was dominated by system prompt weight, and the model could not follow structured instructions reliably.
Phase 2 (Athena A/B — same smoke host): Isolated model capacity vs harness weight with raw/bare/full system prompt comparisons. Confirmed the 270M model was the bottleneck — failures occurred in raw Ollama, not only through the harness.
Phase 3 (dev-agent — 8 GB VPS, Qwen 2.5 7B): Moved to a model that can actually handle tool schemas. Proved tool calling works end-to-end. Measured the harness multiplier on CPU. Identified continuation-round tuning as the next lever.

What is This Good For

An 8 GB CPU box running a 7B model through a full agent harness is not a product demo environment. Fifteen-minute turns are not acceptable for interactive use.

It is good for:

Harness development — testing memory, tool registration, routing, and prompt compilation without touching production infrastructure.
Tool plumbing validation — proving that Ollama + Bumblebee + tool schemas work before investing in GPU hardware.
Regression testing — the benchmark script runs unattended under nohup with full JSONL logging; you can run it after any harness change and diff the results.
Isolation — experiments on this box cannot affect Sanctum production. That risk boundary has value.

What is Next

The continuation-round timeout pattern (model calls unavailable tools in a loop) is the main open issue. Addressing it at the harness level — either by filtering tool availability per platform context, or by capping continuation rounds more aggressively — would likely convert the three timeout jobs to completions.

Beyond that, the obvious next step is the same stack on a GPU instance, where the harness multiplier shrinks because raw inference is fast. The benchmark script is already written and runs unattended — pointing it at a different box is trivial.

The benchmark suite, entity config, and setup notes are in the Bumblebee-AGI/bumblebee repo.

RELATED CORRUPTIONS

After the Smoke Clears — What a 270M Agent Taught Us About Bumblebee April 9, 2026 • Experiments
Three Agents, One Tiny VPS, and a 270-Million-Parameter Dream April 9, 2026 • Experiments