mind-palace/Projects/Thinking/GodHand - Agentic VM Manager.md

[[agentic vm manager]]

# Plan
**Positioning**

The wedge is "agent owns the VM lifecycle, not just the screen." Most computer-use demos assume a desktop already exists; OpenClaw, Anthropic's reference container, and Self-Operating Computer mostly treat the VM as a given. Your project treats `spin up → configure → drive → snapshot → tear down` as one programmable surface. Call out two concrete capabilities competitors lack: (1) the agent can request a fresh, reproducible desktop per task, and (2) it can snapshot/branch state mid-task to explore alternatives or retry from a known-good checkpoint. That framing alone gives the repo a clear reason to exist.

**Architecture sketch**

Four layers, kept deliberately small:

1. _VM substrate_ — Docker+VNC for the default path (fast, cross-platform, easy for contributors to run), QEMU/KVM as an optional backend for full-OS scenarios. Expose both behind one `Sandbox` interface with `create`, `snapshot`, `restore`, `destroy`, `exec`.
2. _Capture/control_ — screenshots via VNC framebuffer or `scrot`; input via `xdotool` or direct VNC events. Keep latency budget under ~300ms round-trip or the agent loop gets painful.
3. _Agent loop_ — Claude (or any vision LLM) in a perceive→plan→act→verify cycle. The verify step is where most projects cut corners; make it a first-class screenshot-diff + assertion mechanism.
4. _Orchestrator_ — Python service exposing a task API: "given this goal and this base image, return a trace." Traces are the artifact you publish.

**8-week milestone plan**

Weeks 1–2: Docker+VNC sandbox with clean lifecycle API, screenshot capture, basic input injection. Goal: a script that boots a fresh Ubuntu+Firefox container, takes a screenshot, clicks a button, exits cleanly.

Weeks 3–4: Agent loop with Claude's computer-use API. Get one end-to-end task working reliably (e.g., "open Firefox, search Wikipedia for X, screenshot the result"). Build the trace format here — every step logged with screenshot, action, model reasoning.

Weeks 5–6: Snapshot/restore as a first-class primitive. This is your differentiator. Implement branching: agent hits a decision point, forks the VM, tries both paths, picks the winner. Even a toy demo of this is rare and memorable.

Week 7: QEMU backend for one "real OS" demo (installing software, multi-app workflows). Doesn't need parity with Docker path; just prove the abstraction holds.

Week 8: Benchmark suite (10–20 tasks), README, demo video, blog post. The video matters more than you'd think for portfolio reach.

**Things worth deciding now**

A benchmark gives the project teeth — even a small homegrown one (file management, browser tasks, multi-app workflows) lets you make claims with numbers. OSWorld and WebArena exist if you want external comparison, but adapting them costs a week; consider whether that's worth it for your goals.

Pick a license early (Apache-2.0 is the path of least resistance for portfolio work). And decide upfront whether the agent layer is pluggable or Claude-specific — pluggable is more work but dramatically expands who'll try it.

**Prior art to study before you start**

Anthropic's computer-use reference implementation (the Docker container is a useful baseline to surpass), OpenClaw, Self-Operating Computer, OSWorld, and Microsoft's OmniParser. Spend a day reading their code before writing yours — you'll find each one has a specific weakness your project can address explicitly in the README.

Want me to go deeper on any piece — the snapshot/branching design, the trace format, or the benchmark task list?