3.5 KiB
Plan
Positioning
The wedge is "agent owns the VM lifecycle, not just the screen." Most computer-use demos assume a desktop already exists; OpenClaw, Anthropic's reference container, and Self-Operating Computer mostly treat the VM as a given. Your project treats spin up → configure → drive → snapshot → tear down as one programmable surface. Call out two concrete capabilities competitors lack: (1) the agent can request a fresh, reproducible desktop per task, and (2) it can snapshot/branch state mid-task to explore alternatives or retry from a known-good checkpoint. That framing alone gives the repo a clear reason to exist.
Architecture sketch
Four layers, kept deliberately small:
- VM substrate — Docker+VNC for the default path (fast, cross-platform, easy for contributors to run), QEMU/KVM as an optional backend for full-OS scenarios. Expose both behind one
Sandboxinterface withcreate,snapshot,restore,destroy,exec. - Capture/control — screenshots via VNC framebuffer or
scrot; input viaxdotoolor direct VNC events. Keep latency budget under ~300ms round-trip or the agent loop gets painful. - Agent loop — Claude (or any vision LLM) in a perceive→plan→act→verify cycle. The verify step is where most projects cut corners; make it a first-class screenshot-diff + assertion mechanism.
- Orchestrator — Python service exposing a task API: "given this goal and this base image, return a trace." Traces are the artifact you publish.
8-week milestone plan
Weeks 1–2: Docker+VNC sandbox with clean lifecycle API, screenshot capture, basic input injection. Goal: a script that boots a fresh Ubuntu+Firefox container, takes a screenshot, clicks a button, exits cleanly.
Weeks 3–4: Agent loop with Claude's computer-use API. Get one end-to-end task working reliably (e.g., "open Firefox, search Wikipedia for X, screenshot the result"). Build the trace format here — every step logged with screenshot, action, model reasoning.
Weeks 5–6: Snapshot/restore as a first-class primitive. This is your differentiator. Implement branching: agent hits a decision point, forks the VM, tries both paths, picks the winner. Even a toy demo of this is rare and memorable.
Week 7: QEMU backend for one "real OS" demo (installing software, multi-app workflows). Doesn't need parity with Docker path; just prove the abstraction holds.
Week 8: Benchmark suite (10–20 tasks), README, demo video, blog post. The video matters more than you'd think for portfolio reach.
Things worth deciding now
A benchmark gives the project teeth — even a small homegrown one (file management, browser tasks, multi-app workflows) lets you make claims with numbers. OSWorld and WebArena exist if you want external comparison, but adapting them costs a week; consider whether that's worth it for your goals.
Pick a license early (Apache-2.0 is the path of least resistance for portfolio work). And decide upfront whether the agent layer is pluggable or Claude-specific — pluggable is more work but dramatically expands who'll try it.
Prior art to study before you start
Anthropic's computer-use reference implementation (the Docker container is a useful baseline to surpass), OpenClaw, Self-Operating Computer, OSWorld, and Microsoft's OmniParser. Spend a day reading their code before writing yours — you'll find each one has a specific weakness your project can address explicitly in the README.
Want me to go deeper on any piece — the snapshot/branching design, the trace format, or the benchmark task list?