mind-palace/Projects/Thinking/GodHand - Agentic VM Manager.md
2026-05-24 13:36:37 +05:30

40 lines
3.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[[agentic vm manager]]
# Plan
**Positioning**
The wedge is "agent owns the VM lifecycle, not just the screen." Most computer-use demos assume a desktop already exists; OpenClaw, Anthropic's reference container, and Self-Operating Computer mostly treat the VM as a given. Your project treats `spin up → configure → drive → snapshot → tear down` as one programmable surface. Call out two concrete capabilities competitors lack: (1) the agent can request a fresh, reproducible desktop per task, and (2) it can snapshot/branch state mid-task to explore alternatives or retry from a known-good checkpoint. That framing alone gives the repo a clear reason to exist.
**Architecture sketch**
Four layers, kept deliberately small:
1. _VM substrate_ — Docker+VNC for the default path (fast, cross-platform, easy for contributors to run), QEMU/KVM as an optional backend for full-OS scenarios. Expose both behind one `Sandbox` interface with `create`, `snapshot`, `restore`, `destroy`, `exec`.
2. _Capture/control_ — screenshots via VNC framebuffer or `scrot`; input via `xdotool` or direct VNC events. Keep latency budget under ~300ms round-trip or the agent loop gets painful.
3. _Agent loop_ — Claude (or any vision LLM) in a perceive→plan→act→verify cycle. The verify step is where most projects cut corners; make it a first-class screenshot-diff + assertion mechanism.
4. _Orchestrator_ — Python service exposing a task API: "given this goal and this base image, return a trace." Traces are the artifact you publish.
**8-week milestone plan**
Weeks 12: Docker+VNC sandbox with clean lifecycle API, screenshot capture, basic input injection. Goal: a script that boots a fresh Ubuntu+Firefox container, takes a screenshot, clicks a button, exits cleanly.
Weeks 34: Agent loop with Claude's computer-use API. Get one end-to-end task working reliably (e.g., "open Firefox, search Wikipedia for X, screenshot the result"). Build the trace format here — every step logged with screenshot, action, model reasoning.
Weeks 56: Snapshot/restore as a first-class primitive. This is your differentiator. Implement branching: agent hits a decision point, forks the VM, tries both paths, picks the winner. Even a toy demo of this is rare and memorable.
Week 7: QEMU backend for one "real OS" demo (installing software, multi-app workflows). Doesn't need parity with Docker path; just prove the abstraction holds.
Week 8: Benchmark suite (1020 tasks), README, demo video, blog post. The video matters more than you'd think for portfolio reach.
**Things worth deciding now**
A benchmark gives the project teeth — even a small homegrown one (file management, browser tasks, multi-app workflows) lets you make claims with numbers. OSWorld and WebArena exist if you want external comparison, but adapting them costs a week; consider whether that's worth it for your goals.
Pick a license early (Apache-2.0 is the path of least resistance for portfolio work). And decide upfront whether the agent layer is pluggable or Claude-specific — pluggable is more work but dramatically expands who'll try it.
**Prior art to study before you start**
Anthropic's computer-use reference implementation (the Docker container is a useful baseline to surpass), OpenClaw, Self-Operating Computer, OSWorld, and Microsoft's OmniParser. Spend a day reading their code before writing yours — you'll find each one has a specific weakness your project can address explicitly in the README.
Want me to go deeper on any piece — the snapshot/branching design, the trace format, or the benchmark task list?