18 KiB
hyperhive — Plan
Status. All phases 0–7 have shipped. This file is the original design doc; CLAUDE.md is the current source of truth for what's actually built, the file map, gotchas, and operator runbook. Keep this file for the why and the original phase rationale; CLAUDE.md for how things are today.
Names.
- Repo:
hyperhive- Host-side daemon (helper):
hive-c0re— its own crate (daemon + CLI in one binary).- In-container harness:
hive-ag3ntcrate, with two[[bin]]targets:
hive-ag3nt— runs in each sub-agent container.hive-m1nd— runs in the manager container (same crate, secondmain.rs, wires the manager tool surface).- Shared crate between
hive-c0reandhive-ag3nt(wire protocol, MCP verb types, message shapes):hive-sh4re.Relationship to damocles. Damocles is a separate, currently-running setup.
hyperhiveis a new, independent system in its own repo. Damocles' existingclaude-container.nixinforms the agent-base template but is not a dependency. Migration options laid out indocs/damocles-migration.md; recommendation is to keep them separate for now.
What we're building
A multi-Claude-Code-agent setup on a single host:
- Each agent runs in its own nixos-container (the NixOS wrapper around
systemd-nspawn— gives usnixos-container update, declarative configs, etc.). - A manager agent (itself a nixos-container) coordinates: spawns/kills agents, routes inter-agent messages, filters relevance, summarises, gates lifecycle changes on user approval.
- Host-side Rust daemon
hive-c0reowns container lifecycle, hosts the MCP transport, brokers messages, runs the dashboard, and orchestrates approval-via-git-commit. - Inside each agent container,
hive-ag3ntruns the agent turn loop (drivesclaudeas a subprocess), provides the per-agent web UI, and is the MCP client for that agent. - Inside the manager container,
hive-m1ndruns the analogous loop with the manager's tool surface and/agents/**RW access. It's a second binary in thehive-ag3ntcrate — same lib code, differentmain.rs. - Wire protocol and types shared between
hive-c0reand the harness live inhive-sh4re.
Architecture
┌────────────────── host ────────────────────────────────────┐
│ │
│ hive-c0re (Rust daemon, NixOS service) │
│ ├── lifecycle : nixos-container CRUD + nixos-container │
│ │ update on approved config commits │
│ ├── broker : sqlite message store + routing │
│ ├── mcp : per-socket MCP servers (see Sockets) │
│ ├── approvals : git-commit-based change requests │
│ └── dashboard : HTMX/SSE web UI │
│ │
│ /var/lib/hyperhive/ │
│ ├── state-repo/ (world: who exists, etc.) │
│ ├── shared-instructions/ (RO into every agent) │
│ └── agents/ │
│ ├── manager/ │
│ │ ├── config/ (flake repo, manager's own setup) │
│ │ ├── prompts/ (manager's role/CLAUDE.md) │
│ │ └── state/ (RW for manager) │
│ └── <name>/ │
│ ├── config/ (flake repo, RW for manager, RO ag) │
│ ├── prompts/ (RO inside agent) │
│ └── state/ (RW for agent + manager) │
│ │
│ Sockets (one per principal — perms by mount location): │
│ ├── /run/hyperhive/host.sock admin/CLI on host │
│ ├── /run/hyperhive/manager.sock → manager container │
│ └── /run/hyperhive/agents/<name>.sock → that agent │
│ │
│ ┌─ nixos-container: hm1nd ──────────────────────┐ │
│ │ hive-m1nd (Rust, hive-ag3nt crate) │ │
│ │ ├ MCP client → /run/hyperhive/manager.sock │ │
│ │ ├ turn loop driving `claude` │ │
│ │ └ per-container web UI │ │
│ │ /var/lib/hyperhive/agents/** bind RW │ │
│ │ MCP tools (manager surface): │ │
│ │ send(to, body, wait_for_reply: bool), │ │
│ │ recv / next_event, │ │
│ │ request_spawn, request_kill, │ │
│ │ request_apply_commit, inject_peer_info, │ │
│ │ update_shared_instructions │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌─ nixos-container: h-<name> ───────────────────┐ │
│ │ hive-ag3nt (Rust) │ │
│ │ MCP client → /run/hyperhive/agents/<name>.sock │
│ │ state/ RW, config/ + prompts/ + shared RO │ │
│ │ MCP tools (agent surface): │ │
│ │ send, recv, request_install │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key model decisions
Sockets, not identity gating. One unix socket per principal, bind-mounted into the right place. The socket is the principal — no SO_PEERCRED lookups, no token plumbing. Perms are filesystem perms on the host side, plus the fact that only the matching container has the bind-mount. Each socket runs the MCP tool surface appropriate to its principal.
Container naming. nixos-container caps the total container name at 11 chars (it gets encoded into network interface names). The manager runs as hm1nd (a compressed form of hive-m1nd). Sub-agents run as h-<name> with <name> capped to 9 chars (MAX_AGENT_NAME). The two namespaces don't collide. On-disk paths (/var/lib/hyperhive/agents/foo/) and socket paths (/run/hyperhive/agents/foo/mcp.sock) use the bare logical name; the prefixing lives only at the nixos-container layer.
Approvals are git commits. hive-c0re maintains a state-repo on host that records the world (which agents exist, their roles, etc.). Per-agent flake configs (agents/<name>/config/) are themselves git repos. The manager edits clones with plain git CLI inside its container and asks hive-c0re to apply a commit via request_apply_commit(agent, sha). hive-c0re queues it; once approved, fast-forwards main and reconciles state (rebuild containers, etc.). No abstract "approval token" — the commit hash is the token.
Approval UX evolves. Early phases approve by CLI (hive-c0re approve <sha>) or even direct git merge on the host. The dashboard's commit-view UI is one of the last features built (so the system runs end-to-end on CLI long before the browser does anything useful).
Manager is a second binary in the same crate. hive-m1nd lives in the hive-ag3nt crate as a second [[bin]], sharing the crate's lib code. The two binaries differ in:
- Tool surface (manager's
main.rswiresrequest_spawn/request_kill/request_apply_commit/inject_peer_info/update_shared_instructions). - Bind-mounts (manager gets
agents/**RW; agents get only their ownstate/). - Auto-restart (manager is auto-restarted by
hive-c0re; agents stay down once killed). - The manager container is declared in the host NixOS module as a known container.
Manager concurrency = event loop. hive-m1nd pulls from a heterogeneous next_event stream: inbound agent messages, replies to sync sends, lifecycle events from hive-c0re (crash, OOM, approval-resolved), and dashboard signals. One queue, claude turn per event.
Anthropic credentials. Shared key on host, bind-mounted into every container. No per-agent keys in v1.
Workdir bootstrap. Each agent's state/ starts empty. Initial-task message tells the agent what to clone/set up. Manager can drop big artefacts into state/ directly (it has RW) and pass the path as a message reference.
Riskiest assumptions (test in this order)
nixos-container updatehot-reloads under the in-flightnsresourced/mountfsd/ privateUsers patch stack without nuking the running harness (and thus theclaudesession). Validated in Phase 1.- The harness driving
claudeas a long-running turn loop is stable — claude as a subprocess streaming output back; harness deciding when a turn ends; injecting inbox messages as new user turns. Validated in Phase 3. hive-m1nd(as a claude session) sensibly drives spawn / route / peer-injection. Behavioural; only knowable by running it. Phase 4.hive-c0rethroughput under multiple agents — bursty messaging, MCP relay, sqlite writes. Likely fine; flag for measurement.
Phased path
All phases ✅ shipped. Each section below is the original design with notes on what actually landed and what deviated. See CLAUDE.md → "Phase status" for the canonical summary.
✅ Phase 0 — repo bootstrap
- Create
~/Repos/hyperhive/, init flake. - Cargo workspace:
hive-c0re/,hive-sh4re/,hive-ag3nt/(the last with two[[bin]]targets —hive-ag3ntandhive-m1nd). All compile, all do nothing useful. - NixOS module skeleton (
nix/modules/hive-c0re.nix) that runs the daemon as a systemd service on the host. - Agent base template (
nix/templates/agent-base.nix) that builds a nixos-container including thehive-ag3ntbinary. - Exit:
nixos-container create test-agent --flake .#agent-base && nixos-container start test-agentbrings up a container whosehive-ag3ntprints "hello" and exits.
✅ Phase 1 — container lifecycle + Risk 1
hive-c0re: open host admin socket (/run/hyperhive/host.sock); verbsspawn(name),kill(name),rebuild(name),list(). Usesnixos-containerunderneath; container name on the host ish-<name>(sub-agents) orhm1nd(manager).- CLI tool talking to the admin socket (same
hive-c0rebinary, subcommand-driven). - Manually mutate an agent's config flake, call
rebuild, observe whetherhive-ag3ntsurvives. - Decision: if hot-reload doesn't preserve the harness, that becomes a hard requirement of
hive-ag3nt's design (resume from disk state). Document the outcome. - Exit: spawn / rebuild / kill via CLI is reliable; known behaviour for in-flight rebuilds.
✅ Phase 2 — sockets + minimal MCP
hive-c0reopensmanager.sockandagents/<name>.sock(one per spawned agent). Per-socket MCP server with the right tool surface baked in. Types fromhive-sh4re.hive-ag3nt: MCP client (types fromhive-sh4re), connects to its socket on startup, exchanges hello.- Tools: agent gets
send(to, body),recv(). No persistence yet (in-memory). - Exit: two test agents exchange messages through
hive-c0remanually-driven.
✅ Phase 3 — broker + turn loop
hive-c0re: sqlite-backed message store (messagestable;id, sender, recipient, body, sent_at, delivered_at). Surviveshive-c0rerestart.hive-ag3nt(lib): real turn loop. Reads fromrecv; feeds new messages as user turns toclaude; captures output; callssendfor outbound. Long-running.- Exit: two
hive-ag3nt-driven agents have a back-and-forth conversation throughhive-c0re.
✅ Phase 4 — hive-m1nd + privileged surface
hive-m1ndbinary (second[[bin]]inhive-ag3nt) wires the manager tool surface.- Manager container (
hm1nd) declared in host NixOS module (auto-restart). Bind-mountagents/**RW. - Manager socket gets the privileged tool surface:
request_spawn/request_kill,request_apply_commit,inject_peer_info,send(..., wait_for_reply=true). - Smoke: attach a terminal to the manager container (
nixos-container root-login); askhive-m1ndto spawn an agent and route a message to it. - Exit: manager spawns, routes, kills a child agent end-to-end; lifecycle still gated by manual CLI approval (no GUI yet).
✅ Phase 5 — git-commit approval flow
state-repoon host tracks world (agents directory listing, allow-lists, etc.).- Per-agent
config/flake repos created at spawn time. - Manager's container: bind-mounted clones; uses plain
gitCLI to edit/commit. request_apply_commit(name, sha)queues a change inhive-c0re. Approval = CLIhive-c0re approve <sha>; on approve,hive-c0refast-forwardsmainand reconciles (rebuild if config changed, runnixos-container update).- Per-agent allow-list for
request_install: in-list installs become auto-applied commits; novel pkgs become pending commits. - Exit: manager adds a package to an agent → user approves on CLI → agent picks it up.
✅ Phase 6 — per-agent web UI + dashboard MVP
hive-ag3ntweb UI module (in the crate's lib): HTTP on a per-container host port (host network): status, last messages, embedded terminal (xterm.js over WebSocket). Bothhive-ag3ntandhive-m1ndbinaries expose it.- Dashboard served by
hive-c0re: agent list, per-agent status, links to each agent's UI, link to manager's UI. - No approval UI yet; users still approve via CLI.
- Exit: browser is a usable navigation layer over the whole system.
✅ Phase 7 — dashboard commit view + polish
- Pending-commits view in the dashboard with diff rendering and Approve/Deny buttons (replaces the CLI approve step).
- Live message-flow view (
hive-c0resees all MCP relay traffic). hive-c0reevent push intohive-m1nd'snext_event(crashes, OOM, approval resolved).- Resource caps in nixos-container units (
MemoryMax,CPUQuota). - Migration plan for damocles' existing containers (separate doc).
Repo layout (target)
~/Repos/hyperhive/
├── hive-c0re/ # host-side Rust crate (daemon + CLI)
│ ├── src/
│ │ ├── main.rs
│ │ ├── lifecycle.rs # nixos-container CRUD
│ │ ├── broker.rs # sqlite message store
│ │ ├── mcp/ # MCP servers (per-socket)
│ │ ├── approvals.rs # git-commit change requests
│ │ ├── state_repo.rs # world-state git repo
│ │ └── dashboard/ # HTMX/SSE web UI
│ └── Cargo.toml
├── hive-sh4re/ # shared crate (hive-c0re ↔ hive-ag3nt)
│ ├── src/
│ │ ├── lib.rs
│ │ ├── protocol.rs # MCP wire types, verb shapes
│ │ └── message.rs # message types
│ └── Cargo.toml
├── hive-ag3nt/ # in-container harness crate; two binaries
│ ├── src/
│ │ ├── lib.rs # shared harness code
│ │ ├── mcp_client.rs # MCP client over unix socket
│ │ ├── turn_loop.rs # claude subprocess driver
│ │ ├── web_ui.rs # per-container UI scaffolding
│ │ └── bin/
│ │ ├── hive-ag3nt.rs # agent tool surface wiring
│ │ └── hive-m1nd.rs # manager tool surface wiring
│ └── Cargo.toml # [[bin]] hive-ag3nt + [[bin]] hive-m1nd
├── nix/
│ ├── modules/hive-c0re.nix # NixOS module for the host
│ ├── templates/
│ │ ├── agent-base.nix # base agent nixos-container (pulls hive-ag3nt)
│ │ └── manager.nix # manager nixos-container (pulls hive-m1nd)
│ └── overlay.nix # exposes the three crates + both bins as pkgs
├── web/ # static assets, HTMX templates
├── flake.nix
├── Cargo.toml # workspace
└── PLAN.md # this file
Resolved implementation decisions
The original open-decisions list, with what we picked:
- Wire format. Custom JSON-line over unix sockets (host admin / manager / per-agent), not real MCP stdio. Simpler and good enough for now; can swap to MCP later. SSE for the dashboard message-flow.
- Per-agent web UI.
axumHTTP server inside each container at a port hashed from the agent name (8100–8999); manager at fixed 8000; dashboard at 7000. Plain HTML, no HTMX, no xterm.js yet. state-reposchema. Per-agent dir with files; not a single TOML. Realised as two parallel git repos per agent:proposed(manager-editable) andapplied(hive-c0re-only). Container builds fromapplied.- Manager access to applied state. Not RW-mounted. Manager only has
proposed/bind-mounted;applied/is hive-c0re-only. - One binary or two. One:
hive-c0reis daemon + CLI dispatched by subcommand (serve/spawn/kill/rebuild/list/pending/approve/deny).
Polish backlog (not phased)
See CLAUDE.md → "Polish backlog" for the live list. Highlights: operator
inbox drain, per-agent UI substance, xterm.js terminal embed, destroy verb,
bounded broker, container-crash events via D-Bus.
Explicitly deferred / out of v1 scope
- Per-agent API keys, cost attribution.
- Pooled / pre-warmed containers.
- Destroy verb on the
hive-c0reAPI (usermon host). - Backup / replication of
agents/state. - Migration of existing damocles containers (
docs/damocles-migration.md). - Anything about multiple hosts.