hyperhive/PLAN.md

18 KiB
Raw Blame History

hyperhive — Plan

Status. All phases 07 have shipped. This file is the original design doc; CLAUDE.md is the current source of truth for what's actually built, the file map, gotchas, and operator runbook. Keep this file for the why and the original phase rationale; CLAUDE.md for how things are today.

Names.

  • Repo: hyperhive
  • Host-side daemon (helper): hive-c0re — its own crate (daemon + CLI in one binary).
  • In-container harness: hive-ag3nt crate, with two [[bin]] targets:
    • hive-ag3nt — runs in each sub-agent container.
    • hive-m1nd — runs in the manager container (same crate, second main.rs, wires the manager tool surface).
  • Shared crate between hive-c0re and hive-ag3nt (wire protocol, MCP verb types, message shapes): hive-sh4re.

Relationship to damocles. Damocles is a separate, currently-running setup. hyperhive is a new, independent system in its own repo. Damocles' existing claude-container.nix informs the agent-base template but is not a dependency. Migration options laid out in docs/damocles-migration.md; recommendation is to keep them separate for now.

What we're building

A multi-Claude-Code-agent setup on a single host:

  • Each agent runs in its own nixos-container (the NixOS wrapper around systemd-nspawn — gives us nixos-container update, declarative configs, etc.).
  • A manager agent (itself a nixos-container) coordinates: spawns/kills agents, routes inter-agent messages, filters relevance, summarises, gates lifecycle changes on user approval.
  • Host-side Rust daemon hive-c0re owns container lifecycle, hosts the MCP transport, brokers messages, runs the dashboard, and orchestrates approval-via-git-commit.
  • Inside each agent container, hive-ag3nt runs the agent turn loop (drives claude as a subprocess), provides the per-agent web UI, and is the MCP client for that agent.
  • Inside the manager container, hive-m1nd runs the analogous loop with the manager's tool surface and /agents/** RW access. It's a second binary in the hive-ag3nt crate — same lib code, different main.rs.
  • Wire protocol and types shared between hive-c0re and the harness live in hive-sh4re.

Architecture

┌────────────────── host ────────────────────────────────────┐
│                                                             │
│  hive-c0re (Rust daemon, NixOS service)                     │
│  ├── lifecycle  : nixos-container CRUD + nixos-container    │
│  │                update on approved config commits         │
│  ├── broker     : sqlite message store + routing            │
│  ├── mcp        : per-socket MCP servers (see Sockets)      │
│  ├── approvals  : git-commit-based change requests          │
│  └── dashboard  : HTMX/SSE web UI                           │
│                                                             │
│  /var/lib/hyperhive/                                        │
│  ├── state-repo/                  (world: who exists, etc.) │
│  ├── shared-instructions/         (RO into every agent)     │
│  └── agents/                                                │
│      ├── manager/                                           │
│      │   ├── config/   (flake repo, manager's own setup)    │
│      │   ├── prompts/  (manager's role/CLAUDE.md)           │
│      │   └── state/    (RW for manager)                     │
│      └── <name>/                                            │
│          ├── config/   (flake repo, RW for manager, RO ag)  │
│          ├── prompts/  (RO inside agent)                    │
│          └── state/    (RW for agent + manager)             │
│                                                             │
│  Sockets (one per principal — perms by mount location):     │
│  ├── /run/hyperhive/host.sock          admin/CLI on host    │
│  ├── /run/hyperhive/manager.sock       → manager container  │
│  └── /run/hyperhive/agents/<name>.sock → that agent         │
│                                                             │
│  ┌─ nixos-container: hm1nd ──────────────────────┐          │
│  │  hive-m1nd (Rust, hive-ag3nt crate)           │          │
│  │   ├ MCP client → /run/hyperhive/manager.sock  │          │
│  │   ├ turn loop driving `claude`                │          │
│  │   └ per-container web UI                      │          │
│  │  /var/lib/hyperhive/agents/** bind RW         │          │
│  │  MCP tools (manager surface):                 │          │
│  │   send(to, body, wait_for_reply: bool),       │          │
│  │   recv / next_event,                          │          │
│  │   request_spawn, request_kill,                │          │
│  │   request_apply_commit, inject_peer_info,     │          │
│  │   update_shared_instructions                  │          │
│  └────────────────────────────────────────────────┘         │
│                                                             │
│  ┌─ nixos-container: h-<name> ───────────────────┐          │
│  │  hive-ag3nt (Rust)                            │          │
│  │   MCP client → /run/hyperhive/agents/<name>.sock         │
│  │  state/ RW, config/ + prompts/ + shared RO    │          │
│  │  MCP tools (agent surface):                   │          │
│  │   send, recv, request_install                 │          │
│  └────────────────────────────────────────────────┘         │
└─────────────────────────────────────────────────────────────┘

Key model decisions

Sockets, not identity gating. One unix socket per principal, bind-mounted into the right place. The socket is the principal — no SO_PEERCRED lookups, no token plumbing. Perms are filesystem perms on the host side, plus the fact that only the matching container has the bind-mount. Each socket runs the MCP tool surface appropriate to its principal.

Container naming. nixos-container caps the total container name at 11 chars (it gets encoded into network interface names). The manager runs as hm1nd (a compressed form of hive-m1nd). Sub-agents run as h-<name> with <name> capped to 9 chars (MAX_AGENT_NAME). The two namespaces don't collide. On-disk paths (/var/lib/hyperhive/agents/foo/) and socket paths (/run/hyperhive/agents/foo/mcp.sock) use the bare logical name; the prefixing lives only at the nixos-container layer.

Approvals are git commits. hive-c0re maintains a state-repo on host that records the world (which agents exist, their roles, etc.). Per-agent flake configs (agents/<name>/config/) are themselves git repos. The manager edits clones with plain git CLI inside its container and asks hive-c0re to apply a commit via request_apply_commit(agent, sha). hive-c0re queues it; once approved, fast-forwards main and reconciles state (rebuild containers, etc.). No abstract "approval token" — the commit hash is the token.

Approval UX evolves. Early phases approve by CLI (hive-c0re approve <sha>) or even direct git merge on the host. The dashboard's commit-view UI is one of the last features built (so the system runs end-to-end on CLI long before the browser does anything useful).

Manager is a second binary in the same crate. hive-m1nd lives in the hive-ag3nt crate as a second [[bin]], sharing the crate's lib code. The two binaries differ in:

  • Tool surface (manager's main.rs wires request_spawn/request_kill/request_apply_commit/inject_peer_info/update_shared_instructions).
  • Bind-mounts (manager gets agents/** RW; agents get only their own state/).
  • Auto-restart (manager is auto-restarted by hive-c0re; agents stay down once killed).
  • The manager container is declared in the host NixOS module as a known container.

Manager concurrency = event loop. hive-m1nd pulls from a heterogeneous next_event stream: inbound agent messages, replies to sync sends, lifecycle events from hive-c0re (crash, OOM, approval-resolved), and dashboard signals. One queue, claude turn per event.

Anthropic credentials. Shared key on host, bind-mounted into every container. No per-agent keys in v1.

Workdir bootstrap. Each agent's state/ starts empty. Initial-task message tells the agent what to clone/set up. Manager can drop big artefacts into state/ directly (it has RW) and pass the path as a message reference.

Riskiest assumptions (test in this order)

  1. nixos-container update hot-reloads under the in-flight nsresourced / mountfsd / privateUsers patch stack without nuking the running harness (and thus the claude session). Validated in Phase 1.
  2. The harness driving claude as a long-running turn loop is stable — claude as a subprocess streaming output back; harness deciding when a turn ends; injecting inbox messages as new user turns. Validated in Phase 3.
  3. hive-m1nd (as a claude session) sensibly drives spawn / route / peer-injection. Behavioural; only knowable by running it. Phase 4.
  4. hive-c0re throughput under multiple agents — bursty messaging, MCP relay, sqlite writes. Likely fine; flag for measurement.

Phased path

All phases shipped. Each section below is the original design with notes on what actually landed and what deviated. See CLAUDE.md → "Phase status" for the canonical summary.

Phase 0 — repo bootstrap

  • Create ~/Repos/hyperhive/, init flake.
  • Cargo workspace: hive-c0re/, hive-sh4re/, hive-ag3nt/ (the last with two [[bin]] targets — hive-ag3nt and hive-m1nd). All compile, all do nothing useful.
  • NixOS module skeleton (nix/modules/hive-c0re.nix) that runs the daemon as a systemd service on the host.
  • Agent base template (nix/templates/agent-base.nix) that builds a nixos-container including the hive-ag3nt binary.
  • Exit: nixos-container create test-agent --flake .#agent-base && nixos-container start test-agent brings up a container whose hive-ag3nt prints "hello" and exits.

Phase 1 — container lifecycle + Risk 1

  • hive-c0re: open host admin socket (/run/hyperhive/host.sock); verbs spawn(name), kill(name), rebuild(name), list(). Uses nixos-container underneath; container name on the host is h-<name> (sub-agents) or hm1nd (manager).
  • CLI tool talking to the admin socket (same hive-c0re binary, subcommand-driven).
  • Manually mutate an agent's config flake, call rebuild, observe whether hive-ag3nt survives.
  • Decision: if hot-reload doesn't preserve the harness, that becomes a hard requirement of hive-ag3nt's design (resume from disk state). Document the outcome.
  • Exit: spawn / rebuild / kill via CLI is reliable; known behaviour for in-flight rebuilds.

Phase 2 — sockets + minimal MCP

  • hive-c0re opens manager.sock and agents/<name>.sock (one per spawned agent). Per-socket MCP server with the right tool surface baked in. Types from hive-sh4re.
  • hive-ag3nt: MCP client (types from hive-sh4re), connects to its socket on startup, exchanges hello.
  • Tools: agent gets send(to, body), recv(). No persistence yet (in-memory).
  • Exit: two test agents exchange messages through hive-c0re manually-driven.

Phase 3 — broker + turn loop

  • hive-c0re: sqlite-backed message store (messages table; id, sender, recipient, body, sent_at, delivered_at). Survives hive-c0re restart.
  • hive-ag3nt (lib): real turn loop. Reads from recv; feeds new messages as user turns to claude; captures output; calls send for outbound. Long-running.
  • Exit: two hive-ag3nt-driven agents have a back-and-forth conversation through hive-c0re.

Phase 4 — hive-m1nd + privileged surface

  • hive-m1nd binary (second [[bin]] in hive-ag3nt) wires the manager tool surface.
  • Manager container (hm1nd) declared in host NixOS module (auto-restart). Bind-mount agents/** RW.
  • Manager socket gets the privileged tool surface: request_spawn/request_kill, request_apply_commit, inject_peer_info, send(..., wait_for_reply=true).
  • Smoke: attach a terminal to the manager container (nixos-container root-login); ask hive-m1nd to spawn an agent and route a message to it.
  • Exit: manager spawns, routes, kills a child agent end-to-end; lifecycle still gated by manual CLI approval (no GUI yet).

Phase 5 — git-commit approval flow

  • state-repo on host tracks world (agents directory listing, allow-lists, etc.).
  • Per-agent config/ flake repos created at spawn time.
  • Manager's container: bind-mounted clones; uses plain git CLI to edit/commit.
  • request_apply_commit(name, sha) queues a change in hive-c0re. Approval = CLI hive-c0re approve <sha>; on approve, hive-c0re fast-forwards main and reconciles (rebuild if config changed, run nixos-container update).
  • Per-agent allow-list for request_install: in-list installs become auto-applied commits; novel pkgs become pending commits.
  • Exit: manager adds a package to an agent → user approves on CLI → agent picks it up.

Phase 6 — per-agent web UI + dashboard MVP

  • hive-ag3nt web UI module (in the crate's lib): HTTP on a per-container host port (host network): status, last messages, embedded terminal (xterm.js over WebSocket). Both hive-ag3nt and hive-m1nd binaries expose it.
  • Dashboard served by hive-c0re: agent list, per-agent status, links to each agent's UI, link to manager's UI.
  • No approval UI yet; users still approve via CLI.
  • Exit: browser is a usable navigation layer over the whole system.

Phase 7 — dashboard commit view + polish

  • Pending-commits view in the dashboard with diff rendering and Approve/Deny buttons (replaces the CLI approve step).
  • Live message-flow view (hive-c0re sees all MCP relay traffic).
  • hive-c0re event push into hive-m1nd's next_event (crashes, OOM, approval resolved).
  • Resource caps in nixos-container units (MemoryMax, CPUQuota).
  • Migration plan for damocles' existing containers (separate doc).

Repo layout (target)

~/Repos/hyperhive/
├── hive-c0re/                        # host-side Rust crate (daemon + CLI)
│   ├── src/
│   │   ├── main.rs
│   │   ├── lifecycle.rs              # nixos-container CRUD
│   │   ├── broker.rs                 # sqlite message store
│   │   ├── mcp/                      # MCP servers (per-socket)
│   │   ├── approvals.rs              # git-commit change requests
│   │   ├── state_repo.rs             # world-state git repo
│   │   └── dashboard/                # HTMX/SSE web UI
│   └── Cargo.toml
├── hive-sh4re/                       # shared crate (hive-c0re ↔ hive-ag3nt)
│   ├── src/
│   │   ├── lib.rs
│   │   ├── protocol.rs               # MCP wire types, verb shapes
│   │   └── message.rs                # message types
│   └── Cargo.toml
├── hive-ag3nt/                       # in-container harness crate; two binaries
│   ├── src/
│   │   ├── lib.rs                    # shared harness code
│   │   ├── mcp_client.rs             # MCP client over unix socket
│   │   ├── turn_loop.rs              # claude subprocess driver
│   │   ├── web_ui.rs                 # per-container UI scaffolding
│   │   └── bin/
│   │       ├── hive-ag3nt.rs         # agent tool surface wiring
│   │       └── hive-m1nd.rs          # manager tool surface wiring
│   └── Cargo.toml                    # [[bin]] hive-ag3nt + [[bin]] hive-m1nd
├── nix/
│   ├── modules/hive-c0re.nix         # NixOS module for the host
│   ├── templates/
│   │   ├── agent-base.nix            # base agent nixos-container (pulls hive-ag3nt)
│   │   └── manager.nix               # manager nixos-container (pulls hive-m1nd)
│   └── overlay.nix                   # exposes the three crates + both bins as pkgs
├── web/                              # static assets, HTMX templates
├── flake.nix
├── Cargo.toml                        # workspace
└── PLAN.md                           # this file

Resolved implementation decisions

The original open-decisions list, with what we picked:

  • Wire format. Custom JSON-line over unix sockets (host admin / manager / per-agent), not real MCP stdio. Simpler and good enough for now; can swap to MCP later. SSE for the dashboard message-flow.
  • Per-agent web UI. axum HTTP server inside each container at a port hashed from the agent name (81008999); manager at fixed 8000; dashboard at 7000. Plain HTML, no HTMX, no xterm.js yet.
  • state-repo schema. Per-agent dir with files; not a single TOML. Realised as two parallel git repos per agent: proposed (manager-editable) and applied (hive-c0re-only). Container builds from applied.
  • Manager access to applied state. Not RW-mounted. Manager only has proposed/ bind-mounted; applied/ is hive-c0re-only.
  • One binary or two. One: hive-c0re is daemon + CLI dispatched by subcommand (serve / spawn / kill / rebuild / list / pending / approve / deny).

Polish backlog (not phased)

See CLAUDE.md → "Polish backlog" for the live list. Highlights: operator inbox drain, per-agent UI substance, xterm.js terminal embed, destroy verb, bounded broker, container-crash events via D-Bus.

Explicitly deferred / out of v1 scope

  • Per-agent API keys, cost attribution.
  • Pooled / pre-warmed containers.
  • Destroy verb on the hive-c0re API (use rm on host).
  • Backup / replication of agents/ state.
  • Migration of existing damocles containers (docs/damocles-migration.md).
  • Anything about multiple hosts.