hyperhive/CLAUDE.md

15 KiB

hyperhive

Multi-Claude-Code-agent orchestration on nixos-containers. A host-side Rust daemon (hive-c0re) spawns nspawn-isolated agent containers and brokers messages between them. A manager agent (hm1nd) coordinates the swarm and gates lifecycle changes on user approval via git commits, surfaced through a vibec0re-styled HTTP dashboard with live SSE message-flow.

PLAN.md is the living design doc. Read it for the why and the phase roadmap; this file is the operator/developer reference for the how.

Architecture

host (NixOS, hive-c0re.service)
│
├── hive-c0re  (Rust daemon — coordinator + dashboard + CLI)
│   ├── lifecycle      — nixos-container CRUD (spawn/kill/rebuild/list)
│   ├── broker         — sqlite message store + broadcast channel
│   ├── approvals      — sqlite approval queue
│   ├── coordinator    — shared state (broker/approvals/agent sockets)
│   ├── actions        — approve/deny (shared between admin socket & dashboard)
│   ├── server         — host admin socket (JSON line protocol)
│   ├── manager_server — manager-only privileged socket
│   ├── agent_server   — per-sub-agent sockets
│   ├── dashboard      — axum HTTP UI + SSE message-flow + approve/deny + T4LK
│   └── client         — admin-socket client (powers `hive-c0re spawn|kill|…`)
│
├── /run/hyperhive/
│   ├── host.sock                — admin CLI ↔ daemon
│   ├── manager.sock             → hm1nd container at /run/hive/mcp.sock
│   └── agents/<name>/mcp.sock   → h-<name> container at /run/hive/mcp.sock
│
├── /var/lib/hyperhive/
│   ├── broker.sqlite            — messages + approvals tables
│   ├── agents/<name>/config/    — proposed repo (manager-editable, RO to hive-c0re)
│   └── applied/<name>/          — applied repo (hive-c0re-only, container builds here)
│
└── nixos-containers
    ├── h-<name>  (sub-agents, hive-ag3nt binary)
    └── hm1nd     (manager, hive-m1nd binary)

Crates / file map

hive-c0re/         host daemon + CLI (one binary, subcommand-dispatched)
  src/main.rs           clap setup; serve / spawn / kill / rebuild / list /
                         pending / approve / deny
  src/server.rs         host admin socket (HostRequest → dispatch)
  src/client.rs         admin-socket client
  src/manager_server.rs manager-privileged socket (ManagerRequest)
  src/agent_server.rs   per-sub-agent socket listener
  src/broker.rs         sqlite Message store + broadcast channel for SSE
  src/approvals.rs      sqlite Approval queue
  src/coordinator.rs    shared state (broker/approvals/agent_flake/sockets)
  src/actions.rs        approve/deny (admin socket + dashboard both call in)
  src/lifecycle.rs      `nixos-container` shellouts, per-agent flake generator,
                         systemd drop-ins, git helpers, agent_web_port hash
  src/dashboard.rs      axum HTTP UI: containers list, T4LK form, approvals
                         (diff + Approve/Deny buttons), SSE message flow

hive-ag3nt/        in-container harness crate; produces TWO binaries
  src/lib.rs            DEFAULT_SOCKET, DEFAULT_WEB_PORT, re-exports
  src/client.rs         generic JSON-line request/response over unix socket
  src/web_ui.rs         per-container axum HTTP page (label + placeholder)
  src/bin/hive-ag3nt.rs sub-agent CLI (serve/send/recv); turn loop + web UI
  src/bin/hive-m1nd.rs  manager CLI (serve/send/recv/spawn/kill/
                         request-apply-commit); recognises HelperEvent

hive-sh4re/        wire types (HostRequest/Response, AgentRequest/Response,
                   ManagerRequest/Response, Message, Approval, HelperEvent)

nix/
  modules/hive-c0re.nix       systemd service + firewall + git path wiring
  templates/agent-base.nix    sub-agent nixos-container template
  templates/manager.nix       manager nixos-container template

tests/roundtrip.sh   Phase 3 messaging round-trip
tests/approval.sh    Phase 5 end-to-end approval flow
tests/dashboard.sh   Phase 6+7 HTTP dashboard + SSE + orphan GC

docs/damocles-migration.md   options for moving damocles onto hyperhive

Conventions

  • Naming. Containers are length-bounded (nixos-container ≤ 11 chars). Sub-agents are h-<name> with <name> ≤ 9 chars; the manager is hm1nd. MAX_AGENT_NAME enforces the cap in lifecycle.rs. Per-agent web UI port = WEB_PORT_BASE + FNV1a(name) % WEB_PORT_RANGE (8100..8999); manager fixed at 8000; dashboard cfg.dashboardPort (default 7000).
  • Identity = socket. No auth/tokens on the per-agent sockets. The socket path identifies the principal; perms come from "who has the bind-mount."
  • Wire protocol. JSON line-delimited over unix sockets in both directions (host admin / manager / agent). /messages/stream is text/event-stream.
  • Commit messages. Short, lowercase, no Co-Authored-By trailer.
  • Commit before test. Stage and commit when work looks ready, then run validation (cargo check, nix flake check, real lpt2 deploy). Failures get a follow-up commit rather than an amend.
  • rebuild is the reconcile verb. Idempotently rewrites /etc/nixos-containers/<C>.conf (PRIVATE_NETWORK=0, clears HOST_ADDRESS/LOCAL_ADDRESS, sets EXTRA_NSPAWN_FLAGS), regenerates applied/<name>/flake.nix, writes the systemd limits drop-in, then nixos-container update + stop + start. Anything that changes per-container state on the host should be re-applied here.
  • Actions are factored. approve / deny live in actions.rs; the admin socket and the dashboard POST handlers both call into them, so the two surfaces never drift.

Gotchas / lessons learned

  • nixos-container doesn't expose --bind on the CLI. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. We rewrite this line in set_nspawn_flags().
  • /run/systemd/nspawn/*.nspawn overrides are ignored by nixos-container's start script (it builds the nspawn cmd line directly).
  • boot.isNspawnContainer = true, not boot.isContainer = true. Renamed in nixos-25.11+.
  • nixos-container create auto-assigns HOST_ADDRESS/LOCAL_ADDRESS in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — which is silently fatal for our web UIs (the bind is invisible from the host). We force-clear those vars (and HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE) plus set PRIVATE_NETWORK=0.
  • systemd service PATH ≠ host PATH. Our service explicitly sets path = [ pkgs.git "/run/current-system/sw" ]. Additionally, environment.HYPERHIVE_GIT = "${pkgs.git}/bin/git" bakes the absolute path in (read by lifecycle::git_command()) so git resolution doesn't depend on PATH plumbing at all.
  • RuntimeDirectoryPreserve = "yes" keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.
  • register_agent is idempotent — drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.
  • claude-code is unfree. agent-base.nix allow-list's it specifically. The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay imports unstable with its own allowUnfreePredicate so the access inside the overlay doesn't itself trip.
  • Claude credentials are stateful and per-container. No ANTHROPIC_API_KEY env var path. For now: nixos-container root-login h-<name>claude (interactive) → log in once. The harness falls back to echo replies when claude --print fails. Future: bind-mount a shared ~/.claude dir from the host so creds survive container destroy/recreate.
  • Echo guard. hive-ag3nt serve skips auto-reply when the incoming body starts with "echo: ". Prevents ping-pong loops when both sides fall back to echo. Real conversations between claude-backed agents will runaway — bounding them is the manager's job.
  • Orphan approvals. If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

Build / deploy / test

# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo clippy --workspace --all-targets -- -D warnings
nix develop -c cargo build

# evaluate everything (incl. rust+nix+toml fmt + clippy)
nix flake check

# build only the workspace package
nix build .#default
./result/bin/{hive-c0re,hive-ag3nt,hive-m1nd}

# deploy to an existing host that imports hyperhive.nixosModules.hive-c0re
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
sudo systemctl restart hive-c0re   # if only env/options changed

# end-to-end tests (each idempotent; runs as root)
sudo bash tests/roundtrip.sh    # alice ↔ bob echo round-trip
sudo bash tests/approval.sh     # manager edit → request → user approve → rebuilt
sudo bash tests/dashboard.sh    # HTTP UI, approve POST, SSE, orphan GC

The host config also needs hyperhive.overlays.default applied — the module's default package = pkgs.hyperhive requires the overlay to bring the package in. The claude-unstable overlay is applied internally to per-agent flakes already.

Phase status

  • Phase 0 — repo + Cargo workspace + flake + agent-base template
  • Phase 1 — container lifecycle; nixos-container update hot-reload works under the patch stack (validated on muede-lpt2)
  • Phase 2 — per-agent sockets, in-memory broker, agent harness round-trips
  • Phase 3 — sqlite broker (durable) + claude-or-echo turn loop
  • Phase 4 — hm1nd manager binary + manager socket + declarative containers.hm1nd
  • Phase 5 — git-commit approval flow
    • 5a — sqlite approval queue (request_apply_commit/pending/approve/deny)
    • 5b — per-agent config flakes
    • 5c — manager edits proposed, hive-c0re writes-only applied; container builds from applied. Approve = read agent.nix at the approved commit from proposed, copy into applied, commit + rebuild. Manager cannot move applied/main on its own.
  • Phase 6 — per-container web UIs (HIVE_PORT deterministic-hash) + hive-c0re dashboard (default 7000, vibec0re aesthetic, deep-linked)
  • Phase 7 — polish:
    • 7a — dashboard Approve/Deny buttons + unified diff (similar crate)
    • 7b — broker broadcast + /messages/stream SSE + live message-flow panel
    • 7c — ApprovalResolved helper events into manager inbox
    • 7d — MemoryMax=2G + CPUQuota=50% systemd drop-in per container
    • 7e — damocles migration plan (docs/damocles-migration.md)
  • Phase 7 follow-ups:
    • Dashboard T4LK form — operator can send messages from the browser (POST /send, becomes from: "operator" broker message)
    • Orphan-approval GC on dashboard render (stale entries auto-failed)
    • PRIVATE_NETWORK=0 + HOST_ADDRESS=/LOCAL_ADDRESS= cleared in set_nspawn_flags so sub-agent web UI ports are reachable on the host
    • HYPERHIVE_GIT env var (absolute path) bypasses PATH ambiguity

Approval flow

End-to-end: manager edits per-agent proposed repo → commits → submits commit sha → user approves on host CLI or dashboard button → hive-c0re reads the file at that sha from proposed, applies into applied, commits there, runs nixos-container update. Helper-event JSON lands in the manager's inbox.

# Inside the hm1nd container (manager has /agents bind-mounted RW):
cd /agents/alice/config
$EDITOR agent.nix              # e.g. environment.systemPackages = [ pkgs.htop ];
git commit -am "add htop"
SHA=$(git rev-parse HEAD)
hive-m1nd request-apply-commit alice $SHA
exit

# On the host (CLI):
sudo hive-c0re pending          # shows queued approval with id N
sudo hive-c0re approve N        # validates, applies, rebuilds
sudo nixos-container run h-alice -- which htop

# Or on the dashboard (browser):
http://<host>:7000/             # ◆ APPR0VE button next to the diff

Per-agent layout — two separate git repos:

/var/lib/hyperhive/agents/<name>/config/    # proposed — manager edits, hive-c0re reads only
├── .git/
└── agent.nix                               # the only file the manager can change
                                            # (initial commit by hive-c0re on first spawn,
                                            # never touched by hive-c0re again)

/var/lib/hyperhive/applied/<name>/          # applied — hive-c0re-only; container builds here
├── .git/
├── flake.nix                               # hive-c0re-managed; references hyperhive_flake
└── agent.nix                               # overwritten by approve from the proposed commit

The container's --flake ref is <applied_dir>#default. The flake's nixosConfigurations.default extends hyperhive.nixosConfigurations.agent-base with ./agent.nix plus an inline module that sets environment.etc."gitconfig".text (committer identity = the agent's name) and systemd.services.hive-ag3nt.environment.HIVE_PORT/HIVE_LABEL.

Polish backlog

Not phased — pick when relevant:

  • Operator inbox view — drain replies addressed to operator and show in the dashboard (today they accumulate in sqlite unread).
  • Per-agent UI substance — show last N inbox messages, last turn timing, link back to dashboard.
  • xterm.js terminal — embed in each per-container UI, attach to a PTY exposed by the harness.
  • destroy verb — currently nixos-container destroy + manual rm -rf. Should be one hive-c0re verb that also purges approvals + state dirs.
  • Bounded broker — cap rows per recipient or auto-vacuum delivered messages older than a threshold.
  • Container crash events — watch container@*.service via D-Bus, push HelperEvent::ContainerCrash to the manager.

Inspirations

  • ~/Repos/bitburner-agent — sibling project, drives Claude Code in a turn loop against a Bitburner CDP session. Patterns to steal as we grow: per-cycle prompt diffing (vs full state), notes compaction as a separate short-lived Claude session, MCP server registering tools from a single TOOLS array, dashboard with SSE + xterm.js + sqlite stats sampler, opaque "terminal event" stream that unifies tool-call / sleep / op-notice / etc.