hyperhive/CLAUDE.md

# hyperhive

Multi-Claude-Code-agent orchestration on **nixos-containers**. A host-side Rust
daemon (`hive-c0re`) spawns nspawn-isolated agent containers and brokers
messages between them. A manager agent (`hm1nd`) coordinates the swarm and
gates lifecycle changes on user approval via git commits, surfaced through a
vibec0re-styled HTTP dashboard with live SSE message-flow.

**PLAN.md** is the living design doc. Read it for the *why* and the phase
roadmap; this file is the operator/developer reference for the *how*.

## Architecture

```
host (NixOS, hive-c0re.service)
│
├── hive-c0re  (Rust daemon — coordinator + dashboard + CLI)
│   ├── lifecycle      — nixos-container CRUD (spawn/kill/rebuild/list)
│   ├── broker         — sqlite message store + broadcast channel
│   ├── approvals      — sqlite approval queue
│   ├── coordinator    — shared state (broker/approvals/agent sockets)
│   ├── actions        — approve/deny (shared between admin socket & dashboard)
│   ├── server         — host admin socket (JSON line protocol)
│   ├── manager_server — manager-only privileged socket
│   ├── agent_server   — per-sub-agent sockets
│   ├── dashboard      — axum HTTP UI + SSE message-flow + approve/deny + T4LK
│   └── client         — admin-socket client (powers `hive-c0re spawn|kill|…`)
│
├── /run/hyperhive/
│   ├── host.sock                — admin CLI ↔ daemon
│   ├── manager.sock             → hm1nd container at /run/hive/mcp.sock
│   └── agents/<name>/mcp.sock   → h-<name> container at /run/hive/mcp.sock
│
├── /var/lib/hyperhive/
│   ├── broker.sqlite            — messages + approvals tables
│   ├── agents/<name>/config/    — proposed repo (manager-editable, RO to hive-c0re)
│   └── applied/<name>/          — applied repo (hive-c0re-only, container builds here)
│
└── nixos-containers
    ├── h-<name>  (sub-agents, hive-ag3nt binary)
    └── hm1nd     (manager, hive-m1nd binary)
```

## Crates / file map

```
hive-c0re/         host daemon + CLI (one binary, subcommand-dispatched)
  src/main.rs           clap setup; serve / spawn / kill / rebuild / list /
                         pending / approve / deny
  src/server.rs         host admin socket (HostRequest → dispatch)
  src/client.rs         admin-socket client
  src/manager_server.rs manager-privileged socket (ManagerRequest)
  src/agent_server.rs   per-sub-agent socket listener
  src/broker.rs         sqlite Message store + broadcast channel for SSE
  src/approvals.rs      sqlite Approval queue
  src/coordinator.rs    shared state (broker/approvals/agent_flake/sockets)
  src/actions.rs        approve/deny (admin socket + dashboard both call in)
  src/lifecycle.rs      `nixos-container` shellouts, per-agent flake generator,
                         systemd drop-ins, git helpers, agent_web_port hash
  src/dashboard.rs      axum HTTP UI: containers list, T4LK form, approvals
                         (diff + Approve/Deny buttons), SSE message flow

hive-ag3nt/        in-container harness crate; produces TWO binaries
  src/lib.rs            DEFAULT_SOCKET, DEFAULT_WEB_PORT, re-exports
  src/client.rs         generic JSON-line request/response over unix socket
  src/web_ui.rs         per-container axum HTTP page (label + placeholder)
  src/bin/hive-ag3nt.rs sub-agent CLI (serve/send/recv); turn loop + web UI
  src/bin/hive-m1nd.rs  manager CLI (serve/send/recv/spawn/kill/
                         request-apply-commit); recognises HelperEvent

hive-sh4re/        wire types (HostRequest/Response, AgentRequest/Response,
                   ManagerRequest/Response, Message, Approval, HelperEvent)

nix/
  modules/hive-c0re.nix       systemd service + firewall + git path wiring
  templates/agent-base.nix    sub-agent nixos-container template
  templates/manager.nix       manager nixos-container template

tests/roundtrip.sh   Phase 3 messaging round-trip
tests/approval.sh    Phase 5 end-to-end approval flow
tests/dashboard.sh   Phase 6+7 HTTP dashboard + SSE + orphan GC

docs/damocles-migration.md   options for moving damocles onto hyperhive
```

## Conventions

- **Naming.** Containers are length-bounded (`nixos-container` ≤ 11 chars).
  Sub-agents are `h-<name>` with `<name>` ≤ 9 chars; the manager is `hm1nd`.
  `MAX_AGENT_NAME` enforces the cap in `lifecycle.rs`. Per-agent web UI port =
  `WEB_PORT_BASE + FNV1a(name) % WEB_PORT_RANGE` (8100..8999); manager fixed
  at 8000; dashboard `cfg.dashboardPort` (default 7000).
- **Identity = socket.** No auth/tokens on the per-agent sockets. The socket
  *path* identifies the principal; perms come from "who has the bind-mount."
- **Wire protocol.** JSON line-delimited over unix sockets in both directions
  (host admin / manager / agent). `/messages/stream` is `text/event-stream`.
- **Commit messages.** Short, lowercase, no Co-Authored-By trailer.
- **Commit before test.** Stage and commit when work *looks* ready, then run
  validation (`cargo check`, `nix flake check`, real lpt2 deploy). Failures get
  a follow-up commit rather than an amend.
- **`rebuild` is the reconcile verb.** Idempotently rewrites
  `/etc/nixos-containers/<C>.conf` (`PRIVATE_NETWORK=0`, clears
  HOST_ADDRESS/LOCAL_ADDRESS, sets `EXTRA_NSPAWN_FLAGS`), regenerates
  `applied/<name>/flake.nix`, writes the systemd limits drop-in, then
  `nixos-container update` + stop + start. Anything that changes per-container
  state on the host should be re-applied here.
- **Actions are factored.** `approve` / `deny` live in `actions.rs`; the admin
  socket and the dashboard POST handlers both call into them, so the two
  surfaces never drift.

## Gotchas / lessons learned

- **`nixos-container` doesn't expose `--bind` on the CLI.** Path is via
  `EXTRA_NSPAWN_FLAGS` in `/etc/nixos-containers/<NAME>.conf` — the start
  script (`/nix/store/.../container_-start`) expands it unquoted into the
  `systemd-nspawn` invocation. We rewrite this line in `set_nspawn_flags()`.
- **`/run/systemd/nspawn/*.nspawn` overrides are *ignored*** by
  `nixos-container`'s start script (it builds the nspawn cmd line directly).
- **`boot.isNspawnContainer = true`**, not `boot.isContainer = true`. Renamed
  in nixos-25.11+.
- **`nixos-container create` auto-assigns `HOST_ADDRESS`/`LOCAL_ADDRESS`** in
  the `.conf`. The start script's `if HOST_ADDRESS set → --network-veth`
  branch then forces a private netns — which is silently fatal for our web
  UIs (the bind is invisible from the host). We force-clear those vars (and
  `HOST_ADDRESS6` / `LOCAL_ADDRESS6` / `HOST_BRIDGE`) plus set
  `PRIVATE_NETWORK=0`.
- **systemd service PATH ≠ host PATH.** Our service explicitly sets
  `path = [ pkgs.git "/run/current-system/sw" ]`. Additionally,
  `environment.HYPERHIVE_GIT = "${pkgs.git}/bin/git"` bakes the absolute path
  in (read by `lifecycle::git_command()`) so git resolution doesn't depend on
  PATH plumbing at all.
- **`RuntimeDirectoryPreserve = "yes"`** keeps `/run/hyperhive/` (and the
  per-agent sub-dirs) across `hive-c0re` restarts. Without it, every restart
  wipes bind sources and existing containers can't be started.
- **`register_agent` is idempotent** — drops any prior socket task before
  rebinding. Required so a `hive-c0re` restart followed by `rebuild alice`
  recreates the agent's socket without needing a clean reinstall.
- **`claude-code` is unfree.** `agent-base.nix` allow-list's it specifically.
  The flake pins it to **nixpkgs-unstable** via `overlays.claude-unstable`
  (stable lags too far). The overlay imports unstable with its own
  `allowUnfreePredicate` so the access inside the overlay doesn't itself trip.
- **Claude credentials are stateful and per-container.** No `ANTHROPIC_API_KEY`
  env var path. Today's stopgap: `nixos-container root-login h-<name>` →
  `claude` (interactive) → log in once. The harness falls back to echo
  replies when `claude --print` fails. **Phase 8** moves this to a per-agent
  persistent dir at `/var/lib/hyperhive/agents/<name>/claude/` bind-mounted
  into the container, with the interactive login driven from the agent's web
  UI. Sharing one `~/.claude` across agents is NOT viable — OAuth refresh
  tokens rotate, so any sibling refresh invalidates all the others.
- **Echo guard.** `hive-ag3nt serve` skips auto-reply when the incoming body
  starts with `"echo: "`. Prevents ping-pong loops when both sides fall back
  to echo. Real conversations between claude-backed agents *will* runaway —
  bounding them is the manager's job.
- **Orphan approvals.** If state dirs are wiped out from under a pending
  approval (test scripts, manual `rm -rf`), the dashboard's next render
  marks them `failed` with note `"agent state dir missing"` so they fall out
  of `pending`. They stay in sqlite for audit.

## Build / deploy / test

```sh
# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo clippy --workspace --all-targets -- -D warnings
nix develop -c cargo build

# evaluate everything (incl. rust+nix+toml fmt + clippy)
nix flake check

# build only the workspace package
nix build .#default
./result/bin/{hive-c0re,hive-ag3nt,hive-m1nd}

# deploy to an existing host that imports hyperhive.nixosModules.hive-c0re
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
sudo systemctl restart hive-c0re   # if only env/options changed

# end-to-end tests (each idempotent; runs as root)
sudo bash tests/roundtrip.sh    # alice ↔ bob echo round-trip
sudo bash tests/approval.sh     # manager edit → request → user approve → rebuilt
sudo bash tests/dashboard.sh    # HTTP UI, approve POST, SSE, orphan GC
```

The host config also needs `hyperhive.overlays.default` applied — the module's
default `package = pkgs.hyperhive` requires the overlay to bring the package
in. The `claude-unstable` overlay is applied internally to per-agent flakes
already.

## Phase status

- ✅ Phase 0 — repo + Cargo workspace + flake + agent-base template
- ✅ Phase 1 — container lifecycle; `nixos-container update` hot-reload works
  under the patch stack (validated on muede-lpt2)
- ✅ Phase 2 — per-agent sockets, in-memory broker, agent harness round-trips
- ✅ Phase 3 — sqlite broker (durable) + claude-or-echo turn loop
- ✅ Phase 4 — `hm1nd` manager binary + manager socket + declarative
  `containers.hm1nd`
- ✅ Phase 5 — git-commit approval flow
  - 5a — sqlite approval queue (`request_apply_commit`/`pending`/`approve`/`deny`)
  - 5b — per-agent config flakes
  - 5c — manager edits `proposed`, hive-c0re writes-only `applied`; container
    builds from `applied`. Approve = read `agent.nix` at the approved commit
    from `proposed`, copy into `applied`, commit + rebuild. Manager cannot
    move `applied/main` on its own.
- ✅ Phase 6 — per-container web UIs (`HIVE_PORT` deterministic-hash) +
  hive-c0re dashboard (default 7000, vibec0re aesthetic, deep-linked)
- ✅ Phase 7 — polish:
  - 7a — dashboard Approve/Deny buttons + unified diff (`similar` crate)
  - 7b — broker broadcast + `/messages/stream` SSE + live message-flow panel
  - 7c — `ApprovalResolved` helper events into manager inbox
  - 7d — `MemoryMax=2G` + `CPUQuota=50%` systemd drop-in per container
  - 7e — damocles migration plan (`docs/damocles-migration.md`)
- ✅ Phase 7 follow-ups:
  - Dashboard **T4LK** form — operator can send messages from the browser
    (`POST /send`, becomes `from: "operator"` broker message)
  - Orphan-approval GC on dashboard render (stale entries auto-failed)
  - `PRIVATE_NETWORK=0` + `HOST_ADDRESS=`/`LOCAL_ADDRESS=` cleared in
    `set_nspawn_flags` so sub-agent web UI ports are reachable on the host
  - `HYPERHIVE_GIT` env var (absolute path) bypasses PATH ambiguity

## Phase 8 — real claude in containers + login UX (in progress)

See PLAN.md → "Phase 8" for the full design. Summary:

- **Per-agent persistent creds dir.** Bind
  `/var/lib/hyperhive/agents/<name>/claude/` → `/root/.claude` (RW) in
  `set_nspawn_flags`. One OAuth lineage per agent; refresh rotations stay
  contained to that agent.
- **State dirs persist by default.** `destroy` keeps
  `/var/lib/hyperhive/agents/<name>/` unless the operator passes an explicit
  wipe flag. Recreating an agent of the same name reuses prior creds.
- **First spawn is approval-gated.** New agent names go through the same
  approval queue as config edits. Manager calls `RequestSpawn` (CLI:
  `hive-m1nd request-spawn <name>`); operator can also queue from the
  dashboard or `hive-c0re request-spawn <name>`. The host's direct
  `hive-c0re spawn <name>` still works as a privileged bypass for tests.
  Approve runs `lifecycle::spawn` in a background task; the dashboard polls
  via `<meta refresh>` and renders a spinner row while
  `nixos-container create` + `update` + `start` is in flight.
- **"needs login" partial-run state.** No valid session in `~/.claude/` →
  harness binds the web UI but does NOT start the turn loop. The harness
  polls the dir; as soon as a login lands it transitions into the turn loop
  without a restart. Dashboard surfaces the state per-agent via a `needs
  login` badge in the container list. "Valid session" today is a heuristic
  (any regular file inside `/root/.claude/`); we may refine once the
  filename layout claude writes is locked in.
- **Login from the per-agent web UI.** Spawn `claude /login` with plain
  stdio pipes (no PTY initially), surface the OAuth URL from stdout on the
  page, accept the resulting code via a paste field, write it to the process
  stdin. On success, harness transitions out of "needs login" and enters the
  turn loop. If pipes turn out to be insufficient (claude refuses without a
  TTY, raw-mode input, ANSI-only output) we redo the backend with a PTY.

Implementation order: bind-mount/dir creation → approval-gated spawn +
spinner → "needs login" partial run → PTY login endpoint. The login UI has
nowhere to live until the partial-run mode exists, so don't ship it earlier.

## Approval flow

End-to-end: manager edits per-agent `proposed` repo → commits → submits commit
sha → user approves on host CLI **or** dashboard button → `hive-c0re` reads the
file at that sha from `proposed`, applies into `applied`, commits there, runs
`nixos-container update`. Helper-event JSON lands in the manager's inbox.

```
# Inside the hm1nd container (manager has /agents bind-mounted RW):
cd /agents/alice/config
$EDITOR agent.nix              # e.g. environment.systemPackages = [ pkgs.htop ];
git commit -am "add htop"
SHA=$(git rev-parse HEAD)
hive-m1nd request-apply-commit alice $SHA
exit

# On the host (CLI):
sudo hive-c0re pending          # shows queued approval with id N
sudo hive-c0re approve N        # validates, applies, rebuilds
sudo nixos-container run h-alice -- which htop

# Or on the dashboard (browser):
http://<host>:7000/             # ◆ APPR0VE button next to the diff
```

Per-agent layout — two separate git repos:

```
/var/lib/hyperhive/agents/<name>/config/    # proposed — manager edits, hive-c0re reads only
├── .git/
└── agent.nix                               # the only file the manager can change
                                            # (initial commit by hive-c0re on first spawn,
                                            # never touched by hive-c0re again)

/var/lib/hyperhive/applied/<name>/          # applied — hive-c0re-only; container builds here
├── .git/
├── flake.nix                               # hive-c0re-managed; references hyperhive_flake
└── agent.nix                               # overwritten by approve from the proposed commit
```

The container's `--flake` ref is `<applied_dir>#default`. The flake's
`nixosConfigurations.default` extends `hyperhive.nixosConfigurations.agent-base`
with `./agent.nix` plus an inline module that sets
`environment.etc."gitconfig".text` (committer identity = the agent's name) and
`systemd.services.hive-ag3nt.environment.HIVE_PORT`/`HIVE_LABEL`.

## Polish backlog

Not phased — pick when relevant:

- **Operator inbox view** — drain replies addressed to `operator` and show
  in the dashboard (today they accumulate in sqlite unread).
- **Per-agent UI substance** — show last N inbox messages, last turn timing,
  link back to dashboard.
- **xterm.js terminal** — embed in each per-container UI, attach to a PTY
  exposed by the harness.
- **`destroy` verb** — currently `nixos-container destroy` + manual `rm -rf`.
  Should be one hive-c0re verb that also purges approvals + state dirs.
- **Bounded broker** — cap rows per recipient or auto-vacuum delivered
  messages older than a threshold.
- **Container crash events** — watch `container@*.service` via D-Bus,
  push `HelperEvent::ContainerCrash` to the manager.

## Inspirations

- **`~/Repos/bitburner-agent`** — sibling project, drives Claude Code in a
  turn loop against a Bitburner CDP session. Patterns to steal as we grow:
  per-cycle prompt diffing (vs full state), notes compaction as a separate
  short-lived Claude session, MCP server registering tools from a single
  `TOOLS` array, dashboard with SSE + xterm.js + sqlite stats sampler,
  opaque "terminal event" stream that unifies tool-call / sleep / op-notice
  / etc.