müde 625bae1b31 turn loop: pin to haiku 4.5 (todo: per-agent override)

2026-05-15 14:43:43 +02:00

23 KiB

Raw Blame History

hyperhive

Multi-Claude-Code-agent orchestration on nixos-containers. A host-side Rust daemon (hive-c0re) spawns nspawn-isolated agent containers and brokers messages between them. A manager agent (hm1nd) coordinates the swarm and gates lifecycle changes on user approval via git commits, surfaced through a vibec0re-styled HTTP dashboard with live SSE message-flow.

PLAN.md is the living design doc. Read it for the why and the phase roadmap; this file is the operator/developer reference for the how.

Architecture

host (NixOS, hive-c0re.service)
│
├── hive-c0re  (Rust daemon — coordinator + dashboard + CLI)
│   ├── lifecycle      — nixos-container CRUD (spawn/kill/rebuild/list)
│   ├── broker         — sqlite message store + broadcast channel
│   ├── approvals      — sqlite approval queue
│   ├── coordinator    — shared state (broker/approvals/agent sockets)
│   ├── actions        — approve/deny (shared between admin socket & dashboard)
│   ├── server         — host admin socket (JSON line protocol)
│   ├── manager_server — manager-only privileged socket
│   ├── agent_server   — per-sub-agent sockets
│   ├── dashboard      — axum HTTP UI + SSE message-flow + approve/deny + T4LK
│   └── client         — admin-socket client (powers `hive-c0re spawn|kill|…`)
│
├── /run/hyperhive/
│   ├── host.sock                — admin CLI ↔ daemon
│   ├── manager.sock             → hm1nd container at /run/hive/mcp.sock
│   └── agents/<name>/mcp.sock   → h-<name> container at /run/hive/mcp.sock
│
├── /var/lib/hyperhive/
│   ├── broker.sqlite            — messages + approvals tables
│   ├── agents/<name>/config/    — proposed repo (manager-editable, RO to hive-c0re)
│   └── applied/<name>/          — applied repo (hive-c0re-only, container builds here)
│
└── nixos-containers
    ├── h-<name>  (sub-agents, hive-ag3nt binary)
    └── hm1nd     (manager, hive-m1nd binary)

Crates / file map

hive-c0re/         host daemon + CLI (one binary, subcommand-dispatched)
  src/main.rs           clap setup; serve / spawn / kill / rebuild / list /
                         pending / approve / deny
  src/server.rs         host admin socket (HostRequest → dispatch)
  src/client.rs         admin-socket client
  src/manager_server.rs manager-privileged socket (ManagerRequest)
  src/agent_server.rs   per-sub-agent socket listener
  src/broker.rs         sqlite Message store + broadcast channel for SSE
  src/approvals.rs      sqlite Approval queue
  src/coordinator.rs    shared state (broker/approvals/agent_flake/sockets)
  src/actions.rs        approve/deny (admin socket + dashboard both call in)
  src/lifecycle.rs      `nixos-container` shellouts, per-agent flake generator,
                         systemd drop-ins, git helpers, agent_web_port hash
  src/dashboard.rs      axum HTTP UI: containers list, T4LK form, approvals
                         (diff + Approve/Deny buttons), SSE message flow

hive-ag3nt/        in-container harness crate; produces TWO binaries
  src/lib.rs            DEFAULT_SOCKET, DEFAULT_WEB_PORT, re-exports
  src/client.rs         generic JSON-line request/response over unix socket
  src/web_ui.rs         per-container axum HTTP page (label + placeholder)
  src/bin/hive-ag3nt.rs sub-agent CLI (serve/send/recv); turn loop + web UI
  src/bin/hive-m1nd.rs  manager CLI (serve/send/recv/spawn/kill/
                         request-apply-commit); recognises HelperEvent

hive-sh4re/        wire types (HostRequest/Response, AgentRequest/Response,
                   ManagerRequest/Response, Message, Approval, HelperEvent)

nix/
  modules/hive-c0re.nix       systemd service + firewall + git path wiring
  templates/agent-base.nix    sub-agent nixos-container template
  templates/manager.nix       manager nixos-container template

tests/roundtrip.sh   Phase 3 messaging round-trip
tests/approval.sh    Phase 5 end-to-end approval flow
tests/dashboard.sh   Phase 6+7 HTTP dashboard + SSE + orphan GC

docs/damocles-migration.md   options for moving damocles onto hyperhive

Conventions

Naming. Containers are length-bounded (nixos-container ≤ 11 chars). Sub-agents are h-<name> with <name> ≤ 9 chars; the manager is hm1nd. MAX_AGENT_NAME enforces the cap in lifecycle.rs. Per-agent web UI port = WEB_PORT_BASE + FNV1a(name) % WEB_PORT_RANGE (8100..8999); manager fixed at 8000; dashboard cfg.dashboardPort (default 7000).
Identity = socket. No auth/tokens on the per-agent sockets. The socket path identifies the principal; perms come from "who has the bind-mount."
Wire protocol. JSON line-delimited over unix sockets in both directions (host admin / manager / agent). /messages/stream is text/event-stream.
Commit messages. Short, lowercase, no Co-Authored-By trailer.
Commit before test. Stage and commit when work looks ready, then run validation (cargo check, nix flake check, real lpt2 deploy). Failures get a follow-up commit rather than an amend.
rebuild is the reconcile verb. Idempotently rewrites /etc/nixos-containers/<C>.conf (PRIVATE_NETWORK=0, clears HOST_ADDRESS/LOCAL_ADDRESS, sets EXTRA_NSPAWN_FLAGS), regenerates applied/<name>/flake.nix, writes the systemd limits drop-in, then nixos-container update + stop + start. Anything that changes per-container state on the host should be re-applied here.
Actions are factored. approve / deny live in actions.rs; the admin socket and the dashboard POST handlers both call into them, so the two surfaces never drift.

Gotchas / lessons learned

nixos-container doesn't expose --bind on the CLI. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. We rewrite this line in set_nspawn_flags().
/run/systemd/nspawn/*.nspawn overrides are ignored by nixos-container's start script (it builds the nspawn cmd line directly).
boot.isNspawnContainer = true, not boot.isContainer = true. Renamed in nixos-25.11+.
nixos-container create auto-assigns HOST_ADDRESS/LOCAL_ADDRESS in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — which is silently fatal for our web UIs (the bind is invisible from the host). We force-clear those vars (and HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE) plus set PRIVATE_NETWORK=0.
systemd service PATH ≠ host PATH. Our service explicitly sets path = [ pkgs.git "/run/current-system/sw" ]. Additionally, environment.HYPERHIVE_GIT = "${pkgs.git}/bin/git" bakes the absolute path in (read by lifecycle::git_command()) so git resolution doesn't depend on PATH plumbing at all.
RuntimeDirectoryPreserve = "yes" keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.
register_agent is idempotent — drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.
claude-code is unfree. agent-base.nix allow-list's it specifically. The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay imports unstable with its own allowUnfreePredicate so the access inside the overlay doesn't itself trip.
Claude credentials are stateful and per-container. No ANTHROPIC_API_KEY env var path. Today's stopgap: nixos-container root-login h-<name> → claude (interactive) → log in once. The harness falls back to echo replies when claude --print fails. Phase 8 moves this to a per-agent persistent dir at /var/lib/hyperhive/agents/<name>/claude/ bind-mounted into the container, with the interactive login driven from the agent's web UI. Sharing one ~/.claude across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others.
Echo guard. hive-ag3nt serve skips auto-reply when the incoming body starts with "echo: ". Prevents ping-pong loops when both sides fall back to echo. Real conversations between claude-backed agents will runaway — bounding them is the manager's job.
Orphan approvals. If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

Agent MCP surface + turn loop

The harness ships an embedded MCP server (rmcp 1.7) that claude launches as a stdio child via --mcp-config. Subcommand: hive-ag3nt mcp. Tools:

mcp__hyperhive__send(to, body) — message a peer or the operator.
mcp__hyperhive__recv() — drain one inbox message.

Both translate to AgentRequest::Send/Recv against the agent's own /run/hive/mcp.sock (the existing hyperhive socket). The MCP surface is just claude's view of that socket — same authority, friendlier protocol.

The turn loop in hive-ag3nt serve writes /run/hive/claude-mcp-config.json at boot pointing at /proc/self/exe mcp (the running hive-ag3nt binary's nix store path). Each turn invokes:

claude --print --mcp-config <path> --tools <builtins> --allowedTools <builtins+mcp> <prompt>

Tool whitelist (see ALLOWED_BUILTIN_TOOLS in hive-ag3nt::mcp):

Allowed built-ins: Bash, Edit, Glob, Grep, NotebookEdit, Read, TodoWrite, Write.
Denied by omission: WebFetch, WebSearch, Task — no external egress or nested-agent spawning until we have a real policy story.
Allowed MCP tools: mcp__hyperhive__send, mcp__hyperhive__recv.

Bash is on the allow-list "for now" — pending a finer-grained allow-list system for command patterns (Bash(git *)-style). When that lands, the builtin_tools_arg shape will probably change to a setting / hooks combo per claude-code's permissions plumbing.

Manager will get its own subcommand later with request_spawn, kill, request_apply_commit added to the TOOLS list.

Manager (hm1nd) is hive-c0re-managed

The manager container runs through the same lifecycle as sub-agents — no separate code path. On hive-c0re serve startup, if nixos-container list doesn't include hm1nd, hive-c0re creates it. The manager's flake lives at /var/lib/hyperhive/applied/hm1nd/; its proposed (manager-editable) config at /var/lib/hyperhive/agents/hm1nd/config/. Manager can edit its own agent.nix (visible inside the container at /agents/hm1nd/config/), commit, and submit request-apply-commit hm1nd <sha> for operator approval — same flow as for sub-agents.

Differences from sub-agents:

flake.nix extends hyperhive.nixosConfigurations.manager (vs agent-base).
Container name is hm1nd (no h- prefix).
Fixed web UI port (MANAGER_PORT = 8000).
set_nspawn_flags adds an extra bind: /var/lib/hyperhive/agents → /agents (RW), so the manager can edit per-agent proposed repos.
First-deploy spawn bypasses the approval queue (manager is required infrastructure).
Per-agent socket is the manager socket at /run/hyperhive/manager/, owned by manager_server::start. coordinator::ensure_runtime returns that path for manager and the usual /run/hyperhive/agents/<name>/ for the rest.

Migration note: drop any containers.hm1nd = { ... } block from your host NixOS config. hyperhive creates and updates the manager itself now.

Auto-update on startup

hive-c0re serve runs auto_update::run in a background task right after opening the coordinator. It enumerates managed containers and rebuilds any whose recorded hyperhive rev differs from the current one:

Sub-agents rebuild via lifecycle::rebuild (regenerates applied/<name>/flake.nix, sets nspawn flags, nixos-container update --flake).
Manager runs nixos-container update hm1nd (no --flake). The manager's config lives in the host's NixOS module; this is belt-and-braces on top of NixOS's own container activation. Idempotent when nothing has actually changed.

"Rev" = canonical filesystem path of cfg.hyperhiveFlake (so /etc/hyperhive resolving to a new /nix/store/...-source triggers a rebuild). Marker file: /var/lib/hyperhive/applied/.<name>.hyperhive-rev. If the flake input has no canonical path (e.g. a github: URL), auto-update is a no-op — rebuild manually. The task is async and never blocks the admin socket; failures are logged and don't take the daemon down.

The dashboard surfaces pending updates per agent: a clickable "needs update ↻" badge appears whenever the marker differs from current rev. The badge POSTs /rebuild/<name>, calling the same auto_update::rebuild_agent / rebuild_manager path so manual triggers and the startup scan can't drift.

Build / deploy / test

# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo clippy --workspace --all-targets -- -D warnings
nix develop -c cargo build

# evaluate everything (incl. rust+nix+toml fmt + clippy)
nix flake check

# build only the workspace package
nix build .#default
./result/bin/{hive-c0re,hive-ag3nt,hive-m1nd}

# deploy to an existing host that imports hyperhive.nixosModules.hive-c0re
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
sudo systemctl restart hive-c0re   # if only env/options changed

# end-to-end tests (each idempotent; runs as root)
sudo bash tests/roundtrip.sh    # alice ↔ bob echo round-trip
sudo bash tests/approval.sh     # manager edit → request → user approve → rebuilt
sudo bash tests/dashboard.sh    # HTTP UI, approve POST, SSE, orphan GC

The host config also needs hyperhive.overlays.default applied — the module's default package = pkgs.hyperhive requires the overlay to bring the package in. The claude-unstable overlay is applied internally to per-agent flakes already.

Phase status

✅ Phase 0 — repo + Cargo workspace + flake + agent-base template
✅ Phase 1 — container lifecycle; nixos-container update hot-reload works under the patch stack (validated on muede-lpt2)
✅ Phase 2 — per-agent sockets, in-memory broker, agent harness round-trips
✅ Phase 3 — sqlite broker (durable) + claude-or-echo turn loop
✅ Phase 4 — hm1nd manager binary + manager socket + declarative containers.hm1nd
✅ Phase 5 — git-commit approval flow
- 5a — sqlite approval queue (request_apply_commit/pending/approve/deny)
- 5b — per-agent config flakes
- 5c — manager edits proposed, hive-c0re writes-only applied; container builds from applied. Approve = read agent.nix at the approved commit from proposed, copy into applied, commit + rebuild. Manager cannot move applied/main on its own.
✅ Phase 6 — per-container web UIs (HIVE_PORT deterministic-hash) + hive-c0re dashboard (default 7000, vibec0re aesthetic, deep-linked)
✅ Phase 7 — polish:
- 7a — dashboard Approve/Deny buttons + unified diff (similar crate)
- 7b — broker broadcast + /messages/stream SSE + live message-flow panel
- 7c — ApprovalResolved helper events into manager inbox
- 7d — MemoryMax=2G + CPUQuota=50% systemd drop-in per container
- 7e — damocles migration plan (docs/damocles-migration.md)
✅ Phase 7 follow-ups:
- Dashboard T4LK form — operator can send messages from the browser (POST /send, becomes from: "operator" broker message)
- Orphan-approval GC on dashboard render (stale entries auto-failed)
- PRIVATE_NETWORK=0 + HOST_ADDRESS=/LOCAL_ADDRESS= cleared in set_nspawn_flags so sub-agent web UI ports are reachable on the host
- HYPERHIVE_GIT env var (absolute path) bypasses PATH ambiguity

See PLAN.md → "Phase 8" for the full design. Summary:

Per-agent persistent creds dir. Bind /var/lib/hyperhive/agents/<name>/claude/ → /root/.claude (RW) in set_nspawn_flags. One OAuth lineage per agent; refresh rotations stay contained to that agent.
State dirs persist by default. destroy keeps /var/lib/hyperhive/agents/<name>/ unless the operator passes an explicit wipe flag. Recreating an agent of the same name reuses prior creds.
First spawn is approval-gated. New agent names go through the same approval queue as config edits. Manager calls RequestSpawn (CLI: hive-m1nd request-spawn <name>); operator can also queue from the dashboard or hive-c0re request-spawn <name>. The host's direct hive-c0re spawn <name> still works as a privileged bypass for tests. Approve runs lifecycle::spawn in a background task; the dashboard polls via <meta refresh> and renders a spinner row while nixos-container create + update + start is in flight.
"needs login" partial-run state. No valid session in ~/.claude/ → harness binds the web UI but does NOT start the turn loop. The harness polls the dir; as soon as a login lands it transitions into the turn loop without a restart. Dashboard surfaces the state per-agent via a needs login badge in the container list. "Valid session" today is a heuristic (any regular file inside /root/.claude/); we may refine once the filename layout claude writes is locked in.
Login from the per-agent web UI. Spawn claude auth login with plain stdio pipes (no PTY initially), surface the OAuth URL from stdout on the page, accept the resulting code via a paste field, write it to the process stdin. Once ~/.claude/ populates, the existing needs-login polling loop flips state to Online and starts the turn loop — no separate signaling needed. The exact command is overridable via HYPERHIVE_LOGIN_CMD so we can adjust without rebuilding. If pipes turn out to be insufficient (claude refuses without a TTY, raw-mode input, ANSI-only output) we redo the backend with a PTY (e.g. portable-pty).

Implementation order: bind-mount/dir creation → approval-gated spawn + spinner → "needs login" partial run → PTY login endpoint. The login UI has nowhere to live until the partial-run mode exists, so don't ship it earlier.

Approval flow

End-to-end: manager edits per-agent proposed repo → commits → submits commit sha → user approves on host CLI or dashboard button → hive-c0re reads the file at that sha from proposed, applies into applied, commits there, runs nixos-container update. Helper-event JSON lands in the manager's inbox.

# Inside the hm1nd container (manager has /agents bind-mounted RW):
cd /agents/alice/config
$EDITOR agent.nix              # e.g. environment.systemPackages = [ pkgs.htop ];
git commit -am "add htop"
SHA=$(git rev-parse HEAD)
hive-m1nd request-apply-commit alice $SHA
exit

# On the host (CLI):
sudo hive-c0re pending          # shows queued approval with id N
sudo hive-c0re approve N        # validates, applies, rebuilds
sudo nixos-container run h-alice -- which htop

# Or on the dashboard (browser):
http://<host>:7000/             # ◆ APPR0VE button next to the diff

Per-agent layout — two separate git repos:

/var/lib/hyperhive/agents/<name>/config/    # proposed — manager edits, hive-c0re reads only
├── .git/
└── agent.nix                               # the only file the manager can change
                                            # (initial commit by hive-c0re on first spawn,
                                            # never touched by hive-c0re again)

/var/lib/hyperhive/applied/<name>/          # applied — hive-c0re-only; container builds here
├── .git/
├── flake.nix                               # hive-c0re-managed; references hyperhive_flake
└── agent.nix                               # overwritten by approve from the proposed commit

The container's --flake ref is <applied_dir>#default. The flake's nixosConfigurations.default extends hyperhive.nixosConfigurations.agent-base with ./agent.nix plus an inline module that sets environment.etc."gitconfig".text (committer identity = the agent's name) and systemd.services.hive-ag3nt.environment.HIVE_PORT/HIVE_LABEL.

Security backlog

Unprivileged containers (userns mapping). Today the nspawn container runs as a fully privileged root. Goal: PrivateUsersChown=yes (or the nixos-container equivalent) so uid 0 inside maps to an unprivileged uid on the host, and a container-root compromise lands the attacker on an ordinary user account, not the host's root. Requires per-agent state dirs to be chown'd to that uid on the host side.
Bash command allow-list. Replace the blanket Bash allow with a pattern allow-list (Bash(git *), Bash(nix build .*), etc.) per claude-code's --allowedTools extended grammar. Likely lives in agent.nix so each agent can scope its own shell surface.

Per-agent settings backlog

Model override. Hard-coded to claude-haiku-4-5-20251001 in the turn loop right now. Surface it as a per-agent override: operator via dashboard, manager via request_apply_commit setting an attr on the agent's flake (most natural place since the flake already carries per-agent env/identity).

Polish backlog

Not phased — pick when relevant:

Operator inbox view — drain replies addressed to operator and show in the dashboard (today they accumulate in sqlite unread).
Per-agent UI substance — show last N inbox messages, last turn timing, link back to dashboard.
xterm.js terminal — embed in each per-container UI, attach to a PTY exposed by the harness.
destroy verb — currently nixos-container destroy + manual rm -rf. Should be one hive-c0re verb that also purges approvals + state dirs.
Bounded broker — cap rows per recipient or auto-vacuum delivered messages older than a threshold.
Container crash events — watch container@*.service via D-Bus, push HelperEvent::ContainerCrash to the manager.

Inspirations

~/Repos/bitburner-agent — sibling project, drives Claude Code in a turn loop against a Bitburner CDP session. Patterns to steal as we grow: per-cycle prompt diffing (vs full state), notes compaction as a separate short-lived Claude session, MCP server registering tools from a single TOOLS array, dashboard with SSE + xterm.js + sqlite stats sampler, opaque "terminal event" stream that unifies tool-call / sleep / op-notice / etc.

23 KiB Raw Blame History