docs: README + TODO split; trim CLAUDE.md; fix async form 415

This commit is contained in:
müde 2026-05-15 16:41:15 +02:00
parent 392a448656
commit 970f645461
6 changed files with 262 additions and 684 deletions

513
CLAUDE.md
View file

@ -1,86 +1,50 @@
# hyperhive # hyperhive — developer reference
Multi-Claude-Code-agent orchestration on **nixos-containers**. A host-side Rust Operator + dev notes: conventions, gotchas, per-subsystem design.
daemon (`hive-c0re`) spawns nspawn-isolated agent containers and brokers
messages between them. A manager agent (`hm1nd`) coordinates the swarm and
gates lifecycle changes on user approval via git commits, surfaced through a
vibec0re-styled HTTP dashboard with live SSE message-flow.
**PLAN.md** is the living design doc. Read it for the *why* and the phase - High-level project intro: **[README.md](README.md)**.
roadmap; this file is the operator/developer reference for the *how*. - Open work + backlog: **[TODO.md](TODO.md)**.
## Architecture ## File map
```
host (NixOS, hive-c0re.service)
├── hive-c0re (Rust daemon — coordinator + dashboard + CLI)
│ ├── lifecycle — nixos-container CRUD (spawn/kill/rebuild/list)
│ ├── broker — sqlite message store + broadcast channel
│ ├── approvals — sqlite approval queue
│ ├── coordinator — shared state (broker/approvals/agent sockets)
│ ├── actions — approve/deny (shared between admin socket & dashboard)
│ ├── server — host admin socket (JSON line protocol)
│ ├── manager_server — manager-only privileged socket
│ ├── agent_server — per-sub-agent sockets
│ ├── dashboard — axum HTTP UI + SSE message-flow + approve/deny + T4LK
│ └── client — admin-socket client (powers `hive-c0re spawn|kill|…`)
├── /run/hyperhive/
│ ├── host.sock — admin CLI ↔ daemon
│ ├── manager.sock → hm1nd container at /run/hive/mcp.sock
│ └── agents/<name>/mcp.sock → h-<name> container at /run/hive/mcp.sock
├── /var/lib/hyperhive/
│ ├── broker.sqlite — messages + approvals tables
│ ├── agents/<name>/config/ — proposed repo (manager-editable, RO to hive-c0re)
│ └── applied/<name>/ — applied repo (hive-c0re-only, container builds here)
└── nixos-containers
├── h-<name> (sub-agents, hive-ag3nt binary)
└── hm1nd (manager, hive-m1nd binary)
```
## Crates / file map
``` ```
hive-c0re/ host daemon + CLI (one binary, subcommand-dispatched) hive-c0re/ host daemon + CLI (one binary, subcommand-dispatched)
src/main.rs clap setup; serve / spawn / kill / rebuild / list / src/main.rs clap setup; serve / spawn / kill / rebuild / list /
pending / approve / deny pending / approve / deny / destroy / request-spawn
src/server.rs host admin socket (HostRequest → dispatch) src/server.rs host admin socket (HostRequest → dispatch)
src/client.rs admin-socket client src/client.rs admin-socket client
src/manager_server.rs manager-privileged socket (ManagerRequest) src/manager_server.rs manager-privileged socket (ManagerRequest)
src/agent_server.rs per-sub-agent socket listener src/agent_server.rs per-sub-agent socket listener (long-poll Recv)
src/broker.rs sqlite Message store + broadcast channel for SSE src/broker.rs sqlite Message store + broadcast channel for SSE
src/approvals.rs sqlite Approval queue src/approvals.rs sqlite Approval queue + kinds
src/coordinator.rs shared state (broker/approvals/agent_flake/sockets) src/coordinator.rs shared state (broker/approvals/transient/sockets)
src/actions.rs approve/deny (admin socket + dashboard both call in) src/actions.rs approve/deny/destroy
src/lifecycle.rs `nixos-container` shellouts, per-agent flake generator, src/auto_update.rs startup rebuild scan + ensure_manager
systemd drop-ins, git helpers, agent_web_port hash src/lifecycle.rs `nixos-container` shellouts, per-agent flake generator
src/dashboard.rs axum HTTP UI: containers list, T4LK form, approvals src/dashboard.rs axum HTTP UI: containers, approvals, async-form actions
(diff + Approve/Deny buttons), SSE message flow assets/ CSS + JS shipped via include_str!
hive-ag3nt/ in-container harness crate; produces TWO binaries hive-ag3nt/ in-container harness crate; produces TWO binaries
src/lib.rs DEFAULT_SOCKET, DEFAULT_WEB_PORT, re-exports src/lib.rs re-exports + DEFAULT_SOCKET, DEFAULT_WEB_PORT
src/client.rs generic JSON-line request/response over unix socket src/client.rs generic JSON-line request/response over unix socket
src/web_ui.rs per-container axum HTTP page (label + placeholder) src/web_ui.rs per-container axum HTTP page
src/bin/hive-ag3nt.rs sub-agent CLI (serve/send/recv); turn loop + web UI src/events.rs LiveEvent + broadcast Bus for the SSE stream
src/bin/hive-m1nd.rs manager CLI (serve/send/recv/spawn/kill/ src/turn.rs claude --print + stream-json pump; --compact retry
request-apply-commit); recognises HelperEvent src/mcp.rs embedded MCP server (rmcp): AgentServer + ManagerServer
src/login.rs probe /root/.claude/ for a valid session
src/login_session.rs drives `claude auth login` over stdio pipes
src/bin/hive-ag3nt.rs sub-agent main
src/bin/hive-m1nd.rs manager main
assets/ CSS + JS for the per-agent UI
hive-sh4re/ wire types (HostRequest/Response, AgentRequest/Response, hive-sh4re/ wire types (HostRequest/Response, AgentRequest/Response,
ManagerRequest/Response, Message, Approval, HelperEvent) ManagerRequest/Response, Message, Approval, HelperEvent)
nix/ nix/
modules/hive-c0re.nix systemd service + firewall + git path wiring modules/hive-c0re.nix systemd service + firewall + git wiring
templates/agent-base.nix sub-agent nixos-container template templates/harness-base.nix shared scaffolding for sub-agents + manager
templates/manager.nix manager nixos-container template templates/agent-base.nix sub-agent nixosConfiguration
templates/manager.nix manager nixosConfiguration
tests/roundtrip.sh Phase 3 messaging round-trip
tests/approval.sh Phase 5 end-to-end approval flow
tests/dashboard.sh Phase 6+7 HTTP dashboard + SSE + orphan GC
docs/damocles-migration.md options for moving damocles onto hyperhive
``` ```
## Conventions ## Conventions
@ -104,9 +68,14 @@ docs/damocles-migration.md options for moving damocles onto hyperhive
`applied/<name>/flake.nix`, writes the systemd limits drop-in, then `applied/<name>/flake.nix`, writes the systemd limits drop-in, then
`nixos-container update` + stop + start. Anything that changes per-container `nixos-container update` + stop + start. Anything that changes per-container
state on the host should be re-applied here. state on the host should be re-applied here.
- **Actions are factored.** `approve` / `deny` live in `actions.rs`; the admin - **Actions are factored.** `approve` / `deny` / `destroy` live in
socket and the dashboard POST handlers both call into them, so the two `actions.rs`; the admin socket and the dashboard POST handlers both call
surfaces never drift. into them so the two surfaces never drift.
- **Async forms.** Dashboard mutating forms carry `data-async`; the
`assets/async_forms.js` helper intercepts submit, shows a spinner, and
fetches with `application/x-www-form-urlencoded` (axum `Form` extractor
rejects multipart). New mutating forms should add `data-async` and
optionally `data-confirm`.
## Gotchas / lessons learned ## Gotchas / lessons learned
@ -124,33 +93,28 @@ docs/damocles-migration.md options for moving damocles onto hyperhive
UIs (the bind is invisible from the host). We force-clear those vars (and UIs (the bind is invisible from the host). We force-clear those vars (and
`HOST_ADDRESS6` / `LOCAL_ADDRESS6` / `HOST_BRIDGE`) plus set `HOST_ADDRESS6` / `LOCAL_ADDRESS6` / `HOST_BRIDGE`) plus set
`PRIVATE_NETWORK=0`. `PRIVATE_NETWORK=0`.
- **systemd service PATH ≠ host PATH.** Our service explicitly sets - **systemd service PATH ≠ host PATH.** The hive-c0re service sets
`path = [ pkgs.git "/run/current-system/sw" ]`. Additionally, `path = [ pkgs.git "/run/current-system/sw" ]`. In-container harness
`environment.HYPERHIVE_GIT = "${pkgs.git}/bin/git"` bakes the absolute path services do the same so anything an agent adds to its own `agent.nix`
in (read by `lifecycle::git_command()`) so git resolution doesn't depend on (`environment.systemPackages`) is visible to claude's Bash tool without
PATH plumbing at all. editing the service definition. `environment.HYPERHIVE_GIT` bakes git's
absolute path in (read by `lifecycle::git_command()`) for the host.
- **`RuntimeDirectoryPreserve = "yes"`** keeps `/run/hyperhive/` (and the - **`RuntimeDirectoryPreserve = "yes"`** keeps `/run/hyperhive/` (and the
per-agent sub-dirs) across `hive-c0re` restarts. Without it, every restart per-agent sub-dirs) across `hive-c0re` restarts. Without it, every restart
wipes bind sources and existing containers can't be started. wipes bind sources and existing containers can't be started.
- **`register_agent` is idempotent** — drops any prior socket task before - **`register_agent` is idempotent** — drops any prior socket task before
rebinding. Required so a `hive-c0re` restart followed by `rebuild alice` rebinding. Required so a `hive-c0re` restart followed by `rebuild alice`
recreates the agent's socket without needing a clean reinstall. recreates the agent's socket without needing a clean reinstall.
- **`claude-code` is unfree.** `agent-base.nix` allow-list's it specifically. - **`claude-code` is unfree.** `harness-base.nix` allow-list's it
The flake pins it to **nixpkgs-unstable** via `overlays.claude-unstable` specifically. The flake pins it to **nixpkgs-unstable** via
(stable lags too far). The overlay imports unstable with its own `overlays.claude-unstable` (stable lags too far). The overlay imports
`allowUnfreePredicate` so the access inside the overlay doesn't itself trip. unstable with its own `allowUnfreePredicate` so the access inside the
- **Claude credentials are stateful and per-container.** No `ANTHROPIC_API_KEY` overlay doesn't itself trip.
env var path. Today's stopgap: `nixos-container root-login h-<name>` - **Claude credentials are per-agent.** `/var/lib/hyperhive/agents/<name>/claude/`
`claude` (interactive) → log in once. The harness falls back to echo bind-mounts to `/root/.claude` (RW). Sharing one dir across agents is NOT
replies when `claude --print` fails. **Phase 8** moves this to a per-agent viable — OAuth refresh tokens rotate, so any sibling refresh invalidates
persistent dir at `/var/lib/hyperhive/agents/<name>/claude/` bind-mounted all the others. Login flow runs from the per-agent web UI; creds persist
into the container, with the interactive login driven from the agent's web across `destroy`/recreate.
UI. Sharing one `~/.claude` across agents is NOT viable — OAuth refresh
tokens rotate, so any sibling refresh invalidates all the others.
- **Echo guard.** `hive-ag3nt serve` skips auto-reply when the incoming body
starts with `"echo: "`. Prevents ping-pong loops when both sides fall back
to echo. Real conversations between claude-backed agents *will* runaway —
bounding them is the manager's job.
- **Orphan approvals.** If state dirs are wiped out from under a pending - **Orphan approvals.** If state dirs are wiped out from under a pending
approval (test scripts, manual `rm -rf`), the dashboard's next render approval (test scripts, manual `rm -rf`), the dashboard's next render
marks them `failed` with note `"agent state dir missing"` so they fall out marks them `failed` with note `"agent state dir missing"` so they fall out
@ -159,339 +123,130 @@ docs/damocles-migration.md options for moving damocles onto hyperhive
## Agent MCP surface + turn loop ## Agent MCP surface + turn loop
The harness ships an embedded MCP server (rmcp 1.7) that claude launches as The harness ships an embedded MCP server (rmcp 1.7) that claude launches as
a stdio child via `--mcp-config`. Subcommand: `hive-ag3nt mcp`. Tools: a stdio child via `--mcp-config`. Subcommand: `hive-ag3nt mcp` (or
`hive-m1nd mcp` for the manager surface).
Sub-agent tools:
- `mcp__hyperhive__send(to, body)` — message a peer or the operator. - `mcp__hyperhive__send(to, body)` — message a peer or the operator.
- `mcp__hyperhive__recv()` — drain one inbox message. - `mcp__hyperhive__recv()` — drain one inbox message.
Both translate to `AgentRequest::Send`/`Recv` against the agent's own Manager additionally:
`/run/hive/mcp.sock` (the existing hyperhive socket). The MCP surface is
just claude's view of that socket — same authority, friendlier protocol.
The turn loop in `hive-ag3nt serve` writes
`/run/hive/claude-mcp-config.json` at boot pointing at
`/proc/self/exe mcp` (the running hive-ag3nt binary's nix store path).
Each turn invokes:
```
claude --print --model haiku --mcp-config <path> --tools <builtins> --allowedTools <builtins+mcp> <prompt>
```
**Loop control.** The harness pops one inbox message (the wake signal) per
cycle and hands claude a prompt naming the agent, the sender, the body,
and the MCP tools. Claude drives any further `recv`/`send` itself —
harness no longer relays claude's stdout as a reply. Stdout is logged for
debugging; the side effects (sends via MCP) are what matter.
**Operator input** moved from the hive-c0re dashboard's T4LK form to each
per-agent page. The per-agent `/send` POST hits the new
`AgentRequest::OperatorMsg` / `ManagerRequest::OperatorMsg` wire verb,
which enqueues `Message { from: "operator", to: <self>, body }` directly
into the broker. No more global recipient dropdown — one input per agent
page, scoped to that agent.
**Live view.** Each agent runs a `hive_ag3nt::events::Bus` (a
`tokio::sync::broadcast<LiveEvent>` wrapper). The harness emits:
- `TurnStart { from, body }` when a wake-up message is popped.
- `Stream(value)` for every line claude prints on stdout (parsed
stream-json; flattened under `{kind: "stream", type: ...}` via serde
internal tagging).
- `Note(text)` for stderr lines and non-JSON stdout (so nothing's lost).
- `TurnEnd { ok, note }` when claude exits.
The web UI subscribes via `/events/stream` (SSE) and a small JS panel on
`/` appends rows. No full-page reload — the login form (and anything else
the operator is typing into) stays put.
claude is invoked with `--print --verbose --output-format stream-json` so
tool calls + assistant text + tool results all land as structured events.
The harness no longer reads claude's text stdout into a reply; claude
calls `mcp__hyperhive__send` itself.
**Tool envelope.** Every MCP tool handler in `hive_ag3nt::mcp::AgentServer`
wraps its logic in `run_tool(name, args_debug, async { ... })`. The
envelope guarantees:
1. Pre-log of the request (tool + args).
2. The tool's own logic runs.
3. A status line is appended to the result body
(`[status] N unread message(s) in inbox`) so claude always sees the
current inbox depth without an extra tool call.
4. Post-log of the full result.
`AgentRequest::Status` is the non-mutating peek that powers the status
line (broker's `count_pending`). When adding new tools (manager surface,
notes/state, etc.), use `run_tool` and they pick up the envelope for free.
**Tool whitelist** (see `ALLOWED_BUILTIN_TOOLS` in `hive-ag3nt::mcp`):
- Allowed built-ins: `Bash`, `Edit`, `Glob`, `Grep`, `Read`, `TodoWrite`,
`Write`.
- Denied by omission: `WebFetch`, `WebSearch`, `Task`, `NotebookEdit`
no external egress, nested-agent spawning, or Jupyter handling until we
have a real policy story.
- Allowed MCP tools: `mcp__hyperhive__send`, `mcp__hyperhive__recv`.
`Bash` is on the allow-list "for now" — pending a finer-grained allow-list
system for command patterns (`Bash(git *)`-style). When that lands, the
`builtin_tools_arg` shape will probably change to a setting / hooks
combo per claude-code's permissions plumbing.
The manager (`hive-m1nd`) runs the same loop with a `ManagerServer` MCP
flavor:
- `mcp__hyperhive__send`, `recv` — agent surface.
- `mcp__hyperhive__request_spawn(name)` — queue Spawn approval. - `mcp__hyperhive__request_spawn(name)` — queue Spawn approval.
- `mcp__hyperhive__kill(name)` — graceful stop of a sub-agent. - `mcp__hyperhive__kill(name)` — graceful stop.
- `mcp__hyperhive__request_apply_commit(agent, commit_ref)` — submit a - `mcp__hyperhive__request_apply_commit(agent, commit_ref)` — submit a
config change for any agent (`hm1nd` for self-modification). config change for any agent (including `hm1nd` for self-mods).
The shared per-turn plumbing lives in `hive_ag3nt::turn::{write_mcp_config, The shared per-turn plumbing lives in `hive_ag3nt::turn::{write_mcp_config,
run_turn}` so both binaries can't drift apart. run_turn, drive_turn, emit_turn_end, wait_for_login}` so the two binaries
can't drift.
Each turn:
```
claude --print --verbose --output-format stream-json --model haiku \
--continue --settings '{"autoCompactEnabled":false,"autoMemoryEnabled":false}' \
--mcp-config <path> --strict-mcp-config \
--tools <builtins> --allowedTools <builtins+mcp>
# prompt piped over stdin
```
`--continue` keeps a persistent session per agent (claude stores sessions in
`~/.claude/projects/`, which is bind-mounted persistently). Auto-compact and
auto-memory are disabled because hyperhive owns compaction (`/compact` on
overflow, retry once).
**Loop control.** The harness pops one inbox message per cycle (the wake
signal — Recv long-polls server-side for up to 30s waking instantly on a new
broker `Sent` event for this agent) and hands claude a prompt naming the
agent, the sender, the body, and the MCP tools. Claude drives any further
`recv`/`send` itself.
**Tool envelope** (`mcp::run_tool_envelope`): every MCP tool handler logs
the request, runs the body, appends a status line (e.g.
`[status] 3 unread message(s) in inbox` from a non-mutating `Status` peek),
logs the result. New tools call this helper.
**Tool whitelist** (`mcp::ALLOWED_BUILTIN_TOOLS`):
- Allowed built-ins: `Bash`, `Edit`, `Glob`, `Grep`, `Read`, `TodoWrite`,
`Write`.
- Denied by omission: `WebFetch`, `WebSearch`, `Task`, `NotebookEdit`.
- Allowed MCP tools: as listed above per flavor.
`Bash` is on the allow-list pending a finer-grained pattern allow-list
(`Bash(git *)`-style) — see [TODO.md](TODO.md).
**Live view.** Each agent runs an `events::Bus` (a
`tokio::sync::broadcast<LiveEvent>` wrapper). The harness emits
`TurnStart`, `Stream(value)` (one per parsed stream-json line), `Note`,
`TurnEnd`. The web UI subscribes via `/events/stream` (SSE) and a JS panel
appends rows. No full-page reload — operator input stays put.
## Manager (hm1nd) is hive-c0re-managed ## Manager (hm1nd) is hive-c0re-managed
The manager container runs through the **same lifecycle as sub-agents** The manager container runs through the **same lifecycle as sub-agents**.
no separate code path. On `hive-c0re serve` startup, if `nixos-container On `hive-c0re serve` startup, if `hm1nd` is missing, hive-c0re creates it.
list` doesn't include `hm1nd`, hive-c0re creates it. The manager's flake The manager's flake lives at `/var/lib/hyperhive/applied/hm1nd/`; its
lives at `/var/lib/hyperhive/applied/hm1nd/`; its proposed (manager-editable) proposed config at `/var/lib/hyperhive/agents/hm1nd/config/`. Manager can
config at `/var/lib/hyperhive/agents/hm1nd/config/`. Manager can edit its edit its own `agent.nix` (visible inside the container at
own `agent.nix` (visible inside the container at `/agents/hm1nd/config/`), `/agents/hm1nd/config/`) and submit `request-apply-commit hm1nd <sha>` for
commit, and submit `request-apply-commit hm1nd <sha>` for operator operator approval.
approval — same flow as for sub-agents.
Differences from sub-agents: Differences from sub-agents:
- `flake.nix` extends `hyperhive.nixosConfigurations.manager` (vs - `flake.nix` extends `hyperhive.nixosConfigurations.manager`
`agent-base`). (vs `agent-base`).
- Container name is `hm1nd` (no `h-` prefix). - Container name is `hm1nd` (no `h-` prefix).
- Fixed web UI port (`MANAGER_PORT = 8000`). - Fixed web UI port (`MANAGER_PORT = 8000`).
- `set_nspawn_flags` adds an extra bind: `/var/lib/hyperhive/agents` - `set_nspawn_flags` adds an extra bind: `/var/lib/hyperhive/agents`
`/agents` (RW), so the manager can edit per-agent proposed repos. `/agents` (RW), so the manager can edit per-agent proposed repos.
- First-deploy spawn bypasses the approval queue (manager is required - First-deploy spawn bypasses the approval queue (manager is required
infrastructure). infrastructure).
- Per-agent socket is the manager socket at `/run/hyperhive/manager/`, owned - Per-agent socket lives at `/run/hyperhive/manager/`, owned by
by `manager_server::start`. `coordinator::ensure_runtime` returns that `manager_server::start`.
path for manager and the usual `/run/hyperhive/agents/<name>/` for the
rest.
**Migration note:** drop any `containers.hm1nd = { ... }` block from your **Migration note (for older hosts):** drop any `containers.hm1nd = { ... }`
host NixOS config. hyperhive creates and updates the manager itself now. block from your host NixOS config. hyperhive creates and updates the
manager itself now.
## Auto-update on startup ## Auto-update on startup
`hive-c0re serve` runs `auto_update::run` in a background task right after `hive-c0re serve` runs `auto_update::run` in a background task right after
opening the coordinator. It enumerates managed containers and rebuilds any opening the coordinator. It enumerates managed containers and rebuilds any
whose recorded hyperhive rev differs from the current one: whose recorded hyperhive rev differs from the current one — sub-agents and
manager go through the same `lifecycle::rebuild` path. "Rev" = canonical
filesystem path of `cfg.hyperhiveFlake`. Marker file:
`/var/lib/hyperhive/applied/.<name>.hyperhive-rev`.
- **Sub-agents** rebuild via `lifecycle::rebuild` (regenerates If the flake input has no canonical path (e.g. a `github:` URL),
`applied/<name>/flake.nix`, sets nspawn flags, `nixos-container update --flake`). auto-update is a no-op — rebuild manually.
- **Manager** runs `nixos-container update hm1nd` (no `--flake`). The
manager's config lives in the host's NixOS module; this is belt-and-braces
on top of NixOS's own container activation. Idempotent when nothing has
actually changed.
"Rev" = canonical filesystem path of `cfg.hyperhiveFlake` (so `/etc/hyperhive`
resolving to a new `/nix/store/...-source` triggers a rebuild). Marker file:
`/var/lib/hyperhive/applied/.<name>.hyperhive-rev`. If the flake input has
no canonical path (e.g. a `github:` URL), auto-update is a no-op — rebuild
manually. The task is async and never blocks the admin socket; failures are
logged and don't take the daemon down.
The dashboard surfaces pending updates per agent: a clickable "needs update The dashboard surfaces pending updates per agent: a clickable "needs update
↻" badge appears whenever the marker differs from current rev. The badge ↻" badge appears whenever the marker differs from current rev. The badge
POSTs `/rebuild/<name>`, calling the same `auto_update::rebuild_agent` / POSTs `/rebuild/<name>`, calling the same `auto_update::rebuild_agent`
`rebuild_manager` path so manual triggers and the startup scan can't drift. path so manual triggers and the startup scan can't drift.
## Build / deploy / test
```sh
# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo clippy --workspace --all-targets -- -D warnings
nix develop -c cargo build
# evaluate everything (incl. rust+nix+toml fmt + clippy)
nix flake check
# build only the workspace package
nix build .#default
./result/bin/{hive-c0re,hive-ag3nt,hive-m1nd}
# deploy to an existing host that imports hyperhive.nixosModules.hive-c0re
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
sudo systemctl restart hive-c0re # if only env/options changed
# end-to-end tests (each idempotent; runs as root)
sudo bash tests/roundtrip.sh # alice ↔ bob echo round-trip
sudo bash tests/approval.sh # manager edit → request → user approve → rebuilt
sudo bash tests/dashboard.sh # HTTP UI, approve POST, SSE, orphan GC
```
The host config also needs `hyperhive.overlays.default` applied — the module's
default `package = pkgs.hyperhive` requires the overlay to bring the package
in. The `claude-unstable` overlay is applied internally to per-agent flakes
already.
## Phase status
- ✅ Phase 0 — repo + Cargo workspace + flake + agent-base template
- ✅ Phase 1 — container lifecycle; `nixos-container update` hot-reload works
under the patch stack (validated on muede-lpt2)
- ✅ Phase 2 — per-agent sockets, in-memory broker, agent harness round-trips
- ✅ Phase 3 — sqlite broker (durable) + claude-or-echo turn loop
- ✅ Phase 4 — `hm1nd` manager binary + manager socket + declarative
`containers.hm1nd`
- ✅ Phase 5 — git-commit approval flow
- 5a — sqlite approval queue (`request_apply_commit`/`pending`/`approve`/`deny`)
- 5b — per-agent config flakes
- 5c — manager edits `proposed`, hive-c0re writes-only `applied`; container
builds from `applied`. Approve = read `agent.nix` at the approved commit
from `proposed`, copy into `applied`, commit + rebuild. Manager cannot
move `applied/main` on its own.
- ✅ Phase 6 — per-container web UIs (`HIVE_PORT` deterministic-hash) +
hive-c0re dashboard (default 7000, vibec0re aesthetic, deep-linked)
- ✅ Phase 7 — polish:
- 7a — dashboard Approve/Deny buttons + unified diff (`similar` crate)
- 7b — broker broadcast + `/messages/stream` SSE + live message-flow panel
- 7c — `ApprovalResolved` helper events into manager inbox
- 7d — `MemoryMax=2G` + `CPUQuota=50%` systemd drop-in per container
- 7e — damocles migration plan (`docs/damocles-migration.md`)
- ✅ Phase 7 follow-ups:
- Dashboard **T4LK** form — operator can send messages from the browser
(`POST /send`, becomes `from: "operator"` broker message)
- Orphan-approval GC on dashboard render (stale entries auto-failed)
- `PRIVATE_NETWORK=0` + `HOST_ADDRESS=`/`LOCAL_ADDRESS=` cleared in
`set_nspawn_flags` so sub-agent web UI ports are reachable on the host
- `HYPERHIVE_GIT` env var (absolute path) bypasses PATH ambiguity
## Phase 8 — real claude in containers + login UX (in progress)
See PLAN.md → "Phase 8" for the full design. Summary:
- **Per-agent persistent creds dir.** Bind
`/var/lib/hyperhive/agents/<name>/claude/``/root/.claude` (RW) in
`set_nspawn_flags`. One OAuth lineage per agent; refresh rotations stay
contained to that agent.
- **State dirs persist by default.** `destroy` keeps
`/var/lib/hyperhive/agents/<name>/` unless the operator passes an explicit
wipe flag. Recreating an agent of the same name reuses prior creds.
- **First spawn is approval-gated.** New agent names go through the same
approval queue as config edits. Manager calls `RequestSpawn` (CLI:
`hive-m1nd request-spawn <name>`); operator can also queue from the
dashboard or `hive-c0re request-spawn <name>`. The host's direct
`hive-c0re spawn <name>` still works as a privileged bypass for tests.
Approve runs `lifecycle::spawn` in a background task; the dashboard polls
via `<meta refresh>` and renders a spinner row while
`nixos-container create` + `update` + `start` is in flight.
- **"needs login" partial-run state.** No valid session in `~/.claude/`
harness binds the web UI but does NOT start the turn loop. The harness
polls the dir; as soon as a login lands it transitions into the turn loop
without a restart. Dashboard surfaces the state per-agent via a `needs
login` badge in the container list. "Valid session" today is a heuristic
(any regular file inside `/root/.claude/`); we may refine once the
filename layout claude writes is locked in.
- **Login from the per-agent web UI.** Spawn `claude auth login` with plain
stdio pipes (no PTY initially), surface the OAuth URL from stdout on the
page, accept the resulting code via a paste field, write it to the process
stdin. Once `~/.claude/` populates, the existing needs-login polling loop
flips state to Online and starts the turn loop — no separate signaling
needed. The exact command is overridable via `HYPERHIVE_LOGIN_CMD` so we
can adjust without rebuilding. If pipes turn out to be insufficient
(claude refuses without a TTY, raw-mode input, ANSI-only output) we redo
the backend with a PTY (e.g. `portable-pty`).
Implementation order: bind-mount/dir creation → approval-gated spawn +
spinner → "needs login" partial run → PTY login endpoint. The login UI has
nowhere to live until the partial-run mode exists, so don't ship it earlier.
## Approval flow ## Approval flow
End-to-end: manager edits per-agent `proposed` repo → commits → submits commit End-to-end: manager edits per-agent `proposed` repo → commits → submits commit
sha → user approves on host CLI **or** dashboard button → `hive-c0re` reads the sha → user approves on host CLI or dashboard button → `hive-c0re` reads the
file at that sha from `proposed`, applies into `applied`, commits there, runs file at that sha from `proposed`, applies into `applied`, commits there, runs
`nixos-container update`. Helper-event JSON lands in the manager's inbox. `nixos-container update`. Helper-event JSON (`ApprovalResolved`) lands in the
manager's inbox.
``` Two separate git repos per agent:
# Inside the hm1nd container (manager has /agents bind-mounted RW):
cd /agents/alice/config
$EDITOR agent.nix # e.g. environment.systemPackages = [ pkgs.htop ];
git commit -am "add htop"
SHA=$(git rev-parse HEAD)
hive-m1nd request-apply-commit alice $SHA
exit
# On the host (CLI):
sudo hive-c0re pending # shows queued approval with id N
sudo hive-c0re approve N # validates, applies, rebuilds
sudo nixos-container run h-alice -- which htop
# Or on the dashboard (browser):
http://<host>:7000/ # ◆ APPR0VE button next to the diff
```
Per-agent layout — two separate git repos:
``` ```
/var/lib/hyperhive/agents/<name>/config/ # proposed — manager edits, hive-c0re reads only /var/lib/hyperhive/agents/<name>/config/ # proposed — manager edits, hive-c0re reads only
├── .git/
└── agent.nix # the only file the manager can change └── agent.nix # the only file the manager can change
# (initial commit by hive-c0re on first spawn, # (initial commit by hive-c0re on first spawn,
# never touched by hive-c0re again) # never touched by hive-c0re again)
/var/lib/hyperhive/applied/<name>/ # applied — hive-c0re-only; container builds here /var/lib/hyperhive/applied/<name>/ # applied — hive-c0re-only; container builds here
├── .git/ ├── flake.nix # auto-generated; references hyperhive_flake
├── flake.nix # hive-c0re-managed; references hyperhive_flake
└── agent.nix # overwritten by approve from the proposed commit └── agent.nix # overwritten by approve from the proposed commit
``` ```
The container's `--flake` ref is `<applied_dir>#default`. The flake's The container's `--flake` ref is `<applied_dir>#default`. The flake extends
`nixosConfigurations.default` extends `hyperhive.nixosConfigurations.agent-base` `hyperhive.nixosConfigurations.{agent-base|manager}` with `./agent.nix` plus
with `./agent.nix` plus an inline module that sets an inline module setting `programs.git.config.user` (committer identity =
`environment.etc."gitconfig".text` (committer identity = the agent's name) and the agent's name) and `systemd.services.<harness>.environment` (HIVE_PORT,
`systemd.services.hive-ag3nt.environment.HIVE_PORT`/`HIVE_LABEL`. HIVE_LABEL, HIVE_DASHBOARD_PORT).
## Security backlog
- **Unprivileged containers (userns mapping).** Today the nspawn container
runs as a fully privileged root. Goal: `PrivateUsersChown=yes` (or the
nixos-container equivalent) so uid 0 inside maps to an unprivileged uid
on the host, and a container-root compromise lands the attacker on an
ordinary user account, not the host's root. Requires per-agent state
dirs to be chown'd to that uid on the host side.
- **Bash command allow-list.** Replace the blanket `Bash` allow with a
pattern allow-list (`Bash(git *)`, `Bash(nix build .*)`, etc.) per
claude-code's `--allowedTools` extended grammar. Likely lives in
`agent.nix` so each agent can scope its own shell surface.
## Per-agent settings backlog
- **Model override.** Hard-coded to `claude-haiku-4-5-20251001` in the turn
loop right now. Surface it as a per-agent override: operator via
dashboard, manager via `request_apply_commit` setting an attr on the
agent's flake (most natural place since the flake already carries
per-agent env/identity).
## Polish backlog
Not phased — pick when relevant:
- **Operator inbox view** — drain replies addressed to `operator` and show
in the dashboard (today they accumulate in sqlite unread).
- **Per-agent UI substance** — show last N inbox messages, last turn timing,
link back to dashboard.
- **xterm.js terminal** — embed in each per-container UI, attach to a PTY
exposed by the harness.
- **`destroy` verb** — currently `nixos-container destroy` + manual `rm -rf`.
Should be one hive-c0re verb that also purges approvals + state dirs.
- **Bounded broker** — cap rows per recipient or auto-vacuum delivered
messages older than a threshold.
- **Container crash events** — watch `container@*.service` via D-Bus,
push `HelperEvent::ContainerCrash` to the manager.
## Inspirations
- **`~/Repos/bitburner-agent`** — sibling project, drives Claude Code in a
turn loop against a Bitburner CDP session. Patterns to steal as we grow:
per-cycle prompt diffing (vs full state), notes compaction as a separate
short-lived Claude session, MCP server registering tools from a single
`TOOLS` array, dashboard with SSE + xterm.js + sqlite stats sampler,
opaque "terminal event" stream that unifies tool-call / sleep / op-notice
/ etc.

301
PLAN.md
View file

@ -1,301 +0,0 @@
# hyperhive — Plan
> **Status.** All phases 07 have shipped. This file is the original design
> doc; **CLAUDE.md** is the current source of truth for what's actually built,
> the file map, gotchas, and operator runbook. Keep this file for the *why*
> and the original phase rationale; CLAUDE.md for *how things are today*.
>
> **Names.**
> - Repo: `hyperhive`
> - Host-side daemon (helper): **`hive-c0re`** — its own crate (daemon + CLI in one binary).
> - In-container harness: **`hive-ag3nt`** crate, with **two `[[bin]]` targets**:
> - `hive-ag3nt` — runs in each sub-agent container.
> - `hive-m1nd` — runs in the manager container (same crate, second `main.rs`, wires the manager tool surface).
> - Shared crate between `hive-c0re` and `hive-ag3nt` (wire protocol, MCP verb types, message shapes): **`hive-sh4re`**.
>
> **Relationship to damocles.** Damocles is a separate, currently-running setup. `hyperhive` is a new, independent system in its own repo. Damocles' existing `claude-container.nix` informs the agent-base template but is not a dependency. Migration options laid out in `docs/damocles-migration.md`; recommendation is to keep them separate for now.
## What we're building
A multi-Claude-Code-agent setup on a single host:
- Each agent runs in its own **nixos-container** (the NixOS wrapper around `systemd-nspawn` — gives us `nixos-container update`, declarative configs, etc.).
- A **manager agent** (itself a nixos-container) coordinates: spawns/kills agents, routes inter-agent messages, filters relevance, summarises, gates lifecycle changes on user approval.
- Host-side Rust daemon **`hive-c0re`** owns container lifecycle, hosts the MCP transport, brokers messages, runs the dashboard, and orchestrates approval-via-git-commit.
- Inside each agent container, **`hive-ag3nt`** runs the agent turn loop (drives `claude` as a subprocess), provides the per-agent web UI, and is the MCP client for that agent.
- Inside the manager container, **`hive-m1nd`** runs the analogous loop with the manager's tool surface and `/agents/**` RW access. It's a second binary in the `hive-ag3nt` crate — same lib code, different `main.rs`.
- Wire protocol and types shared between `hive-c0re` and the harness live in **`hive-sh4re`**.
## Architecture
```
┌────────────────── host ────────────────────────────────────┐
│ │
│ hive-c0re (Rust daemon, NixOS service) │
│ ├── lifecycle : nixos-container CRUD + nixos-container │
│ │ update on approved config commits │
│ ├── broker : sqlite message store + routing │
│ ├── mcp : per-socket MCP servers (see Sockets) │
│ ├── approvals : git-commit-based change requests │
│ └── dashboard : HTMX/SSE web UI │
│ │
│ /var/lib/hyperhive/ │
│ ├── state-repo/ (world: who exists, etc.) │
│ ├── shared-instructions/ (RO into every agent) │
│ └── agents/ │
│ ├── manager/ │
│ │ ├── config/ (flake repo, manager's own setup) │
│ │ ├── prompts/ (manager's role/CLAUDE.md) │
│ │ └── state/ (RW for manager) │
│ └── <name>/ │
│ ├── config/ (flake repo, RW for manager, RO ag) │
│ ├── prompts/ (RO inside agent) │
│ └── state/ (RW for agent + manager) │
│ │
│ Sockets (one per principal — perms by mount location): │
│ ├── /run/hyperhive/host.sock admin/CLI on host │
│ ├── /run/hyperhive/manager.sock → manager container │
│ └── /run/hyperhive/agents/<name>.sock → that agent │
│ │
│ ┌─ nixos-container: hm1nd ──────────────────────┐ │
│ │ hive-m1nd (Rust, hive-ag3nt crate) │ │
│ │ ├ MCP client → /run/hyperhive/manager.sock │ │
│ │ ├ turn loop driving `claude` │ │
│ │ └ per-container web UI │ │
│ │ /var/lib/hyperhive/agents/** bind RW │ │
│ │ MCP tools (manager surface): │ │
│ │ send(to, body, wait_for_reply: bool), │ │
│ │ recv / next_event, │ │
│ │ request_spawn, request_kill, │ │
│ │ request_apply_commit, inject_peer_info, │ │
│ │ update_shared_instructions │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌─ nixos-container: h-<name> ───────────────────┐ │
│ │ hive-ag3nt (Rust) │ │
│ │ MCP client → /run/hyperhive/agents/<name>.sock │
│ │ state/ RW, config/ + prompts/ + shared RO │ │
│ │ MCP tools (agent surface): │ │
│ │ send, recv, request_install │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Key model decisions
**Sockets, not identity gating.** One unix socket per principal, bind-mounted into the right place. The socket *is* the principal — no `SO_PEERCRED` lookups, no token plumbing. Perms are filesystem perms on the host side, plus the fact that only the matching container has the bind-mount. Each socket runs the MCP tool surface appropriate to its principal.
**Container naming.** `nixos-container` caps the total container name at 11 chars (it gets encoded into network interface names). The manager runs as `hm1nd` (a compressed form of `hive-m1nd`). Sub-agents run as `h-<name>` with `<name>` capped to 9 chars (`MAX_AGENT_NAME`). The two namespaces don't collide. On-disk paths (`/var/lib/hyperhive/agents/foo/`) and socket paths (`/run/hyperhive/agents/foo/mcp.sock`) use the bare logical name; the prefixing lives only at the nixos-container layer.
**Approvals are git commits.** `hive-c0re` maintains a `state-repo` on host that records the world (which agents exist, their roles, etc.). Per-agent flake configs (`agents/<name>/config/`) are themselves git repos. The manager edits clones with plain `git` CLI inside its container and asks `hive-c0re` to apply a commit via `request_apply_commit(agent, sha)`. `hive-c0re` queues it; once approved, fast-forwards `main` and reconciles state (rebuild containers, etc.). **No abstract "approval token" — the commit hash is the token.**
**Approval UX evolves.** Early phases approve by CLI (`hive-c0re approve <sha>`) or even direct `git merge` on the host. The dashboard's commit-view UI is one of the last features built (so the system runs end-to-end on CLI long before the browser does anything useful).
**Manager is a second binary in the same crate.** `hive-m1nd` lives in the `hive-ag3nt` crate as a second `[[bin]]`, sharing the crate's lib code. The two binaries differ in:
- Tool surface (manager's `main.rs` wires `request_spawn`/`request_kill`/`request_apply_commit`/`inject_peer_info`/`update_shared_instructions`).
- Bind-mounts (manager gets `agents/**` RW; agents get only their own `state/`).
- Auto-restart (manager is auto-restarted by `hive-c0re`; agents stay down once killed).
- The manager container is declared in the host NixOS module as a known container.
**Manager concurrency = event loop.** `hive-m1nd` pulls from a heterogeneous `next_event` stream: inbound agent messages, replies to sync sends, lifecycle events from `hive-c0re` (crash, OOM, approval-resolved), and dashboard signals. One queue, claude turn per event.
**Anthropic credentials.** ~~Shared key on host~~ — revised in Phase 8.
Per-agent persistent `~/.claude/` dir bind-mounted from
`/var/lib/hyperhive/agents/<name>/claude/`. OAuth refresh tokens rotate, so
sharing across agents is a non-starter (any sibling refresh invalidates all
the others). One interactive login per agent, ever; creds survive
`destroy`/recreate by default. Login flow runs from the per-agent web UI
(see Phase 8).
**Workdir bootstrap.** Each agent's `state/` starts empty. Initial-task message tells the agent what to clone/set up. Manager can drop big artefacts into `state/` directly (it has RW) and pass the path as a message reference.
## Riskiest assumptions (test in this order)
1. **`nixos-container update` hot-reloads under the in-flight `nsresourced` / `mountfsd` / privateUsers patch stack** without nuking the running harness (and thus the `claude` session). Validated in Phase 1.
2. **The harness driving `claude` as a long-running turn loop is stable** — claude as a subprocess streaming output back; harness deciding when a turn ends; injecting inbox messages as new user turns. Validated in Phase 3.
3. **`hive-m1nd` (as a claude session) sensibly drives spawn / route / peer-injection.** Behavioural; only knowable by running it. Phase 4.
4. **`hive-c0re` throughput under multiple agents** — bursty messaging, MCP relay, sqlite writes. Likely fine; flag for measurement.
## Phased path
All phases ✅ shipped. Each section below is the *original* design with notes
on what actually landed and what deviated. See CLAUDE.md → "Phase status" for
the canonical summary.
### ✅ Phase 0 — repo bootstrap
- Create `~/Repos/hyperhive/`, init flake.
- Cargo workspace: `hive-c0re/`, `hive-sh4re/`, `hive-ag3nt/` (the last with two `[[bin]]` targets — `hive-ag3nt` and `hive-m1nd`). All compile, all do nothing useful.
- NixOS module skeleton (`nix/modules/hive-c0re.nix`) that runs the daemon as a systemd service on the host.
- Agent base template (`nix/templates/agent-base.nix`) that builds a nixos-container including the `hive-ag3nt` binary.
- **Exit:** `nixos-container create test-agent --flake .#agent-base && nixos-container start test-agent` brings up a container whose `hive-ag3nt` prints "hello" and exits.
### ✅ Phase 1 — container lifecycle + Risk 1
- `hive-c0re`: open host admin socket (`/run/hyperhive/host.sock`); verbs `spawn(name)`, `kill(name)`, `rebuild(name)`, `list()`. Uses `nixos-container` underneath; container name on the host is `h-<name>` (sub-agents) or `hm1nd` (manager).
- CLI tool talking to the admin socket (same `hive-c0re` binary, subcommand-driven).
- Manually mutate an agent's config flake, call `rebuild`, observe whether `hive-ag3nt` survives.
- **Decision:** if hot-reload doesn't preserve the harness, that becomes a hard requirement of `hive-ag3nt`'s design (resume from disk state). Document the outcome.
- **Exit:** spawn / rebuild / kill via CLI is reliable; known behaviour for in-flight rebuilds.
### ✅ Phase 2 — sockets + minimal MCP
- `hive-c0re` opens `manager.sock` and `agents/<name>.sock` (one per spawned agent). Per-socket MCP server with the right tool surface baked in. Types from `hive-sh4re`.
- `hive-ag3nt`: MCP client (types from `hive-sh4re`), connects to its socket on startup, exchanges hello.
- Tools: agent gets `send(to, body)`, `recv()`. No persistence yet (in-memory).
- **Exit:** two test agents exchange messages through `hive-c0re` manually-driven.
### ✅ Phase 3 — broker + turn loop
- `hive-c0re`: sqlite-backed message store (`messages` table; `id, sender, recipient, body, sent_at, delivered_at`). Survives `hive-c0re` restart.
- `hive-ag3nt` (lib): real turn loop. Reads from `recv`; feeds new messages as user turns to `claude`; captures output; calls `send` for outbound. Long-running.
- **Exit:** two `hive-ag3nt`-driven agents have a back-and-forth conversation through `hive-c0re`.
### ✅ Phase 4 — `hive-m1nd` + privileged surface
- `hive-m1nd` binary (second `[[bin]]` in `hive-ag3nt`) wires the manager tool surface.
- Manager container (`hm1nd`) declared in host NixOS module (auto-restart). Bind-mount `agents/**` RW.
- Manager socket gets the privileged tool surface: `request_spawn`/`request_kill`, `request_apply_commit`, `inject_peer_info`, `send(..., wait_for_reply=true)`.
- Smoke: attach a terminal to the manager container (`nixos-container root-login`); ask `hive-m1nd` to spawn an agent and route a message to it.
- **Exit:** manager spawns, routes, kills a child agent end-to-end; lifecycle still gated by manual CLI approval (no GUI yet).
### ✅ Phase 5 — git-commit approval flow
- `state-repo` on host tracks world (agents directory listing, allow-lists, etc.).
- Per-agent `config/` flake repos created at spawn time.
- Manager's container: bind-mounted clones; uses plain `git` CLI to edit/commit.
- `request_apply_commit(name, sha)` queues a change in `hive-c0re`. Approval = CLI `hive-c0re approve <sha>`; on approve, `hive-c0re` fast-forwards `main` and reconciles (rebuild if config changed, run `nixos-container update`).
- Per-agent allow-list for `request_install`: in-list installs become auto-applied commits; novel pkgs become pending commits.
- **Exit:** manager adds a package to an agent → user approves on CLI → agent picks it up.
### ✅ Phase 6 — per-agent web UI + dashboard MVP
- `hive-ag3nt` web UI module (in the crate's lib): HTTP on a per-container host port (host network): status, last messages, embedded terminal (xterm.js over WebSocket). Both `hive-ag3nt` and `hive-m1nd` binaries expose it.
- Dashboard served by `hive-c0re`: agent list, per-agent status, links to each agent's UI, link to manager's UI.
- No approval UI yet; users still approve via CLI.
- **Exit:** browser is a usable navigation layer over the whole system.
### ✅ Phase 7 — dashboard commit view + polish
- Pending-commits view in the dashboard with diff rendering and Approve/Deny buttons (replaces the CLI approve step).
- Live message-flow view (`hive-c0re` sees all MCP relay traffic).
- `hive-c0re` event push into `hive-m1nd`'s `next_event` (crashes, OOM, approval resolved).
- Resource caps in nixos-container units (`MemoryMax`, `CPUQuota`).
- Migration plan for damocles' existing containers (separate doc).
## Repo layout (target)
```
~/Repos/hyperhive/
├── hive-c0re/ # host-side Rust crate (daemon + CLI)
│ ├── src/
│ │ ├── main.rs
│ │ ├── lifecycle.rs # nixos-container CRUD
│ │ ├── broker.rs # sqlite message store
│ │ ├── mcp/ # MCP servers (per-socket)
│ │ ├── approvals.rs # git-commit change requests
│ │ ├── state_repo.rs # world-state git repo
│ │ └── dashboard/ # HTMX/SSE web UI
│ └── Cargo.toml
├── hive-sh4re/ # shared crate (hive-c0re ↔ hive-ag3nt)
│ ├── src/
│ │ ├── lib.rs
│ │ ├── protocol.rs # MCP wire types, verb shapes
│ │ └── message.rs # message types
│ └── Cargo.toml
├── hive-ag3nt/ # in-container harness crate; two binaries
│ ├── src/
│ │ ├── lib.rs # shared harness code
│ │ ├── mcp_client.rs # MCP client over unix socket
│ │ ├── turn_loop.rs # claude subprocess driver
│ │ ├── web_ui.rs # per-container UI scaffolding
│ │ └── bin/
│ │ ├── hive-ag3nt.rs # agent tool surface wiring
│ │ └── hive-m1nd.rs # manager tool surface wiring
│ └── Cargo.toml # [[bin]] hive-ag3nt + [[bin]] hive-m1nd
├── nix/
│ ├── modules/hive-c0re.nix # NixOS module for the host
│ ├── templates/
│ │ ├── agent-base.nix # base agent nixos-container (pulls hive-ag3nt)
│ │ └── manager.nix # manager nixos-container (pulls hive-m1nd)
│ └── overlay.nix # exposes the three crates + both bins as pkgs
├── web/ # static assets, HTMX templates
├── flake.nix
├── Cargo.toml # workspace
└── PLAN.md # this file
```
## Resolved implementation decisions
The original open-decisions list, with what we picked:
- **Wire format.** Custom JSON-line over unix sockets (host admin / manager /
per-agent), not real MCP stdio. Simpler and good enough for now; can swap
to MCP later. SSE for the dashboard message-flow.
- **Per-agent web UI.** `axum` HTTP server inside each container at a port
hashed from the agent name (81008999); manager at fixed 8000; dashboard
at 7000. Plain HTML, no HTMX, no xterm.js yet.
- **`state-repo` schema.** Per-agent dir with files; not a single TOML.
Realised as two parallel git repos per agent: `proposed` (manager-editable)
and `applied` (hive-c0re-only). Container builds from `applied`.
- **Manager access to applied state.** *Not* RW-mounted. Manager only has
`proposed/` bind-mounted; `applied/` is hive-c0re-only.
- **One binary or two.** One: `hive-c0re` is daemon + CLI dispatched by
subcommand (`serve` / `spawn` / `kill` / `rebuild` / `list` / `pending` /
`approve` / `deny`).
### ⏳ Phase 8 — real claude in containers + login UX
Until this lands the harness falls back to the echo path; we've never run an
end-to-end turn with a real model in a real container.
**Credential model.** Per-agent persistent dir at
`/var/lib/hyperhive/agents/<name>/claude/` bind-mounted RW to `/root/.claude`
inside the container. *Not* shared across agents: OAuth refresh tokens rotate,
and sharing one dir means the first refresh by any sibling invalidates all the
others. Each agent owns its own token lineage from first login onward.
**State-dir persistence.** Agent state dirs (including the claude creds dir)
persist across `destroy`/recreate by default. The `destroy` verb only purges
state when given an explicit "wipe" flag from the operator — recreating an
agent of the same name reuses prior creds with no re-login.
**First-deploy approval.** Spawning a brand-new agent name goes through the
existing approval queue (same path as config edits). The dashboard shows a
spinner while `nixos-container create` + `update` + `start` run.
**"needs login" agent state.** If the bound `~/.claude/` has no valid session,
the harness boots in a partial mode: per-agent web UI is up, but the turn
loop does NOT start. Dashboard surfaces the state per-agent so the operator
knows where to click.
**Login over the per-agent web UI.** No more `nixos-container root-login` for
the common case. The agent's web UI exposes a "log in" action that:
1. Spawns `claude auth login` (or equivalent) inside the container with plain
stdio pipes — no PTY unless we discover we need one.
2. Reads the OAuth URL from the process stdout and shows it on the page.
3. Provides a paste field for the resulting code; writes it to the process
stdin.
4. On success, transitions out of "needs login" and starts the turn loop.
If `claude` turns out to require a TTY (refuses on `!isatty()`, uses raw-mode
input, or only renders the URL with ANSI styling), redo the backend with a
PTY (e.g. `portable-pty`). Don't pre-build for that — start simple.
**Sequence.** Ship in this order — don't do (4) before (3) or there's nowhere
for the login UI to live: (1) bind-mount + per-agent dir creation in
`lifecycle::set_nspawn_flags`, (2) approval-gated first spawn + dashboard
spinner, (3) harness "needs login" partial-run mode, (4) PTY-backed login
endpoint on the per-agent UI.
**Exit:** spawn a new agent from the dashboard → approve → wait for spinner
→ click "log in" on the agent's page → complete OAuth in the browser →
paste code → agent enters the turn loop and replies to a T4LK message via
real `claude --print`.
## Polish backlog (not phased)
See CLAUDE.md → "Polish backlog" for the live list. Highlights: operator
inbox drain, per-agent UI substance, xterm.js terminal embed, `destroy` verb,
bounded broker, container-crash events via D-Bus.
## Explicitly deferred / out of v1 scope
- Per-agent API keys, cost attribution.
- Pooled / pre-warmed containers.
- Destroy verb on the `hive-c0re` API (use `rm` on host).
- Backup / replication of `agents/` state.
- Migration of existing damocles containers (`docs/damocles-migration.md`).
- Anything about multiple hosts.

49
README.md Normal file
View file

@ -0,0 +1,49 @@
# hyperhive
Multi-Claude-Code-agent orchestration on **nixos-containers**.
A host-side Rust daemon (`hive-c0re`) spawns nspawn-isolated agent
containers and brokers messages between them. A manager agent (`hm1nd`)
coordinates the swarm and gates lifecycle changes on user approval via git
commits, surfaced through a vibec0re-styled HTTP dashboard.
```
┌────────────────────────┐
│ hive-c0re (Rust) │
operator ──▶ │ • lifecycle │ ─▶ nixos-containers
│ • broker (sqlite) │ ├── hm1nd (manager)
│ • approvals (sqlite) │ ├── h-alice (sub-agent)
│ • dashboard :7000 │ └── h-bob ...
│ • per-agent sockets │
└────────────────────────┘
```
Each container runs a harness binary that drives `claude --print --continue`
in a turn loop, exposes a per-agent web UI with a live event stream, and
talks to the broker over a bind-mounted unix socket via an embedded MCP
server claude calls into.
## Read next
- **[CLAUDE.md](CLAUDE.md)** — developer reference: conventions, gotchas,
per-subsystem design, file map.
- **[TODO.md](TODO.md)** — open work items.
## Build / deploy
```sh
# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo clippy --workspace --all-targets -- -D warnings
# evaluate everything (rust+nix+toml fmt + clippy)
nix flake check
# deploy to a host that imports `hyperhive.nixosModules.hive-c0re`
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
```
The host config also needs `hyperhive.overlays.default` applied — the
module's default `package = pkgs.hyperhive` requires the overlay.

69
TODO.md Normal file
View file

@ -0,0 +1,69 @@
# TODO
Pick anything from here when relevant. Cross-cutting design notes live in
[CLAUDE.md](CLAUDE.md); high-level project intro in [README.md](README.md).
## Security
- **Unprivileged containers (userns mapping).** Today the nspawn container
runs as a fully privileged root. Goal: `PrivateUsersChown=yes` (or the
nixos-container equivalent) so uid 0 inside maps to an unprivileged uid
on the host, and a container-root compromise lands the attacker on an
ordinary user account, not the host's root. Requires per-agent state
dirs to be chown'd to that uid on the host side.
- **Bash command allow-list.** Replace the blanket `Bash` allow with a
pattern allow-list (`Bash(git *)`, `Bash(nix build .*)`, etc.) per
claude-code's `--allowedTools` extended grammar. Likely lives in
`agent.nix` so each agent can scope its own shell surface.
## Per-agent settings
- **Model override.** Hard-coded to `haiku` in the turn loop right now.
Surface as a per-agent override: operator via dashboard, manager via
`request_apply_commit` setting an attr on the agent's flake (most natural
place since the flake already carries per-agent env/identity).
## UI / UX
- **Operator inbox view.** Drain replies addressed to `operator` and show
them on the dashboard. Today they accumulate in sqlite unread; you can
only see them by watching the live panel of the agent that sent them.
- **Per-agent UI substance.** Show last N inbox messages, last turn timing,
link back to dashboard.
- **Static-asset SPA-style web UI.** Move toward: `index.html` is static,
CSS/JS is static, all dynamic state is fetched over SSE / JSON endpoints.
Currently the index HTML is server-rendered with state-dependent
fragments inlined; the live event stream + async forms are already SSE /
fetch. Goal is a cleaner split so the UI is one HTML file + JS app +
small JSON API.
- **Background JS refresh on the live panel.** Already there for sends;
any remaining places that reload the whole page should switch to fetch +
partial updates.
- **xterm.js terminal** embedded per-agent, attached to a PTY exposed by
the harness. Pairs well with the unprivileged-container work — would let
the operator drop into the container without `nixos-container root-login`.
## Loop substance
- **Notes / state persistence.** Per-agent `notes.md` for durable scratch
memory across turns. Compaction-on-overflow runs a separate short-lived
claude session (à la bitburner-agent). The `--continue` session already
gives short-term memory, but notes give cross-session durable knowledge
that isn't lost on a `/compact` boundary.
## Lifecycle / reliability
- **Bounded broker.** Cap rows per recipient or auto-vacuum delivered
messages older than a threshold. sqlite is growing unbounded.
- **Container crash events.** Watch `container@*.service` via D-Bus, push
`HelperEvent::ContainerCrash` to the manager's inbox so the manager can
react (restart, escalate, etc.).
- **`destroy --purge`.** Today `destroy` keeps state by design; add an
opt-in flag (CLI + dashboard) to also wipe `/var/lib/hyperhive/agents/<name>/`
and `/var/lib/hyperhive/applied/<name>/`.
## Cleanup / docs
- **Debug-only sub-commands.** `hive-ag3nt send/recv` and the analogous
`hive-m1nd send/recv/...` exist only for ops debugging. Move them into a
hidden `debug` sub-command to declutter `--help`, or drop entirely.

View file

@ -115,9 +115,12 @@ fn render_online(label: &str) -> String {
const input = document.getElementById('sendbody');\n \ const input = document.getElementById('sendbody');\n \
const body = input.value.trim();\n \ const body = input.value.trim();\n \
if (!body) return;\n \ if (!body) return;\n \
const fd = new FormData();\n \ const resp = await fetch('/send', {{\n \
fd.set('body', body);\n \ method: 'POST',\n \
const resp = await fetch('/send', {{ method: 'POST', body: fd, redirect: 'manual' }});\n \ headers: {{ 'Content-Type': 'application/x-www-form-urlencoded' }},\n \
body: new URLSearchParams({{ body }}),\n \
redirect: 'manual',\n \
}});\n \
if (resp.type === 'opaqueredirect' || (resp.ok && resp.status < 400)) {{\n \ if (resp.type === 'opaqueredirect' || (resp.ok && resp.status < 400)) {{\n \
input.value = '';\n \ input.value = '';\n \
}} else {{\n \ }} else {{\n \

View file

@ -14,9 +14,12 @@
btn.innerHTML = '<span class="spinner">◐</span>'; btn.innerHTML = '<span class="spinner">◐</span>';
} }
try { try {
// axum's `Form` extractor wants application/x-www-form-urlencoded;
// FormData would send multipart/form-data and bounce with 415.
const resp = await fetch(form.action, { const resp = await fetch(form.action, {
method: form.method || 'POST', method: form.method || 'POST',
body: new FormData(form), headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
body: new URLSearchParams(new FormData(form)),
redirect: 'manual', redirect: 'manual',
}); });
const ok = resp.ok const ok = resp.ok