müde 897e7c07ae dashboard: spawn form moves under approvals; docs synced

submitting R3QU3ST SP4WN immediately queues an approval that lands
in the very next list. the form belonged with that list, not at the
top of containers — the agent doesn't exist yet at form time anyway.

docs: claude.md grows operator_questions.rs / events.rs sqlite /
broker vacuum to the file map; web-ui shape lists the actual current
endpoint set (per-agent cancel/compact/history, dashboard tombstone
purge/answer/spawn); live-view section now describes the state
badge, sticky-bottom scroll, history backfill, and the terminal-
embedded prompt with its slash commands; dashboard-action-surface
rewritten around the new six-section page (containers / kept-state /
questions / inbox / approvals / message-flow) and the two-line
container row. new 'persistence + retention' section documenting both
sqlite databases and their vacuum cadences. readme picks up the new
mgr mcp surface (start/restart/ask_operator) + operator-side
features list + ask_operator answer flow.

todo trimmed of shipped items (bigger terminal / sticky scroll /
cancel button / /compact trigger / /cancel command). new entry for
the two-step spawn-with-preconfig flow.

2026-05-15 20:02:54 +02:00

23 KiB

Raw Blame History

hyperhive — developer reference

Operator + dev notes: conventions, gotchas, per-subsystem design.

High-level project intro: README.md.
Open work + backlog: TODO.md.

File map

hive-c0re/         host daemon + CLI (one binary, subcommand-dispatched)
  src/main.rs           clap setup; serve / spawn / kill / rebuild / list /
                         pending / approve / deny / destroy [--purge] /
                         request-spawn; periodic broker vacuum task
  src/server.rs         host admin socket (HostRequest → dispatch)
  src/client.rs         admin-socket client
  src/manager_server.rs manager-privileged socket (ManagerRequest)
  src/agent_server.rs   per-sub-agent socket listener (long-poll Recv)
  src/broker.rs         sqlite Message store + broadcast channel for SSE +
                         hourly vacuum of delivered>30d
  src/approvals.rs      sqlite Approval queue + kinds
  src/operator_questions.rs  sqlite question queue backing `ask_operator`
  src/coordinator.rs    shared state (broker/approvals/questions/transient/
                         sockets) + tombstone enumeration
  src/actions.rs        approve/deny/destroy (transient-aware)
  src/auto_update.rs    startup rebuild scan + ensure_manager
  src/lifecycle.rs      `nixos-container` shellouts, per-agent flake generator
  src/dashboard.rs      axum HTTP: static shell + /api/state JSON + actions
  assets/               index.html, dashboard.css, app.js (include_str!)

hive-ag3nt/        in-container harness crate; produces TWO binaries
  src/lib.rs            re-exports + DEFAULT_SOCKET, DEFAULT_WEB_PORT
  src/client.rs         generic JSON-line request/response over unix socket
  src/web_ui.rs         per-container axum HTTP page (incl /api/cancel,
                         /api/compact, /events/history)
  src/events.rs         LiveEvent + broadcast Bus + sqlite-backed history
                         (/state/hyperhive-events.sqlite) + hourly vacuum
  src/turn.rs           claude --print + stream-json pump; --compact retry
  src/mcp.rs            embedded MCP server (rmcp): AgentServer + ManagerServer
  src/login.rs          probe /root/.claude/ for a valid session
  src/login_session.rs  drives `claude auth login` over stdio pipes
  src/bin/hive-ag3nt.rs sub-agent main (Serve + Mcp subcommands)
  src/bin/hive-m1nd.rs  manager main (Serve + Mcp subcommands)
  assets/               index.html, agent.css, app.js (include_str!)
  prompts/              static role/tools/settings for claude (include_str!):
                          agent.md  — sub-agent system prompt
                          manager.md — manager system prompt
                          claude-settings.json — --settings JSON

hive-sh4re/        wire types (HostRequest/Response, AgentRequest/Response,
                   ManagerRequest/Response, Message, Approval, HelperEvent)

nix/
  modules/hive-c0re.nix         systemd service + firewall + git wiring
  templates/harness-base.nix    shared scaffolding for sub-agents + manager
  templates/agent-base.nix      sub-agent nixosConfiguration
  templates/manager.nix         manager nixosConfiguration

Conventions

Naming. Containers are length-bounded (nixos-container ≤ 11 chars). Sub-agents are h-<name> with <name> ≤ 9 chars; the manager is hm1nd. MAX_AGENT_NAME enforces the cap in lifecycle.rs. Per-agent web UI port = WEB_PORT_BASE + FNV1a(name) % WEB_PORT_RANGE (8100..8999); manager fixed at 8000; dashboard cfg.dashboardPort (default 7000).
Identity = socket. No auth/tokens on the per-agent sockets. The socket path identifies the principal; perms come from "who has the bind-mount."
Wire protocol. JSON line-delimited over unix sockets in both directions (host admin / manager / agent). /messages/stream is text/event-stream.
Commit messages. Short, lowercase, no Co-Authored-By trailer.
Commit before test. Stage and commit when work looks ready, then run validation (cargo check, nix flake check, real lpt2 deploy). Failures get a follow-up commit rather than an amend.
rebuild is the reconcile verb. Idempotently rewrites /etc/nixos-containers/<C>.conf (PRIVATE_NETWORK=0, clears HOST_ADDRESS/LOCAL_ADDRESS, sets EXTRA_NSPAWN_FLAGS), regenerates applied/<name>/flake.nix, writes the systemd limits drop-in, then nixos-container update + stop + start. Anything that changes per-container state on the host should be re-applied here.
Actions are factored. approve / deny / destroy live in actions.rs; the admin socket and the dashboard POST handlers both call into them so the two surfaces never drift.
Async forms. Dashboard + per-agent mutating forms carry data-async; a delegated submit listener in assets/app.js intercepts, shows a spinner, POSTs with application/x-www-form-urlencoded (axum Form extractor rejects multipart), calls refreshState() on success. New mutating forms should add data-async and optionally data-confirm.

Gotchas / lessons learned

nixos-container doesn't expose --bind on the CLI. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. We rewrite this line in set_nspawn_flags().
/run/systemd/nspawn/*.nspawn overrides are ignored by nixos-container's start script (it builds the nspawn cmd line directly).
boot.isNspawnContainer = true, not boot.isContainer = true. Renamed in nixos-25.11+.
nixos-container create auto-assigns HOST_ADDRESS/LOCAL_ADDRESS in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — which is silently fatal for our web UIs (the bind is invisible from the host). We force-clear those vars (and HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE) plus set PRIVATE_NETWORK=0.
systemd service PATH ≠ host PATH. The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ]. In-container harness services do the same so anything an agent adds to its own agent.nix (environment.systemPackages) is visible to claude's Bash tool without editing the service definition. environment.HYPERHIVE_GIT bakes git's absolute path in (read by lifecycle::git_command()) for the host.
RuntimeDirectoryPreserve = "yes" keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.
register_agent is idempotent — drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.
claude-code is unfree. harness-base.nix allow-list's it specifically. The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay imports unstable with its own allowUnfreePredicate so the access inside the overlay doesn't itself trip.
Claude credentials are per-agent. /var/lib/hyperhive/agents/<name>/claude/ bind-mounts to /root/.claude (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across destroy/recreate.
Persistent notes dir per agent. /var/lib/hyperhive/agents/<name>/state/ bind-mounts to /state (RW). System prompts tell agents to keep durable knowledge here (/state/notes.md, anything else under /state/). Survives destroy/recreate alongside the claude dir.
Orphan approvals. If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

Web UI shape

Both the dashboard (port 7000) and the per-agent web UIs (8000 / 8100-8999) are SPAs with the same skeleton:

GET / → static assets/index.html (placeholders for state-driven sections).
GET /static/*.css + GET /static/*.js → static assets shipped via include_str! so there's no runtime file dependency.
GET /api/state → JSON snapshot the JS app renders into the DOM.
GET /events/stream (per-agent) and GET /messages/stream (dashboard) are text/event-stream SSE for live updates.

Per-agent endpoints: POST /send, POST /login/{start,code,cancel}, POST /api/cancel, POST /api/compact, GET /events/history.

Dashboard endpoints: POST /{approve,deny}/{id}, POST /{rebuild,kill,restart,start,destroy}/{name}, POST /purge-tombstone/{name}, POST /answer-question/{id}, POST /request-spawn, POST /update-all.

The JS app handles all form[data-async] submissions via a delegated listener: read data-confirm, swap the button to a spinner, POST application/x-www-form-urlencoded (axum's Form extractor rejects multipart), then on success re-enable the button (refreshState often keeps the form mounted) and call refreshState() (re-fetch /api/state and re-render). No full-page reloads.

Per-agent + dashboard state shapes live in dashboard.rs::StateSnapshot and web_ui.rs::StateSnapshot. When adding new state fields, plumb through the snapshot struct and the relevant assets/app.js render function — never reach for server-side HTML rendering again.

Agent MCP surface + turn loop

The harness ships an embedded MCP server (rmcp 1.7) that claude launches as a stdio child via --mcp-config. Subcommand: hive-ag3nt mcp (or hive-m1nd mcp for the manager surface).

Sub-agent tools:

mcp__hyperhive__send(to, body) — message a peer or the operator.
mcp__hyperhive__recv() — drain one inbox message.

Manager additionally:

mcp__hyperhive__request_spawn(name) — queue Spawn approval.
mcp__hyperhive__kill(name) — graceful stop. No approval.
mcp__hyperhive__start(name) — start a stopped sub-agent. No approval.
mcp__hyperhive__restart(name) — stop + start. No approval.
mcp__hyperhive__request_apply_commit(agent, commit_ref) — submit a config change for any agent (including hm1nd for self-mods).
mcp__hyperhive__ask_operator(question, options?) — non-blocking; queues a question on the dashboard, returns the question id. Operator's answer arrives later as a HelperEvent::OperatorAnswered in the manager inbox.

The shared per-turn plumbing lives in hive_ag3nt::turn::{write_mcp_config, write_settings, write_system_prompt, run_turn, drive_turn, emit_turn_end, wait_for_login} so the two binaries can't drift.

On harness boot, three files get dropped next to the mcp socket at /run/hive/:

claude-mcp-config.json — re-invokes the running binary as mcp child.
claude-settings.json — --settings blob (auto-compact/auto-memory off, effortLevel medium).
claude-system-prompt.md — rendered from prompts/{agent,manager}.md with {label} substituted. Passed via --system-prompt-file.

Each turn:

claude --print --verbose --output-format stream-json --model haiku \
  --continue --settings /run/hive/claude-settings.json \
  --system-prompt-file /run/hive/claude-system-prompt.md \
  --mcp-config /run/hive/claude-mcp-config.json --strict-mcp-config \
  --tools <builtins> --allowedTools <builtins+mcp>
# wake prompt piped over stdin — minimal, just from/body + optional unread hint

--continue keeps a persistent session per agent (claude stores sessions in ~/.claude/projects/, which is bind-mounted persistently). Auto-compact and auto-memory are disabled because hyperhive owns compaction (/compact on overflow, retry once).

Loop control. The harness pops one inbox message per cycle (the wake signal — Recv long-polls server-side for up to 30s waking instantly on a new broker Sent event for this agent), peeks the remaining inbox depth with Status, and emits TurnStart { from, body, unread }. The wake prompt piped to claude includes a one-line ({unread} more pending — drain via …) hint when unread > 0. Claude drives any further recv/send itself.

Tool envelope (mcp::run_tool_envelope): every MCP tool handler logs the request, runs the body, logs the result. Pre-/post-log only; the old [status] N unread message(s) appendage was removed once unread moved into the wake prompt + UI header. New tools call this helper.

Tool whitelist (mcp::ALLOWED_BUILTIN_TOOLS):

Allowed built-ins: Bash, Edit, Glob, Grep, Read, TodoWrite, Write.
Denied by omission: WebFetch, WebSearch, Task, NotebookEdit.
Allowed MCP tools: as listed above per flavor.

Bash is on the allow-list pending a finer-grained pattern allow-list (Bash(git *)-style) — see TODO.md.

Live view. Each agent runs an events::Bus (broadcast channel + sqlite-backed history at /state/hyperhive-events.sqlite). The harness emits TurnStart { from, body, unread }, Stream(value) (one per parsed stream-json line), Note, TurnEnd { ok, note }. The web UI:

fetches GET /events/history on page load and replays the last 2000 events (oldest first, .no-anim so they don't stagger),
then subscribes to GET /events/stream (SSE) for live tail,
shows a granular state badge above the terminal (💤 idle / 🧠 thinking / ○ offline · <age>) driven from turn_start/turn_end, with a flash animation on transition,
sticky-bottom auto-scroll: scrolling up parks the view; new rows surface a "↓ N new" pill instead of yanking. Scrolling back to bottom clears the counter,
terminal-themed: phosphor mauve glow, Crust bg, backdrop-filter blur, row fade-in slide-up, banner gradient shimmer while state=thinking.

Per-tool rendering:

TurnStart → ◆ TURN ← <from> · N unread header + indented body.
Stream tool_use → → Read /path / → Bash $ cmd / → send → operator: "..." etc., per-tool pretty rather than raw JSON.
Stream tool_result short → flat ← ...; long → collapsed <details> ▸ ← Nl · headline (click to expand full body).
Stream thinking → shows the thinking text if claude provided one, otherwise the bare · thinking … indicator.
Stream system init, result, rate_limit_event are dropped — too noisy and TurnEnd already says the turn finished.
Note → · text.
TurnEnd → ✓ turn ok / ✗ turn fail — note and triggers a refreshState() so the page form view reflects state transitions (e.g. login just landed).

The operator input lives inside the terminal-wrap as a prompt-style textarea below the live tail: multi-line (Enter sends, Shift+Enter newlines), tab-completes slash commands. Available slash commands:

/help — list commands locally
/clear — wipe the visible terminal (server history kept)
/cancel — POST /api/cancel (host shellouts pkill -INT claude, emits a Note); also surfaces as a ■ cancel turn button in the state row while state=thinking
/compact — POST /api/compact (host spawns turn::compact_session in the background; output streams into the live panel)

Unknown /foo shows an error row instead of being silently sent.

Manager (hm1nd) is hive-c0re-managed

The manager container runs through the same lifecycle as sub-agents. On hive-c0re serve startup, if hm1nd is missing, hive-c0re creates it. The manager's flake lives at /var/lib/hyperhive/applied/hm1nd/; its proposed config at /var/lib/hyperhive/agents/hm1nd/config/. Manager can edit its own agent.nix (visible inside the container at /agents/hm1nd/config/) and submit request-apply-commit hm1nd <sha> for operator approval.

Differences from sub-agents:

flake.nix extends hyperhive.nixosConfigurations.manager (vs agent-base).
Container name is hm1nd (no h- prefix).
Fixed web UI port (MANAGER_PORT = 8000).
set_nspawn_flags adds an extra bind: /var/lib/hyperhive/agents → /agents (RW), so the manager can edit per-agent proposed repos.
First-deploy spawn bypasses the approval queue (manager is required infrastructure).
Per-agent socket lives at /run/hyperhive/manager/, owned by manager_server::start.

Migration note (for older hosts): drop any containers.hm1nd = { ... } block from your host NixOS config. hyperhive creates and updates the manager itself now.

Manager policy (from prompts/manager.md): the manager does NOT rubber-stamp sub-agent config requests. It verifies (role match, package legitimacy, cheaper alternative, blast radius) before committing + calling request_apply_commit. For ambiguous cases or anything that needs human signal, the manager calls ask_operator(question, options?) which queues the question on the dashboard and returns the id immediately; the operator's answer arrives later as HelperEvent::OperatorAnswered in the manager inbox. Store at hive-c0re::operator_questions (sqlite); answer flow: POST /answer-question/{id} → OperatorQuestions::answer → notify_manager(OperatorAnswered { ... }).

Helper events to the manager

Coordinator::notify_manager(&HelperEvent) enqueues an inbox message from sender system with the event JSON in the body. The manager harness no longer short-circuits these — they drive a regular claude turn so the manager can react. Variants (hive_sh4re::HelperEvent):

ApprovalResolved { id, agent, commit_ref, status, note } — fired by actions::approve + actions::deny whenever an approval transitions to its terminal state.
Spawned { agent, ok, note } — actions::approve (Spawn-kind) + admin HostRequest::Spawn.
Rebuilt { agent, ok, note } — auto_update::rebuild_agent (covers startup scan + manual /rebuild from dashboard) + actions::approve (ApplyCommit).
Killed { agent } — admin HostRequest::Kill + dashboard /kill.
Destroyed { agent } — actions::destroy.
OperatorAnswered { id, question, answer } — dashboard::post_answer_question fires this after the operator submits the answer form for a question the manager queued via ask_operator.

To add a new event: new HelperEvent variant + call sites + update prompts/manager.md so the manager knows the new shape.

Auto-update on startup

hive-c0re serve runs auto_update::run in a background task right after opening the coordinator. It enumerates managed containers and rebuilds any whose recorded hyperhive rev differs from the current one — sub-agents and manager go through the same lifecycle::rebuild path. "Rev" = canonical filesystem path of cfg.hyperhiveFlake. Marker file: /var/lib/hyperhive/applied/.<name>.hyperhive-rev.

If the flake input has no canonical path (e.g. a github: URL), auto-update is a no-op — rebuild manually.

The dashboard surfaces pending updates per agent: a clickable "needs update ↻" badge appears whenever the marker differs from current rev. The badge POSTs /rebuild/<name>, calling the same auto_update::rebuild_agent path so manual triggers and the startup scan can't drift. When at least one container is stale, a top-level ↻ UPD4TE 4LL button appears that loops over every stale container.

Dashboard action surface

Page sections (top to bottom):

C0NTAINERS — live containers with their action surface (below).
K3PT ST4T3 — destroyed-but-state-kept tombstones (size + age + claude-creds badge). Two actions: ⊕ R3V1V3 (queues a Spawn approval; existing state is reused), PURG3 (wipes state + applied dirs; POST /purge-tombstone/{name}).
M1ND H4S QU3STI0NS — pending ask_operator questions (amber pulsing border). Always renders a free-text fallback alongside any option list; multi=true renders options as checkboxes; submit merges selections + free text comma-joined.
0PER4T0R 1NB0X — recent messages addressed to operator (last 50, from the broker).
P3NDING APPR0VALS — the queue. The R3QU3ST SP4WN form lives at the top of this section since submitting it immediately queues an approval that lands directly below.
MESS4GE FL0W — live broker SSE tail.

Container row (two-line layout, assets/app.js::renderContainers):

Line 1: agent name (link → new tab), m1nd/ag3nt chip, needs login / needs update warning badges, in-flight ◐ pending-state… pill (replaces buttons during start/stop/restart/rebuild/destroy), container name + port.
Line 2: action buttons — ↻ R3BU1LD always, DESTR0Y + PURG3 on sub-agents, ↺ R3ST4RT + (sub-agents) ■ ST0P when running, ▶ ST4RT when stopped. Buttons dim + disable while a transient lifecycle action is in flight.

↻ UPD4TE 4LL button appears above the containers list when any agent is stale.

Banner pulses on each broker SSE event (pulseBanner with a 4s grace timer).

Approval flow

End-to-end: manager edits per-agent proposed repo → commits → submits commit sha → user approves on host CLI or dashboard button → hive-c0re reads the file at that sha from proposed, applies into applied, commits there, runs nixos-container update. Helper-event JSON (ApprovalResolved) lands in the manager's inbox.

Two separate git repos per agent:

/var/lib/hyperhive/agents/<name>/config/    # proposed — manager edits, hive-c0re reads only
└── agent.nix                               # the only file the manager can change
                                            # (initial commit by hive-c0re on first spawn,
                                            # never touched by hive-c0re again)

/var/lib/hyperhive/applied/<name>/          # applied — hive-c0re-only; container builds here
├── flake.nix                               # auto-generated; references hyperhive_flake
└── agent.nix                               # overwritten by approve from the proposed commit

The container's --flake ref is <applied_dir>#default. The flake extends hyperhive.nixosConfigurations.{agent-base|manager} with ./agent.nix plus an inline module setting programs.git.config.user (committer identity = the agent's name) and systemd.services.<harness>.environment (HIVE_PORT, HIVE_LABEL, HIVE_DASHBOARD_PORT).

Persistence + retention

Two sqlite files; both autovacuum on a 1h tokio task:

/var/lib/hyperhive/broker.sqlite (host) — messages + approvals + operator_questions tables. Broker::vacuum_delivered drops delivered messages older than 30 days; undelivered rows are always kept. Approvals + questions are kept indefinitely (auditable).
/state/hyperhive-events.sqlite (per-container, bind-mounted from /var/lib/hyperhive/agents/<name>/state/) — every LiveEvent emitted on the per-agent Bus. Hourly vacuum drops rows older than 7 days, then trims to the most recent 2000. Path overridable via HYPERHIVE_EVENTS_DB (for dev / no-/state setups; on open failure the Bus falls back to no-store mode rather than crashing the harness). Survives destroy/recreate; gone on --purge.

State dirs (per agent, under /var/lib/hyperhive/agents/<name>/): config/ (proposed nix repo), claude/ (creds, bind-mounted RW to /root/.claude), state/ (durable notes + events db, bind-mounted to /state). Wiped only on explicit --purge. Tombstones (state dir without a live container) surface in the dashboard's K3PT ST4T3 section so the operator can either revive or purge.

23 KiB Raw Blame History