hyperhive/docs/gotchas.md
müde 62d1a74929 docs sync + revert auto-unfree removal
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
2026-05-15 21:26:13 +02:00

4.5 KiB
Raw Permalink Blame History

Gotchas

NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.

nixos-container doesn't expose --bind on the CLI

The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites this line.

/run/systemd/nspawn/*.nspawn overrides are ignored

nixos-container's start script builds the nspawn command line directly. Dropping a .nspawn file under /run/systemd/nspawn/ looks like the obvious extension point and does nothing. Use EXTRA_NSPAWN_FLAGS (above).

boot.isNspawnContainer = true

Not boot.isContainer = true. Renamed in nixos-25.11+.

nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS

…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal for our web UIs (the bind is invisible from the host). We force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.

systemd service PATH ≠ host PATH

The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ]. In-container harness services do the same so anything an agent adds to its own agent.nix (environment.systemPackages) is visible to claude's Bash tool without editing the service definition. environment.HYPERHIVE_GIT bakes git's absolute path in (read by lifecycle::git_command()) for the host.

RuntimeDirectoryPreserve = "yes"

…keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.

register_agent is idempotent

Drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.

claude-code is unfree

The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay sets config.allowUnfreePredicate on its unstable import to whitelist claude-code specifically — scoped, only this one package. harness-base.nix does the same at the container level because each per-agent nixosConfiguration evaluates its own nixpkgs instance and the operator's host-level allowUnfree does not propagate in. Operators don't need to set anything on their side.

Claude credentials are per-agent

/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to /root/.claude (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across destroy/recreate (--purge wipes them).

Persistent notes dir per agent

/var/lib/hyperhive/agents/<name>/state/ bind-mounts to /state (RW). System prompts tell agents to keep durable knowledge here (/state/notes.md, anything else under /state/). The harness also writes its events log here (/state/hyperhive-events.sqlite). Survives destroy/recreate alongside the claude dir.

Web UI ports collide on hash

Sub-agent web UI ports are deterministic FNV-1a of the agent name modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox collision rate gets meaningful; at 23 agents you can still get unlucky. Operator resolves a collision by renaming the offending agent (different hash → different port) and rebuilding. No state file, no probing, no port-allocation drift — the value is reproducible from just the name. Manager is fixed at 8000; dashboard at cfg.dashboardPort (default 7000).

Restart races on TCP bind

Both the dashboard and per-agent web UI use tokio::net::TcpSocket with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries, exponential backoff capped at 2s, ~22s total). REUSEADDR handles the TIME_WAIT case from a clean previous exit; retry covers the genuine "previous process is still alive during a systemd restart overlap" case. REUSEADDR does not allow two simultaneous LISTEN sockets on the same port (that would be SO_REUSEPORT, which we don't use) — exclusivity is preserved.

Orphan approvals

If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.