müde 62d1a74929 docs sync + revert auto-unfree removal

revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.

2026-05-15 21:26:13 +02:00

4.5 KiB

Raw Permalink Blame History

Gotchas

NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.

`nixos-container` doesn't expose `--bind` on the CLI

The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<NAME>.conf — the start script (/nix/store/.../container_-start) expands it unquoted into the systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites this line.

`/run/systemd/nspawn/*.nspawn` overrides are ignored

nixos-container's start script builds the nspawn command line directly. Dropping a .nspawn file under /run/systemd/nspawn/ looks like the obvious extension point and does nothing. Use EXTRA_NSPAWN_FLAGS (above).

`boot.isNspawnContainer = true`

Not boot.isContainer = true. Renamed in nixos-25.11+.

`nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS`

…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal for our web UIs (the bind is invisible from the host). We force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 / LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.

systemd service PATH ≠ host PATH

The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ]. In-container harness services do the same so anything an agent adds to its own agent.nix (environment.systemPackages) is visible to claude's Bash tool without editing the service definition. environment.HYPERHIVE_GIT bakes git's absolute path in (read by lifecycle::git_command()) for the host.

`RuntimeDirectoryPreserve = "yes"`

…keeps /run/hyperhive/ (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started.

`register_agent` is idempotent

Drops any prior socket task before rebinding. Required so a hive-c0re restart followed by rebuild alice recreates the agent's socket without needing a clean reinstall.

`claude-code` is unfree

The flake pins it to nixpkgs-unstable via overlays.claude-unstable (stable lags too far). The overlay sets config.allowUnfreePredicate on its unstable import to whitelist claude-code specifically — scoped, only this one package. harness-base.nix does the same at the container level because each per-agent nixosConfiguration evaluates its own nixpkgs instance and the operator's host-level allowUnfree does not propagate in. Operators don't need to set anything on their side.

Claude credentials are per-agent

/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to /root/.claude (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across destroy/recreate (--purge wipes them).

Persistent notes dir per agent

/var/lib/hyperhive/agents/<name>/state/ bind-mounts to /state (RW). System prompts tell agents to keep durable knowledge here (/state/notes.md, anything else under /state/). The harness also writes its events log here (/state/hyperhive-events.sqlite). Survives destroy/recreate alongside the claude dir.

Web UI ports collide on hash

Sub-agent web UI ports are deterministic FNV-1a of the agent name modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox collision rate gets meaningful; at 2–3 agents you can still get unlucky. Operator resolves a collision by renaming the offending agent (different hash → different port) and rebuilding. No state file, no probing, no port-allocation drift — the value is reproducible from just the name. Manager is fixed at 8000; dashboard at cfg.dashboardPort (default 7000).

Restart races on TCP bind

Both the dashboard and per-agent web UI use tokio::net::TcpSocket with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries, exponential backoff capped at 2s, ~22s total). REUSEADDR handles the TIME_WAIT case from a clean previous exit; retry covers the genuine "previous process is still alive during a systemd restart overlap" case. REUSEADDR does not allow two simultaneous LISTEN sockets on the same port (that would be SO_REUSEPORT, which we don't use) — exclusivity is preserved.

Orphan approvals

If state dirs are wiped out from under a pending approval (test scripts, manual rm -rf), the dashboard's next render marks them failed with note "agent state dir missing" so they fall out of pending. They stay in sqlite for audit.

4.5 KiB Raw Permalink Blame History Unescape Escape

Gotchas

nixos-container doesn't expose --bind on the CLI

/run/systemd/nspawn/*.nspawn overrides are ignored

boot.isNspawnContainer = true

nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS