# Gotchas NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath. ## `nixos-container` doesn't expose `--bind` on the CLI The CLI doesn't accept `--bind`. Path is via `EXTRA_NSPAWN_FLAGS` in `/etc/nixos-containers/.conf` — the start script (`/nix/store/.../container_-start`) expands it unquoted into the `systemd-nspawn` invocation. `lifecycle::set_nspawn_flags()` rewrites this line. ## `/run/systemd/nspawn/*.nspawn` overrides are ignored `nixos-container`'s start script builds the nspawn command line directly. Dropping a `.nspawn` file under `/run/systemd/nspawn/` looks like the obvious extension point and does nothing. Use `EXTRA_NSPAWN_FLAGS` (above). ## `boot.isNspawnContainer = true` Not `boot.isContainer = true`. Renamed in nixos-25.11+. ## `nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS` …in the `.conf`. The start script's `if HOST_ADDRESS set → --network-veth` branch then forces a private netns — silently fatal for our web UIs (the bind is invisible from the host). We force-clear `HOST_ADDRESS` / `LOCAL_ADDRESS` / `HOST_ADDRESS6` / `LOCAL_ADDRESS6` / `HOST_BRIDGE` and set `PRIVATE_NETWORK=0`. ## systemd service PATH ≠ host PATH The hive-c0re service sets `path = [ pkgs.git "/run/current-system/sw" ]`. In-container harness services do the same so anything an agent adds to its own `agent.nix` (`environment.systemPackages`) is visible to claude's Bash tool without editing the service definition. `environment.HYPERHIVE_GIT` bakes git's absolute path in (read by `lifecycle::git_command()`) for the host. ## `RuntimeDirectoryPreserve = "yes"` …keeps `/run/hyperhive/` (and the per-agent sub-dirs) across hive-c0re restarts. Without it, every restart wipes bind sources and existing containers can't be started. ## `register_agent` is idempotent Drops any prior socket task before rebinding. Required so a hive-c0re restart followed by `rebuild alice` recreates the agent's socket without needing a clean reinstall. ## `claude-code` is unfree The flake pins it to **nixpkgs-unstable** via `overlays.claude-unstable` (stable lags too far). The overlay sets `config.allowUnfreePredicate` on its unstable import to whitelist `claude-code` specifically — scoped, only this one package. `harness-base.nix` does the same at the container level because each per-agent `nixosConfiguration` evaluates its own nixpkgs instance and the operator's host-level `allowUnfree` does **not** propagate in. Operators don't need to set anything on their side. ## Claude credentials are per-agent `/var/lib/hyperhive/agents//claude/` bind-mounts to `/root/.claude` (RW). Sharing one dir across agents is NOT viable — OAuth refresh tokens rotate, so any sibling refresh invalidates all the others. Login flow runs from the per-agent web UI; creds persist across `destroy`/recreate (`--purge` wipes them). ## Persistent notes dir per agent `/var/lib/hyperhive/agents//state/` bind-mounts to `/state` (RW). System prompts tell agents to keep durable knowledge here (`/state/notes.md`, anything else under `/state/`). The harness also writes its events log here (`/state/hyperhive-events.sqlite`). Survives `destroy`/recreate alongside the claude dir. ## Web UI ports collide on hash Sub-agent web UI ports are deterministic FNV-1a of the agent name modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox collision rate gets meaningful; at 2–3 agents you can still get unlucky. Operator resolves a collision by renaming the offending agent (different hash → different port) and rebuilding. No state file, no probing, no port-allocation drift — the value is reproducible from just the name. Manager is fixed at 8000; dashboard at `cfg.dashboardPort` (default 7000). ## Restart races on TCP bind Both the dashboard and per-agent web UI use `tokio::net::TcpSocket` with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries, exponential backoff capped at 2s, ~22s total). REUSEADDR handles the `TIME_WAIT` case from a clean previous exit; retry covers the genuine "previous process is still alive during a systemd restart overlap" case. REUSEADDR does **not** allow two simultaneous `LISTEN` sockets on the same port (that would be `SO_REUSEPORT`, which we don't use) — exclusivity is preserved. ## Orphan approvals If state dirs are wiped out from under a pending approval (test scripts, manual `rm -rf`), the dashboard's next render marks them `failed` with note `"agent state dir missing"` so they fall out of `pending`. They stay in sqlite for audit.