5.9 KiB
Gotchas
NixOS + nspawn quirks and lessons we hit the hard way. If something here looks unmotivated in the code, there's usually a story underneath.
nixos-container doesn't expose --bind on the CLI
The CLI doesn't accept --bind. Path is via EXTRA_NSPAWN_FLAGS in
/etc/nixos-containers/<NAME>.conf — the start script
(/nix/store/.../container_-start) expands it unquoted into the
systemd-nspawn invocation. lifecycle::set_nspawn_flags() rewrites
this line.
/run/systemd/nspawn/*.nspawn overrides are ignored
nixos-container's start script builds the nspawn command line
directly. Dropping a .nspawn file under /run/systemd/nspawn/
looks like the obvious extension point and does nothing. Use
EXTRA_NSPAWN_FLAGS (above).
boot.isNspawnContainer = true
Not boot.isContainer = true. Renamed in nixos-25.11+.
nixos-container create auto-assigns HOST_ADDRESS / LOCAL_ADDRESS
…in the .conf. The start script's if HOST_ADDRESS set → --network-veth branch then forces a private netns — silently fatal
for our web UIs (the bind is invisible from the host). We
force-clear HOST_ADDRESS / LOCAL_ADDRESS / HOST_ADDRESS6 /
LOCAL_ADDRESS6 / HOST_BRIDGE and set PRIVATE_NETWORK=0.
systemd service PATH ≠ host PATH
The hive-c0re service sets path = [ pkgs.git "/run/current-system/sw" ].
In-container harness services do the same so anything an agent adds
to its own agent.nix (environment.systemPackages) is visible to
claude's Bash tool without editing the service definition.
environment.HYPERHIVE_GIT bakes git's absolute path in (read by
lifecycle::git_command()) for the host.
RuntimeDirectoryPreserve = "yes"
…keeps /run/hyperhive/ (and the per-agent sub-dirs) across
hive-c0re restarts. Without it, every restart wipes bind sources and
existing containers can't be started.
register_agent is idempotent
Drops any prior socket task before rebinding. Required so a
hive-c0re restart followed by rebuild alice recreates the agent's
socket without needing a clean reinstall.
claude-code is unfree
The flake pins it to nixpkgs-unstable via
overlays.claude-unstable (stable lags too far). The overlay sets
config.allowUnfreePredicate on its unstable import to whitelist
claude-code specifically — scoped, only this one package.
harness-base.nix does the same at the container level because
each per-agent nixosConfiguration evaluates its own nixpkgs
instance and the operator's host-level allowUnfree does not
propagate in. Operators don't need to set anything on their side.
Claude credentials are per-agent
/var/lib/hyperhive/agents/<name>/claude/ bind-mounts to
/root/.claude (RW). Sharing one dir across agents is NOT viable —
OAuth refresh tokens rotate, so any sibling refresh invalidates all
the others. Login flow runs from the per-agent web UI; creds persist
across destroy/recreate (--purge wipes them).
Persistent notes dir per agent
/var/lib/hyperhive/agents/<name>/state/ bind-mounts to /state
(RW). System prompts tell agents to keep durable knowledge here
(/state/notes.md, anything else under /state/). The harness also
writes its events log here (/state/hyperhive-events.sqlite).
Survives destroy/recreate alongside the claude dir.
Web UI ports collide on hash
Sub-agent web UI ports are deterministic FNV-1a of the agent name
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
collision rate gets meaningful; at 2–3 agents you can still get
unlucky. Operator resolves a collision by renaming the offending
agent (different hash → different port) and rebuilding. No state
file, no probing, no port-allocation drift — the value is
reproducible from just the name. Manager is fixed at 8000;
dashboard at cfg.dashboardPort (default 7000).
Restart races on TCP bind
Both the dashboard and per-agent web UI use tokio::net::TcpSocket
with SO_REUSEADDR plus a retry-on-AddrInUse loop (12 tries,
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
the TIME_WAIT case from a clean previous exit; retry covers the
genuine "previous process is still alive during a systemd restart
overlap" case. REUSEADDR does not allow two simultaneous
LISTEN sockets on the same port (that would be SO_REUSEPORT,
which we don't use) — exclusivity is preserved.
Orphan approvals
If state dirs are wiped out from under a pending approval (test
scripts, manual rm -rf), the dashboard's next render marks them
failed with note "agent state dir missing" so they fall out of
pending. They stay in sqlite for audit.
Nix store cp -r preserves read-only bits
Copying a nix store path with cp -r src/. $out/ inside a
pkgs.runCommand derivation preserves the read-only permissions of
store files. Any subsequent write into the copied tree (adding new
files in subdirectories) fails with EPERM. Fix: pass
--no-preserve=mode,ownership so the output tree is writable.
hive-forge: prefer over raw curl pipelines
Every agent container has hive-forge in PATH (installed via
harness-base.nix; lives in /hive-forge as a proper Rust binary
since #280). Use it instead of ad-hoc curl pipelines:
hive-forge view 42 # title + body + comments
hive-forge comment 42 --body "..." # post comment (inline body)
hive-forge comment 42 --body-file - <<EOF # ...or pipe a HEREDOC
multi-line body
EOF
hive-forge assign 42 damocles
hive-forge close 42
hive-forge labels 42 add feature
hive-forge pr 42 # PR metadata as JSON
hive-forge branches deployed/ # filter branches by pattern
hive-forge -r other-org/other-repo pr 7 # target a different repo
hive-forge <verb> --help prints the full signature for any verb.
Credentials come from $HYPERHIVE_STATE_DIR/forge-token; default
repo from $HIVE_FORGE_REPO, overridden per-invocation by the
global -r/--repo flag.