141 lines
5.9 KiB
Markdown
141 lines
5.9 KiB
Markdown
# Gotchas
|
||
|
||
NixOS + nspawn quirks and lessons we hit the hard way. If something
|
||
here looks unmotivated in the code, there's usually a story underneath.
|
||
|
||
## `nixos-container` doesn't expose `--bind` on the CLI
|
||
|
||
The CLI doesn't accept `--bind`. Path is via `EXTRA_NSPAWN_FLAGS` in
|
||
`/etc/nixos-containers/<NAME>.conf` — the start script
|
||
(`/nix/store/.../container_-start`) expands it unquoted into the
|
||
`systemd-nspawn` invocation. `lifecycle::set_nspawn_flags()` rewrites
|
||
this line.
|
||
|
||
## `/run/systemd/nspawn/*.nspawn` overrides are ignored
|
||
|
||
`nixos-container`'s start script builds the nspawn command line
|
||
directly. Dropping a `.nspawn` file under `/run/systemd/nspawn/`
|
||
looks like the obvious extension point and does nothing. Use
|
||
`EXTRA_NSPAWN_FLAGS` (above).
|
||
|
||
## `boot.isNspawnContainer = true`
|
||
|
||
Not `boot.isContainer = true`. Renamed in nixos-25.11+.
|
||
|
||
## `nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS`
|
||
|
||
…in the `.conf`. The start script's `if HOST_ADDRESS set →
|
||
--network-veth` branch then forces a private netns — silently fatal
|
||
for our web UIs (the bind is invisible from the host). We
|
||
force-clear `HOST_ADDRESS` / `LOCAL_ADDRESS` / `HOST_ADDRESS6` /
|
||
`LOCAL_ADDRESS6` / `HOST_BRIDGE` and set `PRIVATE_NETWORK=0`.
|
||
|
||
## systemd service PATH ≠ host PATH
|
||
|
||
The hive-c0re service sets `path = [ pkgs.git "/run/current-system/sw" ]`.
|
||
In-container harness services do the same so anything an agent adds
|
||
to its own `agent.nix` (`environment.systemPackages`) is visible to
|
||
claude's Bash tool without editing the service definition.
|
||
`environment.HYPERHIVE_GIT` bakes git's absolute path in (read by
|
||
`lifecycle::git_command()`) for the host.
|
||
|
||
## `RuntimeDirectoryPreserve = "yes"`
|
||
|
||
…keeps `/run/hyperhive/` (and the per-agent sub-dirs) across
|
||
hive-c0re restarts. Without it, every restart wipes bind sources and
|
||
existing containers can't be started.
|
||
|
||
## `register_agent` is idempotent
|
||
|
||
Drops any prior socket task before rebinding. Required so a
|
||
hive-c0re restart followed by `rebuild alice` recreates the agent's
|
||
socket without needing a clean reinstall.
|
||
|
||
## `claude-code` is unfree
|
||
|
||
The flake pins it to **nixpkgs-unstable** via
|
||
`overlays.claude-unstable` (stable lags too far). The overlay sets
|
||
`config.allowUnfreePredicate` on its unstable import to whitelist
|
||
`claude-code` specifically — scoped, only this one package.
|
||
`harness-base.nix` does the same at the container level because
|
||
each per-agent `nixosConfiguration` evaluates its own nixpkgs
|
||
instance and the operator's host-level `allowUnfree` does **not**
|
||
propagate in. Operators don't need to set anything on their side.
|
||
|
||
## Claude credentials are per-agent
|
||
|
||
`/var/lib/hyperhive/agents/<name>/claude/` bind-mounts to
|
||
`/root/.claude` (RW). Sharing one dir across agents is NOT viable —
|
||
OAuth refresh tokens rotate, so any sibling refresh invalidates all
|
||
the others. Login flow runs from the per-agent web UI; creds persist
|
||
across `destroy`/recreate (`--purge` wipes them).
|
||
|
||
## Persistent notes dir per agent
|
||
|
||
`/var/lib/hyperhive/agents/<name>/state/` bind-mounts to `/state`
|
||
(RW). System prompts tell agents to keep durable knowledge here
|
||
(`/state/notes.md`, anything else under `/state/`). The harness also
|
||
writes its events log here (`/state/hyperhive-events.sqlite`).
|
||
Survives `destroy`/recreate alongside the claude dir.
|
||
|
||
## Web UI ports collide on hash
|
||
|
||
Sub-agent web UI ports are deterministic FNV-1a of the agent name
|
||
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
|
||
collision rate gets meaningful; at 2–3 agents you can still get
|
||
unlucky. Operator resolves a collision by renaming the offending
|
||
agent (different hash → different port) and rebuilding. No state
|
||
file, no probing, no port-allocation drift — the value is
|
||
reproducible from just the name. Manager is fixed at 8000;
|
||
dashboard at `cfg.dashboardPort` (default 7000).
|
||
|
||
## Restart races on TCP bind
|
||
|
||
Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
|
||
with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
|
||
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
|
||
the `TIME_WAIT` case from a clean previous exit; retry covers the
|
||
genuine "previous process is still alive during a systemd restart
|
||
overlap" case. REUSEADDR does **not** allow two simultaneous
|
||
`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
|
||
which we don't use) — exclusivity is preserved.
|
||
|
||
## Orphan approvals
|
||
|
||
If state dirs are wiped out from under a pending approval (test
|
||
scripts, manual `rm -rf`), the dashboard's next render marks them
|
||
`failed` with note `"agent state dir missing"` so they fall out of
|
||
`pending`. They stay in sqlite for audit.
|
||
|
||
## Nix store `cp -r` preserves read-only bits
|
||
|
||
Copying a nix store path with `cp -r src/. $out/` inside a
|
||
`pkgs.runCommand` derivation preserves the read-only permissions of
|
||
store files. Any subsequent write into the copied tree (adding new
|
||
files in subdirectories) fails with `EPERM`. Fix: pass
|
||
`--no-preserve=mode,ownership` so the output tree is writable.
|
||
|
||
## `hive-forge`: prefer over raw curl pipelines
|
||
|
||
Every agent container has `hive-forge` in PATH (installed via
|
||
`harness-base.nix`; lives in `/hive-forge` as a proper Rust binary
|
||
since #280). Use it instead of ad-hoc curl pipelines:
|
||
|
||
```bash
|
||
hive-forge view 42 # title + body + comments
|
||
hive-forge comment 42 --body "..." # post comment (inline body)
|
||
hive-forge comment 42 --body-file - <<EOF # ...or pipe a HEREDOC
|
||
multi-line body
|
||
EOF
|
||
hive-forge assign 42 damocles
|
||
hive-forge close 42
|
||
hive-forge labels 42 add feature
|
||
hive-forge pr 42 # PR metadata as JSON
|
||
hive-forge branches deployed/ # filter branches by pattern
|
||
hive-forge -r other-org/other-repo pr 7 # target a different repo
|
||
```
|
||
|
||
`hive-forge <verb> --help` prints the full signature for any verb.
|
||
Credentials come from `$HYPERHIVE_STATE_DIR/forge-token`; default
|
||
repo from `$HIVE_FORGE_REPO`, overridden per-invocation by the
|
||
global `-r/--repo` flag.
|