hyperhive/docs/gotchas.md
müde 62d1a74929 docs sync + revert auto-unfree removal
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
2026-05-15 21:26:13 +02:00

108 lines
4.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Gotchas
NixOS + nspawn quirks and lessons we hit the hard way. If something
here looks unmotivated in the code, there's usually a story underneath.
## `nixos-container` doesn't expose `--bind` on the CLI
The CLI doesn't accept `--bind`. Path is via `EXTRA_NSPAWN_FLAGS` in
`/etc/nixos-containers/<NAME>.conf` — the start script
(`/nix/store/.../container_-start`) expands it unquoted into the
`systemd-nspawn` invocation. `lifecycle::set_nspawn_flags()` rewrites
this line.
## `/run/systemd/nspawn/*.nspawn` overrides are ignored
`nixos-container`'s start script builds the nspawn command line
directly. Dropping a `.nspawn` file under `/run/systemd/nspawn/`
looks like the obvious extension point and does nothing. Use
`EXTRA_NSPAWN_FLAGS` (above).
## `boot.isNspawnContainer = true`
Not `boot.isContainer = true`. Renamed in nixos-25.11+.
## `nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS`
…in the `.conf`. The start script's `if HOST_ADDRESS set →
--network-veth` branch then forces a private netns — silently fatal
for our web UIs (the bind is invisible from the host). We
force-clear `HOST_ADDRESS` / `LOCAL_ADDRESS` / `HOST_ADDRESS6` /
`LOCAL_ADDRESS6` / `HOST_BRIDGE` and set `PRIVATE_NETWORK=0`.
## systemd service PATH ≠ host PATH
The hive-c0re service sets `path = [ pkgs.git "/run/current-system/sw" ]`.
In-container harness services do the same so anything an agent adds
to its own `agent.nix` (`environment.systemPackages`) is visible to
claude's Bash tool without editing the service definition.
`environment.HYPERHIVE_GIT` bakes git's absolute path in (read by
`lifecycle::git_command()`) for the host.
## `RuntimeDirectoryPreserve = "yes"`
…keeps `/run/hyperhive/` (and the per-agent sub-dirs) across
hive-c0re restarts. Without it, every restart wipes bind sources and
existing containers can't be started.
## `register_agent` is idempotent
Drops any prior socket task before rebinding. Required so a
hive-c0re restart followed by `rebuild alice` recreates the agent's
socket without needing a clean reinstall.
## `claude-code` is unfree
The flake pins it to **nixpkgs-unstable** via
`overlays.claude-unstable` (stable lags too far). The overlay sets
`config.allowUnfreePredicate` on its unstable import to whitelist
`claude-code` specifically — scoped, only this one package.
`harness-base.nix` does the same at the container level because
each per-agent `nixosConfiguration` evaluates its own nixpkgs
instance and the operator's host-level `allowUnfree` does **not**
propagate in. Operators don't need to set anything on their side.
## Claude credentials are per-agent
`/var/lib/hyperhive/agents/<name>/claude/` bind-mounts to
`/root/.claude` (RW). Sharing one dir across agents is NOT viable —
OAuth refresh tokens rotate, so any sibling refresh invalidates all
the others. Login flow runs from the per-agent web UI; creds persist
across `destroy`/recreate (`--purge` wipes them).
## Persistent notes dir per agent
`/var/lib/hyperhive/agents/<name>/state/` bind-mounts to `/state`
(RW). System prompts tell agents to keep durable knowledge here
(`/state/notes.md`, anything else under `/state/`). The harness also
writes its events log here (`/state/hyperhive-events.sqlite`).
Survives `destroy`/recreate alongside the claude dir.
## Web UI ports collide on hash
Sub-agent web UI ports are deterministic FNV-1a of the agent name
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
collision rate gets meaningful; at 23 agents you can still get
unlucky. Operator resolves a collision by renaming the offending
agent (different hash → different port) and rebuilding. No state
file, no probing, no port-allocation drift — the value is
reproducible from just the name. Manager is fixed at 8000;
dashboard at `cfg.dashboardPort` (default 7000).
## Restart races on TCP bind
Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
the `TIME_WAIT` case from a clean previous exit; retry covers the
genuine "previous process is still alive during a systemd restart
overlap" case. REUSEADDR does **not** allow two simultaneous
`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
which we don't use) — exclusivity is preserved.
## Orphan approvals
If state dirs are wiped out from under a pending approval (test
scripts, manual `rm -rf`), the dashboard's next render marks them
`failed` with note `"agent state dir missing"` so they fall out of
`pending`. They stay in sqlite for audit.