docs sync + revert auto-unfree removal

revert the earlier 'operator must set allowUnfree' move: per-agent containers evaluate their own nixpkgs and the operator's host-level allowUnfree doesn't propagate in. restoring the scoped allowUnfreePredicate inside both the claude-unstable overlay and harness-base.nix; documented in README + gotchas as 'nothing to set on the operator side'. docs: - claude.md file map adds crash_watch.rs, kick_agent on coordinator, /api/model + journald viewer + bind-with-retry references. - scratchpad rewritten to reflect the recent run. - web-ui.md: notification row + browser notifications section, state row (badge + model chip + last-turn chip + cancel button), per-agent inbox, /model slash, /cancel-question + journald endpoints, focus-preservation on refresh. - turn-loop.md: --model is read from Bus::model() per turn (runtime override via /model); recv(wait_seconds) up to 180s with the rationale; ask_operator gains ttl_seconds; new TurnState section; kick_agent inbox-on-startup hint. - approvals.md: ttl/cancel resolution paths for operator questions. - persistence.md: /state/hyperhive-model file. - gotchas.md: web UI port collision policy (rename, don't probe); bind retry + SO_REUSEADDR shape; auto-unfree restored. - todo.md: cleaned up empty sections and stale entries; /model shipped, dropped from the list.
2026-05-15 21:26:13 +02:00 · 2026-05-15 21:26:13 +02:00 · 62d1a74929
commit 62d1a74929
parent d275b50177
10 changed files with 239 additions and 95 deletions
--- a/docs/gotchas.md
+++ b/docs/gotchas.md
@ -54,14 +54,13 @@ socket without needing a clean reinstall.
 ## `claude-code` is unfree

 The flake pins it to **nixpkgs-unstable** via
-`overlays.claude-unstable` (stable lags too far). The overlay
-imports unstable inheriting the user's `nixpkgs.config`, so the
-operator must opt in by setting `allowUnfree = true` (or an
-`allowUnfreePredicate` that whitelists `claude-code`) on their host
-config. hyperhive deliberately does NOT auto-allow — silent unfree
-bypass would be sketchy, and the error message is clear enough that
-the operator can fix it once and forget about it. Same on the
-per-agent containers (they inherit through the same nixpkgs).
+`overlays.claude-unstable` (stable lags too far). The overlay sets
+`config.allowUnfreePredicate` on its unstable import to whitelist
+`claude-code` specifically — scoped, only this one package.
+`harness-base.nix` does the same at the container level because
+each per-agent `nixosConfiguration` evaluates its own nixpkgs
+instance and the operator's host-level `allowUnfree` does **not**
+propagate in. Operators don't need to set anything on their side.

 ## Claude credentials are per-agent

@ -79,6 +78,28 @@ across `destroy`/recreate (`--purge` wipes them).
 writes its events log here (`/state/hyperhive-events.sqlite`).
 Survives `destroy`/recreate alongside the claude dir.

+## Web UI ports collide on hash
+
+Sub-agent web UI ports are deterministic FNV-1a of the agent name
+modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
+collision rate gets meaningful; at 2–3 agents you can still get
+unlucky. Operator resolves a collision by renaming the offending
+agent (different hash → different port) and rebuilding. No state
+file, no probing, no port-allocation drift — the value is
+reproducible from just the name. Manager is fixed at 8000;
+dashboard at `cfg.dashboardPort` (default 7000).
+
+## Restart races on TCP bind
+
+Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
+with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
+exponential backoff capped at 2s, ~22s total). REUSEADDR handles
+the `TIME_WAIT` case from a clean previous exit; retry covers the
+genuine "previous process is still alive during a systemd restart
+overlap" case. REUSEADDR does **not** allow two simultaneous
+`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
+which we don't use) — exclusivity is preserved.
+
 ## Orphan approvals

 If state dirs are wiped out from under a pending approval (test