docs sync + revert auto-unfree removal

revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
This commit is contained in:
müde 2026-05-15 21:26:13 +02:00
parent d275b50177
commit 62d1a74929
10 changed files with 239 additions and 95 deletions

View file

@ -54,14 +54,13 @@ socket without needing a clean reinstall.
## `claude-code` is unfree
The flake pins it to **nixpkgs-unstable** via
`overlays.claude-unstable` (stable lags too far). The overlay
imports unstable inheriting the user's `nixpkgs.config`, so the
operator must opt in by setting `allowUnfree = true` (or an
`allowUnfreePredicate` that whitelists `claude-code`) on their host
config. hyperhive deliberately does NOT auto-allow — silent unfree
bypass would be sketchy, and the error message is clear enough that
the operator can fix it once and forget about it. Same on the
per-agent containers (they inherit through the same nixpkgs).
`overlays.claude-unstable` (stable lags too far). The overlay sets
`config.allowUnfreePredicate` on its unstable import to whitelist
`claude-code` specifically — scoped, only this one package.
`harness-base.nix` does the same at the container level because
each per-agent `nixosConfiguration` evaluates its own nixpkgs
instance and the operator's host-level `allowUnfree` does **not**
propagate in. Operators don't need to set anything on their side.
## Claude credentials are per-agent
@ -79,6 +78,28 @@ across `destroy`/recreate (`--purge` wipes them).
writes its events log here (`/state/hyperhive-events.sqlite`).
Survives `destroy`/recreate alongside the claude dir.
## Web UI ports collide on hash
Sub-agent web UI ports are deterministic FNV-1a of the agent name
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
collision rate gets meaningful; at 23 agents you can still get
unlucky. Operator resolves a collision by renaming the offending
agent (different hash → different port) and rebuilding. No state
file, no probing, no port-allocation drift — the value is
reproducible from just the name. Manager is fixed at 8000;
dashboard at `cfg.dashboardPort` (default 7000).
## Restart races on TCP bind
Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
the `TIME_WAIT` case from a clean previous exit; retry covers the
genuine "previous process is still alive during a systemd restart
overlap" case. REUSEADDR does **not** allow two simultaneous
`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
which we don't use) — exclusivity is preserved.
## Orphan approvals
If state dirs are wiped out from under a pending approval (test