docs sync + revert auto-unfree removal

revert the earlier 'operator must set allowUnfree' move: per-agent containers evaluate their own nixpkgs and the operator's host-level allowUnfree doesn't propagate in. restoring the scoped allowUnfreePredicate inside both the claude-unstable overlay and harness-base.nix; documented in README + gotchas as 'nothing to set on the operator side'. docs: - claude.md file map adds crash_watch.rs, kick_agent on coordinator, /api/model + journald viewer + bind-with-retry references. - scratchpad rewritten to reflect the recent run. - web-ui.md: notification row + browser notifications section, state row (badge + model chip + last-turn chip + cancel button), per-agent inbox, /model slash, /cancel-question + journald endpoints, focus-preservation on refresh. - turn-loop.md: --model is read from Bus::model() per turn (runtime override via /model); recv(wait_seconds) up to 180s with the rationale; ask_operator gains ttl_seconds; new TurnState section; kick_agent inbox-on-startup hint. - approvals.md: ttl/cancel resolution paths for operator questions. - persistence.md: /state/hyperhive-model file. - gotchas.md: web UI port collision policy (rename, don't probe); bind retry + SO_REUSEADDR shape; auto-unfree restored. - todo.md: cleaned up empty sections and stale entries; /model shipped, dropped from the list.
2026-05-15 21:26:13 +02:00 · 2026-05-15 21:26:13 +02:00 · 62d1a74929
commit 62d1a74929
parent d275b50177
10 changed files with 239 additions and 95 deletions
--- a/docs/approvals.md
+++ b/docs/approvals.md
@ -84,9 +84,9 @@ package legitimacy, cheaper alternative, blast radius) before
 committing and calling `request_apply_commit`.

 For ambiguous cases or anything that needs human signal, the
-manager calls `ask_operator(question, options?, multi?)` — queues
-the question on the dashboard and returns the id immediately. The
-operator's answer arrives later as
+manager calls `ask_operator(question, options?, multi?,
+ttl_seconds?)` — queues the question on the dashboard and returns
+the id immediately. The operator's answer arrives later as
 `HelperEvent::OperatorAnswered` in the manager inbox. Storage is
 `hive-c0re::operator_questions` (sqlite); the answer flow is:

@ -96,6 +96,16 @@ POST /answer-question/{id}
  → notify_manager(OperatorAnswered { id, question, answer })
 ```

+Two more paths resolve a pending question with a sentinel answer:
+
+- `POST /cancel-question/{id}` (✗ CANC3L button on the dashboard)
+  resolves with `[cancelled]`. The manager sees a terminal state
+  and can fall back.
+- `ttl_seconds` deadline: a tokio watchdog spawned at submit time
+  fires `answer(id, "[expired]")` once the ttl runs out. Already-
+  resolved races no-op. The dashboard surfaces a `⏳ MM:SS` chip
+  on each pending question with a deadline.
+
 ## Helper events to the manager

 `Coordinator::notify_manager(&HelperEvent)` enqueues an inbox
--- a/docs/gotchas.md
+++ b/docs/gotchas.md
@ -54,14 +54,13 @@ socket without needing a clean reinstall.
 ## `claude-code` is unfree

 The flake pins it to **nixpkgs-unstable** via
-`overlays.claude-unstable` (stable lags too far). The overlay
-imports unstable inheriting the user's `nixpkgs.config`, so the
-operator must opt in by setting `allowUnfree = true` (or an
-`allowUnfreePredicate` that whitelists `claude-code`) on their host
-config. hyperhive deliberately does NOT auto-allow — silent unfree
-bypass would be sketchy, and the error message is clear enough that
-the operator can fix it once and forget about it. Same on the
-per-agent containers (they inherit through the same nixpkgs).
+`overlays.claude-unstable` (stable lags too far). The overlay sets
+`config.allowUnfreePredicate` on its unstable import to whitelist
+`claude-code` specifically — scoped, only this one package.
+`harness-base.nix` does the same at the container level because
+each per-agent `nixosConfiguration` evaluates its own nixpkgs
+instance and the operator's host-level `allowUnfree` does **not**
+propagate in. Operators don't need to set anything on their side.

 ## Claude credentials are per-agent

@ -79,6 +78,28 @@ across `destroy`/recreate (`--purge` wipes them).
 writes its events log here (`/state/hyperhive-events.sqlite`).
 Survives `destroy`/recreate alongside the claude dir.

+## Web UI ports collide on hash
+
+Sub-agent web UI ports are deterministic FNV-1a of the agent name
+modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
+collision rate gets meaningful; at 2–3 agents you can still get
+unlucky. Operator resolves a collision by renaming the offending
+agent (different hash → different port) and rebuilding. No state
+file, no probing, no port-allocation drift — the value is
+reproducible from just the name. Manager is fixed at 8000;
+dashboard at `cfg.dashboardPort` (default 7000).
+
+## Restart races on TCP bind
+
+Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
+with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
+exponential backoff capped at 2s, ~22s total). REUSEADDR handles
+the `TIME_WAIT` case from a clean previous exit; retry covers the
+genuine "previous process is still alive during a systemd restart
+overlap" case. REUSEADDR does **not** allow two simultaneous
+`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
+which we don't use) — exclusivity is preserved.
+
 ## Orphan approvals

 If state dirs are wiped out from under a pending approval (test
--- a/docs/persistence.md
+++ b/docs/persistence.md
@ -46,6 +46,15 @@ setups). On open failure the `Bus` falls back to no-store mode
 rather than crashing the harness — events still broadcast over SSE,
 just nothing persisted.

+### `/state/hyperhive-model` (per agent)
+
+Single-line text file holding the claude model name currently
+selected for this agent (default `haiku` when absent). Written by
+`Bus::set_model` whenever the operator flips it via `/model
+<name>` in the web terminal. Read once at harness boot in
+`Bus::new`. Path overridable via `HYPERHIVE_MODEL_FILE`.
+Survives destroy/recreate, gone on `--purge`.
+
 ## State dirs (per agent)

 Under `/var/lib/hyperhive/agents/<name>/`:
--- a/docs/turn-loop.md
+++ b/docs/turn-loop.md
@ -26,7 +26,7 @@ Each agent harness (`hive-ag3nt serve` or `hive-m1nd serve`) runs:
 ## The claude invocation

 ```
-claude --print --verbose --output-format stream-json --model haiku \
+claude --print --verbose --output-format stream-json --model <name> \
  --continue --settings /run/hive/claude-settings.json \
  --system-prompt-file /run/hive/claude-system-prompt.md \
  --mcp-config /run/hive/claude-mcp-config.json --strict-mcp-config \
@ -34,6 +34,12 @@ claude --print --verbose --output-format stream-json --model haiku \
 # wake prompt piped over stdin
 ```

+`<name>` is read from `Bus::model()` on each turn, default
+`haiku`. Operator can flip it at runtime with `/model <name>` in
+the web terminal — the next turn picks it up.  The choice is
+persisted to `/state/hyperhive-model` so it survives restart;
+override path: `HYPERHIVE_MODEL_FILE` env var for tests.
+
 `--continue` keeps a persistent session per agent (claude stores
 sessions in `~/.claude/projects/`, which is bind-mounted
 persistently). Auto-compact and auto-memory are disabled via
@ -45,6 +51,12 @@ The wake prompt is intentionally minimal: just the popped message's
 …)` hint when `unread > 0`. Claude drives any further `recv`/`send`
 itself via the embedded MCP server.

+Whenever hive-c0re starts / restarts / rebuilds a container, it
+also drops a `system` message into the agent's inbox via
+`Coordinator::kick_agent` — a one-line "you were just (re)started,
+check /state/ for your notes, --continue session is intact". The
+next turn picks it up like any other inbox message.
+
 ### On-boot files

 `hive_ag3nt::turn::write_*` writes three files next to the per-agent
@ -75,7 +87,11 @@ it as a stdio child via `--mcp-config`. The hyperhive socket name is
 - `send(to, body)` — message a peer (logical agent name), another
  agent, or the operator (recipient `operator`, surfaces in the
  dashboard inbox).
- `recv()` — drain one inbox message.
+- `recv(wait_seconds?)` — drain one inbox message. Long-polls
+  server-side; `wait_seconds` is capped at 180 (default 30 when
+  omitted). Agents use a long wait to park their turn waiting for
+  work instead of busy-looping with short polls — they wake
+  instantly when a message arrives.

 ### Manager tools (in addition to send/recv)

@ -87,12 +103,16 @@ it as a stdio child via `--mcp-config`. The hyperhive socket name is
 - `request_apply_commit(agent, commit_ref)` — submit a config
  change for any agent (`hm1nd` for the manager's own config) for
  operator approval.
- `ask_operator(question, options?, multi?)` — surface a question
-  on the dashboard. Non-blocking — returns the queued question id;
-  the operator's answer arrives later as
+- `ask_operator(question, options?, multi?, ttl_seconds?)` —
+  surface a question on the dashboard. Non-blocking — returns the
+  queued question id; the operator's answer arrives later as
  `HelperEvent::OperatorAnswered` in the manager inbox. Options
  always render alongside a free-text fallback; `multi=true`
-  renders options as checkboxes.
+  renders options as checkboxes. `ttl_seconds` auto-cancels with
+  answer `[expired]` after the deadline (useful for time-sensitive
+  decisions that become moot if the operator hasn't responded).
+  The operator can also manually cancel with `[cancelled]` via the
+  dashboard.

 The boundary: lifecycle ops on *existing* sub-agents
 (`kill`/`start`/`restart`) are at the manager's discretion — no
@ -100,6 +120,21 @@ operator approval. Creating a new agent (`request_spawn`) and
 changing any agent's config (`request_apply_commit`) still go
 through the approval queue.

+### Authoritative state
+
+`hive_ag3nt::events::Bus` carries the current turn-loop state in
+addition to the broadcast channel and the events history. Variants:
+
+- `Idle` — sitting on `Recv` waiting for mail.
+- `Thinking` — `claude --print` is running for a turn.
+- `Compacting` — operator-triggered `/compact` is in flight.
+
+The harness flips state at the relevant transitions
+(`set_state(Thinking)` before `drive_turn`, `set_state(Idle)`
+after; `set_state(Compacting)` around `compact_session`). Exposed
+via `/api/state.turn_state` + `turn_state_since` (unix seconds);
+the agent page renders this rather than deriving from SSE events.
+
 ### Tool envelope

 `mcp::run_tool_envelope`: every MCP tool handler logs the request,
--- a/docs/web-ui.md
+++ b/docs/web-ui.md
@ -20,46 +20,84 @@ listener: read `data-confirm`, swap the button to a spinner, POST
 `application/x-www-form-urlencoded`, re-enable the button on success
 (refreshState may keep the form mounted, so we don't rely on a
 re-render), call `refreshState()`. State shapes live in
-`dashboard.rs::StateSnapshot` and `web_ui.rs::StateSnapshot` —
-when adding state fields, plumb through the snapshot struct and the
+`dashboard.rs::StateSnapshot` and `web_ui.rs::StateSnapshot` — when
+adding state fields, plumb through the snapshot struct and the
 relevant `assets/app.js` render function.

+**Focus preservation:** `refreshState` checks whether
+`document.activeElement` sits inside one of the managed sections
+and, if so, skips the refresh (defers 2s). The operator never has
+the form yanked out from under them mid-type; the update lands as
+soon as they blur.
+
+Both bind their listeners with `SO_REUSEADDR` via
+`tokio::net::TcpSocket` plus a retry loop on `AddrInUse` (12 tries,
+exponential backoff capped at 2s) so an nspawn restart that races
+the previous process's socket release resolves itself.
+
 ## Dashboard sections (top to bottom)

-1. **C0NTAINERS** — live containers with their action surface.
-2. **K3PT ST4T3** — destroyed-but-state-kept tombstones (size +
+1. **Notification row** — `🔔 enable notifications` button when
+   permission ungranted; `🔕 mute / 🔔 unmute` toggle once granted;
+   inline "unsupported / blocked" message when applicable. Sits
+   under the banner.
+2. **C0NTAINERS** — live containers with their action surface.
+3. **K3PT ST4T3** — destroyed-but-state-kept tombstones (size +
   age + claude-creds badge). Two actions: `⊕ R3V1V3` (queues a
   Spawn approval; existing state is reused), `PURG3` (wipes
   state + applied dirs; `POST /purge-tombstone/{name}`).
-3. **M1ND H4S QU3STI0NS** — pending `ask_operator` questions
-   (amber pulsing border). Always renders a free-text fallback
+4. **M1ND H4S QU3STI0NS** — pending `ask_operator` questions
+   (amber pulsing border). Free-text fallback always rendered
   alongside any option list; `multi=true` renders options as
   checkboxes; submit merges selections + free text comma-joined.
-4. **0PER4T0R 1NB0X** — recent messages addressed to `operator`
+   Each row has a `✗ CANC3L` button that resolves the question
+   with `[cancelled]`. Questions with a `ttl_seconds` show a
+   `⏳ MM:SS` chip; the host-side watchdog auto-cancels with
+   `[expired]` when the deadline fires.
+5. **0PER4T0R 1NB0X** — recent messages addressed to `operator`
   (last 50, from the broker).
-5. **P3NDING APPR0VALS** — the queue. The R3QU3ST SP4WN form
-   lives at the top of this section since submitting it immediately
-   queues an approval that lands directly below.
-6. **MESS4GE FL0W** — live broker SSE tail.
+6. **P3NDING APPR0VALS** — the queue. The R3QU3ST SP4WN form
+   lives at the top of this section since submitting it
+   immediately queues an approval that lands directly below.
+7. **MESS4GE FL0W** — live broker SSE tail.

 ### Container row

 Two-line layout (`assets/app.js::renderContainers`):

 - Line 1: agent name (link → new tab), m1nd/ag3nt chip, `needs
-  login` / `needs update` warning badges, in-flight `◐ pending-state…`
-  pill (replaces buttons during start/stop/restart/rebuild/destroy),
-  container name + port.
+  login` / `needs update` warning badges, in-flight `◐
+  pending-state…` pill (replaces buttons during start / stop /
+  restart / rebuild / destroy), container name + port.
 - Line 2: action buttons — `↻ R3BU1LD` always, `DESTR0Y` + `PURG3`
  on sub-agents, `↺ R3ST4RT` + (sub-agents) `■ ST0P` when running,
  `▶ ST4RT` when stopped. Buttons dim + disable while a transient
  lifecycle action is in flight.
+- Plus a collapsible `↳ logs · <container>` `<details>` block.
+  Expanding lazy-fetches journald output via `GET
+  /api/journal/{name}?unit=...&lines=...` (`journalctl -M
+  <container> -b --no-pager --output=short-iso`). A unit dropdown
+  switches between the harness service (default) and the full
+  machine journal; refresh button re-fetches.

 `↻ UPD4TE 4LL` button appears above the containers list when any
-agent is stale.
+agent is stale. Banner pulses on each broker SSE event
+(`pulseBanner` with a 4s grace timer).

-Banner pulses on each broker SSE event (`pulseBanner` with a 4 s
-grace timer).
+### Browser notifications
+
+Pure frontend (`Notification` API). Three signals trigger them:
+
+- new pending approval (per id, delta on `/api/state`)
+- new pending operator question (per id)
+- new broker message sent `to: "operator"` (live via SSE)
+
+First `/api/state` after page load seeds "seen" sets without
+firing — only items that arrive while the page is open count.
+`tag: "hyperhive"` collapses bursts; click focuses the dashboard
+tab. localStorage-backed mute toggle silences without revoking
+the OS permission. Requires a secure context (HTTPS or
+localhost); on other origins the controls hide themselves.

 ### Dashboard endpoints

@ -67,20 +105,38 @@ grace timer).
 - `POST /{rebuild,kill,restart,start,destroy}/{name}` — lifecycle.
 - `POST /purge-tombstone/{name}` — wipe a tombstone's state dirs.
 - `POST /answer-question/{id}` — answer a pending operator question.
+- `POST /cancel-question/{id}` — cancel a pending question with
+  the sentinel `[cancelled]`. Same code path as a real answer.
 - `POST /request-spawn` — queue a Spawn approval.
 - `POST /update-all` — rebuild every stale container.
+- `GET /api/journal/{name}?unit=&lines=` — journalctl viewer for
+  a managed container.

 ## Per-agent page

 Layout, top to bottom:

 - Banner (gradient shimmer while state=thinking).
- Title with `↻ R3BU1LD` button.
+- Title with `↑ DASHB04RD` back-link (new tab) + `↻ R3BU1LD`.
 - Status section (online / needs login / login-in-progress).
- State badge row (`💤 idle / 🧠 thinking / ○ offline · <age>`) +
-  `■ cancel turn` button visible while state=thinking.
- Terminal-wrap: live event tail (with sticky-bottom auto-scroll
-  and a `↓ N new` pill when not at bottom) followed by an
+- **State row**: state badge + model chip + last-turn timing +
+  cancel-turn button.
+  - State badge: `💤 idle` / `🧠 thinking` / `📦 compacting` /
+    `○ offline` / `… booting`, with an age suffix (`12s`,
+    `2m 14s`). Driven from `/api/state.turn_state` +
+    `turn_state_since`; SSE turn_start/turn_end still flip it
+    instantly between polls. Authoritative source is the
+    harness's `Bus::state_snapshot()`.
+  - Model chip: `model · <name>` (e.g. `model · haiku`).
+  - Last-turn chip: `last turn 12.3s` appears after the first
+    turn ends, computed from the state-since deltas.
+  - `■ cancel turn` button: visible only while state=thinking,
+    POSTs `/api/cancel`.
+- Inbox `<details>` block (collapsed): `inbox · N` — last 30
+  messages addressed to this agent, fetched via
+  `AgentRequest::Recent { limit: 30 }`.
+- Terminal-wrap: live event tail (sticky-bottom auto-scroll +
+  `↓ N new` pill when not at bottom) followed by an
  operator-input textarea acting as a prompt.

 ### Live view
@ -92,25 +148,32 @@ The harness emits `TurnStart { from, body, unread }`,
 `TurnEnd { ok, note }`. The web UI:

 - fetches `GET /events/history` on page load and replays the last
-  2000 events (oldest first, with `.no-anim` so they don't stagger);
+  2000 events (oldest first, with `.no-anim` so they don't
+  stagger);
 - then subscribes to `GET /events/stream` (SSE) for live tail;
- shows a granular state badge above the terminal, driven from
-  `turn_start`/`turn_end`, with a flash animation on transition;
+- shows a granular state badge above the terminal, driven
+  authoritatively from `/api/state.turn_state`. SSE turn_start /
+  turn_end still flip the badge instantly between renders;
 - sticky-bottom auto-scroll: scrolling up parks the view; new rows
  surface a "↓ N new" pill instead of yanking;
- terminal-themed: phosphor mauve glow, Crust bg, backdrop-filter
-  blur, row fade-in slide-up.
+- terminal-themed: phosphor mauve glow, Crust bg,
+  backdrop-filter blur, row fade-in slide-up.

 Per-stream rendering:

- `Stream` `tool_use` → `→ Read /path` / `→ Bash $ cmd` / `→ send →
-  operator: "..."` etc., per-tool pretty rather than raw JSON.
+- `Stream` `tool_use` →
+  - `Write` / `Edit`: collapsed `<details>` with a +/- diff body
+    (`-` lines from `input.old_string`, `+` lines from
+    `input.new_string` or every line of `input.content`).
+    Summary carries the path + line counts.
+  - others (`Read /path`, `Bash $ cmd`, `mcp__hyperhive__send →
+    operator: "..."`, etc.): flat one-line per-tool format.
 - `Stream` `tool_result` short → flat `← ...`; long → collapsed
  `<details>` `▸ ← Nl · headline` (click to expand full body).
 - `Stream` `thinking` → text content if claude provided one,
  otherwise the bare `· thinking …` indicator.
- `Stream` `system init`, `result`, `rate_limit_event` are dropped
-  — too noisy.
+- `Stream` `system init`, `result`, `rate_limit_event` are
+  dropped — too noisy.
 - `Note` → `· text`.
 - `TurnEnd` → `✓ turn ok` / `✗ turn fail — note`, triggers a
  `refreshState()`.
@ -118,19 +181,23 @@ Per-stream rendering:
 ### Terminal-embedded prompt

 The operator input lives *inside* the terminal-wrap as a
-prompt-style textarea below the live tail: multi-line (Enter sends,
-Shift+Enter newlines), tab-completes slash commands.
+prompt-style textarea below the live tail: multi-line (Enter
+sends, Shift+Enter newlines), tab-completes slash commands.

 Slash commands today:

- `/help` — list commands locally
- `/clear` — wipe the local terminal view (server history kept)
+- `/help` — list commands locally.
+- `/clear` — wipe the local terminal view (server history kept).
 - `/cancel` — `POST /api/cancel` → host shellouts `pkill -INT
-  claude`, emits a Note. Also surfaces as a `■ cancel turn` button
-  in the state row while state=thinking.
+  claude`, emits a Note. Also surfaces as a `■ cancel turn`
+  button in the state row while state=thinking.
 - `/compact` — `POST /api/compact` → host spawns
  `turn::compact_session` in the background; output streams into
  the live panel.
+- `/model <name>` — `POST /api/model` flipping `Bus::set_model`.
+  Takes effect on the next turn; persisted to
+  `/state/hyperhive-model` so the override survives harness
+  restart / rebuild.

 Unknown `/foo` shows an error row instead of being silently sent.

@ -140,4 +207,6 @@ Unknown `/foo` shows an error row instead of being silently sent.
 - `POST /login/{start,code,cancel}` — claude OAuth login flow.
 - `POST /api/cancel` — SIGINT the in-flight claude turn.
 - `POST /api/compact` — run `/compact` on the persistent session.
+- `POST /api/model` (`model=<name>`) — switch the model for
+  future turns.
 - `GET /events/history` — replay buffer for the terminal.