crash_watch grows two more state-axes alongside running/stopped:
- logged-in (claude session dir populated for the agent)
- up-to-date (recorded flake rev matches current)
per-tick transitions emit HelperEvent::NeedsLogin / LoggedIn /
NeedsUpdate. seed-on-first-tick semantics retained — nothing fires
on harness boot for agents that were already in their state. only
needs_update fires the 'stale appeared' direction; the resolved
direction is already covered by Rebuilt.
new mcp__hyperhive__update(name) on the manager surface: idempotent
rebuild via auto_update::rebuild_agent. transient-aware (Rebuilding)
so the dashboard shows the spinner. login intentionally has NO tool
— it's interactive OAuth, only the operator can complete it.
prompts + approvals doc + turn-loop doc updated. todo grows a
'show per-agent applied config in dashboard' entry (separate
follow-up).
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.
docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
/api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
state row (badge + model chip + last-turn chip + cancel button),
per-agent inbox, /model slash, /cancel-question + journald
endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
override via /model); recv(wait_seconds) up to 180s with the
rationale; ask_operator gains ttl_seconds; new TurnState section;
kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
shipped, dropped from the list.
new hive_c0re::crash_watch task polls every 10s, builds the set of
currently-running containers, and on running→stopped transitions
checks the transient snapshot: if no Stopping / Restarting /
Destroying / Rebuilding flag is set, the container exited
unexpectedly and we fire HelperEvent::ContainerCrash into the
manager's inbox so it can react (typically: start it again).
first poll is a seeding pass — no events on harness startup. dbus
subscription would be lower-latency but polling is honest and
debuggable, and a 10s delay on crash detection is fine for our
scale.
manager prompt + approvals doc updated to advertise the new
event variant. todo drops the entry (and the journald-viewer
entry that already shipped).
claude.md was eating 400 lines of subsystem detail that's useful
when you're working on that subsystem and noise the rest of the
time. split into:
- docs/conventions.md naming, identity, async forms, commit style
- docs/gotchas.md nspawn / nixos-container quirks
- docs/web-ui.md dashboard + per-agent layouts and endpoints
- docs/turn-loop.md claude invocation, wake prompt, mcp surface
- docs/approvals.md approval flow, manager policy, helper events
- docs/persistence.md sqlite dbs, retention, state dir layout
claude.md is now the entry point — file map, reading paths
("pick the doc that matches your task"), quick reminders that
fit on one screen, and a small scratchpad section for in-flight
context. references the docs; the docs don't reference claude.md.
no content was lost — the docs/ files cover everything the old
claude.md did, plus things i wrote up better while extracting.