docs: turn_stats sink + event-driven agent badges + dashboard event vocabulary

This commit is contained in:
müde 2026-05-17 23:28:34 +02:00
parent e772182724
commit d890509be3
3 changed files with 171 additions and 30 deletions

View file

@ -45,9 +45,19 @@ hive-c0re/ host daemon + CLI (one binary, subcommand-dispatched)
src/crash_watch.rs poll every 10s; fire HelperEvent::ContainerCrash src/crash_watch.rs poll every 10s; fire HelperEvent::ContainerCrash
when a previously-running container disappears when a previously-running container disappears
without an operator-initiated transient without an operator-initiated transient
src/container_view.rs ContainerView struct + build_all helper;
shared between dashboard.rs (cold-load via
/api/state) and coordinator.rs's
rescan_containers_and_emit
src/coordinator.rs shared state (broker/approvals/operator_questions/ src/coordinator.rs shared state (broker/approvals/operator_questions/
transient/sockets) + tombstone enumeration + transient/sockets) + tombstone enumeration +
kick_agent + notify_agent (helper-event push) kick_agent + notify_agent (helper-event push) +
last_containers cache + rescan_and_emit diff helper
src/open_threads.rs loose-ends aggregator (pending approvals +
unanswered questions) — for_agent (filtered) and
hive_wide (manager surface). Backs
AgentRequest::GetOpenThreads + ManagerRequest::
GetOpenThreads (the get_open_threads MCP tool).
src/actions.rs approve/deny/destroy (transient-aware) src/actions.rs approve/deny/destroy (transient-aware)
src/auto_update.rs startup rebuild scan + ensure_manager + src/auto_update.rs startup rebuild scan + ensure_manager +
meta::lock_update_hyperhive bump meta::lock_update_hyperhive bump
@ -85,6 +95,9 @@ hive-ag3nt/ in-container harness crate; produces TWO binaries
src/client.rs generic JSON-line request/response over unix socket src/client.rs generic JSON-line request/response over unix socket
src/web_ui.rs per-container axum HTTP page (incl /api/cancel, src/web_ui.rs per-container axum HTTP page (incl /api/cancel,
/api/compact, /api/model, /events/history) /api/compact, /api/model, /events/history)
src/turn_stats.rs per-turn analytics sink (one sqlite row per
turn at /state/hyperhive-turn-stats.sqlite);
schema + best-effort writer
src/events.rs LiveEvent + broadcast Bus + sqlite-backed history src/events.rs LiveEvent + broadcast Bus + sqlite-backed history
(/state/hyperhive-events.sqlite) + TurnState + (/state/hyperhive-events.sqlite) + TurnState +
model selection (persisted at /state/hyperhive-model) model selection (persisted at /state/hyperhive-model)
@ -193,6 +206,34 @@ Prune freely.
domain tooling — the agent flake's `inputs` block pulls domain tooling — the agent flake's `inputs` block pulls
the external flake, `agent.nix` references it via the external flake, `agent.nix` references it via
`flakeInputs.<name>.packages.${pkgs.system}.default`. `flakeInputs.<name>.packages.${pkgs.system}.default`.
- **Just landed:** per-turn analytics sink. New
`hive-ag3nt::turn_stats` writes one row per claude turn to
`/state/hyperhive-turn-stats.sqlite`: identity (model,
wake_from, result_kind), timing (started/ended_at,
duration_ms), cost (full token-usage breakdown), behaviour
(tool_call_count + per-tool JSON map), and post-turn snapshot
metrics (open_threads_count, open_reminders_count fetched via
the existing GetOpenThreads + new CountPendingReminders RPC).
Both ag3nt + m1nd bin loops capture, both Bus accumulates
tool_use blocks via observe_stream during the stdout pump.
Writes are best-effort. No host-side vacuum yet — TODO under
Telemetry; same shape as events_vacuum, target 90d retention.
- **Just landed:** agent web UI event-driven badges. New
`LiveEvent::StatusChanged / ModelChanged / TokenUsageChanged
/ TurnStateChanged` variants replace the per-agent page's
/api/state polling for the state row. Status/model/token/state
badges all update from SSE; /api/state only fetched on cold
load + during the login flow (session output isn't event-
shaped). Per-agent endpoints (`/api/cancel|compact|model|
new-session`, `/login/*`) all flip 303→200. New `alive-badge`
chip carries the harness reachability signal (replaces the
"● harness alive" paragraph); new `ctx-badge` mirrors Claude
Code's bottom-right "N tokens" indicator. Every chip carries
a `title=...` tooltip for hover detail.
- **Just landed:** events_vacuum simplified to age-only —
`KEEP_SECS = 7d`, no row cap. Chatty turn no longer evicts
a quiet day's history sooner than expected. Hourly sweep
unchanged.
- **Just landed:** Phase 6 container events. New - **Just landed:** Phase 6 container events. New
`DashboardEvent::ContainerStateChanged { container }` + `DashboardEvent::ContainerStateChanged { container }` +
`ContainerRemoved { name }` close the last refetch loop on the `ContainerRemoved { name }` close the last refetch loop on the

View file

@ -41,17 +41,36 @@ One table:
harness emits during turn loop execution. harness emits during turn loop execution.
The harness writes; the host vacuums. `hive-c0re::events_vacuum` The harness writes; the host vacuums. `hive-c0re::events_vacuum`
runs hourly and sweeps every existing agent state dir, applying the runs hourly and sweeps every existing agent state dir, deleting
same two-stage delete to each file: drop rows older than 7 days, rows older than 7 days. Age-only — no row cap — so a chatty turn
then trim to the 2000 most-recent. Centralising retention on the doesn't lose history sooner than a quiet one; disk pressure on a
host means a misbehaving harness can't disable its own vacuum and sustained burst is the cheaper problem to have. Centralising
agents don't need any cleanup wiring of their own. retention on the host means a misbehaving harness can't disable
its own vacuum and agents don't need any cleanup wiring of their
own.
Path overridable via `HYPERHIVE_EVENTS_DB` (for dev / no-`/state` Path overridable via `HYPERHIVE_EVENTS_DB` (for dev / no-`/state`
setups). On open failure the `Bus` falls back to no-store mode setups). On open failure the `Bus` falls back to no-store mode
rather than crashing the harness — events still broadcast over SSE, rather than crashing the harness — events still broadcast over SSE,
just nothing persisted. just nothing persisted.
### `/state/hyperhive-turn-stats.sqlite` (per agent)
Per-turn analytics sink. One row per claude turn captures
identity (`model`, `wake_from`, `result_kind`), timing
(`started_at`, `ended_at`, `duration_ms`), cost (input / output /
cache_read / cache_creation token counts), behaviour
(`tool_call_count` + `tool_call_breakdown_json`), and post-turn
snapshot metrics (`open_threads_count`,
`open_reminders_count` — fetched via the same socket the harness
already uses for `GetOpenThreads` + `CountPendingReminders`).
Bin-loop helpers `build_row` + `record` land each row at
`turn_end`; writes are best-effort, a sqlite hiccup logs + lets
the turn loop continue.
No host-side vacuum yet — tracked in `TODO.md` under Telemetry
(target retention ~90 days, age-only sweep like events_vacuum).
### `/state/hyperhive-model` (per agent) ### `/state/hyperhive-model` (per agent)
Single-line text file holding the claude model name currently Single-line text file holding the claude model name currently
@ -68,8 +87,10 @@ Under `/var/lib/hyperhive/agents/<name>/`:
- `config/` — the proposed nix repo (manager-editable). - `config/` — the proposed nix repo (manager-editable).
- `claude/` — claude OAuth credentials, bind-mounted RW to - `claude/` — claude OAuth credentials, bind-mounted RW to
`/root/.claude` inside the container. `/root/.claude` inside the container.
- `state/` — durable notes + the events.sqlite db, bind-mounted - `state/` — durable notes, the events.sqlite db, and the
to `/state` inside the container. turn-stats sqlite db. Bind-mounted to `/agents/<name>/state`
inside the container (the manager still uses the legacy
`/state` mount point — same host path either way).
Under `/var/lib/hyperhive/applied/<name>/` — the hive-c0re-only Under `/var/lib/hyperhive/applied/<name>/` — the hive-c0re-only
applied repo. Tracks `flake.nix` (module-only boilerplate; never applied repo. Tracks `flake.nix` (module-only boilerplate; never

View file

@ -201,6 +201,22 @@ not ours.
a managed container. a managed container.
- `GET /api/agent-config/{name}` — read-only view of the applied - `GET /api/agent-config/{name}` — read-only view of the applied
`agent.nix`. `agent.nix`.
- `GET /api/state-file?path=<host-or-container-path>` — bounded
text read of a file under the per-agent `state/` subtree or
the shared `/var/lib/hyperhive/shared/`. Accepts the
container-view forms (`/agents/<n>/state/...`, `/shared/...`)
and the host form. Canonicalises + verifies the path stays
inside the allow-list, refuses anything but a regular file,
refuses `/agents/<n>/claude` / `config` subtrees, truncates
bodies at 1 MiB. Backs the dashboard's inline path-link
preview (PATH_RE detects pointer strings in message bodies,
question/answer text, and the operator inbox; clicking
expands a `<details>` that lazy-fetches via this endpoint).
Trailing-slash matches (i.e. directory paths) are skipped on
the client side — only files linkify.
- `GET /api/reminders` — list pending reminders for the
dashboard's queued-reminders panel.
- `POST /cancel-reminder/{id}` — hard-delete a pending reminder.
- `GET /dashboard/stream` — unified live event channel: - `GET /dashboard/stream` — unified live event channel:
broker `sent` / `delivered`, plus the mutation events listed broker `sent` / `delivered`, plus the mutation events listed
below. Each frame carries `seq`. below. Each frame carries `seq`.
@ -223,21 +239,37 @@ payload):
queue + history mutations. Client mutates a derived store and queue + history mutations. Client mutates a derived store and
re-renders only the approvals section. re-renders only the approvals section.
- `question_added` (id, asker, question, options, multi, - `question_added` (id, asker, question, options, multi,
asked_at, deadline_at) / `question_resolved` (id, answer, asked_at, deadline_at, target) / `question_resolved` (id,
answerer, answered_at, cancelled) — operator-targeted answer, answerer, answered_at, cancelled, target) — both
questions only (peer-to-peer questions never fire these). The operator-targeted and peer (agent-to-agent) threads fire
ttl watchdog fires `question_resolved` with these. The dashboard's questions pane surfaces both, with
`answerer = "ttl-watchdog"` on expiry. filter chips (all / @operator / @peer / per-participant) and
an `0V3RR1D3` button on peer rows so the operator can
answer when an agent is stuck. The ttl watchdog fires
`question_resolved` with `answerer = "ttl-watchdog"` on
expiry.
- `transient_set` (name, transient_kind, since_unix) / - `transient_set` (name, transient_kind, since_unix) /
`transient_cleared` (name) — lifecycle action spinners. The `transient_cleared` (name) — lifecycle action spinners. The
client ticks the elapsed-seconds badge off `since_unix` client ticks the elapsed-seconds badge off `since_unix`
client-side, no polling. client-side, no polling.
- `container_state_changed` (container: ContainerView) /
`container_removed` (name) — per-row container mutations,
emitted by `Coordinator::rescan_containers_and_emit` from
every mutation site (`actions::approve` post-spawn,
`actions::destroy`, the lifecycle_action wrapper,
`auto_update::rebuild_agent`) and from the 10s
`crash_watch` poll. Client upserts/removes by name; the
pending overlay is read from `transientsState` since the
payload doesn't carry it.
`/api/state` still serves `approvals` / `approval_history` / `/api/state` is **only fetched on cold-load and on the few
`questions` / `question_history` / `transients` for cold-start forms that mutate non-event-derived state** (PURG3 +
on first page load and as a safety-net resync from the 5s poll; meta-update, since tombstones + meta_inputs aren't event-
the client maintains the same arrays in derived stores and shaped yet). Every other section — approvals, questions,
applies the events on top. transients, containers, operator inbox, message flow —
derives from `/dashboard/stream` after the initial snapshot,
maintaining its own client-side store and applying events on
top. The 5s periodic poll is gone.
Generalised form helpers: `form[data-confirm="…"]` pops Generalised form helpers: `form[data-confirm="…"]` pops
`confirm()` before submit; `form[data-prompt="…"]` pops `confirm()` before submit; `form[data-prompt="…"]` pops
@ -250,16 +282,34 @@ Layout, top to bottom:
- Banner (gradient shimmer while state=thinking). - Banner (gradient shimmer while state=thinking).
- Title with `↑ DASHB04RD` back-link (new tab) + `↻ R3BU1LD`. - Title with `↑ DASHB04RD` back-link (new tab) + `↻ R3BU1LD`.
- Status section (online / needs login / login-in-progress). - Status section: empty when online (alive-badge in the state
- **State row**: state badge + model chip + last-turn timing + row carries the signal), populated with the login form /
cancel-turn button + new-session button. OAuth URL when `status` is `needs_login_*`.
- **State row**: alive badge + state badge + model chip + ctx
badge + last-turn timing + cancel-turn button + new-session
button. Every chip carries a `title=...` tooltip with the
detailed breakdown.
- Alive badge: `● alive` (green) / `◌ needs login` (amber) /
`◌ logging in` / `○ offline` / `… connecting`. Driven by
`LiveEvent::StatusChanged`; replaces the old "harness alive
— turn loop running" paragraph so the state row carries
every reachability signal.
- State badge: `💤 idle` / `🧠 thinking` / `📦 compacting` / - State badge: `💤 idle` / `🧠 thinking` / `📦 compacting` /
`○ offline` / `… booting`, with an age suffix (`12s`, `○ offline` / `… booting`, with an age suffix (`12s`,
`2m 14s`). Driven from `/api/state.turn_state` + `2m 14s`). Driven by `LiveEvent::TurnStateChanged`
`turn_state_since`; SSE turn_start/turn_end still flip it (`{state, since_unix}`) — the bus emits on every
instantly between polls. Authoritative source is the `Bus::set_state` so the badge updates without a /api/state
harness's `Bus::state_snapshot()`. refetch. Cold-load via `/api/state.turn_state` +
- Model chip: `model · <name>` (e.g. `model · haiku`). `turn_state_since`.
- Model chip: `model · <name>` (e.g. `model · haiku`). Driven
by `LiveEvent::ModelChanged`; emitted from `Bus::set_model`.
- Ctx badge: `ctx · 142k` — total prompt tokens in the
current context window (input + cache_read + cache_write),
mirroring claude code's bottom-right indicator. Hover for
the breakdown including output. Driven by
`LiveEvent::TokenUsageChanged`; emitted from
`Bus::record_usage` whenever the terminal `result` event
delivers a fresh usage block.
- Last-turn chip: `last turn 12.3s` appears after the first - Last-turn chip: `last turn 12.3s` appears after the first
turn ends, computed from the state-since deltas. turn ends, computed from the state-since deltas.
- `■ cancel turn` button: visible only while state=thinking, - `■ cancel turn` button: visible only while state=thinking,
@ -269,6 +319,11 @@ Layout, top to bottom:
arm a one-shot Bus flag — the next turn drops arm a one-shot Bus flag — the next turn drops
`--continue`, starting a fresh claude session. Subsequent `--continue`, starting a fresh claude session. Subsequent
turns resume normal `--continue`. turns resume normal `--continue`.
Polling: `/api/state` is fetched **once** on cold load, and
again while `status === 'needs_login_in_progress'` (login
session output isn't event-shaped yet). Every other badge
updates from SSE; no periodic refresh timer runs.
- Inbox `<details>` block (collapsed): `inbox · N` — last 30 - Inbox `<details>` block (collapsed): `inbox · N` — last 30
messages addressed to this agent, fetched via messages addressed to this agent, fetched via
`AgentRequest::Recent { limit: 30 }`. (Separate from `AgentRequest::Recent { limit: 30 }`. (Separate from
@ -345,14 +400,38 @@ Unknown `/foo` shows an error row instead of being silently sent.
### Per-agent endpoints ### Per-agent endpoints
All POSTs return 200 (no 303 redirects). The matching mutations
fire `LiveEvent` variants on the per-agent bus, so the client
doesn't refetch `/api/state` on submit — the SSE stream
delivers the new state faster anyway. Only the login flow still
polls (session output streams in updates that aren't event-
shaped).
- `POST /send` — operator-injected message into this agent's inbox. - `POST /send` — operator-injected message into this agent's inbox.
- `POST /login/{start,code,cancel}` — claude OAuth login flow. - `POST /login/{start,code,cancel}` — claude OAuth login flow.
- `POST /api/cancel` — SIGINT the in-flight claude turn. Start/cancel emit `LiveEvent::StatusChanged` to flip the
badge to/from `needs_login_in_progress`.
- `POST /api/cancel` — SIGINT the in-flight claude turn. Emits a
`LiveEvent::Note`.
- `POST /api/compact` — run `/compact` on the persistent session - `POST /api/compact` — run `/compact` on the persistent session
(same MCP config + system prompt + allowed tools as a normal (same MCP config + system prompt + allowed tools as a normal
turn — only the stdin payload differs). turn — only the stdin payload differs). Flips state to
`Compacting` via `Bus::set_state`, which emits
`TurnStateChanged`.
- `POST /api/model` (`model=<name>`) — switch the model for - `POST /api/model` (`model=<name>`) — switch the model for
future turns. future turns. `Bus::set_model` emits `ModelChanged`.
- `POST /api/new-session` — arm a one-shot for the next turn to - `POST /api/new-session` — arm a one-shot for the next turn to
drop `--continue`. drop `--continue`. Emits a `LiveEvent::Note`.
- `GET /events/history` — replay buffer for the terminal. - `GET /events/history` — replay buffer for the terminal.
Bus events (new vocabulary on `/events/stream`):
- `status_changed { status }``online` /
`needs_login_idle` / `needs_login_in_progress`. Drives the
alive-badge.
- `model_changed { model }` — drives the model chip.
- `token_usage_changed { usage: TokenUsage }` — drives the
ctx-badge. Emitted from `Bus::record_usage` whenever the
stream-json `result` event delivers a fresh usage block.
- `turn_state_changed { state, since_unix }` — drives the
state badge (`idle`/`thinking`/`compacting`).