From d890509be356e4dfa44960b6af4397eda471bfaf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?m=C3=BCde?= Date: Sun, 17 May 2026 23:28:34 +0200 Subject: [PATCH] docs: turn_stats sink + event-driven agent badges + dashboard event vocabulary --- CLAUDE.md | 43 +++++++++++++++- docs/persistence.md | 35 ++++++++++--- docs/web-ui.md | 123 ++++++++++++++++++++++++++++++++++++-------- 3 files changed, 171 insertions(+), 30 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index ab82688..200996b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -45,9 +45,19 @@ hive-c0re/ host daemon + CLI (one binary, subcommand-dispatched) src/crash_watch.rs poll every 10s; fire HelperEvent::ContainerCrash when a previously-running container disappears without an operator-initiated transient + src/container_view.rs ContainerView struct + build_all helper; + shared between dashboard.rs (cold-load via + /api/state) and coordinator.rs's + rescan_containers_and_emit src/coordinator.rs shared state (broker/approvals/operator_questions/ transient/sockets) + tombstone enumeration + - kick_agent + notify_agent (helper-event push) + kick_agent + notify_agent (helper-event push) + + last_containers cache + rescan_and_emit diff helper + src/open_threads.rs loose-ends aggregator (pending approvals + + unanswered questions) — for_agent (filtered) and + hive_wide (manager surface). Backs + AgentRequest::GetOpenThreads + ManagerRequest:: + GetOpenThreads (the get_open_threads MCP tool). src/actions.rs approve/deny/destroy (transient-aware) src/auto_update.rs startup rebuild scan + ensure_manager + meta::lock_update_hyperhive bump @@ -85,6 +95,9 @@ hive-ag3nt/ in-container harness crate; produces TWO binaries src/client.rs generic JSON-line request/response over unix socket src/web_ui.rs per-container axum HTTP page (incl /api/cancel, /api/compact, /api/model, /events/history) + src/turn_stats.rs per-turn analytics sink (one sqlite row per + turn at /state/hyperhive-turn-stats.sqlite); + schema + best-effort writer src/events.rs LiveEvent + broadcast Bus + sqlite-backed history (/state/hyperhive-events.sqlite) + TurnState + model selection (persisted at /state/hyperhive-model) @@ -193,6 +206,34 @@ Prune freely. domain tooling — the agent flake's `inputs` block pulls the external flake, `agent.nix` references it via `flakeInputs..packages.${pkgs.system}.default`. +- **Just landed:** per-turn analytics sink. New + `hive-ag3nt::turn_stats` writes one row per claude turn to + `/state/hyperhive-turn-stats.sqlite`: identity (model, + wake_from, result_kind), timing (started/ended_at, + duration_ms), cost (full token-usage breakdown), behaviour + (tool_call_count + per-tool JSON map), and post-turn snapshot + metrics (open_threads_count, open_reminders_count fetched via + the existing GetOpenThreads + new CountPendingReminders RPC). + Both ag3nt + m1nd bin loops capture, both Bus accumulates + tool_use blocks via observe_stream during the stdout pump. + Writes are best-effort. No host-side vacuum yet — TODO under + Telemetry; same shape as events_vacuum, target 90d retention. +- **Just landed:** agent web UI event-driven badges. New + `LiveEvent::StatusChanged / ModelChanged / TokenUsageChanged + / TurnStateChanged` variants replace the per-agent page's + /api/state polling for the state row. Status/model/token/state + badges all update from SSE; /api/state only fetched on cold + load + during the login flow (session output isn't event- + shaped). Per-agent endpoints (`/api/cancel|compact|model| + new-session`, `/login/*`) all flip 303→200. New `alive-badge` + chip carries the harness reachability signal (replaces the + "● harness alive" paragraph); new `ctx-badge` mirrors Claude + Code's bottom-right "N tokens" indicator. Every chip carries + a `title=...` tooltip for hover detail. +- **Just landed:** events_vacuum simplified to age-only — + `KEEP_SECS = 7d`, no row cap. Chatty turn no longer evicts + a quiet day's history sooner than expected. Hourly sweep + unchanged. - **Just landed:** Phase 6 container events. New `DashboardEvent::ContainerStateChanged { container }` + `ContainerRemoved { name }` close the last refetch loop on the diff --git a/docs/persistence.md b/docs/persistence.md index 55b951d..eea0a98 100644 --- a/docs/persistence.md +++ b/docs/persistence.md @@ -41,17 +41,36 @@ One table: harness emits during turn loop execution. The harness writes; the host vacuums. `hive-c0re::events_vacuum` -runs hourly and sweeps every existing agent state dir, applying the -same two-stage delete to each file: drop rows older than 7 days, -then trim to the 2000 most-recent. Centralising retention on the -host means a misbehaving harness can't disable its own vacuum and -agents don't need any cleanup wiring of their own. +runs hourly and sweeps every existing agent state dir, deleting +rows older than 7 days. Age-only — no row cap — so a chatty turn +doesn't lose history sooner than a quiet one; disk pressure on a +sustained burst is the cheaper problem to have. Centralising +retention on the host means a misbehaving harness can't disable +its own vacuum and agents don't need any cleanup wiring of their +own. Path overridable via `HYPERHIVE_EVENTS_DB` (for dev / no-`/state` setups). On open failure the `Bus` falls back to no-store mode rather than crashing the harness — events still broadcast over SSE, just nothing persisted. +### `/state/hyperhive-turn-stats.sqlite` (per agent) + +Per-turn analytics sink. One row per claude turn captures +identity (`model`, `wake_from`, `result_kind`), timing +(`started_at`, `ended_at`, `duration_ms`), cost (input / output / +cache_read / cache_creation token counts), behaviour +(`tool_call_count` + `tool_call_breakdown_json`), and post-turn +snapshot metrics (`open_threads_count`, +`open_reminders_count` — fetched via the same socket the harness +already uses for `GetOpenThreads` + `CountPendingReminders`). +Bin-loop helpers `build_row` + `record` land each row at +`turn_end`; writes are best-effort, a sqlite hiccup logs + lets +the turn loop continue. + +No host-side vacuum yet — tracked in `TODO.md` under Telemetry +(target retention ~90 days, age-only sweep like events_vacuum). + ### `/state/hyperhive-model` (per agent) Single-line text file holding the claude model name currently @@ -68,8 +87,10 @@ Under `/var/lib/hyperhive/agents//`: - `config/` — the proposed nix repo (manager-editable). - `claude/` — claude OAuth credentials, bind-mounted RW to `/root/.claude` inside the container. -- `state/` — durable notes + the events.sqlite db, bind-mounted - to `/state` inside the container. +- `state/` — durable notes, the events.sqlite db, and the + turn-stats sqlite db. Bind-mounted to `/agents//state` + inside the container (the manager still uses the legacy + `/state` mount point — same host path either way). Under `/var/lib/hyperhive/applied//` — the hive-c0re-only applied repo. Tracks `flake.nix` (module-only boilerplate; never diff --git a/docs/web-ui.md b/docs/web-ui.md index 79c9ca2..1304e66 100644 --- a/docs/web-ui.md +++ b/docs/web-ui.md @@ -201,6 +201,22 @@ not ours. a managed container. - `GET /api/agent-config/{name}` — read-only view of the applied `agent.nix`. +- `GET /api/state-file?path=` — bounded + text read of a file under the per-agent `state/` subtree or + the shared `/var/lib/hyperhive/shared/`. Accepts the + container-view forms (`/agents//state/...`, `/shared/...`) + and the host form. Canonicalises + verifies the path stays + inside the allow-list, refuses anything but a regular file, + refuses `/agents//claude` / `config` subtrees, truncates + bodies at 1 MiB. Backs the dashboard's inline path-link + preview (PATH_RE detects pointer strings in message bodies, + question/answer text, and the operator inbox; clicking + expands a `
` that lazy-fetches via this endpoint). + Trailing-slash matches (i.e. directory paths) are skipped on + the client side — only files linkify. +- `GET /api/reminders` — list pending reminders for the + dashboard's queued-reminders panel. +- `POST /cancel-reminder/{id}` — hard-delete a pending reminder. - `GET /dashboard/stream` — unified live event channel: broker `sent` / `delivered`, plus the mutation events listed below. Each frame carries `seq`. @@ -223,21 +239,37 @@ payload): queue + history mutations. Client mutates a derived store and re-renders only the approvals section. - `question_added` (id, asker, question, options, multi, - asked_at, deadline_at) / `question_resolved` (id, answer, - answerer, answered_at, cancelled) — operator-targeted - questions only (peer-to-peer questions never fire these). The - ttl watchdog fires `question_resolved` with - `answerer = "ttl-watchdog"` on expiry. + asked_at, deadline_at, target) / `question_resolved` (id, + answer, answerer, answered_at, cancelled, target) — both + operator-targeted and peer (agent-to-agent) threads fire + these. The dashboard's questions pane surfaces both, with + filter chips (all / @operator / @peer / per-participant) and + an `0V3RR1D3` button on peer rows so the operator can + answer when an agent is stuck. The ttl watchdog fires + `question_resolved` with `answerer = "ttl-watchdog"` on + expiry. - `transient_set` (name, transient_kind, since_unix) / `transient_cleared` (name) — lifecycle action spinners. The client ticks the elapsed-seconds badge off `since_unix` client-side, no polling. +- `container_state_changed` (container: ContainerView) / + `container_removed` (name) — per-row container mutations, + emitted by `Coordinator::rescan_containers_and_emit` from + every mutation site (`actions::approve` post-spawn, + `actions::destroy`, the lifecycle_action wrapper, + `auto_update::rebuild_agent`) and from the 10s + `crash_watch` poll. Client upserts/removes by name; the + pending overlay is read from `transientsState` since the + payload doesn't carry it. -`/api/state` still serves `approvals` / `approval_history` / -`questions` / `question_history` / `transients` for cold-start -on first page load and as a safety-net resync from the 5s poll; -the client maintains the same arrays in derived stores and -applies the events on top. +`/api/state` is **only fetched on cold-load and on the few +forms that mutate non-event-derived state** (PURG3 + +meta-update, since tombstones + meta_inputs aren't event- +shaped yet). Every other section — approvals, questions, +transients, containers, operator inbox, message flow — +derives from `/dashboard/stream` after the initial snapshot, +maintaining its own client-side store and applying events on +top. The 5s periodic poll is gone. Generalised form helpers: `form[data-confirm="…"]` pops `confirm()` before submit; `form[data-prompt="…"]` pops @@ -250,16 +282,34 @@ Layout, top to bottom: - Banner (gradient shimmer while state=thinking). - Title with `↑ DASHB04RD` back-link (new tab) + `↻ R3BU1LD`. -- Status section (online / needs login / login-in-progress). -- **State row**: state badge + model chip + last-turn timing + - cancel-turn button + new-session button. +- Status section: empty when online (alive-badge in the state + row carries the signal), populated with the login form / + OAuth URL when `status` is `needs_login_*`. +- **State row**: alive badge + state badge + model chip + ctx + badge + last-turn timing + cancel-turn button + new-session + button. Every chip carries a `title=...` tooltip with the + detailed breakdown. + - Alive badge: `● alive` (green) / `◌ needs login` (amber) / + `◌ logging in` / `○ offline` / `… connecting`. Driven by + `LiveEvent::StatusChanged`; replaces the old "harness alive + — turn loop running" paragraph so the state row carries + every reachability signal. - State badge: `💤 idle` / `🧠 thinking` / `📦 compacting` / `○ offline` / `… booting`, with an age suffix (`12s`, - `2m 14s`). Driven from `/api/state.turn_state` + - `turn_state_since`; SSE turn_start/turn_end still flip it - instantly between polls. Authoritative source is the - harness's `Bus::state_snapshot()`. - - Model chip: `model · ` (e.g. `model · haiku`). + `2m 14s`). Driven by `LiveEvent::TurnStateChanged` + (`{state, since_unix}`) — the bus emits on every + `Bus::set_state` so the badge updates without a /api/state + refetch. Cold-load via `/api/state.turn_state` + + `turn_state_since`. + - Model chip: `model · ` (e.g. `model · haiku`). Driven + by `LiveEvent::ModelChanged`; emitted from `Bus::set_model`. + - Ctx badge: `ctx · 142k` — total prompt tokens in the + current context window (input + cache_read + cache_write), + mirroring claude code's bottom-right indicator. Hover for + the breakdown including output. Driven by + `LiveEvent::TokenUsageChanged`; emitted from + `Bus::record_usage` whenever the terminal `result` event + delivers a fresh usage block. - Last-turn chip: `last turn 12.3s` appears after the first turn ends, computed from the state-since deltas. - `■ cancel turn` button: visible only while state=thinking, @@ -269,6 +319,11 @@ Layout, top to bottom: arm a one-shot Bus flag — the next turn drops `--continue`, starting a fresh claude session. Subsequent turns resume normal `--continue`. + + Polling: `/api/state` is fetched **once** on cold load, and + again while `status === 'needs_login_in_progress'` (login + session output isn't event-shaped yet). Every other badge + updates from SSE; no periodic refresh timer runs. - Inbox `
` block (collapsed): `inbox · N` — last 30 messages addressed to this agent, fetched via `AgentRequest::Recent { limit: 30 }`. (Separate from @@ -345,14 +400,38 @@ Unknown `/foo` shows an error row instead of being silently sent. ### Per-agent endpoints +All POSTs return 200 (no 303 redirects). The matching mutations +fire `LiveEvent` variants on the per-agent bus, so the client +doesn't refetch `/api/state` on submit — the SSE stream +delivers the new state faster anyway. Only the login flow still +polls (session output streams in updates that aren't event- +shaped). + - `POST /send` — operator-injected message into this agent's inbox. - `POST /login/{start,code,cancel}` — claude OAuth login flow. -- `POST /api/cancel` — SIGINT the in-flight claude turn. + Start/cancel emit `LiveEvent::StatusChanged` to flip the + badge to/from `needs_login_in_progress`. +- `POST /api/cancel` — SIGINT the in-flight claude turn. Emits a + `LiveEvent::Note`. - `POST /api/compact` — run `/compact` on the persistent session (same MCP config + system prompt + allowed tools as a normal - turn — only the stdin payload differs). + turn — only the stdin payload differs). Flips state to + `Compacting` via `Bus::set_state`, which emits + `TurnStateChanged`. - `POST /api/model` (`model=`) — switch the model for - future turns. + future turns. `Bus::set_model` emits `ModelChanged`. - `POST /api/new-session` — arm a one-shot for the next turn to - drop `--continue`. + drop `--continue`. Emits a `LiveEvent::Note`. - `GET /events/history` — replay buffer for the terminal. + +Bus events (new vocabulary on `/events/stream`): + +- `status_changed { status }` — `online` / + `needs_login_idle` / `needs_login_in_progress`. Drives the + alive-badge. +- `model_changed { model }` — drives the model chip. +- `token_usage_changed { usage: TokenUsage }` — drives the + ctx-badge. Emitted from `Bus::record_usage` whenever the + stream-json `result` event delivers a fresh usage block. +- `turn_state_changed { state, since_unix }` — drives the + state badge (`idle`/`thinking`/`compacting`).