agent ui: consolidate status into state-row badges

drop the "● harness alive — turn loop running" paragraph; the
new #alive-badge chip in the state row carries the same signal
across all statuses (loading / online / needs-login / offline)
with colour coding. token-usage chip renamed + restyled as
#ctx-badge — primary number is total context-window tokens
used, mirroring claude code's "N tokens" indicator.

every state-row badge now has hover detail: state-badge gets
per-state tooltips + age suffix, model-chip explains the
/model command, last-turn shows the raw ms duration, ctx-badge
breaks out input / cache_read / cache_write / output.

new todo entry for the per-turn stats sink (start/end/model/
tokens/tool-call-count) the harness should be writing.
This commit is contained in:
müde 2026-05-17 22:36:02 +02:00
parent 85c0df2e64
commit b444dac6e8
4 changed files with 79 additions and 12 deletions

View file

@ -76,6 +76,10 @@ how often the friction bites in normal use.
Field is optional, ignored if the referenced id is unknown / cross-
agent / out of retention.
## Telemetry
- **Per-turn stats log**: persist one row per claude turn in a new sqlite table on the per-agent state dir (or the host broker DB, indexed by agent). Columns: `started_at`, `ended_at`, `duration_ms`, `model`, `input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`, `tool_call_count`, `tool_call_breakdown` (JSON: `{Read: 12, Bash: 3, ...}`), `bytes_streamed`, `wake_reason` (recv'd message / reminder / operator-kick / manual), `result_kind` (ok / cancelled / failed-mid-turn / compacted), `note` (e.g. failure reason). Powers: per-agent dashboards (avg turn time over time, tool-usage histogram, cost projections from token counts × model rate), debugging stuck loops (look for repeated identical wake_reason + zero tool calls), and operator-visible "this is what your spend looked like this week" rollups. Source data is already mostly in the harness's `TurnState` + the per-event bus; just needs a sink. Keep a retention sweep (host-side) so the table doesn't grow forever.
## Bugs
- **Post-rebuild system-message missed wake**: at 09:13:14 the dashboard showed `system → damocles container rebuilt` as ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequent `recv()` from inside the agent returned `(empty)`, confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_server `serve_agent_stdio` task is up and answering MCP/socket calls, but the `hive-ag3nt::serve` long-poll loop that drives `drive_turn` either died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survive `nixos-container update` cleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to the `recv_blocking` fix in `e423d57` if the rebuild restarts the broker mid-subscribe.