hyperhive/TODO.md

# Hyperhive TODOs

> **Rough split for who picks up what:** harness ergonomics + host-side
> harness plumbing tend to be damocles' interest area; UI-polish work
> for the operator is not. Use that as a hint when picking up items,
> not a hard rule.

## Architecture / Features

- Shared space for all agents to access documents/files without manager routing
- Private git forge agents can push to and create new repos in
- Move bind mounts in agents to `/agents/<name>/state` so path for agent = path for manager
- **Split harness-internal state from agent-visible state**: the `/agents/<n>/state/` mount (host `/var/lib/hyperhive/agents/<n>/state/`) currently mixes the agent's durable notes with harness internals — `hyperhive-events.sqlite`, `hyperhive-turn-stats.sqlite`, `hyperhive-model`, future per-agent skill caches, etc. The agent can accidentally overwrite a harness file, the harness clutters what claude thinks is "my notes dir", and the host-side vacuum has to special-case filenames it owns. Move harness internals to a sibling dir, e.g. `/var/lib/hyperhive/agents/<n>/harness/`, bind-mounted RW into the container as `/agents/<n>/harness/` (same path inside + out, same convention as state). Container's `/agents/<n>/state/` becomes purely agent-owned. Touches: `paths.rs` (new `harness_dir()`), `events.rs`, `turn_stats.rs` (default paths flip), `events_vacuum.rs` (sweep root flips), `lifecycle.rs` (extra bind mount), and a migration that moves existing files on first boot under the new layout. Side benefit: makes the privsep TODO cheaper — the unprivileged web server only needs read access to `/agents/<n>/state/` (operator-meaningful files), not `/agents/<n>/harness/`. The legacy bare `/state` mount the manager still uses (`container_state_prefix("manager") == "/state/"`, manager bind in `lifecycle::set_nspawn_flags`) gets removed in the same pass — manager goes to `/agents/manager/state/` + `/agents/manager/harness/` like every other agent.
- **Broadcast messaging**: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
- **Multi-agent restart coordination**: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
- **Shared docs/skills repo (RO)**: a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or `/shared`. Implementation likely: seed an `org-shared/docs` repo on first hive-forge boot, grant every per-agent user a read membership in the org. Agents `git clone` it (or use the API) to read; only the manager + operator can push.

## Reminder Tool

- Per-agent reminder limits (burst capacity, rate limiting)
- **Scheduler shutdown**: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
- **DB lock contention**: under high reminder volume, the broker's `Mutex<Connection>` serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.

## Dashboard

- **Unified URL scheme via reverse proxy**: today every agent's web UI is reached at `<host>:<per-agent-port>/`, so operators juggle a port list. Stand up nginx (or similar) terminating one domain that fans requests to `/agent/<name>/...` out to each container's web port, and to `/` for the main dashboard. Touches: a NixOS module on the host, the dashboard's per-agent link rendering, and the per-agent web server's base-path handling (currently assumes root). Lets bookmarks survive port reshuffles and unblocks per-agent stats links being relative URLs instead of hard-coded ports.
- **Delivered-reminder rollup on the per-agent stats page**: surface attempt / success / failure counts for reminders this agent fired (in the existing `/stats` page). Needs an `AgentRequest::ReminderRollup { since_secs }` / matching `ManagerRequest::ReminderRollup` RPC so the agent can pull the counts from the host's broker DB (the reminders table is host-owned; agent state doesn't have them). Deferred from the initial stats page so the first cut stays self-contained to data the agent already owns.

## Security

- **Privsep the dashboard from the privileged daemon**: hive-c0re runs as root (it has to — `nixos-container` create / start / destroy, the meta git repo, every per-agent bind mount). The HTTP server lives in the same process, so every read-endpoint (`/api/state-file`, `/api/journal/{name}`, `/api/agent-config/{name}`) is one allow-list bug away from serving arbitrary host files. Split the architecture: keep the privileged daemon doing lifecycle + git + ipc, run the web UI as an unprivileged user that talks to the daemon over a unix socket with a narrow request surface (`ReadAgentStateFile { agent, rel_path }` etc.). The unprivileged process can't read `/etc/shadow` even if every check in `get_state_file` is bypassed — it doesn't have the bits. Container-lifecycle POSTs (`/restart`, `/destroy`, etc.) become forwarded RPCs the privileged side authorises on its terms.

## Harness Ergonomics (agent-side wishlist)

Filed by damocles, who actually lives in this thing. Loosely ranked by
how often the friction bites in normal use.

- **Optional `in_reply_to: <msg_id>` on send** — pure wire addition; no
  behavioural change. The dashboard could render conversation threads
  (already wants this for the agent-to-agent question UI in the
  Dashboard section). Today every reply is a fresh root in the message
  flow which obscures cause-and-effect when two agents are mid-debate.
  Field is optional, ignored if the referenced id is unknown / cross-
  agent / out of retention.

## Telemetry

- **Per-turn stats: host-side vacuum sweep**: the sink writes to `/state/hyperhive-turn-stats.sqlite` on each agent's state dir; needs a periodic retention sweep mirroring `events_vacuum.rs` so the table doesn't grow forever. Default keep-window: 90 days (turn-stats are denser than events but smaller per-row, ~200B each).

## Harness Behaviour

- **Auto session-reset when context is large and cache is cold**: today every turn uses `--continue`, so a long-lived agent carries its entire transcript forward indefinitely. When the next turn's context is above some threshold (rough starting point: ~50% of the session limit — hive startup alone burned ~15%, so the headroom disappears fast) *and* the prompt cache is no longer warm (last turn ended past the cache TTL), it's cheaper to start fresh than to re-send the whole history uncached. Open question: drop `--continue` vs. trigger `--compact` first — needs measurement of what each actually costs (uncached re-read of N tokens vs. a compact turn's own token spend + the post-compact uncached re-read). Decision should be data-driven, not guessed. Needs: a context-size estimate per turn (turn_stats already tracks token usage), a cache-warmth heuristic (time since last turn vs. cache TTL), and a one-shot fresh-session path in `turn.rs` mirroring the existing `↻ new session` button.

## Bugs

- **Token-budget exhaustion crashes the harness**: when claude's account hits its rate/token cap, the in-flight `claude --print` invocation returns an error the harness doesn't recognise as recoverable, the serve loop exits, and the container stays up with a dead daemon. Operator only notices when an unrelated wake fails to drive a turn. Want: detect the budget-exceeded class of failure (likely a specific stderr line or stream-json `rate_limit_event` shape), fire a `LiveEvent::StatusChanged("rate_limited")` or new status, surface as a red badge + banner on the dashboard + per-agent UI, and have the serve loop park (sleep N minutes, retry) instead of returning Err. Operator can also see "this agent is rate-limited until ~HH:MM" if claude tells us when. Inspect `crate::turn::run_claude`'s `bail!` paths + claude's stderr conventions for the budget error string.
- **Post-rebuild system-message missed wake**: at 09:13:14 the dashboard showed `system → damocles container rebuilt` as ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequent `recv()` from inside the agent returned `(empty)`, confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_server `serve_agent_stdio` task is up and answering MCP/socket calls, but the `hive-ag3nt::serve` long-poll loop that drives `drive_turn` either died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survive `nixos-container update` cleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to the `recv_blocking` fix in `e423d57` if the rebuild restarts the broker mid-subscribe.