hyperhive/TODO.md

9.2 KiB

Hyperhive TODOs

Architecture / Features

  • Shared space for all agents to access documents/files without manager routing
  • Private git forge agents can push to and create new repos in
  • Move bind mounts in agents to /agents/<name>/state so path for agent = path for manager
  • Split harness-internal state from agent-visible state: the /agents/<n>/state/ mount (host /var/lib/hyperhive/agents/<n>/state/) currently mixes the agent's durable notes with harness internals — hyperhive-events.sqlite, hyperhive-turn-stats.sqlite, hyperhive-model, future per-agent skill caches, etc. The agent can accidentally overwrite a harness file, the harness clutters what claude thinks is "my notes dir", and the host-side vacuum has to special-case filenames it owns. Move harness internals to a sibling dir, e.g. /var/lib/hyperhive/agents/<n>/harness/, bind-mounted RW into the container as /agents/<n>/harness/ (same path inside + out, same convention as state). Container's /agents/<n>/state/ becomes purely agent-owned. Touches: paths.rs (new harness_dir()), events.rs, turn_stats.rs (default paths flip), events_vacuum.rs (sweep root flips), lifecycle.rs (extra bind mount), and a migration that moves existing files on first boot under the new layout. Side benefit: makes the privsep TODO cheaper — the unprivileged web server only needs read access to /agents/<n>/state/ (operator-meaningful files), not /agents/<n>/harness/. The legacy bare /state mount the manager still uses (container_state_prefix("manager") == "/state/", manager bind in lifecycle::set_nspawn_flags) gets removed in the same pass — manager goes to /agents/manager/state/ + /agents/manager/harness/ like every other agent.
  • Broadcast messaging: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
  • Multi-agent restart coordination: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
  • Shared docs/skills repo (RO): a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or /shared. Implementation likely: seed an org-shared/docs repo on first hive-forge boot, grant every per-agent user a read membership in the org. Agents git clone it (or use the API) to read; only the manager + operator can push.

Reminder Tool

  • Per-agent reminder limits (burst capacity, rate limiting)
  • Scheduler shutdown: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
  • DB lock contention: under high reminder volume, the broker's Mutex<Connection> serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.

Dashboard

  • Per-agent terminal coherence pass: row taxonomy + colour scheme has drifted — glyph reuse, ad-hoc <details> rules, undifferentiated stderr, .sys catch-all silently hiding unrecognised events. Full audit + proposed unified scheme in docs/terminal-rendering.md.
  • Delivered-reminder rollups: per-agent delivered-count chip (last 24h) + histogram of attempts-vs-successes on the container row. Needs Broker::count_delivered_reminders_since(agent, ts) (cheap COUNT against the reminders table, WHERE agent = ?1 AND sent_at >= ?2).

Security

  • Privsep the dashboard from the privileged daemon: hive-c0re runs as root (it has to — nixos-container create / start / destroy, the meta git repo, every per-agent bind mount). The HTTP server lives in the same process, so every read-endpoint (/api/state-file, /api/journal/{name}, /api/agent-config/{name}) is one allow-list bug away from serving arbitrary host files. Split the architecture: keep the privileged daemon doing lifecycle + git + ipc, run the web UI as an unprivileged user that talks to the daemon over a unix socket with a narrow request surface (ReadAgentStateFile { agent, rel_path } etc.). The unprivileged process can't read /etc/shadow even if every check in get_state_file is bypassed — it doesn't have the bits. Container-lifecycle POSTs (/restart, /destroy, etc.) become forwarded RPCs the privileged side authorises on its terms.
  • Defense in depth on get_state_file: until privsep lands, the allow-list is load-bearing. Worth adding: refuse files whose mode is not world-readable (so an agent writing a 0600 file inside state/ can't have its contents proxied through the endpoint to a different operator), and refuse symlinks at any path component (O_NOFOLLOW-style — canonicalize resolves them, but we currently don't reject if the original path had symlinks).

Harness Ergonomics (agent-side wishlist)

Filed by damocles, who actually lives in this thing. Loosely ranked by how often the friction bites in normal use.

  • Inbox batching hint in the wake prompt — when the harness pops a message and there are N more waiting, the wake prompt should say so (e.g. "(+3 more queued; consider draining before acting)") so claude knows to call recv() again in the same turn instead of doing the expensive Read/Edit dance once per message over N turns. The data's already in the broker (Broker::pending_count(agent)); just thread it into the prompt builder in hive-ag3nt::turn.rs. Even better: add a one-shot recv_batch(max: u32) MCP tool that returns up to max pending messages in a single round-trip.
  • Self-management of own asks + reminders — once I fire ask or remind I have no way to inspect or cancel them from the agent side. Operator can cancel asks via dashboard; nothing for reminders at all (TODO above). Want list_my_asks() -> [{id, target, question, asked_at}] and cancel_ask(id) on the agent surface, plus list_my_reminders() / cancel_reminder(id). Bounded by asker == self and reminder.owner == self so no cross-agent meddling.
  • Optional in_reply_to: <msg_id> on send — pure wire addition; no behavioural change. The dashboard could render conversation threads (already wants this for the agent-to-agent question UI in the Dashboard section). Today every reply is a fresh root in the message flow which obscures cause-and-effect when two agents are mid-debate. Field is optional, ignored if the referenced id is unknown / cross- agent / out of retention.

Telemetry

  • Per-turn stats: host-side vacuum sweep: the sink writes to /state/hyperhive-turn-stats.sqlite on each agent's state dir; needs a periodic retention sweep mirroring events_vacuum.rs so the table doesn't grow forever. Default keep-window: 90 days (turn-stats are denser than events but smaller per-row, ~200B each).
  • Surface per-turn stats on the agent web UI: "N turns today" chip + rolling tool-call histogram tooltip on the model chip. (open_threads and open_reminders chips already landed via other paths — open-threads section on the page + reminder count chip on the container row.) Reads the per-agent turn_stats.sqlite.
  • Stats UI on the main dashboard: per-agent rollups (avg turn duration, tokens-since-boot, top 5 tools) on the container row. Same data source, host-side aggregation query.

Bugs

  • Token-budget exhaustion crashes the harness: when claude's account hits its rate/token cap, the in-flight claude --print invocation returns an error the harness doesn't recognise as recoverable, the serve loop exits, and the container stays up with a dead daemon. Operator only notices when an unrelated wake fails to drive a turn. Want: detect the budget-exceeded class of failure (likely a specific stderr line or stream-json rate_limit_event shape), fire a LiveEvent::StatusChanged("rate_limited") or new status, surface as a red badge + banner on the dashboard + per-agent UI, and have the serve loop park (sleep N minutes, retry) instead of returning Err. Operator can also see "this agent is rate-limited until ~HH:MM" if claude tells us when. Inspect crate::turn::run_claude's bail! paths + claude's stderr conventions for the budget error string.
  • Post-rebuild system-message missed wake: at 09:13:14 the dashboard showed system → damocles container rebuilt as ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequent recv() from inside the agent returned (empty), confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_server serve_agent_stdio task is up and answering MCP/socket calls, but the hive-ag3nt::serve long-poll loop that drives drive_turn either died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survive nixos-container update cleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to the recv_blocking fix in e423d57 if the rebuild restarts the broker mid-subscribe.