hyperhive/TODO.md

10 KiB

Hyperhive TODOs

Architecture / Features

  • Shared space for all agents to access documents/files without manager routing
  • Private git forge agents can push to and create new repos in
  • Move bind mounts in agents to /agents/<name>/state so path for agent = path for manager
  • Broadcast messaging: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
  • Multi-agent restart coordination: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
  • Shared docs/skills repo (RO): a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or /shared. Implementation likely: seed an org-shared/docs repo on first hive-forge boot, grant every per-agent user a read membership in the org. Agents git clone it (or use the API) to read; only the manager + operator can push.
  • Rename ask_operatorask with optional to param ✓ done — Ask { question, options, multi, ttl_seconds, to: Option<String> } on both AgentRequest + ManagerRequest. to = None (or Some("operator")) = dashboard path; to = Some(<agent>) pushes HelperEvent::QuestionAsked into the target's inbox. New Answer { id, answer } request on both surfaces — target answers via mcp__hyperhive__answer; answer flows back to the asker as HelperEvent::QuestionAnswered { id, question, answer, answerer } (renamed from OperatorAnswered; carries who answered so the asker can distinguish operator vs peer vs ttl-watchdog). Authorisation: only the question's target agent or the operator can answer; self-ask is rejected. DB gets a nullable target column (NULL = operator path, back-compat). Dashboard's pending() / recent_answered() filter on target IS NULL so peer questions never leak into the operator's queue. Shared dispatch lives in hive-c0re/src/questions.rs so both surfaces stay aligned.
  • Loose-ends tracker + get_open_threads tool: hive-c0re already knows about pending approvals + unanswered questions; soon will also know about open PRs on hive-forge. Aggregate these into a per-agent "open threads" view (e.g. [{kind: "approval", id: 7, summary: "spawn alice"}, {kind: "question", id: 12, asker: "alice", summary: "deploy now?"}]). New MCP tool mcp__hyperhive__get_open_threads returns the list so an agent can see what's still pending against it without rebuilding context from inbox history. Manager's version includes hive-wide threads. Also surface this list on the per-agent web UI so the operator can see at a glance what each agent has hanging open — same data source as the MCP tool, just rendered into the existing per-agent dashboard page (next to inbox view / model chip / etc).

Reminder Tool

  • Handle text overflow → suggest file_path option for long messages ✓ fixed — Remind dispatch rejects message.len() > 4096 (when no file_path was supplied) with an error pointing at the file_path escape hatch.
  • Per-agent reminder limits (burst capacity, rate limiting)
  • Expose remind MCP tool ✓ fixed — mcp__hyperhive__remind now on AgentServer; takes message, exactly one of delay_seconds / at_unix_timestamp, optional file_path. Manager surface still missing (no ManagerRequest::Remind variant) — separate item below.
  • Manager-side remind ✓ fixed — ManagerRequest::Remind variant added, dispatch reuses agent_server::store_remind helper (shared across both surfaces), mcp__hyperhive__remind now on ManagerServer (auto-file lands at /state/reminders/auto-<ts>.md — manager's legacy state mount).
  • File path delivery ✓ fixed — scheduler now writes the reminder body to the requested file_path (mapped from container /agents/<agent>/state/... to host /var/lib/hyperhive/agents/<agent>/state/...) and delivers a short pointer message in its place. Path-traversal + foreign-agent-state writes are rejected; on rejection or write failure the body falls back to inline delivery with a noted warning. New module hive-c0re/src/reminder_scheduler.rs (extracted from main.rs).
  • Orphan reminders ✓ fixed — Broker::deliver_reminder wraps the inbox INSERT + reminders UPDATE in one sqlite transaction; partial failure can no longer cause duplicate delivery on the next tick.
  • Unbounded batches ✓ fixed — scheduler now calls get_due_reminders(REMINDER_BATCH_LIMIT) (cap = 100/tick); overflow stays due and gets picked up next cycle.
  • Scheduler shutdown: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
  • DB lock contention: under high reminder volume, the broker's Mutex<Connection> serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.

Dashboard

  • UI for agent-to-agent questions (follow-up to the ask rename): now that agents can ask(to: <agent>) each other, surface those threads in the per-agent dashboard view. Replace the existing read/unread tabs with THREE filters: unread, from: <agent>, to: <agent>. The to: filter makes agent-targeted questions visible so the operator can see at a glance "alice has 3 questions outstanding from bob" and intervene if a thread is stuck. Same UI is useful for general inbox filtering too. Data lives in the existing operator_questions table (with the new target column) + the broker inbox; no new schema needed. Also expose a "respond" affordance so the operator can override-answer a peer question when an agent is offline / stuck (the answerer-auth check in OperatorQuestions::answer already permits the operator on any target).
  • Clickable file paths in message bodies: agents drop pointer strings like /agents/<name>/state/foo.md constantly (it's the whole 1 KiB-cap escape hatch). Right now they're plain text — operator has to copy-paste into a terminal to peek. Detect path-shaped tokens (start with /agents/, /shared/, /state/, or absolute /var/lib/hyperhive/...) in rendered message bodies + question text + answer text + helper-event payloads, render as clickable links that hit a new /api/state-file?path=… dashboard endpoint. Endpoint serves the file as text (with a strict allow-list — only paths under /var/lib/hyperhive/agents/*/state/, /var/lib/hyperhive/shared/, never anything else), syntax-highlighting where it makes sense, falling back to download for binaries. Reuses the existing <details> collapse pattern so inline preview doesn't blow up the message-flow stream.
  • UI for pending reminders: show pending/queued reminders in dashboard, allow operator to view/debug/cancel
  • Per-agent reminder status (pending, delivered)
  • Reminder query interface for debugging
  • Display reminder delivery errors (failed sends, mark failures)
  • Phase 5b: per-domain mutation event types + client derived state ✓ landed across 56d615b (approvals), 1879b2f (questions), 7956e1c (transients). DashboardEvent now carries ApprovalAdded / ApprovalResolved, QuestionAdded / QuestionResolved, TransientSet / TransientCleared; emit sites cover actions::approve/deny/finish_approval, dashboard's orphan-approval GC, manager-socket request_spawn + request_apply_commit (success + git_fetch failure), questions::handle_ask/handle_answer (operator-targeted only), dashboard's /answer-question + /cancel-question, ttl-watchdog, Coordinator::set_transient/clear_transient. /api/state still serves these arrays for cold-start; live updates flow through the events. Container-list events still deferred — ContainerView is sourced from external nixos-container list, so the 5s poll continues to drive /containers-section. Phase 6 remaining redirect conversions (/approve, /deny, /restart, /destroy, /kill, /rebuild, /api/cancel, /api/compact, /api/model, /api/new-session, /request-spawn, /answer-question, /cancel-question, /meta-update, /purge-tombstone) are now unblocked for the event-covered domains; container-lifecycle ones still need either container-list events or to live with the 5s poll-refresh delay.

Bugs

  • Pending message wake-up ✓ fixed (e423d57) — subscribe-before-check race in broker.recv_blocking meant a send landing between the initial recv() and subscribe() was missed; agent then sat on the 180s long-poll until another, unrelated message woke it. Now subscribe first.
  • Post-rebuild system-message missed wake: at 09:13:14 the dashboard showed system → damocles container rebuilt as ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequent recv() from inside the agent returned (empty), confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_server serve_agent_stdio task is up and answering MCP/socket calls, but the hive-ag3nt::serve long-poll loop that drives drive_turn either died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survive nixos-container update cleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to the recv_blocking fix in e423d57 if the rebuild restarts the broker mid-subscribe.
  • LiveEvent::Note(String) never reaches the browser ✓ fixed — converted to struct variant Note { text: String }; wire shape {"kind":"note","text":"..."} matches what the JS already reads via ev.text. Historical sqlite rows persisted as the literal string "null" (from when serialization silently failed) get filtered out by the rows.flatten().flatten() pipeline in EventStore::recent, so replay tolerates them.