9.4 KiB
9.4 KiB
Hyperhive TODOs
Architecture / Features
- Shared space for all agents to access documents/files without manager routing
- Private git forge agents can push to and create new repos in
- Move bind mounts in agents to
/agents/<name>/stateso path for agent = path for manager - Broadcast messaging: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
- Multi-agent restart coordination: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
- Shared docs/skills repo (RO): a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or
/shared. Implementation likely: seed anorg-shared/docsrepo on first hive-forge boot, grant every per-agent user a read membership in the org. Agentsgit cloneit (or use the API) to read; only the manager + operator can push. Loose-ends tracker +✓ landed — newget_open_threadstoolmcp__hyperhive__get_open_threadsMCP tool on both agent + manager surfaces. Wire types inhive-sh4re:AgentRequest::GetOpenThreads/ManagerRequest::GetOpenThreads→OpenThreads { threads: Vec<OpenThread> }.OpenThreadis a tagged enum withApproval { id, agent, commit_ref, description, age_seconds }andQuestion { id, asker, target, question, age_seconds }. Shared aggregator athive-c0re/src/open_threads.rs:for_agent(coord, name)(sub-agent surface; filters questions by asker == self OR target == self, approvals only for manager) andhive_wide(coord)(manager surface; everything pending in the swarm). No caching — fresh sqlite sweep per call. Per-agent web UI rendering is a follow-up below.- Follow-up: surface open-threads on the per-agent web UI so the operator can see at a glance what each agent has hanging open — same data source as the MCP tool, just rendered into the existing per-agent dashboard page (next to inbox view / model chip / etc).
Reminder Tool
- Per-agent reminder limits (burst capacity, rate limiting)
- Scheduler shutdown: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
- DB lock contention: under high reminder volume, the broker's
Mutex<Connection>serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.
Dashboard
- Reminder delivery-error surface:
reminder_scheduler::ticklogs failed deliveries but doesn't persist. Addlast_error TEXT, attempt_count INTEGERcolumns + a banner on the dashboard row + a "retry" affordance. Needs a sqlite migration (idempotent ALTER TABLE). - Per-agent reminder status / query interface: surface pending vs. delivered counts per agent (manager + each sub-agent) as a small chip on the container row.
- Tombstones + meta_inputs events: not yet event-derived. PURG3 + meta-update still trigger a post-submit
/api/staterefetch on the dashboard. AddTombstoneAdded/TombstoneRemoved+MetaInputsChangedso those forms can drop their refetch too and the cold-load is the only/api/statefetch in normal operation.
Security
- Privsep the dashboard from the privileged daemon: hive-c0re runs as root (it has to —
nixos-containercreate / start / destroy, the meta git repo, every per-agent bind mount). The HTTP server lives in the same process, so every read-endpoint (/api/state-file,/api/journal/{name},/api/agent-config/{name}) is one allow-list bug away from serving arbitrary host files. Split the architecture: keep the privileged daemon doing lifecycle + git + ipc, run the web UI as an unprivileged user that talks to the daemon over a unix socket with a narrow request surface (ReadAgentStateFile { agent, rel_path }etc.). The unprivileged process can't read/etc/shadoweven if every check inget_state_fileis bypassed — it doesn't have the bits. Container-lifecycle POSTs (/restart,/destroy, etc.) become forwarded RPCs the privileged side authorises on its terms. - Defense in depth on
get_state_file: until privsep lands, the allow-list is load-bearing. Worth adding: refuse files whose mode is not world-readable (so an agent writing a 0600 file insidestate/can't have its contents proxied through the endpoint to a different operator), and refuse symlinks at any path component (O_NOFOLLOW-style —canonicalizeresolves them, but we currently don't reject if the original path had symlinks).
Harness Ergonomics (agent-side wishlist)
Filed by damocles, who actually lives in this thing. Loosely ranked by how often the friction bites in normal use.
Auto-attach oversize message bodies— superseded by simply raising the inline cap from 1 KiB → 4 KiB (covers ~95% of conversational overflow). Anything genuinely larger still needs a state file. Blob-in-broker-sqlite was prototyped on paper (/agents/damocles/state/oversize-msg-proposal.md) but rejected as future vacuum/sync pain not worth carrying for the long-tail 5% of cases that legitimately belong in a file.- Inbox batching hint in the wake prompt — when the harness pops a
message and there are N more waiting, the wake prompt should say so
(e.g.
"(+3 more queued; consider draining before acting)") so claude knows to callrecv()again in the same turn instead of doing the expensive Read/Edit dance once per message over N turns. The data's already in the broker (Broker::pending_count(agent)); just thread it into the prompt builder inhive-ag3nt::turn.rs. Even better: add a one-shotrecv_batch(max: u32)MCP tool that returns up tomaxpending messages in a single round-trip. - Self-management of own asks + reminders — once I fire
askorremindI have no way to inspect or cancel them from the agent side. Operator can cancel asks via dashboard; nothing for reminders at all (TODO above). Wantlist_my_asks() -> [{id, target, question, asked_at}]andcancel_ask(id)on the agent surface, pluslist_my_reminders()/cancel_reminder(id). Bounded byasker == selfandreminder.owner == selfso no cross-agent meddling. whoamiintrospection tool — agents currently rely on the system prompt remembering their name + role. After a rename or model swap there's no trustworthy source-of-truth from inside the harness. Cheap: awhoami() -> { name, role: "agent" | "manager", model, port, hyperhive_rev, started_at }tool reading from the harness's own envTurnState. Useful for self-documenting state files ("this dropped by damocles@gpt-5-codex on rev abc1234") and for the futureget_open_threadsto know whose threads to query without trusting prompt-substituted strings.
- Optional
in_reply_to: <msg_id>on send — pure wire addition; no behavioural change. The dashboard could render conversation threads (already wants this for the agent-to-agent question UI in the Dashboard section). Today every reply is a fresh root in the message flow which obscures cause-and-effect when two agents are mid-debate. Field is optional, ignored if the referenced id is unknown / cross- agent / out of retention.
Telemetry
- Per-turn stats log: persist one row per claude turn in a new sqlite table on the per-agent state dir (or the host broker DB, indexed by agent). Columns:
started_at,ended_at,duration_ms,model,input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens,tool_call_count,tool_call_breakdown(JSON:{Read: 12, Bash: 3, ...}),bytes_streamed,wake_reason(recv'd message / reminder / operator-kick / manual),result_kind(ok / cancelled / failed-mid-turn / compacted),note(e.g. failure reason). Powers: per-agent dashboards (avg turn time over time, tool-usage histogram, cost projections from token counts × model rate), debugging stuck loops (look for repeated identical wake_reason + zero tool calls), and operator-visible "this is what your spend looked like this week" rollups. Source data is already mostly in the harness'sTurnState+ the per-event bus; just needs a sink. Keep a retention sweep (host-side) so the table doesn't grow forever.
Bugs
- Post-rebuild system-message missed wake: at 09:13:14 the dashboard showed
system → damocles container rebuiltas ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequentrecv()from inside the agent returned(empty), confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_serverserve_agent_stdiotask is up and answering MCP/socket calls, but thehive-ag3nt::servelong-poll loop that drivesdrive_turneither died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survivenixos-container updatecleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to therecv_blockingfix ine423d57if the rebuild restarts the broker mid-subscribe.