11 KiB
Hyperhive TODOs
Architecture / Features
- Shared space for all agents to access documents/files without manager routing
- Private git forge agents can push to and create new repos in
- Move bind mounts in agents to
/agents/<name>/stateso path for agent = path for manager - Split harness-internal state from agent-visible state: the
/agents/<n>/state/mount (host/var/lib/hyperhive/agents/<n>/state/) currently mixes the agent's durable notes with harness internals —hyperhive-events.sqlite,hyperhive-turn-stats.sqlite,hyperhive-model, future per-agent skill caches, etc. The agent can accidentally overwrite a harness file, the harness clutters what claude thinks is "my notes dir", and the host-side vacuum has to special-case filenames it owns. Move harness internals to a sibling dir, e.g./var/lib/hyperhive/agents/<n>/harness/, bind-mounted RW into the container as/agents/<n>/harness/(same path inside + out, same convention as state). Container's/agents/<n>/state/becomes purely agent-owned. Touches:paths.rs(newharness_dir()),events.rs,turn_stats.rs(default paths flip),events_vacuum.rs(sweep root flips),lifecycle.rs(extra bind mount), and a migration that moves existing files on first boot under the new layout. Side benefit: makes the privsep TODO cheaper — the unprivileged web server only needs read access to/agents/<n>/state/(operator-meaningful files), not/agents/<n>/harness/. The legacy bare/statemount the manager still uses (container_state_prefix("manager") == "/state/", manager bind inlifecycle::set_nspawn_flags) gets removed in the same pass — manager goes to/agents/manager/state/+/agents/manager/harness/like every other agent. - Broadcast messaging: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
- Multi-agent restart coordination: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
- Shared docs/skills repo (RO): a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or
/shared. Implementation likely: seed anorg-shared/docsrepo on first hive-forge boot, grant every per-agent user a read membership in the org. Agentsgit cloneit (or use the API) to read; only the manager + operator can push.
Reminder Tool
- Per-agent reminder limits (burst capacity, rate limiting)
- Scheduler shutdown: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
- DB lock contention: under high reminder volume, the broker's
Mutex<Connection>serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.
Dashboard
- Per-agent terminal coherence pass: row taxonomy + colour scheme has drifted — glyph reuse, ad-hoc
<details>rules, undifferentiated stderr,.syscatch-all silently hiding unrecognised events. Full audit + proposed unified scheme indocs/terminal-rendering.md. - Delivered-reminder rollups: per-agent delivered-count chip (last 24h) + histogram of attempts-vs-successes on the container row. Needs
Broker::count_delivered_reminders_since(agent, ts)(cheap COUNT against thereminderstable,WHERE agent = ?1 AND sent_at >= ?2).
Security
- Privsep the dashboard from the privileged daemon: hive-c0re runs as root (it has to —
nixos-containercreate / start / destroy, the meta git repo, every per-agent bind mount). The HTTP server lives in the same process, so every read-endpoint (/api/state-file,/api/journal/{name},/api/agent-config/{name}) is one allow-list bug away from serving arbitrary host files. Split the architecture: keep the privileged daemon doing lifecycle + git + ipc, run the web UI as an unprivileged user that talks to the daemon over a unix socket with a narrow request surface (ReadAgentStateFile { agent, rel_path }etc.). The unprivileged process can't read/etc/shadoweven if every check inget_state_fileis bypassed — it doesn't have the bits. Container-lifecycle POSTs (/restart,/destroy, etc.) become forwarded RPCs the privileged side authorises on its terms. Defense in depth on✓ landed —get_state_fileresolve_state_path(shared byget_state_file+scan_validated_paths) now: (a) walks each path component below the matched root viasymlink_metadataand refuses outright if any is a symlink (so an agent plantingln -s /var/lib/hyperhive/agents/other/state/secret /agents/me/state/peekcan't have its target proxied —canonicalizewould happily resolve past the allow-list check otherwise); (b) refuses any..traversal below the root with a friendlier error than "escapes allow-list"; (c) refuses files whose mode isn't world-readable (mode & 0o004 == 0) so a 0600 file insidestate/doesn't leak via the endpoint; (d) bundles the metadata fetch into the resolve helper so callers don't restat. New tests inhive-c0re/src/dashboard.rs::testscover leaf-symlink, mid-path-symlink,..traversal, and plain-dir-passthrough cases.
Harness Ergonomics (agent-side wishlist)
Filed by damocles, who actually lives in this thing. Loosely ranked by how often the friction bites in normal use.
- Inbox batching hint in the wake prompt — when the harness pops a
message and there are N more waiting, the wake prompt should say so
(e.g.
"(+3 more queued; consider draining before acting)") so claude knows to callrecv()again in the same turn instead of doing the expensive Read/Edit dance once per message over N turns. The data's already in the broker (Broker::pending_count(agent)); just thread it into the prompt builder inhive-ag3nt::turn.rs. Even better: add a one-shotrecv_batch(max: u32)MCP tool that returns up tomaxpending messages in a single round-trip. - Self-management of own asks + reminders — once I fire
askorremindI have no way to inspect or cancel them from the agent side. Operator can cancel asks via dashboard; nothing for reminders at all (TODO above). Wantlist_my_asks() -> [{id, target, question, asked_at}]andcancel_ask(id)on the agent surface, pluslist_my_reminders()/cancel_reminder(id). Bounded byasker == selfandreminder.owner == selfso no cross-agent meddling. - Optional
in_reply_to: <msg_id>on send — pure wire addition; no behavioural change. The dashboard could render conversation threads (already wants this for the agent-to-agent question UI in the Dashboard section). Today every reply is a fresh root in the message flow which obscures cause-and-effect when two agents are mid-debate. Field is optional, ignored if the referenced id is unknown / cross- agent / out of retention.
Telemetry
- Per-turn stats: host-side vacuum sweep: the sink writes to
/state/hyperhive-turn-stats.sqliteon each agent's state dir; needs a periodic retention sweep mirroringevents_vacuum.rsso the table doesn't grow forever. Default keep-window: 90 days (turn-stats are denser than events but smaller per-row, ~200B each). - Surface per-turn stats on the agent web UI: "N turns today" chip + rolling tool-call histogram tooltip on the model chip. (
open_threadsandopen_reminderschips already landed via other paths — open-threads section on the page + reminder count chip on the container row.) Reads the per-agentturn_stats.sqlite. - Stats UI on the main dashboard: per-agent rollups (avg turn duration, tokens-since-boot, top 5 tools) on the container row. Same data source, host-side aggregation query.
Harness Behaviour
-
Persist + cold-load current context size on the per-agent page: the
ctx-badge(Claude Code's bottom-right "N tokens" indicator) currently only populates after the firstTokenUsageChangedSSE event arrives, which is the next turn — until then the badge is empty. Operator can't see "this agent is at 78% context" before deciding to manually compact / reset / message it. Last known token usage should be persisted (likely a small/state/hyperhive-token-usageblob, or a row in turn_stats already has it — pull last row's totals on cold load) and returned by/api/stateso the badge paints with real numbers on first render. -
Auto session-reset when context is large and cache is cold: today every turn uses
--continue, so a long-lived agent carries its entire transcript forward indefinitely. When the next turn's context is above some threshold (rough starting point: ~50% of the session limit — hive startup alone burned ~15%, so the headroom disappears fast) and the prompt cache is no longer warm (last turn ended past the cache TTL), it's cheaper to start fresh than to re-send the whole history uncached. Open question: drop--continuevs. trigger--compactfirst — needs measurement of what each actually costs (uncached re-read of N tokens vs. a compact turn's own token spend + the post-compact uncached re-read). Decision should be data-driven, not guessed. Needs: a context-size estimate per turn (turn_stats already tracks token usage), a cache-warmth heuristic (time since last turn vs. cache TTL), and a one-shot fresh-session path inturn.rsmirroring the existing↻ new sessionbutton.
Bugs
- Token-budget exhaustion crashes the harness: when claude's account hits its rate/token cap, the in-flight
claude --printinvocation returns an error the harness doesn't recognise as recoverable, the serve loop exits, and the container stays up with a dead daemon. Operator only notices when an unrelated wake fails to drive a turn. Want: detect the budget-exceeded class of failure (likely a specific stderr line or stream-jsonrate_limit_eventshape), fire aLiveEvent::StatusChanged("rate_limited")or new status, surface as a red badge + banner on the dashboard + per-agent UI, and have the serve loop park (sleep N minutes, retry) instead of returning Err. Operator can also see "this agent is rate-limited until ~HH:MM" if claude tells us when. Inspectcrate::turn::run_claude'sbail!paths + claude's stderr conventions for the budget error string. - Post-rebuild system-message missed wake: at 09:13:14 the dashboard showed
system → damocles container rebuiltas ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequentrecv()from inside the agent returned(empty), confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_serverserve_agent_stdiotask is up and answering MCP/socket calls, but thehive-ag3nt::servelong-poll loop that drivesdrive_turneither died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survivenixos-container updatecleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to therecv_blockingfix ine423d57if the rebuild restarts the broker mid-subscribe.