Commit graph

73 commits

Author SHA1 Message Date
müde
e3b5837378 todo: security section — privsep + state-file hardening 2026-05-17 22:13:18 +02:00
müde
b3970c439c todo: drop landed-dashboard comments 2026-05-17 22:10:32 +02:00
müde
1db6b8ffed dashboard: queued reminders surface
new 'qu3u3d r3m1nd3rs' section between approvals and operator
inbox. lists every pending reminder with agent, due-relative
timestamp, body, payload path (path-linkified), and a cancel
button. drives off a new /api/reminders endpoint and a
POST /cancel-reminder/{id} that hard-deletes the row.

failure surface (last_error / attempt_count + retry) deferred —
needs a sqlite migration; tracked in TODO.md.
2026-05-17 22:10:02 +02:00
müde
cb71a07300 dashboard: clickable file-path previews
agents constantly emit pointer strings to /agents/<n>/state/foo.md
since broker bodies cap at 1 KiB. now those tokens linkify in the
message flow, question bodies, answer text, and operator inbox;
clicking expands an inline <details> that lazy-fetches via the
new /api/state-file?path=... endpoint.

endpoint allow-list: per-agent state dirs + shared docs, both
in their container-mount form (/agents/<n>/state, /shared) and
host form (/var/lib/hyperhive/...). 1 MiB read cap; canonicalises
before the prefix check so `..` / symlinks can't escape.

legacy bare `/state/...` is deliberately not matched — ambiguous
from the host's perspective (we'd need to know which agent the
message references to translate). agents should use the qualified
form going forward.
2026-05-17 22:08:15 +02:00
müde
a15fafb5de dashboard: surface peer questions + operator override
questions pane now shows both operator-targeted threads
(target IS NULL) and agent-to-agent threads (target = some
agent). filter chips above the list: all / @operator / @peer /
per-participant. peer rows get a mauve left rule + a 0V3RR1D3
button that POSTs the same /answer-question endpoint
(OperatorQuestions::answer already permits the operator as
answerer on any target).

wire changes: OperatorQuestions gains pending_all +
recent_answered_all; QuestionAdded + QuestionResolved events
carry target: Option<String>; emit sites drop their
target.is_none() guard. answered-history rows show the
answerer prefix so override answers are auditable at a glance.
2026-05-17 22:06:53 +02:00
müde
e7ce35c503 phase 6: container events + drop the 5s /api/state poll
new DashboardEvent::ContainerStateChanged + ContainerRemoved
close the last refetch loop on the dashboard. Coordinator's
rescan_containers_and_emit diffs a fresh container_view::build_all
against a cached last_containers map and fires per-row events.
called from actions::approve (post-spawn), actions::destroy,
the lifecycle_action wrapper, auto_update::rebuild_agent, and
the existing 10s crash_watch poll.

ContainerView extracted to its own module so coordinator and
dashboard can both build it. dashboard endpoints flip to 200;
container-lifecycle forms carry data-no-refresh. client drops
the periodic poll entirely — initial cold load + SSE for
everything afterwards. pending overlay reads from the existing
transientsState since the new event payload doesn't carry it.

PURG3 + meta-update keep the post-submit refetch since
tombstones + meta_inputs aren't event-derived yet; tracked in
TODO.md.
2026-05-17 22:01:15 +02:00
damocles
c423ce9e39 todo: lock down get_open_threads scope (asker + target questions) 2026-05-17 14:43:08 +02:00
müde
e4438d1a6e todo: phase 6 event-covered redirects converted 2026-05-17 14:27:03 +02:00
müde
62784d4933 todo: prune resolved items 2026-05-17 14:22:47 +02:00
müde
88a1f4c146 todo: mark phase 5b done; note remaining phase 6 conversions now unblocked 2026-05-17 14:21:12 +02:00
damocles
291f1fce42 todo: clickable file paths in dashboard message bodies 2026-05-17 13:20:33 +02:00
damocles
82b0877c47 ask: rename ask_operator → ask + optional 'to' for agent-to-agent Q&A 2026-05-17 13:20:32 +02:00
müde
87f8f8a123 todo: phase 5b — mutation events for approvals/questions/transients 2026-05-17 13:15:32 +02:00
müde
b60774a66c events: LiveEvent::Note becomes struct variant so serde can actually serialize it 2026-05-17 13:14:09 +02:00
müde
d48cee7c2d approvals: ship raw diff text instead of pre-rendered html; client classifies per-line 2026-05-17 12:30:45 +02:00
damocles
1770b51845 manager mcp: expose 'remind' tool sharing storage helper with agent surface 2026-05-17 11:43:14 +02:00
damocles
3da6912e73 todo: open-threads list also rendered on the per-agent web ui 2026-05-17 11:20:01 +02:00
damocles
0c606fd2dd todo: post-rebuild missed-wake bug + ask rename + open-threads tracker 2026-05-17 11:20:01 +02:00
damocles
6ce85bd6f2 reminder: file_path delivery + extract scheduler into own module 2026-05-17 11:05:29 +02:00
damocles
f2484b5e78 agent mcp: expose 'remind' tool for self-scheduled wakes 2026-05-17 10:54:36 +02:00
damocles
271c524e66 agent_server: reminder body size cap + extract Remind/AskOperator handlers 2026-05-17 02:59:51 +02:00
damocles
dba3badeae todo: mark orphan-reminder + unbounded-batch items as fixed 2026-05-17 02:59:51 +02:00
damocles
e45d161cb8 todo: mark recv_blocking race bug as fixed 2026-05-17 02:59:51 +02:00
müde
9fc7cae132 prompts: tell agents + manager about the code forge; todo: shared docs repo
system prompts now describe the hyperhive Forgejo at localhost:3000,
the per-agent user, the pre-configured tea CLI, and the REST API
fallback with /state/forge-token. todo gains the shared docs/skills
RO-repo follow-up (org-shared + per-agent read membership).
2026-05-16 23:36:05 +02:00
damocles
4ceae6cf67 todo: add bug - pending message wake-up issue 2026-05-16 13:43:41 +02:00
damocles
3642ae1a61 todo: add dashboard ui for pending reminders 2026-05-16 13:40:15 +02:00
damocles
a57e500f48 todo: add multi-agent restart coordination item 2026-05-16 13:14:17 +02:00
damocles
3b8cdc7e20 todo: add broadcast messaging feature 2026-05-16 13:07:31 +02:00
damocles
24eec69418 fix reminder tool issues: error on time overflow, optimize scheduler query 2026-05-16 13:00:56 +02:00
damocles
bc27113967 docs: add hyperhive feature TODOs 2026-05-16 12:52:08 +02:00
müde
67e4242b9f per-agent send allow-list via hyperhive.allowedRecipients
new NixOS option in harness-base.nix:
  hyperhive.allowedRecipients = [ 'alice' 'manager' ];  # whitelist
  hyperhive.allowedRecipients = [ ];                    # default = unrestricted

module writes the list as JSON to /etc/hyperhive/send-allow
.json at activation. AgentServer::send reads the file before
issuing the broker request; if the list is non-empty and
`to` isn't on it, the tool returns a claude-readable refusal
string without touching the broker. the manager is always
implicitly permitted regardless of the list — otherwise a
misconfigured allow-list could strand a sub-agent without an
escalation path.

enforcement is in the in-container MCP server (not on the
host's per-agent socket) because the agent's nix config is the
trust boundary anyway — the operator audits agent.nix at
deploy time, the activation-time /etc/hyperhive/send-allow
.json is r/o under /nix/store, so the agent can't tamper at
runtime without going through a new approval.

agent prompt mentions the option + tells claude to route
through the manager when refused. retires the matching TODO
under Permissions / policy.
2026-05-16 03:59:28 +02:00
müde
d1c69b134a dashboard: reorder sections into grouped sequence
after reverting the 3-column attempt (74ba8a6), keep the
single-column layout but put related sections adjacent:

  swarm:     containers → kept-state → meta-inputs
  decisions: questions → approvals
  messages:  operator-inbox → message-flow + compose

this is a free improvement — the operator scrolls through one
logical group at a time instead of bouncing between swarm /
decisions / messages mid-page. follow-up improvements
(collapsing rarely-active sections, multi-column at wide
viewports done less aggressively) captured in TODO under
'Dashboard layout overhaul'.
2026-05-16 03:54:53 +02:00
müde
40938d8b54 dashboard: surface silent unwrap_or_default in api_state
every snapshot source backing /api/state used .unwrap_or_default()
— sqlite errors, broker errors, nixos-container list failures,
operator_questions decode crashes all degraded to empty lists
without a log line. the 'pending question doesn't render'
bug we've been chasing was likely a row-decode panic in
OperatorQuestions::pending() being swallowed this way.

new log_default(what, result) replaces each call site: same
default value on Err but emits target=api_state warn with the
source name + dbg error first. five sources covered:
nixos-container list, approvals.pending,
approvals.recent_resolved, broker.recent_for(operator),
questions.pending. next time the question goes missing the
journal will say which source failed and how.

todo updated — pending-question entry now points at the new
log instead of three suspect paths.
2026-05-16 03:49:49 +02:00
müde
06af23c8a4 recv: None = peek, positive value = opt-in long-poll
old behavior: omitted wait_seconds fell through to the 30s
RECV_LONG_POLL_DEFAULT — claude calling 'is there anything in
my inbox right now?' between actions blocked the turn for half
a minute. flip the semantics: None (or 0) returns immediately,
positive value parks up to MAX (180s, unchanged). cleaner
'peek vs wait' distinction; tool descriptions + agent/manager
prompts updated to point at the new shape.

harness's own serve loops in hive-ag3nt + hive-m1nd relied on
the old default for their inbox poll. they now explicitly pass
wait_seconds: Some(180) to opt into the full park — same
effective behavior as before, just spelled out.

retires the matching TODO under Turn loop.
2026-05-16 03:22:42 +02:00
müde
96cb9f84c9 dashboard: approval history tab on P3NDING APPR0VALS
new tabs above the approvals list: 'pending · N' and
'history · M'. active tab persists in localStorage so the
operator can park on history if they prefer. on a fresh
dashboard the default is pending (matches the prior shape).

history view shows the last 30 resolved approvals — newest
first by resolved_at — with one row per approval: status
glyph (✓ approved / ✗ denied / ⚠ failed), id, agent, kind,
short sha, status label, and a relative time chip. when the
row has a note (deny reason or build error), it renders
below in a muted block with line wraps preserved.

backend: Approvals::recent_resolved(limit) queries by
status IN ('approved', 'denied', 'failed') ORDER BY
resolved_at DESC. StateSnapshot gets approval_history (a
lean ApprovalHistoryView without diff_html — rendering 30
git diffs per state poll would be expensive and the operator
already saw the diff at decision time). dashboard's
history_view fn projects the sqlite row.

retires the matching TODO entry.
2026-05-16 03:07:50 +02:00
müde
c2bf0aa4f1 todo: approval history tab; retire streaming-output entry
new entry under UI/UX for an approval history tab on the
P3NDING APPR0VALS section — sqlite already has every row + the
applied repo's annotated denied/failed tags carry the
human-readable reasons, so this is a render-side change.

retire the 'stream nixos-container stdout' entry — landed in
6f1b664. run() now pipes child output line-by-line into
tracing so 'slow build' no longer looks like 'wedged daemon'.
2026-05-16 02:59:02 +02:00
müde
7d6d8e96c1 per-agent extra MCP servers via hyperhive.extraMcpServers
new NixOS option in harness-base.nix:
  hyperhive.extraMcpServers.<key> = {
    command = "/path/to/server";
    args = [ ... ];
    env = { KEY = "value"; };
    allowedTools = [ "send_message" "join_room" ];  # or ["*"]
  };

declared as attrsOf submodule so agents stack arbitrarily many.
the module writes the whole map as JSON to
/etc/hyperhive/extra-mcp.json at activation; the harness reads
that file in mcp::render_claude_config and merges each entry
into the rendered --mcp-config under its own mcpServers.<key>
block. allowed_mcp_tools(flavor) extends the --allowedTools
arg with mcp__<key>__<pattern> for every entry — "*" (the
default) becomes mcp__<key>__* so every tool from that server
is auto-approved, or pass a concrete list to tighten.

collision guard: an extra server keyed "hyperhive" is dropped
with a warn-log so user config can't shadow the built-in
surface. malformed JSON / missing file fall back to "no
extras" silently.

prompt note added: agents see "(some agents only) extra MCP
tools surfaced as mcp__<server>__<tool>" and learn they're
declared via agent.nix. retires the matching TODO under
Per-agent extension. matrix-chat agents + bitburner-agent
migration unblocked.
2026-05-16 02:10:11 +02:00
müde
6b3ef4549c manager_server: reject proposals that modify flake.nix
submit_apply_commit now diffs the freshly-tagged proposal/<id>
against applied/main and refuses if flake.nix is in the
changeset. flake.nix is fixed boilerplate the meta flake
depends on (it exports nixosModules.default = import ./agent
.nix); silent edits there would break the nixosConfiguration
in subtle ways. the manager prompt already says don't touch
it; this is the host-side belt — clear error to the manager
on submit, row marked failed in sqlite, no orphan pending
approval to chase. diff-failure is logged + ignored: the
build path surfaces concrete errors if flake.nix is actually
broken.
2026-05-16 01:42:11 +02:00
müde
68ef6ab433 todo: stream nixos-container output so slow != stuck
surfaced by a real hang investigation today — lifecycle::run
uses .output() which buffers stdout/stderr until exit, so a
multi-minute nix build through nixos-container update looks
identical to a wedged daemon. line-buffered streaming into
tracing (and ideally the per-agent live event bus when the
agent is known) makes 'still building, just slow' visible
without strace gymnastics.
2026-05-16 01:38:02 +02:00
müde
65bdde898e todo: tag retention, flake.nix tamper-check, sync_agents nix call
three things surfaced by the meta-flake overhaul + the nix CLI
deprecation we just fixed worth tracking explicitly. extend
the web-UI-for-config-repos entry to also cover the /meta
deploy log now that meta's git history is the swarm-wide
audit trail.
2026-05-16 01:21:27 +02:00
müde
df9da4d6e1 todo: recv default should not sleep, agent opts into wait 2026-05-15 23:00:25 +02:00
müde
497cd15137 docs: tag-driven config-apply plan + migration story
scratchpad in claude.md marks this as in-flight; docs/approvals.md
gets the new tag state machine (proposal/approved/building/deployed/
failed/denied) and the manager applied.git read-only mount. todo
picks up the unprivileged-containers git-identity caveat and a web
ui for config repos as a downstream follow-up.
2026-05-15 22:43:47 +02:00
müde
75e7faff0c docs: full sync ahead of compaction + config-management overhaul
readme: manager mcp surface picks up update; operator-surface
recap mentions /model + last-turn + model chip + the three
collapsibles (inbox / journald / agent.nix).

web-ui.md: details-restore-key story under shape; port-conflict
banner mention on containers; agent.nix viewer alongside journald;
notifications use per-event tags + console.debug log on
block/show; deny endpoint takes note=<reason>; data-prompt /
data-prompt-field generalisation noted.

conventions.md: data-prompt and snapshot/restoreOpenDetails added
to the async-forms section.

persistence.md: operator_questions row picks up deadline_at (ttl)
column with a migration note.

todo.md: new 'Bugs' section captures the manager-question
not-rendering issue with three suspect paths to chase.

claude.md scratchpad rewritten as a clean handoff for the
compaction + the upcoming config-git overhaul. flags the
two-repo (proposed/ + applied/) split as the thing to
reconsider.
2026-05-15 22:12:40 +02:00
müde
91c78d626f dashboard: per-container applied agent.nix viewer
new GET /api/agent-config/{name} returns the contents of
/var/lib/hyperhive/applied/<name>/agent.nix — the file the
container actually builds against. validated against the live
container list to avoid arbitrary filesystem reads.

frontend mirrors the journald viewer: collapsed <details> on each
container row, lazy-fetches on expand, refresh button re-fetches.
restore-keyed (agent-config:<name>) so it survives the dashboard
heartbeat refresh.

read-only — mutating the applied config goes through the existing
request_apply_commit + operator approval flow.
2026-05-15 21:46:25 +02:00
müde
80229c6af9 manager: needs_login / logged_in / needs_update events + update tool
crash_watch grows two more state-axes alongside running/stopped:

- logged-in (claude session dir populated for the agent)
- up-to-date (recorded flake rev matches current)

per-tick transitions emit HelperEvent::NeedsLogin / LoggedIn /
NeedsUpdate. seed-on-first-tick semantics retained — nothing fires
on harness boot for agents that were already in their state. only
needs_update fires the 'stale appeared' direction; the resolved
direction is already covered by Rebuilt.

new mcp__hyperhive__update(name) on the manager surface: idempotent
rebuild via auto_update::rebuild_agent. transient-aware (Rebuilding)
so the dashboard shows the spinner. login intentionally has NO tool
— it's interactive OAuth, only the operator can complete it.

prompts + approvals doc + turn-loop doc updated. todo grows a
'show per-agent applied config in dashboard' entry (separate
follow-up).
2026-05-15 21:42:13 +02:00
müde
62d1a74929 docs sync + revert auto-unfree removal
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
2026-05-15 21:26:13 +02:00
müde
237b215c55 dashboard: browser notifications for operator-bound events
three signals fire OS notifications:
- new approval lands in the queue (per id, via /api/state delta)
- new ask_operator question queued (per id)
- broker message sent to operator (live via SSE)

first /api/state render after page load seeds the 'seen' sets
without firing — only items that arrive while the page is open
count. controls in a row under the banner: 🔔 enable
notifications (calls requestPermission, hides on grant), 🔕 mute /
🔔 unmute toggle (localStorage-backed so operator can silence
without revoking the permission), inline status text when blocked
or unsupported.

notification tag='hyperhive' collapses rapid bursts; onclick
focuses the dashboard tab. requires secure context (HTTPS or
localhost) — on other origins the API is unavailable and the
controls hide themselves.

todo: entry dropped.
2026-05-15 21:10:20 +02:00
müde
a67aada7c9 todo: browser notifications for approvals / questions / operator msgs
pure frontend — Notification API + existing /api/state and
/messages/stream signals. Caveats: secure-context requirement
(HTTPS or localhost), per-browser permission grant. Includes a
sketch of the implementation: request-permission button, count
deltas on refreshState, SSE hook on operator-bound sends,
localStorage 'muted' toggle.
2026-05-15 21:07:21 +02:00
müde
8b9f7d21b7 model persisted to /state; stop auto-allowing claude-code unfree
model persistence: /model <name> now writes to /state/hyperhive-model
(in-container), Bus::new reads it on init. operator override survives
harness restart and container rebuild; gone on --purge like every
other piece of agent state. path overridable via HYPERHIVE_MODEL_FILE
for tests. failure to persist is a warn, not fatal — runtime override
still applies, just won't survive a restart.

unfree opt-in: drop the auto-allowUnfreePredicate from
harness-base.nix and the claude-unstable overlay. operator now has to
set nixpkgs.config.allowUnfree (or a predicate listing claude-code)
in their own host config. silent unfree bypass was sketchy; this is
honest. readme + gotchas updated to spell out the snippet.

todo: drops model-persistence + container-crash + journald (all
shipped); adds per-agent send allow-list (constrain who an agent can
message).
2026-05-15 21:05:40 +02:00
müde
58c3cd853b container crash watcher → HelperEvent::ContainerCrash
new hive_c0re::crash_watch task polls every 10s, builds the set of
currently-running containers, and on running→stopped transitions
checks the transient snapshot: if no Stopping / Restarting /
Destroying / Rebuilding flag is set, the container exited
unexpectedly and we fire HelperEvent::ContainerCrash into the
manager's inbox so it can react (typically: start it again).

first poll is a seeding pass — no events on harness startup. dbus
subscription would be lower-latency but polling is honest and
debuggable, and a 10s delay on crash detection is fine for our
scale.

manager prompt + approvals doc updated to advertise the new
event variant. todo drops the entry (and the journald-viewer
entry that already shipped).
2026-05-15 21:02:05 +02:00