Commit graph

391 commits

Author SHA1 Message Date
damocles
f3739d2b8e update plugin marketplaces before install at harness boot 2026-05-16 18:51:06 +02:00
damocles
dc53615686 fix stale /state refs in agent and manager prompts 2026-05-16 18:50:15 +02:00
müde
772fdd8320 forward plugin install failures to manager from sub-agents
install_configured now takes an optional notify recipient. on a
non-zero or spawn-failed 'claude plugin install', sub-agents send
the spec + stderr to manager via the hyperhive socket; manager
passes None so it doesn't message itself. boot still proceeds either
way — notification is best-effort.
2026-05-16 17:24:04 +02:00
müde
3e040d5b16 agent: forward unhandled turn failures to manager
run_claude now keeps a 20-line stderr ring buffer and bails with it
inline (was just 'exit <status>'). agent serve loop, on Failed (not
PromptTooLong — that's already absorbed by drive_turn's compaction
retry), sends the error body to manager via the normal hyperhive
send. swallows transport errors — failure is already in journald
and the events sqlite. manager-only harness (hive-m1nd) is unchanged
so it doesn't try to notify itself.
2026-05-16 16:04:35 +02:00
müde
7ec658851a back out bypassPermissions: claude refuses it under root uid
claude-code rejects --dangerously-skip-permissions / defaultMode=
bypassPermissions when running as root, which all hyperhive
containers do. revert to the previous explicit allow-list plumbing
(per-flavor list spliced into permissions.allow + --tools enable
list), keep TodoWrite out of the built-in allow set, and keep the
deny list (TodoWrite, WebFetch, WebSearch, Task) as belt-and-braces
in case anything sneaks past the allow gate.
2026-05-16 15:58:41 +02:00
müde
36c7f3d1c7 mirror claude stderr to tracing so journald captures it
bus-only note made post-mortems require the web UI / events sqlite;
now stderr lines also land in 'journalctl -M <container> -b' alongside
the existing LiveEvent::Note for the dashboard.
2026-05-16 15:30:03 +02:00
müde
7d33da3727 retry hive socket up to 5x over 60s, surface retry count to claude
socket client now retries connect/IO failures with 2-4-8-16-30s
backoffs (60s total budget). transparent for non-tool callers via
request(); tool handlers go through request_retried() which also
returns the retry count, then annotate_retries() appends a one-line
note to the tool result so claude knows the slow round-trip was a
c0re flicker, not a content failure — avoids burning tokens on an
LLM-level retry.
2026-05-16 15:28:18 +02:00
damocles
4a8a668348 feat: add optional description to request_apply_commit and request_spawn 2026-05-16 15:18:32 +02:00
damocles
a6d1464071 refactor: per-agent state paths (/agents/{label}/state), centralize in paths.rs 2026-05-16 15:18:32 +02:00
damocles
a82009cf8c docs: update agent prompt to reference /agents/{label}/state and /agents/{label}/claude 2026-05-16 15:18:19 +02:00
damocles
ecaa178199 refactor: compute per-agent mount points for /agents/<name>/ structure 2026-05-16 15:18:19 +02:00
müde
6dd17864ac auto-install claude plugins at harness boot
new hyperhive.claudePlugins NixOS option (list of strings) rendered
to /etc/hyperhive/claude-plugins.json. both hive-ag3nt and hive-m1nd
shell out 'claude plugin install <spec>' for each entry once at
startup before the turn loop opens. failures log a warning but don't
abort boot.
2026-05-16 15:17:34 +02:00
müde
8e7405db13 bypass-mode perms + deny list, drop allow-list plumbing
claude-settings.json now sets permissions.defaultMode=bypassPermissions
with a small deny list (WebFetch, WebSearch, Task, TodoWrite). The
per-flavor allow list and --tools / --allowedTools CLI flags are gone
— anything not denied auto-approves. mcp.rs loses ALLOWED_BUILTIN_TOOLS,
builtin_tools_arg, allow_list, allowed_mcp_tools. The extraMcpServers
allowedTools field is parsed for back-compat but no longer wired
anywhere; restrict via permissions.deny instead.
2026-05-16 15:17:30 +02:00
damocles
3d2a7ffec7 fix: auto-wake after turn if pending messages exist, don't block on recv 2026-05-16 13:50:11 +02:00
damocles
d99e0812d0 fix: move sleep to only occur when recv returns empty, avoid message delivery delay 2026-05-16 13:48:04 +02:00
damocles
4ceae6cf67 todo: add bug - pending message wake-up issue 2026-05-16 13:43:41 +02:00
damocles
ab0df71068 docs: update system prompts to document /shared directory 2026-05-16 13:43:05 +02:00
damocles
37e56af6ba add /shared mount: new shared directory accessible to all agents 2026-05-16 13:42:41 +02:00
damocles
3642ae1a61 todo: add dashboard ui for pending reminders 2026-05-16 13:40:15 +02:00
damocles
abcf7a0c41 implement broadcast messaging: send to '*' reaches all agents with hint 2026-05-16 13:16:13 +02:00
damocles
a57e500f48 todo: add multi-agent restart coordination item 2026-05-16 13:14:17 +02:00
damocles
3b8cdc7e20 todo: add broadcast messaging feature 2026-05-16 13:07:31 +02:00
damocles
22cea88c7e remove unused broker/coordinator methods 2026-05-16 13:02:53 +02:00
damocles
24eec69418 fix reminder tool issues: error on time overflow, optimize scheduler query 2026-05-16 13:00:56 +02:00
damocles
bc27113967 docs: add hyperhive feature TODOs 2026-05-16 12:52:08 +02:00
damocles
f38510930a reminder: add background scheduler loop - checks & delivers due reminders every 5s 2026-05-16 12:49:59 +02:00
damocles
4fc9c02934 reminder: add sqlite storage + broker methods + dispatch 2026-05-16 12:49:59 +02:00
damocles
7e9fd8e978 agent: add Remind request + ReminderTiming enum (stub implementation) 2026-05-16 12:49:59 +02:00
damocles
862bc1de44 Revert "agent: add Wake command - co-process self-wake via agent socket"
This reverts commit 68a9b8575b1647643c87bd753767acabf96528c3.
2026-05-16 12:49:59 +02:00
damocles
f0e87f0bc5 agent: add Wake command - co-process self-wake via agent socket 2026-05-16 12:49:59 +02:00
damocles
286da8980e Revert "mcp: wire extra server allowedTools into --allowedTools arg"
This reverts commit e0b18ff3c2ec5a7f771ab9a1a247ff4a24a8c475.
2026-05-16 12:49:59 +02:00
damocles
caa495aeda mcp: wire extra server allowedTools into --allowedTools arg 2026-05-16 12:49:59 +02:00
müde
d06b598c56 kick_agent on every rebuild + apply path
agents weren't being woken with the 'you were rebuilt — check
/state/ for notes, --continue intact' system message after
several recent rebuild surfaces:

- auto_update::rebuild_agent — used by the dashboard rebuild
  button, admin-CLI rebuild via lifecycle_action, the startup
  rev-scan, AND the new meta-input update batch loop. kick
  moves *into* rebuild_agent's success arm so all four
  paths benefit. (the dashboard's lifecycle_action extra
  closure was already firing kick — now it's a no-op for the
  rebuild path since rebuild_agent does it.)
- actions::run_apply_commit — apply-commit approve flow built
  + tagged deployed/<id> but never kicked. add kick on
  success with the more specific 'config update applied' hint.
- server.rs::HostRequest::Rebuild — the admin-CLI direct path
  calls lifecycle::rebuild bypassing rebuild_agent. add kick
  on success.

dashboard's restart / start lifecycle_action extras still
kick via their own closures since they don't route through
rebuild_agent. stop / kill / destroy intentionally don't
kick — there's nothing to wake.
2026-05-16 04:20:01 +02:00
müde
78aa830430 meta inputs panel: walk transitive inputs, slash-path names
read_meta_inputs() previously only included direct inputs of
meta's root node — so a manager-added 'inputs.mcp-matrix' in
agent-dmatrix's flake.nix never surfaced in the dashboard
panel even though it's a real fetched input that nix can
update.

now: BFS the flake.lock graph from root to depth 2. emits
one MetaInputView per fetched (non-follows) node, names are
slash-paths from root — 'hyperhive', 'agent-coder',
'agent-dmatrix/mcp-matrix', 'hyperhive/nixpkgs', etc. that's
the same syntax 'nix flake update' accepts for transitive
inputs, so the existing POST /meta-update path needs no
nix-side change.

depth limit of 2 keeps the panel readable — deeper transitives
(nixpkgs's own deps etc.) would explode it; bumping a level-2
entry re-fetches its sub-inputs anyway.

POST /meta-update's 'which agents to rebuild' derivation
updated for the slash names: anything under hyperhive/
fans out to all agents (shared base); 'agent-<n>/...' picks
out the agent name from before the first slash.

read_meta_locked_revs (used by the deployed:<sha> chip per
container) split out into its own straight root-input lookup
since the chip only cares about the agent's own input.
2026-05-16 04:12:04 +02:00
müde
67e4242b9f per-agent send allow-list via hyperhive.allowedRecipients
new NixOS option in harness-base.nix:
  hyperhive.allowedRecipients = [ 'alice' 'manager' ];  # whitelist
  hyperhive.allowedRecipients = [ ];                    # default = unrestricted

module writes the list as JSON to /etc/hyperhive/send-allow
.json at activation. AgentServer::send reads the file before
issuing the broker request; if the list is non-empty and
`to` isn't on it, the tool returns a claude-readable refusal
string without touching the broker. the manager is always
implicitly permitted regardless of the list — otherwise a
misconfigured allow-list could strand a sub-agent without an
escalation path.

enforcement is in the in-container MCP server (not on the
host's per-agent socket) because the agent's nix config is the
trust boundary anyway — the operator audits agent.nix at
deploy time, the activation-time /etc/hyperhive/send-allow
.json is r/o under /nix/store, so the agent can't tamper at
runtime without going through a new approval.

agent prompt mentions the option + tells claude to route
through the manager when refused. retires the matching TODO
under Permissions / policy.
2026-05-16 03:59:28 +02:00
müde
d1c69b134a dashboard: reorder sections into grouped sequence
after reverting the 3-column attempt (74ba8a6), keep the
single-column layout but put related sections adjacent:

  swarm:     containers → kept-state → meta-inputs
  decisions: questions → approvals
  messages:  operator-inbox → message-flow + compose

this is a free improvement — the operator scrolls through one
logical group at a time instead of bouncing between swarm /
decisions / messages mid-page. follow-up improvements
(collapsing rarely-active sections, multi-column at wide
viewports done less aggressively) captured in TODO under
'Dashboard layout overhaul'.
2026-05-16 03:54:53 +02:00
müde
fe8fb15f8f Revert "dashboard: 3-column layout — swarm / 0per4t0r 1n / m3ss4g3s"
This reverts commit 74ba8a63e1.
2026-05-16 03:54:02 +02:00
müde
40938d8b54 dashboard: surface silent unwrap_or_default in api_state
every snapshot source backing /api/state used .unwrap_or_default()
— sqlite errors, broker errors, nixos-container list failures,
operator_questions decode crashes all degraded to empty lists
without a log line. the 'pending question doesn't render'
bug we've been chasing was likely a row-decode panic in
OperatorQuestions::pending() being swallowed this way.

new log_default(what, result) replaces each call site: same
default value on Err but emits target=api_state warn with the
source name + dbg error first. five sources covered:
nixos-container list, approvals.pending,
approvals.recent_resolved, broker.recent_for(operator),
questions.pending. next time the question goes missing the
journal will say which source failed and how.

todo updated — pending-question entry now points at the new
log instead of three suspect paths.
2026-05-16 03:49:49 +02:00
müde
74ba8a63e1 dashboard: 3-column layout — swarm / 0per4t0r 1n / m3ss4g3s
regroups the 7 stacked sections into three semantic columns
backed by a CSS grid (single column under 1400px, 3 columns
above). column headers are sticky so vertical scrolling
inside a column doesn't lose context.

- SW4RM (left, slightly wider): containers + kept-state +
  spawn-agent form + meta-input update form. all
  swarm-mutating operator knobs live here.
- 0PER4T0R 1N (middle): mind-questions + pending approvals.
  the two things waiting on operator action.
- M3SS4G3S (right): operator-inbox + msg-flow tail + the
  @-mention compose box. broker traffic in one place.

spawn form moves out of renderApprovals into static HTML
under sw4rm; renderApprovals no longer injects it.

cosmetic: per-section h2/divider replaced with smaller cyan
sub-heads + a dashed underline so each column reads as one
cohesive unit instead of seven competing banners. body
max-width grows 70em → 110em to actually use the new
horizontal real estate.
2026-05-16 03:47:16 +02:00
müde
266c2c7a77 dashboard: meta flake inputs UI + sequential rebuild loop
new section 'M3T4 1NPUTS' between approvals and message flow:
one row per input in meta/flake.lock (hyperhive first, then
agent-<n> alphabetically). each row shows the input name, the
first 12 chars of the locked sha, a relative timestamp from
locked.lastModified, and the original.url when available.
checkbox per row; submit button is disabled until at least one
box is checked; submitting confirms then POSTs the selected
names to /meta-update.

backend:
- meta::lock_update(inputs: &[String]) — runs 'nix flake update
  <names>' in the meta dir, commits the lock change with a
  combined message ('lock update: hyperhive, agent-coder').
  preserves the existing META_LOCK serialization. existing
  lock_update_for_rebuild / lock_update_hyperhive stay for
  their single-input callers.
- POST /meta-update — comma-separated 'inputs' form field
  (JS joins checkboxes since axum::Form doesn't natively
  decode repeated keys); spawns a background task that runs
  the lock update + per-agent rebuild loop. hyperhive
  selection fans out to all agents; agent-<n> selection only
  rebuilds <n>. each rebuild fires Rebuilt to the manager
  exactly like dashboard / admin-CLI / auto-update.

rebuild loop is sequential — auto_update::run too (was
parallel via tokio::spawn). parallel rebuilds collide on
nix-store's sqlite cache ('sqlite db busy, not using cache')
and the meta META_LOCK contention. nix-daemon serializes the
heavy build steps anyway, so this isn't a throughput loss.
2026-05-16 03:38:07 +02:00
müde
891223219e server: notify manager on admin-socket Rebuild outcomes
HostRequest::Rebuild was the only rebuild path that bypassed
notify_manager. dashboard / auto_update / actions::approve
already emit Rebuilt events on both success + failure, but a
'hive-c0re rebuild <name>' from the host CLI (and the recent
matrix-flake build failure that surfaced in journald) left the
manager in the dark.

mirror auto_update::rebuild_agent's pattern: on success →
Rebuilt{ok:true}, on failure → Rebuilt{ok:false, note=
format!('{e:#}')}. note carries the stderr tail lifecycle::run
collected (the actual nix error: missing prompt file, dep
build failure, etc.), so the manager has enough context to
adjust the agent's agent.nix without ssh-ing to the host.
2026-05-16 03:30:02 +02:00
müde
06af23c8a4 recv: None = peek, positive value = opt-in long-poll
old behavior: omitted wait_seconds fell through to the 30s
RECV_LONG_POLL_DEFAULT — claude calling 'is there anything in
my inbox right now?' between actions blocked the turn for half
a minute. flip the semantics: None (or 0) returns immediately,
positive value parks up to MAX (180s, unchanged). cleaner
'peek vs wait' distinction; tool descriptions + agent/manager
prompts updated to point at the new shape.

harness's own serve loops in hive-ag3nt + hive-m1nd relied on
the old default for their inbox poll. they now explicitly pass
wait_seconds: Some(180) to opt into the full park — same
effective behavior as before, just spelled out.

retires the matching TODO under Turn loop.
2026-05-16 03:22:42 +02:00
müde
90df2106bf agent socket: external wake-up path for in-container MCP servers
new AgentRequest::Wake { from, body } drops a message into
this agent's inbox via the per-agent socket. matrix-style MCP
servers can use it when they receive an external event
(matrix message, webhook, scrape result) to nudge claude
into running a turn. broker.send wakes whatever Recv is
currently long-polling, the harness picks the message up,
formats a wake prompt with the caller's chosen from label
('matrix: new dm', 'webhook: deploy succeeded', etc.).

new `hive-ag3nt wake --from <label> --body <text>` subcommand
on the harness binary so MCP servers can shell out instead of
implementing the line-JSON protocol themselves; body=='-'
reads from stdin for multi-line / quoting-friendly payloads.

identity = socket: anything that can connect to /run/hive/mcp
.sock is implicitly trusted to inject. that's fine because the
bind-mount is the agent's own container; no new auth surface
opens up.

docs/turn-loop.md gets a new 'Waking the agent from inside
the container' section pointing at both paths (CLI + raw
JSON).
2026-05-16 03:15:58 +02:00
müde
96cb9f84c9 dashboard: approval history tab on P3NDING APPR0VALS
new tabs above the approvals list: 'pending · N' and
'history · M'. active tab persists in localStorage so the
operator can park on history if they prefer. on a fresh
dashboard the default is pending (matches the prior shape).

history view shows the last 30 resolved approvals — newest
first by resolved_at — with one row per approval: status
glyph (✓ approved / ✗ denied / ⚠ failed), id, agent, kind,
short sha, status label, and a relative time chip. when the
row has a note (deny reason or build error), it renders
below in a muted block with line wraps preserved.

backend: Approvals::recent_resolved(limit) queries by
status IN ('approved', 'denied', 'failed') ORDER BY
resolved_at DESC. StateSnapshot gets approval_history (a
lean ApprovalHistoryView without diff_html — rendering 30
git diffs per state poll would be expensive and the operator
already saw the diff at decision time). dashboard's
history_view fn projects the sqlite row.

retires the matching TODO entry.
2026-05-16 03:07:50 +02:00
müde
7276e6d5d9 git identity: shorten to 'c0re' across all helpers
lifecycle::GIT_{NAME,EMAIL}, meta::GIT_{NAME,EMAIL}, and the
inline strings migrate.rs uses for its bootstrap commits all
move from 'hive-c0re' / 'hive-c0re@hyperhive' to 'c0re' /
'c0re@hyperhive'. shows up shorter in git log everywhere
(applied + meta repos).
2026-05-16 03:02:44 +02:00
müde
8336017eda lifecycle: annotated tags need a tagger identity
git_tag_annotated planted failed/<id> + denied/<id> as
annotated tags via 'git tag -a' — which produces a git
object and therefore needs user.name + user.email. without a
global git config on the host that fell through to
'fatal: unable to auto-detect email address (got
root@muede-lpt2.(none))' and the tag never landed.

pass the hive-c0re identity inline with -c user.name=… -c
user.email=… (same shape git_commit already uses), so the
applied repo's deny/failure audit tags get planted reliably
without depending on the host user's git config.
2026-05-16 03:00:44 +02:00
müde
c2bf0aa4f1 todo: approval history tab; retire streaming-output entry
new entry under UI/UX for an approval history tab on the
P3NDING APPR0VALS section — sqlite already has every row + the
applied repo's annotated denied/failed tags carry the
human-readable reasons, so this is a render-side change.

retire the 'stream nixos-container stdout' entry — landed in
6f1b664. run() now pipes child output line-by-line into
tracing so 'slow build' no longer looks like 'wedged daemon'.
2026-05-16 02:59:02 +02:00
müde
c92108a11c lifecycle: fetch into checked-out main with --update-head-ok
setup_applied does `git init --initial-branch=main` then
`git fetch <proposed> main:refs/heads/main` to seed the
applied repo with proposed's initial commit. git's default
safeguard refuses to fetch into the currently-checked-out
branch, even though the working tree is empty (we just init'd).
add --update-head-ok to bypass — the read-tree-reset
immediately after fetches the right state, so the safeguard
the flag bypasses isn't relevant here anyway.

repro from the user: spawn of 'dmatrix' failed with
  fatal: refusing to fetch into branch 'refs/heads/main'
  checked out at '/var/lib/hyperhive/applied/dmatrix'
2026-05-16 02:58:34 +02:00
müde
6f1b664c85 lifecycle: stream nixos-container stdout/stderr line-by-line
run() previously buffered the child's output via .output() and
only logged at exit — a multi-minute 'nixos-container update'
(typical on a fresh hyperhive bump) showed nothing in journald
until the very end. operator watching 'journalctl -u hive-c0re
-f' couldn't tell 'slow nix build' from 'wedged daemon'.

new shape: spawn with piped stdio, pump each line into tracing
as it arrives (stdout → INFO, stderr → WARN), keep a tail of
the last 32 stderr lines for the bail message so the eventual
'failed (status 2)' still carries the actual nix eval error.
target field 'nixos-container', argv-equivalent attached via
the 'cmdline' field so filtering by subcommand works.
2026-05-16 02:57:16 +02:00
müde
1278f880da docs: sync to current state of the world
claude.md scratchpad rewritten — folds in pronouns option,
extra MCP servers + flakeInputs forwarding, ask_operator
on sub-agents, dashboard compose box with @-mentions, new-
session button, cwd=/state for claude turns, meta-mutex +
stale-lock cleanup.

readme picks up the operator pronouns option example,
the dashboard compose box description, the new slash
commands list, the deployed-sha chip, the per-agent UI
gains new-session.

docs/web-ui.md gains:
- a fuller MESS4GE FL0W description that calls out the
  compose box, sticky @-mention recipient, /op-send, and
  the manager-name swap
- /op-send in the dashboard endpoint table
- new-session button + /new-session slash command in the
  per-agent surface
- compact endpoint now notes 'same session shape as a normal
  turn'

docs/turn-loop.md:
- new-session one-shot, cwd=/state with CLAUDE.md auto-load
  walking upward, operator-pronouns substitution
- sub-agent tool list grows ask_operator
- new 'Extra MCP servers (per-agent)' section documenting
  hyperhive.extraMcpServers + the flakeInputs forwarding
  pattern
2026-05-16 02:49:48 +02:00