agents weren't being woken with the 'you were rebuilt — check
/state/ for notes, --continue intact' system message after
several recent rebuild surfaces:
- auto_update::rebuild_agent — used by the dashboard rebuild
button, admin-CLI rebuild via lifecycle_action, the startup
rev-scan, AND the new meta-input update batch loop. kick
moves *into* rebuild_agent's success arm so all four
paths benefit. (the dashboard's lifecycle_action extra
closure was already firing kick — now it's a no-op for the
rebuild path since rebuild_agent does it.)
- actions::run_apply_commit — apply-commit approve flow built
+ tagged deployed/<id> but never kicked. add kick on
success with the more specific 'config update applied' hint.
- server.rs::HostRequest::Rebuild — the admin-CLI direct path
calls lifecycle::rebuild bypassing rebuild_agent. add kick
on success.
dashboard's restart / start lifecycle_action extras still
kick via their own closures since they don't route through
rebuild_agent. stop / kill / destroy intentionally don't
kick — there's nothing to wake.
read_meta_inputs() previously only included direct inputs of
meta's root node — so a manager-added 'inputs.mcp-matrix' in
agent-dmatrix's flake.nix never surfaced in the dashboard
panel even though it's a real fetched input that nix can
update.
now: BFS the flake.lock graph from root to depth 2. emits
one MetaInputView per fetched (non-follows) node, names are
slash-paths from root — 'hyperhive', 'agent-coder',
'agent-dmatrix/mcp-matrix', 'hyperhive/nixpkgs', etc. that's
the same syntax 'nix flake update' accepts for transitive
inputs, so the existing POST /meta-update path needs no
nix-side change.
depth limit of 2 keeps the panel readable — deeper transitives
(nixpkgs's own deps etc.) would explode it; bumping a level-2
entry re-fetches its sub-inputs anyway.
POST /meta-update's 'which agents to rebuild' derivation
updated for the slash names: anything under hyperhive/
fans out to all agents (shared base); 'agent-<n>/...' picks
out the agent name from before the first slash.
read_meta_locked_revs (used by the deployed:<sha> chip per
container) split out into its own straight root-input lookup
since the chip only cares about the agent's own input.
every snapshot source backing /api/state used .unwrap_or_default()
— sqlite errors, broker errors, nixos-container list failures,
operator_questions decode crashes all degraded to empty lists
without a log line. the 'pending question doesn't render'
bug we've been chasing was likely a row-decode panic in
OperatorQuestions::pending() being swallowed this way.
new log_default(what, result) replaces each call site: same
default value on Err but emits target=api_state warn with the
source name + dbg error first. five sources covered:
nixos-container list, approvals.pending,
approvals.recent_resolved, broker.recent_for(operator),
questions.pending. next time the question goes missing the
journal will say which source failed and how.
todo updated — pending-question entry now points at the new
log instead of three suspect paths.
new section 'M3T4 1NPUTS' between approvals and message flow:
one row per input in meta/flake.lock (hyperhive first, then
agent-<n> alphabetically). each row shows the input name, the
first 12 chars of the locked sha, a relative timestamp from
locked.lastModified, and the original.url when available.
checkbox per row; submit button is disabled until at least one
box is checked; submitting confirms then POSTs the selected
names to /meta-update.
backend:
- meta::lock_update(inputs: &[String]) — runs 'nix flake update
<names>' in the meta dir, commits the lock change with a
combined message ('lock update: hyperhive, agent-coder').
preserves the existing META_LOCK serialization. existing
lock_update_for_rebuild / lock_update_hyperhive stay for
their single-input callers.
- POST /meta-update — comma-separated 'inputs' form field
(JS joins checkboxes since axum::Form doesn't natively
decode repeated keys); spawns a background task that runs
the lock update + per-agent rebuild loop. hyperhive
selection fans out to all agents; agent-<n> selection only
rebuilds <n>. each rebuild fires Rebuilt to the manager
exactly like dashboard / admin-CLI / auto-update.
rebuild loop is sequential — auto_update::run too (was
parallel via tokio::spawn). parallel rebuilds collide on
nix-store's sqlite cache ('sqlite db busy, not using cache')
and the meta META_LOCK contention. nix-daemon serializes the
heavy build steps anyway, so this isn't a throughput loss.
new tabs above the approvals list: 'pending · N' and
'history · M'. active tab persists in localStorage so the
operator can park on history if they prefer. on a fresh
dashboard the default is pending (matches the prior shape).
history view shows the last 30 resolved approvals — newest
first by resolved_at — with one row per approval: status
glyph (✓ approved / ✗ denied / ⚠ failed), id, agent, kind,
short sha, status label, and a relative time chip. when the
row has a note (deny reason or build error), it renders
below in a muted block with line wraps preserved.
backend: Approvals::recent_resolved(limit) queries by
status IN ('approved', 'denied', 'failed') ORDER BY
resolved_at DESC. StateSnapshot gets approval_history (a
lean ApprovalHistoryView without diff_html — rendering 30
git diffs per state poll would be expensive and the operator
already saw the diff at decision time). dashboard's
history_view fn projects the sqlite row.
retires the matching TODO entry.
new section under MESS4GE FL0W. msgflow already tails only
broker traffic (sent + delivered), which is exactly the
'messages through core' view the operator wants; no
per-agent thinking leaks through. compose box below:
- a prompt span renders the sticky recipient ('@coder>'),
rendered outside the textarea so it can't be edited
inadvertently. on submit the recipient gets persisted to
localStorage so it survives reload.
- start the input with '@name body' to redirect — the parser
splits at the first whitespace and the new recipient
becomes sticky.
- typing '@' at the start opens a completion dropdown over
the textarea pulled from window.__hyperhive_state.containers;
arrow keys cycle, tab/enter selects, escape closes. clicking
works too.
- manager swap: agents flagged is_manager are surfaced as
'@manager' (the broker's recipient string) instead of
'@hm1nd' (the container name), so the message actually
routes to the manager's inbox.
backend: new POST /op-send accepts {to, body} and drops a
broker.send({from:'operator', to, body}) — same shape as the
per-agent web UI's OperatorMsg, but lets the operator choose
the recipient explicitly from the main dashboard.
new AgentRequest::AskOperator + AgentResponse::QuestionQueued on
the per-agent socket — same shape as the manager flavor, agent
gets the same wire surface (still uses the same operator_questions
table). agent_server::dispatch wires AskOperator through coord
.questions.submit(agent, ...) so the row's asker is the sub-agent
name; the ttl watchdog already in manager_server gets shared and
spawn_question_watchdog goes pub.
answer routing: operator_questions::answer now returns (question,
asker). post_answer_question + post_cancel_question + the watchdog
fire OperatorAnswered through new coord.notify_agent(asker, event)
instead of always notify_manager — the event lands in whichever
agent originally asked. notify_manager is now a thin wrapper.
agent socket plumbing: agent_server::start takes Arc<Coordinator>
instead of Arc<Broker> so dispatch has access to questions +
notify path; coordinator::{register_agent,ensure_runtime} take
self: &Arc<Self>. mcp::AgentServer grows the ask_operator tool;
allowed_mcp_tools(Agent) adds it; prompts/agent.md replaces the
'message the manager to ask the operator' guidance with the
direct tool description.
ContainerView grows deployed_sha (first 12 chars of the rev
that /var/lib/hyperhive/meta/flake.lock currently has locked
for agent-<name>). renderContainers appends a 'deployed:<sha12>'
chip next to the container name + port — title attribute
explains it's the meta-lock sha. degrades gracefully when the
meta repo isn't seeded yet (missing / unparsable lock = empty
map = no chip). new read_meta_locked_revs helper does the JSON
parsing without unwraps.
approval_diff now runs git diff refs/heads/main..refs/tags/
proposal/<id> against the applied repo instead of cobbling a
single-file diff from proposed. consequences: multi-file
proposals show every change, manager amendments in proposed
cannot lie about what'll be deployed, no-op proposals render
an explicit '(proposal matches currently-deployed tree)'.
displayed sha prefers fetched_sha (hive-c0re-vouched) and
falls back to commit_ref only for the brief pre-fetch window.
unified_diff helper + similar dep dropped — git diff is the
source of truth now. dead-code allows on the lifecycle git
helpers + approvals.set_fetched_sha come off since all are
wired up. readme picks up the tag flow + /applied RO mount.
apply-commit denials now leave a git object behind: tag
denied/<id> annotated with the operator's note (or empty body
if they didn't supply one) at proposal/<id> inside the applied
repo. rejected configs become first-class git history — git
show denied/<id> in the manager's applied.git mount yields the
tree the operator rejected plus the reason. helper event
carries the tag for parity with deployed/failed. spawn denials
fall through unannotated since they have no proposal commit.
deny becomes async (single git plumbing call); dashboard +
admin-socket callers grow .await.
manager is fixed at 8000, sub-agents are 8100-8999, so collisions
are strictly between two sub-agents hashing to the same value.
the colliding container's harness restart-loops on AddrInUse —
which the user just hit on :8945. previously the only sign was a
buried journalctl warn line.
now surfaced two ways:
- lifecycle::spawn / rebuild preflight: walks the live container
list, computes each agent's hashed port, refuses with
'port N already taken by <other> — rename one of them' if any
running sub-agent shares the new agent's port. so the operator
sees an actionable error in the dashboard's transient pill /
approve-result instead of waiting for the harness to die.
- /api/state grows a port_conflicts: [{port, agents: [...]}]
array; dashboard renders a pulsing red banner above the
containers list listing each cluster. matches the questions
panel pulse so it's hard to miss.
clicking DENY on the dashboard now prompts for an optional reason
('reason for denying (optional, sent to manager):'). the value
rides along as a hidden 'note' form field; backend chain:
POST /deny/{id} { note }
→ actions::deny(coord, id, Some(note))
→ Approvals::mark_denied writes it to the row
→ HelperEvent::ApprovalResolved { ..., note: Some("...") }
manager already had note: Option<String> on the event, just never
populated for denials before. host admin socket (hive-c0re deny)
still passes None.
generalized the prompt-on-submit pattern: any form with a
data-prompt attribute pops a window.prompt() before the POST and
stashes the answer in a hidden input named by data-prompt-field
(default 'note'). reusable for future opt-in note fields.
new GET /api/agent-config/{name} returns the contents of
/var/lib/hyperhive/applied/<name>/agent.nix — the file the
container actually builds against. validated against the live
container list to avoid arbitrary filesystem reads.
frontend mirrors the journald viewer: collapsed <details> on each
container row, lazy-fetches on expand, refresh button re-fetches.
restore-keyed (agent-config:<name>) so it survives the dashboard
heartbeat refresh.
read-only — mutating the applied config goes through the existing
request_apply_commit + operator approval flow.
- runtime model override: Bus::{model,set_model} + POST /api/model
(form-encoded {model: name}). turn.rs reads bus.model() per turn
so a flip lands on the next claude invocation. /api/state grows
a model field; agent page shows a 'model · <name>' chip in the
state row. '/model <name>' slash command POSTs to the endpoint
and refreshes state.
- port regression fix: agent_web_port no longer probes forward for
*existing* agents (the previous fix shifted ports for any agent
without a port file, including legacy ones whose container was
already bound to the bare hashed port — dashboard rendered the
new port, container was still on the old one, conn errors). new
rule: port file exists → use it; absent + applied flake present
→ legacy, persist port_hash without probing; absent + no applied
flake → fresh spawn, probe forward.
- SO_REUSEADDR on both the dashboard and per-agent web UI binds
via tokio::net::TcpSocket. operator hit 12 retries failing on
manager :8000 — REUSEADDR handles the TIME_WAIT case cleanly
without a new dep; retry still covers the genuine
process-still-alive overlap.
todo: drops the model-override entry (shipped); adds two new
items — model persistence (optional, future), and custom
per-agent MCP tools (groundwork for moving bitburner-agent into
hyperhive).
new GET /api/journal/{name}?unit=&lines= shells out journalctl -M
<container> -b --no-pager --output=short-iso --lines=<N> (cap 5000).
optional unit filter, restricted to hive-ag3nt.service /
hive-m1nd.service so the shell-out can't be coerced into reading
unrelated units. validates the container name against the live list
before invoking journalctl.
frontend renders a collapsed '↳ logs · <container>' details block
on each container row. expanding triggers a lazy fetch; refresh
button re-fetches; unit dropdown switches between the harness
service (default) and the full machine journal. output sits in a
24em-tall monospace pre, auto-scrolled to the bottom on fresh
fetch.
hive-c0re's systemd unit already runs as root, so journalctl has
the access it needs.
operator hit 'Address already in use (os error 98)' on a harness
restart — the new harness raced the old socket's release. add a
bind_with_retry helper that backs off (250ms doubling, capped at
2s, 12 tries ≈ 22s total) on AddrInUse before giving up. applied
to both the per-agent web UI and the hive-c0re dashboard.
proper fix would be SO_REUSEADDR via socket2 but retry covers the
TIME_WAIT case fine and keeps the dep count down. Other bind errors
still fail immediately (port permission, fd exhaustion).
new wire request AgentRequest::Recent { limit } / ManagerRequest::Recent
(plus matching responses with Vec<InboxRow>). InboxRow moved to
hive-sh4re so it lives on both surfaces without an internal-to-wire
conversion. host-side dispatch in agent_server / manager_server
calls broker.recent_for(name, limit).
per-agent web_ui /api/state grew an inbox: Vec<InboxRow> populated
via the same per-agent socket (best-effort; transport failure
returns empty). frontend renders as a collapsible <details> section
between the state row and the terminal — fmt timestamp / from /
body in a tight grid, capped at 16em scrollable. only visible when
there are rows.
new POST /cancel-question/{id} resolves a pending operator question
with the sentinel answer '[cancelled]' and fires the usual
HelperEvent::OperatorAnswered so the manager sees a terminal state
and can fall back. uses the same OperatorQuestions::answer path —
no special handling, the manager already has to deal with arbitrary
answer strings.
dashboard renders the cancel as a separate <form> below the main
qform so the answer-merge submit handler on the main form doesn't
inadvertently fire when the operator clicks cancel. confirm dialog
spells out what the manager will see.
ttl-based auto-cancel is still on the todo (would spawn a tokio task
per submitted question).
new Coordinator::kick_agent(name, reason) drops a system message
into the agent's inbox so the next turn picks it up with a 'you
were just (re)started, check /state/ for notes, --continue session
is intact' hint. wakes the turn loop without any harness-side
handling needed — it's just another inbox message with sender =
'system'.
wired from:
- dashboard /start /restart /rebuild handlers (via lifecycle_action's
on-success tail)
- manager mcp_hyperhive_start / restart
dashboard: pending approvals + tombstones + questions now refresh on
a 5s heartbeat when nothing else is happening. previously refresh
only fired on async-form submit or on broker traffic addressed to
operator — manager-queued approvals went through neither, so the
operator had to reload to see them. 5s is the slow-path; 2s
remains for in-flight transients.
drops the #[allow(clippy::too_many_lines)] on api_state by extracting
four pure helpers:
- build_container_views — live containers + any_stale flag
- build_transient_views — agents in pre-creation Spawning state only
- build_approval_views — pending approvals with diff html
- build_tombstone_views — destroyed-but-kept state dirs
api_state itself is now ~30 lines of orchestration. zero behavior
change. each helper is independently readable + testable.
five POST handlers (post_kill / post_restart / post_start / post_rebuild)
were all repeating the same boilerplate: strip prefix, set_transient,
call lifecycle::X, clear_transient, match the result. extract one
helper that takes the transient kind, error-message verb, the work
body, and an optional 'on success' tail (used by kill to also
unregister + emit HelperEvent::Killed). each handler shrinks to a
single lifecycle_action(..) call. zero behavior change.
new section between containers and questions: lists every name with a
state dir under /var/lib/hyperhive/agents/ that doesn't correspond to
a live container. shows state size + last-modified age + whether
claude creds are kept. two actions per row:
- R3V1V3 — queues a spawn approval with the same name (operator
approves to recreate; spawn flow reuses prior config + claude
creds, no re-login needed)
- PURG3 — wipes the agent's state + applied dirs (post /purge-tombstone/
endpoint; refuses if a live container with that name still exists)
dashboard also opens agent links in new tabs now (target=_blank +
rel=noopener) so the operator's overview tab stays put when they
dive into an agent.
backend:
- TransientKind grows Starting / Stopping / Restarting / Rebuilding /
Destroying alongside the existing Spawning. each dashboard handler
(start/restart/kill/rebuild/destroy) wraps the lifecycle call with
set_transient + clear_transient so the dashboard knows what's in
flight. transient kind is surfaced inline on ContainerView.pending
(existing-container actions) — only Spawning (pre-creation) lands
in the separate transients list.
frontend:
- container row is now two lines: identity + meta on top, action
buttons below. less cluttered, leaves room for the pending state
pill. pending rows dim their actions and surface a pulsing
'◐ spawning… / starting… / stopping… / restarting… / rebuilding…
/ destroying…' indicator next to the name.
- 'needs login' / 'needs update' chips moved into a unified .badge
styling for consistency.
- auto-refresh kicks in not only on transient spawn but on any
container with a pending action.
new --purge flag on the destroy verb (cli + admin socket + dashboard).
default destroy still keeps /var/lib/hyperhive/{agents,applied}/<name>/
so recreating with the same name reuses prior config + creds.
with --purge, both dirs go too (config history, claude creds, /state/
notes). no undo. dashboard adds a separate PURG3 button with an
explicit confirmation copy; the existing DESTR0Y button keeps the
soft semantics.
claude.md dashboard-action-surface section updated; todo entry
dropped.
new mcp tool on the manager surface that queues a question on the
dashboard and returns the question id immediately. operator submits an
answer via /answer-question/<id>; the dashboard fires
HelperEvent::OperatorAnswered { id, question, answer } into the manager
inbox so the next turn picks it up.
also: fix async-form button stuck on spinner after successful submit
(refreshState skipped re-rendering, so the button was never re-enabled).