on startup (and after every applied-repo ref mutation) core pushes
each agent's hive-c0re-owned applied repo — main plus every
proposal/approved/building/deployed/failed/denied tag — to
agent-configs/<name> on the local forge. the org is private and
agents are not members, so core is the only principal that can read
it.
the tokenised push url is passed inline, never stored as a named
remote: the applied repo is bind-mounted read-only into the manager,
so a token in .git/config would leak the core admin credential to an
agent.
push_config is best-effort at every site (ensure_all, spawn,
approve, deny, submit) — a missing or down forge never blocks a
deploy.
new sqlite table at /state/hyperhive-turn-stats.sqlite on each
agent's state dir. one row per claude turn captures identity
(model, wake_from, result_kind), timing (started/ended_at,
duration_ms), cost (input/output/cache_read/cache_creation token
counts), behaviour (tool_call_count + per-tool breakdown JSON),
and post-turn snapshot metrics (open_threads_count,
open_reminders_count).
wire additions:
- AgentRequest/ManagerRequest::CountPendingReminders +
Broker::count_pending_reminders_for(agent)
- Bus::observe_stream + take_tool_calls — pumps the existing
stdout stream-json, picks out tool_use blocks, accumulates per
turn. bin loops fold the breakdown into each row.
- TurnStats::open_default + TurnStatRow + record() — best-effort
inserts; failures log + don't block the harness.
both ag3nt and m1nd bins capture started_at + duration via
Instant::elapsed, fetch open-thread + reminder counts from
hive-c0re via the existing socket (post-turn, best-effort), and
record one row at turn_end. record_kind splits ok / failed /
prompt_too_long; failures carry the error message in note.
todo entries for host-side vacuum sweep + reading the table back
into agent/dashboard badges.
questions pane now shows both operator-targeted threads
(target IS NULL) and agent-to-agent threads (target = some
agent). filter chips above the list: all / @operator / @peer /
per-participant. peer rows get a mauve left rule + a 0V3RR1D3
button that POSTs the same /answer-question endpoint
(OperatorQuestions::answer already permits the operator as
answerer on any target).
wire changes: OperatorQuestions gains pending_all +
recent_answered_all; QuestionAdded + QuestionResolved events
carry target: Option<String>; emit sites drop their
target.is_none() guard. answered-history rows show the
answerer prefix so override answers are auditable at a glance.
bare set_transient/clear_transient pairs leak the in-memory transient
on task cancellation, panics, or any early return between the two
calls — dashboard then shows the agent stuck in 'rebuilding…'
forever (coder hit this today). add Coordinator::transient_guard
returning a TransientGuard whose Drop clears, and convert every
caller (dashboard lifecycle_action, auto_update::rebuild_agent,
manager_server Update, actions::destroy, actions Spawn task,
migrate phase 4). destroy() now takes &Arc<Coordinator> so it can
hold a guard. existing stuck transients clear on next hive-c0re
restart since transient state is in-memory only.
old behavior: omitted wait_seconds fell through to the 30s
RECV_LONG_POLL_DEFAULT — claude calling 'is there anything in
my inbox right now?' between actions blocked the turn for half
a minute. flip the semantics: None (or 0) returns immediately,
positive value parks up to MAX (180s, unchanged). cleaner
'peek vs wait' distinction; tool descriptions + agent/manager
prompts updated to point at the new shape.
harness's own serve loops in hive-ag3nt + hive-m1nd relied on
the old default for their inbox poll. they now explicitly pass
wait_seconds: Some(180) to opt into the full park — same
effective behavior as before, just spelled out.
retires the matching TODO under Turn loop.
new boilerplate wraps agent.nix as a sub-module + passes every
flake input (minus self) through to it via _module.args.flake
Inputs. manager edits the inputs block of flake.nix to pull in
out-of-tree flakes (MCP servers etc.) and references them in
agent.nix as flakeInputs.<name>.packages.${pkgs.system}.default
— the new input's pinned sha lands in the agent's own flake
.lock (already tracked + part of the proposal flow), and
transitively rolls up into meta's lock.
migrate's MODULE_FLAKE_MARKER swaps to _module.args.flakeInputs
so existing agents on the old 'nixosModules.default = import
./agent.nix' template get re-rendered onto the new shape on
next hive-c0re start.
manager_server's flake.nix tamper-check goes away — the build
path's failed/<id> annotated tag already provides the safety
net when a manager edit breaks the flake; enforcing 'no
flake.nix edits at all' was overly strict (blocks the inputs-
addition pattern that's the whole point of this change).
manager prompt updated with a worked example for adding an
MCP-server flake input + wiring it through agent.nix.
new AgentRequest::AskOperator + AgentResponse::QuestionQueued on
the per-agent socket — same shape as the manager flavor, agent
gets the same wire surface (still uses the same operator_questions
table). agent_server::dispatch wires AskOperator through coord
.questions.submit(agent, ...) so the row's asker is the sub-agent
name; the ttl watchdog already in manager_server gets shared and
spawn_question_watchdog goes pub.
answer routing: operator_questions::answer now returns (question,
asker). post_answer_question + post_cancel_question + the watchdog
fire OperatorAnswered through new coord.notify_agent(asker, event)
instead of always notify_manager — the event lands in whichever
agent originally asked. notify_manager is now a thin wrapper.
agent socket plumbing: agent_server::start takes Arc<Coordinator>
instead of Arc<Broker> so dispatch has access to questions +
notify path; coordinator::{register_agent,ensure_runtime} take
self: &Arc<Self>. mcp::AgentServer grows the ask_operator tool;
allowed_mcp_tools(Agent) adds it; prompts/agent.md replaces the
'message the manager to ask the operator' guidance with the
direct tool description.
submit_apply_commit now diffs the freshly-tagged proposal/<id>
against applied/main and refuses if flake.nix is in the
changeset. flake.nix is fixed boilerplate the meta flake
depends on (it exports nixosModules.default = import ./agent
.nix); silent edits there would break the nixosConfiguration
in subtle ways. the manager prompt already says don't touch
it; this is the host-side belt — clear error to the manager
on submit, row marked failed in sqlite, no orphan pending
approval to chase. diff-failure is logged + ignored: the
build path surfaces concrete errors if flake.nix is actually
broken.
submit_apply_commit (1) queues the approval row, (2) git-fetches
the manager-supplied sha from proposed into applied, pins it as
refs/tags/proposal/<id>, (3) persists the resolved sha on the
row via approvals.set_fetched_sha. from this point on the
proposal is immutable from the manager's perspective: amends
or force-pushes in proposed do not change what hive-c0re will
build. fetch failures mark the row failed and surface the error
to the manager so a phantom pending entry can't linger.
crash_watch grows two more state-axes alongside running/stopped:
- logged-in (claude session dir populated for the agent)
- up-to-date (recorded flake rev matches current)
per-tick transitions emit HelperEvent::NeedsLogin / LoggedIn /
NeedsUpdate. seed-on-first-tick semantics retained — nothing fires
on harness boot for agents that were already in their state. only
needs_update fires the 'stale appeared' direction; the resolved
direction is already covered by Rebuilt.
new mcp__hyperhive__update(name) on the manager surface: idempotent
rebuild via auto_update::rebuild_agent. transient-aware (Rebuilding)
so the dashboard shows the spinner. login intentionally has NO tool
— it's interactive OAuth, only the operator can complete it.
prompts + approvals doc + turn-loop doc updated. todo grows a
'show per-agent applied config in dashboard' entry (separate
follow-up).
recv-with-timeout is strictly better than a fixed sleep because it
wakes instantly on incoming messages. drop the half-written nap MCP
tool, raise the recv wait_seconds cap from 60s to 180s on both
agent and manager sockets.
prompts updated: agent.md + manager.md now spell out the pattern —
when there's nothing else useful to do, call recv with
wait_seconds=180 to park the turn; do NOT use Bash sleep for the
same purpose. todo drops the nap entry and the napping-state-badge
follow-up; both replaced by 'just use a long recv'.
AgentRequest::Recv and ManagerRequest::Recv grow an optional
wait_seconds field (default None → 30s, capped at 60s server-side).
agent_server / manager_server clamp via recv_timeout(). MCP tool
schemas advertise the param so claude can pick its own poll window
— useful when an agent wants to throttle wakes without entering a
distinct nap state.
both harness loops still pass None, keeping the existing 30s
default behaviour for system-level Recvs.
manager can pass ttl_seconds to ask_operator. on submit, host
stores deadline_at = now + ttl in operator_questions (new column,
migrated via existing pragma_table_info pattern), spawns a tokio
task that sleeps until the deadline then resolves the question with
answer '[expired]' and fires the same OperatorAnswered helper event.
already-resolved races no-op silently.
dashboard renders a '⏳ MM:SS' chip on the question row when
deadline_at is set. format collapses seconds → s, < 1h → m s, ≥ 1h
→ h m. heartbeat refresh (5s) keeps the chip current; the operator
sees it tick down.
manager prompt + mcp tool description updated. journald viewer per
container queued in todo (separate task).
new wire request AgentRequest::Recent { limit } / ManagerRequest::Recent
(plus matching responses with Vec<InboxRow>). InboxRow moved to
hive-sh4re so it lives on both surfaces without an internal-to-wire
conversion. host-side dispatch in agent_server / manager_server
calls broker.recent_for(name, limit).
per-agent web_ui /api/state grew an inbox: Vec<InboxRow> populated
via the same per-agent socket (best-effort; transport failure
returns empty). frontend renders as a collapsible <details> section
between the state row and the terminal — fmt timestamp / from /
body in a tight grid, capped at 16em scrollable. only visible when
there are rows.
new Coordinator::kick_agent(name, reason) drops a system message
into the agent's inbox so the next turn picks it up with a 'you
were just (re)started, check /state/ for notes, --continue session
is intact' hint. wakes the turn loop without any harness-side
handling needed — it's just another inbox message with sender =
'system'.
wired from:
- dashboard /start /restart /rebuild handlers (via lifecycle_action's
on-success tail)
- manager mcp_hyperhive_start / restart
dashboard: pending approvals + tombstones + questions now refresh on
a 5s heartbeat when nothing else is happening. previously refresh
only fired on async-form submit or on broker traffic addressed to
operator — manager-queued approvals went through neither, so the
operator had to reload to see them. 5s is the slow-path; 2s
remains for in-flight transients.
ask_operator now accepts a multi: bool. when true and options is
non-empty, the dashboard renders the choices as checkboxes — operator
picks any subset, answer comes back as a ', '-joined string. when
false (default), options are radio buttons.
independent of multi, a free-text input ('or type your own…') is
always rendered alongside options so the operator is never trapped
by an incomplete list. submit merges checked options + free text into
the single 'answer' field.
schema migration: operator_questions grows a multi INTEGER column
with a one-shot ALTER TABLE on open. backward compatible — old rows
default to 0 (not multi).
prompt + mcp tool description updated; existing dashboard css for
.qform was rewritten around the new vertical layout.
new manager tools mcp__hyperhive__{start,restart} that delegate to the
existing lifecycle::start / lifecycle::restart on the host. kill was
already at the manager's discretion; rounding out start + restart for
parity so day-to-day container care doesn't have to round-trip through
the operator.
guard: refuse self-targeting on kill/start/restart — the manager would
just be cutting its own legs. spawn (request_spawn) and config changes
(request_apply_commit) still go through the approval queue, since those
are the actual gate. prompt + claude.md updated to make the boundary
explicit. kill now also emits HelperEvent::Killed (it didn't before).
new mcp tool on the manager surface that queues a question on the
dashboard and returns the question id immediately. operator submits an
answer via /answer-question/<id>; the dashboard fires
HelperEvent::OperatorAnswered { id, question, answer } into the manager
inbox so the next turn picks it up.
also: fix async-form button stuck on spinner after successful submit
(refreshState skipped re-rendering, so the button was never re-enabled).