lexis 40589c8510 docs: update spawn flow docs for apply_commit handling first spawn (follow-up to 66f1568)

2026-05-22 22:34:42 +02:00

18 KiB

Raw Blame History

Approvals + manager + helper events

The approval queue is hyperhive's pivot: nothing that changes the shape of an agent (its config, whether it exists) happens without an operator click. The manager (hm1nd) is the policy gate in front of that queue; helper events are how it stays informed about what happens after a decision lands.

End-to-end approval flow

Manager edits files under /agents/<name>/config/ (any tracked path, but agent.nix is the contract entry point) and commits with its own git identity.
Manager submits the commit sha via request_apply_commit(agent, commit_ref). commit_ref must be a commit sha (7-40 hex chars, short or full) — a branch or tag name is rejected so the approval pins an immutable commit.
hive-c0re immediately fetches that commit from the proposed repo into the applied repo and tags it proposal/<id>. It resolves the sha locally against the proposed repo, fetches all of proposed's heads into applied's object db, then tags the resolved commit — git fetch <remote> <sha>:<dst> can't fetch by a bare sha (the left side of a refspec is a remote ref name), so the resolution happens on hive-c0re's side. The approval row stores both the manager-supplied sha and the canonical hive-c0re-vouched sha. From here on the proposed repo is irrelevant for this approval — the manager can amend, force-push, or rm -rf the proposed repo and the queued approval still points at an immutable git object inside applied.
Operator sees the proposal as a card on the dashboard — a full multi-file diff, toggleable between three bases (vs the running tree / vs the last approved proposal / vs the previous queued proposal) — and clicks ◆ APPR0VE (or hive-c0re approve <id> on the CLI).
hive-c0re moves the working tree to proposal/<id> and runs the build under a sequence of tags (see below). On success, applied/main fast-forwards to the proposal commit. On failure, main stays put and the working tree resets back to the previous deployed commit.
HelperEvent::ApprovalResolved (and Rebuilt for the ApplyCommit kind) land in the manager's inbox, carrying both the canonical sha and the terminal tag.

InitConfig approvals are the first step in a two-step spawn flow. On approve, hive-c0re seeds the proposed config repo with a default agent.nix template and sends the manager HelperEvent::ConfigReady { agent }. The manager then reviews, edits, and commits the template before calling request_apply_commit to proceed to an ApplyCommit approval. The first ApplyCommit creates the container; subsequent ones rebuild it with new config. This gives the manager (and operator) an explicit review gate on the initial configuration before any container is created.

Meta flake

The hive-c0re-owned repo at /var/lib/hyperhive/meta/ declares one flake input per agent (agent-<n>.url = "git+file:///var/lib/hyperhive/applied/<n>") and one nixosConfigurations.<n> output per agent. Each output wraps inputs.agent-<n>.nixosModules.default with the identity + HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT injection module that setup_applied used to generate inline. Containers run against --flake /var/lib/hyperhive/meta#<n>.

Per-deploy lock flow (two-phase, owned by actions::run_apply_commit → meta::{prepare,finalize,abort} _deploy):

meta::prepare_deploy(name) runs nix flake lock --update-input agent-<n> without committing. Working tree of meta now points the input at applied/<n>/main (which run_apply_commit already fast-forwarded to proposal/<id>).
lifecycle::rebuild_no_meta runs nixos-container update <c> --flake meta#<name>. Nix evaluates against the staged lock.
On success — meta::finalize_deploy(name, sha, "deployed/ <id>") stages flake.lock and commits with deploy <n> deployed/<id> <sha12>. Meta's git log gains one entry per successful deploy.
On failure — meta::abort_deploy() runs git restore flake.lock so the meta history shows only successes; the failure stays as an annotated failed/<id> tag in applied/<n>.

Single-phase variants exist for paths without rollback semantics: meta::lock_update_for_rebuild(name) for the manual ↻ R3BU1LD button (commits if the lock changed) and meta::lock_update_hyperhive() for the auto-update flake-rev bump (one shot before per-agent rebuilds, commits if the lock changed).

meta::sync_agents(hyperhive_flake, dashboard_port, &agents) is the idempotent reconciler called by spawn, destroy, rebuild, and the startup migration. Renders flake.nix from the agent list; if it differs from disk, runs nix flake lock + commits as regenerate meta flake (or seed meta from N agent(s) on the very first call).

The manager has /meta RO-bound inside its container: git -C /meta log --oneline is the swarm-wide deploy log, cat /meta/flake.lock | jq '.nodes["agent-<n>"].locked' resolves which sha each agent is pinned at right now. Dashboard surfaces the same info as a deployed:<sha12> chip per container row.

Two repos per agent

/var/lib/hyperhive/agents/<name>/config/    proposed — manager RW
└── <anything>                              # any files the manager
                                            # wants in the commit.
                                            # agent.nix is the
                                            # convention entry
                                            # point; flake.nix is
                                            # tracked boilerplate
                                            # (manager doesn't edit
                                            # it).

/var/lib/hyperhive/applied/<name>/          applied — core-only
├── .git/                                   # tag-rich history
├── flake.nix                               # tracked, fixed
│                                           # boilerplate exporting
│                                           # nixosModules.default
├── agent.nix                               # working tree of main
└── <other manager files>                   # also tracked

/var/lib/hyperhive/meta/                    swarm-wide flake — core
├── .git/                                   # one commit per successful
│                                           # deploy
├── flake.nix                               # generated from agent set
└── flake.lock                              # pins each agent's sha

Why two physical repos: the manager's /agents/<n>/config/ is RW — a buggy or hostile agent can git clean -fdx its own proposed tree. The applied repo is never bind-mounted (except the read-only .git exposure described below) so a destructive move inside the container cannot reach it.

The container's --flake ref is /var/lib/hyperhive/meta#<name> (see "Meta flake" above). The agent's own applied/<n>/flake.nix is a fixed boilerplate that exports nixosModules.default = import ./agent.nix; the meta flake imports that module and wraps it with identity + HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT.

Tag state machine

Every approval id walks through a fixed set of tags on the underlying commit inside the applied repo:

Tag	When	Annotated?
`proposal/<id>`	request_apply_commit, after fetch	no
`approved/<id>`	operator approve	no
`building/<id>`	rebuild started	no
`deployed/<id>`	rebuild succeeded — `main` ff's here	no
`failed/<id>`	rebuild failed	yes (body = error)
`denied/<id>`	operator deny	yes (body = operator note)

applied/main is always the latest deployed/*. denied/ and failed/ are terminal; the manager submits a new commit + new approval id to retry. Because tags are first-class git objects, rejected and failed trees stay browsable forever — git log --tags in the applied repo is the audit trail.

Forge mirror

When the bundled hive-forge container is running — on by default, hyperhive.forge.enable — hive-c0re mirrors every agent's applied repo into a private agent-configs Forgejo org. forge::push_config(<name>) pushes applied/main plus every tag to agent-configs/<name> after each ref mutation: the spawn that seeds deployed/0, every request_apply_commit (which plants proposal/<id>), every approve / deny, and a sweep at startup. Pushes are best-effort — a missing or stopped forge never blocks a deploy.

The org is private and agents are not members, so only the core user (a Forgejo site admin) can read it: an agent can't reach another agent's config — or even its own — through the forge. The tokenised push URL is passed inline to git push, never written into applied/<n>/.git/config; that repo is RO-bind-mounted into the manager, and a stored token would leak core's admin credential to an agent.

The dashboard deep-links into this org — a config repo link per container row and a commit on forge link per approval card. See docs/web-ui.md.

Manager view of applied + meta

The manager container gets three host-side bind mounts via set_nspawn_flags:

/var/lib/hyperhive/agents/ → /agents/ (RW) — proposed repos. Manager edits + commits per-agent config here.
/var/lib/hyperhive/applied/ → /applied/ (RO) — every agent's authoritative applied repo, including .git.
/var/lib/hyperhive/meta/ → /meta/ (RO) — the swarm-wide deploy flake.

Each proposed repo (/agents/<n>/config/) is pre-configured with applied as a git remote pointing at /applied/<n>/.git. Useful incantations from inside the manager:

git -C /agents/<n>/config fetch applied
git -C /agents/<n>/config log applied/main --oneline
git -C /agents/<n>/config show applied/refs/tags/deployed/<id>
git -C /agents/<n>/config show applied/refs/tags/failed/<id>   # body = build error
git -C /agents/<n>/config show applied/refs/tags/denied/<id>   # body = operator note
git -C /agents/<n>/config rebase applied/main                 # base in-flight work on what's deployed

git -C /meta log --oneline                                    # swarm-wide deploy history
cat /meta/flake.lock | jq '.nodes | with_entries(select(.key | startswith("agent-")))'

The RO binds block push at the kernel level, so the manager can only fetch / read — git plumbing inside the container cannot corrupt either authoritative repo.

Migration from the pre-tag / pre-meta schemes

Both overhauls (tag-driven flow + meta flake) ship in-place migrations that run on every hive-c0re startup. Idempotent; each phase is a no-op once already applied. Behaviour:

Tag-driven phase: assumes the operator ran the one-shot git tag deployed/0 main script (see commit history / earlier docs revisions) once per agent. Tagging is non-destructive: it doesn't touch live containers, state dirs, or claude creds.
Meta-flake phase: rewrites each applied/<n>/flake.nix to the module-only boilerplate, wires the applied remote in each proposed repo, bootstraps the meta repo from the current agent list, and nixos-container updates every container at meta#<n>. The expensive last step is guarded by /var/lib/hyperhive/.meta-migration-done so it only runs once across hive-c0re restarts. Set HIVE_SKIP_META_MIGRATION=1 on the service to defer.

No state loss in either migration. claude creds, /state/ notes, the events DB, proposed history, and applied history all survive. The manager keeps its session; sub-agents stay logged in.

Manager (`hm1nd`) is hive-c0re-managed

The manager container runs through the same lifecycle as sub-agents. On hive-c0re serve startup, if hm1nd is missing, hive-c0re creates it. The manager's flake lives at /var/lib/hyperhive/applied/hm1nd/; its proposed config at /var/lib/hyperhive/agents/hm1nd/config/. Manager can edit its own agent.nix (visible inside the container at /agents/hm1nd/config/) and submit request_apply_commit("hm1nd", <sha>) for operator approval.

Differences from sub-agents:

flake.nix extends hyperhive.nixosConfigurations.manager (vs agent-base).
Container name is hm1nd (no h- prefix).
Fixed web UI port (MANAGER_PORT = 8000).
set_nspawn_flags adds two extra binds: /var/lib/hyperhive/agents → /agents (RW) so the manager can edit per-agent proposed repos, and /var/lib/hyperhive/applied → /applied (RO) so the manager can git fetch deployed/failed/denied tags from any agent's authoritative applied repo (see "Manager view of applied" below).
First-deploy spawn bypasses the approval queue (manager is required infrastructure).
Per-agent socket lives at /run/hyperhive/manager/, owned by manager_server::start.

Migration note (for older hosts): drop any containers.hm1nd = { ... } block from your host NixOS config. hyperhive creates and updates the manager itself.

Manager policy

From hive-ag3nt/prompts/manager.md: the manager does NOT rubber-stamp sub-agent config requests. It verifies (role match, package legitimacy, cheaper alternative, blast radius) before committing and calling request_apply_commit.

For ambiguous cases or anything that needs human signal, the manager calls ask(question, options?, multi?, ttl_seconds?, to?) — queues the question and returns the id immediately. When to is omitted (or "operator") the question shows up on the dashboard; when to is a sub-agent's name, the recipient receives a HelperEvent::QuestionAsked and answers via their own answer tool. Either way the answer arrives back as HelperEvent::QuestionAnswered { id, question, answer, answerer } in the asker's inbox. Storage is hive-c0re::operator_questions (sqlite) — same table, with a nullable target column (NULL = operator). Dispatch goes through hive-c0re/src/questions.rs::{handle_ask, handle_answer} so both the agent + manager surfaces stay aligned. The answer flow is:

POST /answer-question/{id}                       agent: Answer { id, answer }
  → OperatorQuestions::answer(_, _, "operator")    → questions::handle_answer
  → notify_agent(asker, QuestionAnswered {         → OperatorQuestions::answer(_, _, agent)
       answerer: "operator", ... })                → notify_agent(asker, QuestionAnswered {
                                                       answerer: agent, ... })

Two more paths resolve a pending question with a sentinel answer:

POST /cancel-question/{id} (✗ CANC3L button on the dashboard) resolves with [cancelled]. The manager sees a terminal state and can fall back.
ttl_seconds deadline: a tokio watchdog spawned at submit time fires answer(id, "[expired]") once the ttl runs out. Already- resolved races no-op. The dashboard surfaces a ⏳ MM:SS chip on each pending question with a deadline.

Helper events to the manager

Coordinator::notify_manager(&HelperEvent) enqueues an inbox message from sender system with the event JSON in the body. The manager harness no longer short-circuits these — they drive a regular claude turn so the manager can react. Variants (hive_sh4re::HelperEvent):

ApprovalResolved { id, agent, commit_ref, status, note } — fired by actions::approve + actions::deny whenever an approval transitions to its terminal state.
Spawned { agent, ok, note } — actions::approve (first-time ApplyCommit-kind) + admin HostRequest::Spawn (deprecated).
Rebuilt { agent, ok, note } — auto_update::rebuild_agent (covers startup scan + manual /rebuild from dashboard) + actions::approve (ApplyCommit).
Killed { agent } — admin HostRequest::Kill + dashboard /kill + manager Kill MCP tool.
Destroyed { agent } — actions::destroy.
ContainerCrash { agent, note } — crash_watch: a previously- running container went away with no operator-initiated transient state (Stopping / Restarting / Destroying / Rebuilding). Manager can start it again or escalate.
NeedsLogin { agent } — sub-agent has no claude session yet. Manager can't act directly (interactive OAuth); typically flags the operator.
LoggedIn { agent } — sub-agent just completed login. Manager often greets the agent on this event.
ConfigReady { agent } — a new agent's proposed config repo was just seeded (post-InitConfig approval). The manager can now edit /agents/<agent>/config/agent.nix, commit the changes, and submit request_apply_commit with the commit sha to create the container (first ApplyCommit also triggers spawn bookkeeping).
NeedsUpdate { agent } — sub-agent's recorded flake rev is stale. Manager calls update(name) to rebuild — idempotent, no approval required.
QuestionAnswered { id, question, answer, answerer } — dashboard /answer-question/{id} (answerer = "operator"), peer Answer request (answerer = agent name), or ttl watchdog expiry (answerer = "ttl-watchdog", answer = "[expired]").
QuestionAsked { id, asker, question, options, multi } — fired when an agent calls Ask { to: Some(<this-agent>), ... }. The recipient responds via Answer { id, answer } and the asker sees the matching QuestionAnswered.

To add a new event: new HelperEvent variant + call sites + update prompts/manager.md so the manager knows the new shape.

Auto-update on startup

hive-c0re serve runs auto_update::run in a background task right after opening the coordinator. It enumerates managed containers and rebuilds any whose recorded hyperhive rev differs from the current one — sub-agents and manager go through the same lifecycle::rebuild path.

"Rev" = canonical filesystem path of cfg.hyperhiveFlake. Marker file: /var/lib/hyperhive/applied/.<name>.hyperhive-rev. If the flake input has no canonical path (e.g. a github: URL), auto-update is a no-op — rebuild manually.

The dashboard surfaces pending updates per agent: a clickable "needs update ↻" badge appears whenever the marker differs from current rev. The badge POSTs /rebuild/<name>, calling the same auto_update::rebuild_agent path so manual triggers and the startup scan can't drift. When at least one container is stale, a top-level ↻ UPD4TE 4LL button appears that loops over every stale container.

18 KiB Raw Blame History