damocles f8795dc029 fix: request_apply_commit resolves sha locally + rejects non-sha refs

2026-05-20 09:48:05 +02:00

16 KiB

Raw Blame History

Approvals + manager + helper events

The approval queue is hyperhive's pivot: nothing that changes the shape of an agent (its config, whether it exists) happens without an operator click. The manager (hm1nd) is the policy gate in front of that queue; helper events are how it stays informed about what happens after a decision lands.

End-to-end approval flow

Manager edits files under /agents/<name>/config/ (any tracked path, but agent.nix is the contract entry point) and commits with its own git identity.
Manager submits the commit sha via request_apply_commit(agent, commit_ref). commit_ref must be a commit sha (7-40 hex chars, short or full) — a branch or tag name is rejected so the approval pins an immutable commit.
hive-c0re immediately fetches that commit from the proposed repo into the applied repo and tags it proposal/<id>. It resolves the sha locally against the proposed repo, fetches all of proposed's heads into applied's object db, then tags the resolved commit — git fetch <remote> <sha>:<dst> can't fetch by a bare sha (the left side of a refspec is a remote ref name), so the resolution happens on hive-c0re's side. The approval row stores both the manager-supplied sha and the canonical hive-c0re-vouched sha. From here on the proposed repo is irrelevant for this approval — the manager can amend, force-push, or rm -rf the proposed repo and the queued approval still points at an immutable git object inside applied.
Operator sees the diff on the dashboard, clicks ◆ APPR0VE (or hive-c0re approve <id> on the CLI).
hive-c0re moves the working tree to proposal/<id> and runs the build under a sequence of tags (see below). On success, applied/main fast-forwards to the proposal commit. On failure, main stays put and the working tree resets back to the previous deployed commit.
HelperEvent::ApprovalResolved (and Rebuilt for the ApplyCommit kind) land in the manager's inbox, carrying both the canonical sha and the terminal tag.

Spawn approvals follow the same shape but skip the commit-diff step — the operator just sees the name. On approve, hive-c0re creates the container in a background task while the dashboard shows a spinner.

Meta flake

The hive-c0re-owned repo at /var/lib/hyperhive/meta/ declares one flake input per agent (agent-<n>.url = "git+file:///var/lib/hyperhive/applied/<n>") and one nixosConfigurations.<n> output per agent. Each output wraps inputs.agent-<n>.nixosModules.default with the identity + HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT injection module that setup_applied used to generate inline. Containers run against --flake /var/lib/hyperhive/meta#<n>.

Per-deploy lock flow (two-phase, owned by actions::run_apply_commit → meta::{prepare,finalize,abort} _deploy):

meta::prepare_deploy(name) runs nix flake lock --update-input agent-<n> without committing. Working tree of meta now points the input at applied/<n>/main (which run_apply_commit already fast-forwarded to proposal/<id>).
lifecycle::rebuild_no_meta runs nixos-container update <c> --flake meta#<name>. Nix evaluates against the staged lock.
On success — meta::finalize_deploy(name, sha, "deployed/ <id>") stages flake.lock and commits with deploy <n> deployed/<id> <sha12>. Meta's git log gains one entry per successful deploy.
On failure — meta::abort_deploy() runs git restore flake.lock so the meta history shows only successes; the failure stays as an annotated failed/<id> tag in applied/<n>.

Single-phase variants exist for paths without rollback semantics: meta::lock_update_for_rebuild(name) for the manual ↻ R3BU1LD button (commits if the lock changed) and meta::lock_update_hyperhive() for the auto-update flake-rev bump (one shot before per-agent rebuilds, commits if the lock changed).

meta::sync_agents(hyperhive_flake, dashboard_port, &agents) is the idempotent reconciler called by spawn, destroy, rebuild, and the startup migration. Renders flake.nix from the agent list; if it differs from disk, runs nix flake lock + commits as regenerate meta flake (or seed meta from N agent(s) on the very first call).

The manager has /meta RO-bound inside its container: git -C /meta log --oneline is the swarm-wide deploy log, cat /meta/flake.lock | jq '.nodes["agent-<n>"].locked' resolves which sha each agent is pinned at right now. Dashboard surfaces the same info as a deployed:<sha12> chip per container row.

Two repos per agent

/var/lib/hyperhive/agents/<name>/config/    proposed — manager RW
└── <anything>                              # any files the manager
                                            # wants in the commit.
                                            # agent.nix is the
                                            # convention entry
                                            # point; flake.nix is
                                            # tracked boilerplate
                                            # (manager doesn't edit
                                            # it).

/var/lib/hyperhive/applied/<name>/          applied — core-only
├── .git/                                   # tag-rich history
├── flake.nix                               # tracked, fixed
│                                           # boilerplate exporting
│                                           # nixosModules.default
├── agent.nix                               # working tree of main
└── <other manager files>                   # also tracked

/var/lib/hyperhive/meta/                    swarm-wide flake — core
├── .git/                                   # one commit per successful
│                                           # deploy
├── flake.nix                               # generated from agent set
└── flake.lock                              # pins each agent's sha

Why two physical repos: the manager's /agents/<n>/config/ is RW — a buggy or hostile agent can git clean -fdx its own proposed tree. The applied repo is never bind-mounted (except the read-only .git exposure described below) so a destructive move inside the container cannot reach it.

The container's --flake ref is /var/lib/hyperhive/meta#<name> (see "Meta flake" above). The agent's own applied/<n>/flake.nix is a fixed boilerplate that exports nixosModules.default = import ./agent.nix; the meta flake imports that module and wraps it with identity + HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT.

Tag state machine

Every approval id walks through a fixed set of tags on the underlying commit inside the applied repo:

Tag	When	Annotated?
`proposal/<id>`	request_apply_commit, after fetch	no
`approved/<id>`	operator approve	no
`building/<id>`	rebuild started	no
`deployed/<id>`	rebuild succeeded — `main` ff's here	no
`failed/<id>`	rebuild failed	yes (body = error)
`denied/<id>`	operator deny	yes (body = operator note)

applied/main is always the latest deployed/*. denied/ and failed/ are terminal; the manager submits a new commit + new approval id to retry. Because tags are first-class git objects, rejected and failed trees stay browsable forever — git log --tags in the applied repo is the audit trail.

Manager view of applied + meta

The manager container gets three host-side bind mounts via set_nspawn_flags:

/var/lib/hyperhive/agents/ → /agents/ (RW) — proposed repos. Manager edits + commits per-agent config here.
/var/lib/hyperhive/applied/ → /applied/ (RO) — every agent's authoritative applied repo, including .git.
/var/lib/hyperhive/meta/ → /meta/ (RO) — the swarm-wide deploy flake.

Each proposed repo (/agents/<n>/config/) is pre-configured with applied as a git remote pointing at /applied/<n>/.git. Useful incantations from inside the manager:

git -C /agents/<n>/config fetch applied
git -C /agents/<n>/config log applied/main --oneline
git -C /agents/<n>/config show applied/refs/tags/deployed/<id>
git -C /agents/<n>/config show applied/refs/tags/failed/<id>   # body = build error
git -C /agents/<n>/config show applied/refs/tags/denied/<id>   # body = operator note
git -C /agents/<n>/config rebase applied/main                 # base in-flight work on what's deployed

git -C /meta log --oneline                                    # swarm-wide deploy history
cat /meta/flake.lock | jq '.nodes | with_entries(select(.key | startswith("agent-")))'

The RO binds block push at the kernel level, so the manager can only fetch / read — git plumbing inside the container cannot corrupt either authoritative repo.

Migration from the pre-tag / pre-meta schemes

Both overhauls (tag-driven flow + meta flake) ship in-place migrations that run on every hive-c0re startup. Idempotent; each phase is a no-op once already applied. Behaviour:

Tag-driven phase: assumes the operator ran the one-shot git tag deployed/0 main script (see commit history / earlier docs revisions) once per agent. Tagging is non-destructive: it doesn't touch live containers, state dirs, or claude creds.
Meta-flake phase: rewrites each applied/<n>/flake.nix to the module-only boilerplate, wires the applied remote in each proposed repo, bootstraps the meta repo from the current agent list, and nixos-container updates every container at meta#<n>. The expensive last step is guarded by /var/lib/hyperhive/.meta-migration-done so it only runs once across hive-c0re restarts. Set HIVE_SKIP_META_MIGRATION=1 on the service to defer.

No state loss in either migration. claude creds, /state/ notes, the events DB, proposed history, and applied history all survive. The manager keeps its session; sub-agents stay logged in.

Manager (`hm1nd`) is hive-c0re-managed

The manager container runs through the same lifecycle as sub-agents. On hive-c0re serve startup, if hm1nd is missing, hive-c0re creates it. The manager's flake lives at /var/lib/hyperhive/applied/hm1nd/; its proposed config at /var/lib/hyperhive/agents/hm1nd/config/. Manager can edit its own agent.nix (visible inside the container at /agents/hm1nd/config/) and submit request_apply_commit("hm1nd", <sha>) for operator approval.

Differences from sub-agents:

flake.nix extends hyperhive.nixosConfigurations.manager (vs agent-base).
Container name is hm1nd (no h- prefix).
Fixed web UI port (MANAGER_PORT = 8000).
set_nspawn_flags adds two extra binds: /var/lib/hyperhive/agents → /agents (RW) so the manager can edit per-agent proposed repos, and /var/lib/hyperhive/applied → /applied (RO) so the manager can git fetch deployed/failed/denied tags from any agent's authoritative applied repo (see "Manager view of applied" below).
First-deploy spawn bypasses the approval queue (manager is required infrastructure).
Per-agent socket lives at /run/hyperhive/manager/, owned by manager_server::start.

Migration note (for older hosts): drop any containers.hm1nd = { ... } block from your host NixOS config. hyperhive creates and updates the manager itself.

Manager policy

From hive-ag3nt/prompts/manager.md: the manager does NOT rubber-stamp sub-agent config requests. It verifies (role match, package legitimacy, cheaper alternative, blast radius) before committing and calling request_apply_commit.

For ambiguous cases or anything that needs human signal, the manager calls ask(question, options?, multi?, ttl_seconds?, to?) — queues the question and returns the id immediately. When to is omitted (or "operator") the question shows up on the dashboard; when to is a sub-agent's name, the recipient receives a HelperEvent::QuestionAsked and answers via their own answer tool. Either way the answer arrives back as HelperEvent::QuestionAnswered { id, question, answer, answerer } in the asker's inbox. Storage is hive-c0re::operator_questions (sqlite) — same table, with a nullable target column (NULL = operator). Dispatch goes through hive-c0re/src/questions.rs::{handle_ask, handle_answer} so both the agent + manager surfaces stay aligned. The answer flow is:

POST /answer-question/{id}                       agent: Answer { id, answer }
  → OperatorQuestions::answer(_, _, "operator")    → questions::handle_answer
  → notify_agent(asker, QuestionAnswered {         → OperatorQuestions::answer(_, _, agent)
       answerer: "operator", ... })                → notify_agent(asker, QuestionAnswered {
                                                       answerer: agent, ... })

Two more paths resolve a pending question with a sentinel answer:

POST /cancel-question/{id} (✗ CANC3L button on the dashboard) resolves with [cancelled]. The manager sees a terminal state and can fall back.
ttl_seconds deadline: a tokio watchdog spawned at submit time fires answer(id, "[expired]") once the ttl runs out. Already- resolved races no-op. The dashboard surfaces a ⏳ MM:SS chip on each pending question with a deadline.

Helper events to the manager

Coordinator::notify_manager(&HelperEvent) enqueues an inbox message from sender system with the event JSON in the body. The manager harness no longer short-circuits these — they drive a regular claude turn so the manager can react. Variants (hive_sh4re::HelperEvent):

ApprovalResolved { id, agent, commit_ref, status, note } — fired by actions::approve + actions::deny whenever an approval transitions to its terminal state.
Spawned { agent, ok, note } — actions::approve (Spawn-kind)
- admin HostRequest::Spawn.
Rebuilt { agent, ok, note } — auto_update::rebuild_agent (covers startup scan + manual /rebuild from dashboard) + actions::approve (ApplyCommit).
Killed { agent } — admin HostRequest::Kill + dashboard /kill + manager Kill MCP tool.
Destroyed { agent } — actions::destroy.
ContainerCrash { agent, note } — crash_watch: a previously- running container went away with no operator-initiated transient state (Stopping / Restarting / Destroying / Rebuilding). Manager can start it again or escalate.
NeedsLogin { agent } — sub-agent has no claude session yet. Manager can't act directly (interactive OAuth); typically flags the operator.
LoggedIn { agent } — sub-agent just completed login. Manager often greets the agent on this event.
NeedsUpdate { agent } — sub-agent's recorded flake rev is stale. Manager calls update(name) to rebuild — idempotent, no approval required.
QuestionAnswered { id, question, answer, answerer } — dashboard /answer-question/{id} (answerer = "operator"), peer Answer request (answerer = agent name), or ttl watchdog expiry (answerer = "ttl-watchdog", answer = "[expired]").
QuestionAsked { id, asker, question, options, multi } — fired when an agent calls Ask { to: Some(<this-agent>), ... }. The recipient responds via Answer { id, answer } and the asker sees the matching QuestionAnswered.

To add a new event: new HelperEvent variant + call sites + update prompts/manager.md so the manager knows the new shape.

Auto-update on startup

hive-c0re serve runs auto_update::run in a background task right after opening the coordinator. It enumerates managed containers and rebuilds any whose recorded hyperhive rev differs from the current one — sub-agents and manager go through the same lifecycle::rebuild path.

"Rev" = canonical filesystem path of cfg.hyperhiveFlake. Marker file: /var/lib/hyperhive/applied/.<name>.hyperhive-rev. If the flake input has no canonical path (e.g. a github: URL), auto-update is a no-op — rebuild manually.

The dashboard surfaces pending updates per agent: a clickable "needs update ↻" badge appears whenever the marker differs from current rev. The badge POSTs /rebuild/<name>, calling the same auto_update::rebuild_agent path so manual triggers and the startup scan can't drift. When at least one container is stale, a top-level ↻ UPD4TE 4LL button appears that loops over every stale container.

16 KiB Raw Blame History