müde 497cd15137 docs: tag-driven config-apply plan + migration story

scratchpad in claude.md marks this as in-flight; docs/approvals.md
gets the new tag state machine (proposal/approved/building/deployed/
failed/denied) and the manager applied.git read-only mount. todo
picks up the unprivileged-containers git-identity caveat and a web
ui for config repos as a downstream follow-up.

2026-05-15 22:43:47 +02:00

10 KiB

Raw Blame History

Approvals + manager + helper events

The approval queue is hyperhive's pivot: nothing that changes the shape of an agent (its config, whether it exists) happens without an operator click. The manager (hm1nd) is the policy gate in front of that queue; helper events are how it stays informed about what happens after a decision lands.

End-to-end approval flow

Manager edits files under /agents/<name>/config/ (any tracked path, but agent.nix is the contract entry point) and commits with its own git identity.
Manager submits the commit sha via request_apply_commit(agent, commit_ref).
hive-c0re immediately fetches that commit from the proposed repo into the applied repo and tags it proposal/<id>. The approval row stores both the manager-supplied sha and the canonical hive-c0re-vouched sha. From here on the proposed repo is irrelevant for this approval — the manager can amend, force-push, or rm -rf the proposed repo and the queued approval still points at an immutable git object inside applied.
Operator sees the diff on the dashboard, clicks ◆ APPR0VE (or hive-c0re approve <id> on the CLI).
hive-c0re moves the working tree to proposal/<id> and runs the build under a sequence of tags (see below). On success, applied/main fast-forwards to the proposal commit. On failure, main stays put and the working tree resets back to the previous deployed commit.
HelperEvent::ApprovalResolved (and Rebuilt for the ApplyCommit kind) land in the manager's inbox, carrying both the canonical sha and the terminal tag.

Spawn approvals follow the same shape but skip the commit-diff step — the operator just sees the name. On approve, hive-c0re creates the container in a background task while the dashboard shows a spinner.

Two repos per agent

/var/lib/hyperhive/agents/<name>/config/    proposed — manager RW
└── <anything>                              # any files the manager
                                            # wants in the commit.
                                            # agent.nix is the
                                            # convention entry
                                            # point; flake.nix is
                                            # generated and not
                                            # tracked here.

/var/lib/hyperhive/applied/<name>/          applied — core-only
├── .git/                                   # tag-rich history
├── .gitignore                              # ignores flake.nix
├── flake.nix                               # hive-c0re-generated,
│                                           # untracked, rewritten
│                                           # on spawn/rebuild only
├── agent.nix                               # working tree of main
└── <other manager files>                   # also tracked

Why two physical repos: the manager's /agents/<n>/config/ is RW — a buggy or hostile agent can git clean -fdx its own proposed tree. The applied repo is never bind-mounted (except the read-only .git exposure described below) so a destructive move inside the container cannot reach it.

The container's --flake ref is <applied_dir>#default. The generated flake.nix extends hyperhive.nixosConfigurations.{agent-base|manager} with ./agent.nix plus an inline module setting programs.git.config.user (committer identity = the agent's name) and systemd.services.<harness>.environment (HIVE_PORT, HIVE_LABEL, HIVE_DASHBOARD_PORT).

Tag state machine

Every approval id walks through a fixed set of tags on the underlying commit inside the applied repo:

Tag	When	Annotated?
`proposal/<id>`	request_apply_commit, after fetch	no
`approved/<id>`	operator approve	no
`building/<id>`	rebuild started	no
`deployed/<id>`	rebuild succeeded — `main` ff's here	no
`failed/<id>`	rebuild failed	yes (body = error)
`denied/<id>`	operator deny	yes (body = operator note)

applied/main is always the latest deployed/*. denied/ and failed/ are terminal; the manager submits a new commit + new approval id to retry. Because tags are first-class git objects, rejected and failed trees stay browsable forever — git log --tags in the applied repo is the audit trail.

Manager view of applied

/agents/<n>/applied.git is a read-only bind-mount of /var/lib/hyperhive/applied/<n>/.git inside the manager container. The manager fetches tags into its proposed clone (git fetch /agents/<n>/applied.git refs/tags/*:refs/tags/applied/*) and git show any deployed / failed / denied tree to see what actually shipped, what error blocked the last build, or what note the operator left on a denial. The RO mount means git plumbing inside the manager cannot corrupt the applied repo.

Migration from the pre-tag scheme

There is no in-place migration. Each existing agent must be purged and re-spawned: hive-c0re destroy --purge <name> (or PURG3 on the dashboard), then request_spawn and the operator approves the fresh agent. The new agent starts with deployed/0 seeded by hive-c0re; the manager's first config edit becomes proposal/1 and walks the tag scheme from there. Pre-overhaul tombstones lose their config history.

Manager (`hm1nd`) is hive-c0re-managed

The manager container runs through the same lifecycle as sub-agents. On hive-c0re serve startup, if hm1nd is missing, hive-c0re creates it. The manager's flake lives at /var/lib/hyperhive/applied/hm1nd/; its proposed config at /var/lib/hyperhive/agents/hm1nd/config/. Manager can edit its own agent.nix (visible inside the container at /agents/hm1nd/config/) and submit request_apply_commit("hm1nd", <sha>) for operator approval.

Differences from sub-agents:

flake.nix extends hyperhive.nixosConfigurations.manager (vs agent-base).
Container name is hm1nd (no h- prefix).
Fixed web UI port (MANAGER_PORT = 8000).
set_nspawn_flags adds an extra bind: /var/lib/hyperhive/agents → /agents (RW), so the manager can edit per-agent proposed repos.
First-deploy spawn bypasses the approval queue (manager is required infrastructure).
Per-agent socket lives at /run/hyperhive/manager/, owned by manager_server::start.

Migration note (for older hosts): drop any containers.hm1nd = { ... } block from your host NixOS config. hyperhive creates and updates the manager itself.

Manager policy

From hive-ag3nt/prompts/manager.md: the manager does NOT rubber-stamp sub-agent config requests. It verifies (role match, package legitimacy, cheaper alternative, blast radius) before committing and calling request_apply_commit.

For ambiguous cases or anything that needs human signal, the manager calls ask_operator(question, options?, multi?, ttl_seconds?) — queues the question on the dashboard and returns the id immediately. The operator's answer arrives later as HelperEvent::OperatorAnswered in the manager inbox. Storage is hive-c0re::operator_questions (sqlite); the answer flow is:

POST /answer-question/{id}
  → OperatorQuestions::answer
  → notify_manager(OperatorAnswered { id, question, answer })

Two more paths resolve a pending question with a sentinel answer:

POST /cancel-question/{id} (✗ CANC3L button on the dashboard) resolves with [cancelled]. The manager sees a terminal state and can fall back.
ttl_seconds deadline: a tokio watchdog spawned at submit time fires answer(id, "[expired]") once the ttl runs out. Already- resolved races no-op. The dashboard surfaces a ⏳ MM:SS chip on each pending question with a deadline.

Helper events to the manager

Coordinator::notify_manager(&HelperEvent) enqueues an inbox message from sender system with the event JSON in the body. The manager harness no longer short-circuits these — they drive a regular claude turn so the manager can react. Variants (hive_sh4re::HelperEvent):

ApprovalResolved { id, agent, commit_ref, status, note } — fired by actions::approve + actions::deny whenever an approval transitions to its terminal state.
Spawned { agent, ok, note } — actions::approve (Spawn-kind)
- admin HostRequest::Spawn.
Rebuilt { agent, ok, note } — auto_update::rebuild_agent (covers startup scan + manual /rebuild from dashboard) + actions::approve (ApplyCommit).
Killed { agent } — admin HostRequest::Kill + dashboard /kill + manager Kill MCP tool.
Destroyed { agent } — actions::destroy.
ContainerCrash { agent, note } — crash_watch: a previously- running container went away with no operator-initiated transient state (Stopping / Restarting / Destroying / Rebuilding). Manager can start it again or escalate.
NeedsLogin { agent } — sub-agent has no claude session yet. Manager can't act directly (interactive OAuth); typically flags the operator.
LoggedIn { agent } — sub-agent just completed login. Manager often greets the agent on this event.
NeedsUpdate { agent } — sub-agent's recorded flake rev is stale. Manager calls update(name) to rebuild — idempotent, no approval required.
OperatorAnswered { id, question, answer } — dashboard /answer-question/{id} after the operator submits the answer form.

To add a new event: new HelperEvent variant + call sites + update prompts/manager.md so the manager knows the new shape.

Auto-update on startup

hive-c0re serve runs auto_update::run in a background task right after opening the coordinator. It enumerates managed containers and rebuilds any whose recorded hyperhive rev differs from the current one — sub-agents and manager go through the same lifecycle::rebuild path.

"Rev" = canonical filesystem path of cfg.hyperhiveFlake. Marker file: /var/lib/hyperhive/applied/.<name>.hyperhive-rev. If the flake input has no canonical path (e.g. a github: URL), auto-update is a no-op — rebuild manually.

The dashboard surfaces pending updates per agent: a clickable "needs update ↻" badge appears whenever the marker differs from current rev. The badge POSTs /rebuild/<name>, calling the same auto_update::rebuild_agent path so manual triggers and the startup scan can't drift. When at least one container is stale, a top-level ↻ UPD4TE 4LL button appears that loops over every stale container.

10 KiB Raw Blame History