hyperhive/docs/approvals.md
müde a1cfb60fd0 docs: pre-load meta-flake design
scratchpad in claude.md and an in-flight callout at the top of
docs/approvals.md describe the upcoming overhaul so subsequent
commits can cite the design. covers: module-only agent flake
shape, /var/lib/hyperhive/meta/ as a hive-c0re-owned single
repo, applied remote pre-wired in proposed for manager git
plumbing, /meta RO bind for the system-wide deploy log,
auto-migration on hive-c0re startup with HIVE_SKIP_META_MIGRATION
kill-switch.
2026-05-16 00:06:42 +02:00

259 lines
12 KiB
Markdown

# Approvals + manager + helper events
The approval queue is hyperhive's pivot: nothing that changes the
shape of an agent (its config, whether it exists) happens without an
operator click. The manager (`hm1nd`) is the policy gate in front of
that queue; helper events are how it stays informed about what
happens after a decision lands.
## End-to-end approval flow
1. Manager edits files under `/agents/<name>/config/` (any tracked
path, but `agent.nix` is the contract entry point) and commits
with its own git identity.
2. Manager submits the commit sha via `request_apply_commit(agent,
commit_ref)`.
3. **hive-c0re immediately fetches that commit from the proposed
repo into the applied repo and tags it `proposal/<id>`.** The
approval row stores both the manager-supplied sha and the
canonical hive-c0re-vouched sha. From here on the proposed
repo is irrelevant for this approval — the manager can amend,
force-push, or `rm -rf` the proposed repo and the queued
approval still points at an immutable git object inside
applied.
4. Operator sees the diff on the dashboard, clicks ◆ APPR0VE (or
`hive-c0re approve <id>` on the CLI).
5. hive-c0re moves the working tree to `proposal/<id>` and runs
the build under a sequence of tags (see below). On success,
`applied/main` fast-forwards to the proposal commit. On
failure, main stays put and the working tree resets back to
the previous deployed commit.
6. `HelperEvent::ApprovalResolved` (and `Rebuilt` for the
ApplyCommit kind) land in the manager's inbox, carrying both
the canonical sha and the terminal tag.
`Spawn` approvals follow the same shape but skip the commit-diff
step — the operator just sees the name. On approve, hive-c0re
creates the container in a background task while the dashboard
shows a spinner.
## Meta flake (in flight)
> The next overhaul (currently being implemented) introduces a
> single hive-c0re-owned meta repo at
> `/var/lib/hyperhive/meta/` that consumes every agent's
> applied repo as a flake input and owns the wrapper
> nixosConfiguration. Each agent's `applied/<n>/flake.nix`
> shrinks to `nixosModules.default = import ./agent.nix` —
> `agent.nix` becomes a plain NixOS module function (no
> extendModules / hyperhive input). Containers will run
> against `--flake /var/lib/hyperhive/meta#<n>`. Every
> approval that builds does
> `nix flake lock --update-input agent-<n>` in meta and
> commits the lock; meta's git log is the system-wide deploy
> trail. Manager additionally gets `/applied/<n>/.git`
> pre-registered as the `applied` remote inside its proposed
> repo, and `/meta` RO-bound for browsing the deploy log.
> Auto-migrates on startup. Sections below describe the
> current (still-deployed) tag-driven shape that the meta
> flake builds on top of.
## Two repos per agent
```
/var/lib/hyperhive/agents/<name>/config/ proposed — manager RW
└── <anything> # any files the manager
# wants in the commit.
# agent.nix is the
# convention entry
# point; flake.nix is
# generated and not
# tracked here.
/var/lib/hyperhive/applied/<name>/ applied — core-only
├── .git/ # tag-rich history
├── .gitignore # ignores flake.nix
├── flake.nix # hive-c0re-generated,
│ # untracked, rewritten
│ # on spawn/rebuild only
├── agent.nix # working tree of main
└── <other manager files> # also tracked
```
Why two physical repos: the manager's `/agents/<n>/config/` is
RW — a buggy or hostile agent can `git clean -fdx` its own
proposed tree. The applied repo is never bind-mounted (except
the read-only `.git` exposure described below) so a destructive
move inside the container cannot reach it.
The container's `--flake` ref is `<applied_dir>#default`. The
generated `flake.nix` extends
`hyperhive.nixosConfigurations.{agent-base|manager}` with
`./agent.nix` plus an inline module setting
`programs.git.config.user` (committer identity = the agent's name)
and `systemd.services.<harness>.environment` (`HIVE_PORT`,
`HIVE_LABEL`, `HIVE_DASHBOARD_PORT`).
### Tag state machine
Every approval id walks through a fixed set of tags on the
underlying commit inside the applied repo:
| Tag | When | Annotated? |
|---|---|---|
| `proposal/<id>` | request_apply_commit, after fetch | no |
| `approved/<id>` | operator approve | no |
| `building/<id>` | rebuild started | no |
| `deployed/<id>` | rebuild succeeded — `main` ff's here | no |
| `failed/<id>` | rebuild failed | yes (body = error) |
| `denied/<id>` | operator deny | yes (body = operator note) |
`applied/main` is always the latest `deployed/*`. `denied/` and
`failed/` are terminal; the manager submits a new commit + new
approval id to retry. Because tags are first-class git objects,
rejected and failed trees stay browsable forever — `git log
--tags` in the applied repo is the audit trail.
### Manager view of applied
`/applied/` is a **read-only bind-mount** of
`/var/lib/hyperhive/applied/` (the entire tree) inside the
manager container. The manager fetches tags into its proposed
clone with `git fetch /applied/<n>/.git
'refs/tags/*:refs/tags/applied/*'` and `git show` any
deployed / failed / denied tree to see what actually shipped,
what error blocked the last build, or what note the operator
left on a denial. The RO bind means git plumbing inside the
manager cannot corrupt the applied repos — and a single mount
covers every agent (existing + future) without rebuilding the
manager on each spawn.
## Migration from the pre-tag scheme
There is no in-place migration. Each existing agent must be
purged and re-spawned: `hive-c0re destroy --purge <name>` (or
PURG3 on the dashboard), then `request_spawn` and the operator
approves the fresh agent. The new agent starts with `deployed/0`
seeded by hive-c0re; the manager's first config edit becomes
`proposal/1` and walks the tag scheme from there. Pre-overhaul
tombstones lose their config history.
## Manager (`hm1nd`) is hive-c0re-managed
The manager container runs through the **same lifecycle as
sub-agents**. On `hive-c0re serve` startup, if `hm1nd` is missing,
hive-c0re creates it. The manager's flake lives at
`/var/lib/hyperhive/applied/hm1nd/`; its proposed config at
`/var/lib/hyperhive/agents/hm1nd/config/`. Manager can edit its own
`agent.nix` (visible inside the container at `/agents/hm1nd/config/`)
and submit `request_apply_commit("hm1nd", <sha>)` for operator
approval.
Differences from sub-agents:
- `flake.nix` extends `hyperhive.nixosConfigurations.manager`
(vs `agent-base`).
- Container name is `hm1nd` (no `h-` prefix).
- Fixed web UI port (`MANAGER_PORT = 8000`).
- `set_nspawn_flags` adds two extra binds: `/var/lib/hyperhive/agents`
→ `/agents` (RW) so the manager can edit per-agent proposed repos,
and `/var/lib/hyperhive/applied` → `/applied` (RO) so the manager
can `git fetch` deployed/failed/denied tags from any agent's
authoritative applied repo (see "Manager view of applied" below).
- First-deploy spawn bypasses the approval queue (manager is
required infrastructure).
- Per-agent socket lives at `/run/hyperhive/manager/`, owned by
`manager_server::start`.
**Migration note** (for older hosts): drop any `containers.hm1nd =
{ ... }` block from your host NixOS config. hyperhive creates and
updates the manager itself.
## Manager policy
From `hive-ag3nt/prompts/manager.md`: the manager does NOT
rubber-stamp sub-agent config requests. It verifies (role match,
package legitimacy, cheaper alternative, blast radius) before
committing and calling `request_apply_commit`.
For ambiguous cases or anything that needs human signal, the
manager calls `ask_operator(question, options?, multi?,
ttl_seconds?)` — queues the question on the dashboard and returns
the id immediately. The operator's answer arrives later as
`HelperEvent::OperatorAnswered` in the manager inbox. Storage is
`hive-c0re::operator_questions` (sqlite); the answer flow is:
```
POST /answer-question/{id}
→ OperatorQuestions::answer
→ notify_manager(OperatorAnswered { id, question, answer })
```
Two more paths resolve a pending question with a sentinel answer:
- `POST /cancel-question/{id}` (✗ CANC3L button on the dashboard)
resolves with `[cancelled]`. The manager sees a terminal state
and can fall back.
- `ttl_seconds` deadline: a tokio watchdog spawned at submit time
fires `answer(id, "[expired]")` once the ttl runs out. Already-
resolved races no-op. The dashboard surfaces a `⏳ MM:SS` chip
on each pending question with a deadline.
## Helper events to the manager
`Coordinator::notify_manager(&HelperEvent)` enqueues an inbox
message from sender `system` with the event JSON in the body. The
manager harness no longer short-circuits these — they drive a
regular claude turn so the manager can react. Variants
(`hive_sh4re::HelperEvent`):
- `ApprovalResolved { id, agent, commit_ref, status, note }` —
fired by `actions::approve` + `actions::deny` whenever an
approval transitions to its terminal state.
- `Spawned { agent, ok, note }` — `actions::approve` (Spawn-kind)
+ admin `HostRequest::Spawn`.
- `Rebuilt { agent, ok, note }` — `auto_update::rebuild_agent`
(covers startup scan + manual `/rebuild` from dashboard) +
`actions::approve` (ApplyCommit).
- `Killed { agent }` — admin `HostRequest::Kill` + dashboard
`/kill` + manager `Kill` MCP tool.
- `Destroyed { agent }` — `actions::destroy`.
- `ContainerCrash { agent, note }` — `crash_watch`: a previously-
running container went away with no operator-initiated transient
state (Stopping / Restarting / Destroying / Rebuilding). Manager
can `start` it again or escalate.
- `NeedsLogin { agent }` — sub-agent has no claude session yet.
Manager can't act directly (interactive OAuth); typically flags
the operator.
- `LoggedIn { agent }` — sub-agent just completed login. Manager
often greets the agent on this event.
- `NeedsUpdate { agent }` — sub-agent's recorded flake rev is
stale. Manager calls `update(name)` to rebuild — idempotent,
no approval required.
- `OperatorAnswered { id, question, answer }` — dashboard
`/answer-question/{id}` after the operator submits the answer
form.
To add a new event: new `HelperEvent` variant + call sites + update
`prompts/manager.md` so the manager knows the new shape.
## Auto-update on startup
`hive-c0re serve` runs `auto_update::run` in a background task right
after opening the coordinator. It enumerates managed containers and
rebuilds any whose recorded hyperhive rev differs from the current
one — sub-agents and manager go through the same `lifecycle::rebuild`
path.
"Rev" = canonical filesystem path of `cfg.hyperhiveFlake`. Marker
file: `/var/lib/hyperhive/applied/.<name>.hyperhive-rev`. If the
flake input has no canonical path (e.g. a `github:` URL),
auto-update is a no-op — rebuild manually.
The dashboard surfaces pending updates per agent: a clickable
"needs update ↻" badge appears whenever the marker differs from
current rev. The badge POSTs `/rebuild/<name>`, calling the same
`auto_update::rebuild_agent` path so manual triggers and the
startup scan can't drift. When at least one container is stale, a
top-level `↻ UPD4TE 4LL` button appears that loops over every
stale container.