hyperhive/docs/approvals.md

# Approvals + manager + helper events

The approval queue is hyperhive's pivot: nothing that changes the
shape of an agent (its config, whether it exists) happens without an
operator click. The manager (`hm1nd`) is the policy gate in front of
that queue; helper events are how it stays informed about what
happens after a decision lands.

## End-to-end approval flow

1. Manager edits files under `/agents/<name>/config/` (any tracked
   path, but `agent.nix` is the contract entry point) and commits
   with its own git identity.
2. Manager submits the commit sha via `request_apply_commit(agent,
   commit_ref)`.
3. **hive-c0re immediately fetches that commit from the proposed
   repo into the applied repo and tags it `proposal/<id>`.** The
   approval row stores both the manager-supplied sha and the
   canonical hive-c0re-vouched sha. From here on the proposed
   repo is irrelevant for this approval — the manager can amend,
   force-push, or `rm -rf` the proposed repo and the queued
   approval still points at an immutable git object inside
   applied.
4. Operator sees the diff on the dashboard, clicks ◆ APPR0VE (or
   `hive-c0re approve <id>` on the CLI).
5. hive-c0re moves the working tree to `proposal/<id>` and runs
   the build under a sequence of tags (see below). On success,
   `applied/main` fast-forwards to the proposal commit. On
   failure, main stays put and the working tree resets back to
   the previous deployed commit.
6. `HelperEvent::ApprovalResolved` (and `Rebuilt` for the
   ApplyCommit kind) land in the manager's inbox, carrying both
   the canonical sha and the terminal tag.

`Spawn` approvals follow the same shape but skip the commit-diff
step — the operator just sees the name. On approve, hive-c0re
creates the container in a background task while the dashboard
shows a spinner.

## Meta flake

The hive-c0re-owned repo at `/var/lib/hyperhive/meta/`
declares one flake input per agent (`agent-<n>.url =
"git+file:///var/lib/hyperhive/applied/<n>"`) and one
`nixosConfigurations.<n>` output per agent. Each output wraps
`inputs.agent-<n>.nixosModules.default` with the identity +
`HIVE_PORT` / `HIVE_LABEL` / `HIVE_DASHBOARD_PORT` injection
module that `setup_applied` used to generate inline.
Containers run against `--flake /var/lib/hyperhive/meta#<n>`.

Per-deploy lock flow (two-phase, owned by
`actions::run_apply_commit` → `meta::{prepare,finalize,abort}
_deploy`):

1. `meta::prepare_deploy(name)` runs
   `nix flake lock --update-input agent-<n>` without
   committing. Working tree of meta now points the input at
   `applied/<n>/main` (which `run_apply_commit` already
   fast-forwarded to `proposal/<id>`).
2. `lifecycle::rebuild_no_meta` runs
   `nixos-container update <c> --flake meta#<name>`. Nix
   evaluates against the staged lock.
3. On success — `meta::finalize_deploy(name, sha, "deployed/
   <id>")` stages `flake.lock` and commits with
   `deploy <n> deployed/<id> <sha12>`. Meta's git log gains
   one entry per successful deploy.
4. On failure — `meta::abort_deploy()` runs
   `git restore flake.lock` so the meta history shows only
   successes; the failure stays as an annotated `failed/<id>`
   tag in `applied/<n>`.

Single-phase variants exist for paths without
rollback semantics: `meta::lock_update_for_rebuild(name)` for
the manual `↻ R3BU1LD` button (commits if the lock changed)
and `meta::lock_update_hyperhive()` for the
auto-update flake-rev bump (one shot before per-agent
rebuilds, commits if the lock changed).

`meta::sync_agents(hyperhive_flake, dashboard_port, &agents)`
is the idempotent reconciler called by `spawn`, `destroy`,
`rebuild`, and the startup migration. Renders `flake.nix`
from the agent list; if it differs from disk, runs
`nix flake lock` + commits as `regenerate meta flake` (or
`seed meta from N agent(s)` on the very first call).

The manager has `/meta` RO-bound inside its container:
`git -C /meta log --oneline` is the swarm-wide deploy log,
`cat /meta/flake.lock | jq '.nodes["agent-<n>"].locked'`
resolves which sha each agent is pinned at right now.
Dashboard surfaces the same info as a `deployed:<sha12>` chip
per container row.

## Two repos per agent

```
/var/lib/hyperhive/agents/<name>/config/    proposed — manager RW
└── <anything>                              # any files the manager
                                            # wants in the commit.
                                            # agent.nix is the
                                            # convention entry
                                            # point; flake.nix is
                                            # tracked boilerplate
                                            # (manager doesn't edit
                                            # it).

/var/lib/hyperhive/applied/<name>/          applied — core-only
├── .git/                                   # tag-rich history
├── flake.nix                               # tracked, fixed
│                                           # boilerplate exporting
│                                           # nixosModules.default
├── agent.nix                               # working tree of main
└── <other manager files>                   # also tracked

/var/lib/hyperhive/meta/                    swarm-wide flake — core
├── .git/                                   # one commit per successful
│                                           # deploy
├── flake.nix                               # generated from agent set
└── flake.lock                              # pins each agent's sha
```

Why two physical repos: the manager's `/agents/<n>/config/` is
RW — a buggy or hostile agent can `git clean -fdx` its own
proposed tree. The applied repo is never bind-mounted (except
the read-only `.git` exposure described below) so a destructive
move inside the container cannot reach it.

The container's `--flake` ref is `/var/lib/hyperhive/meta#<name>`
(see "Meta flake" above). The agent's own `applied/<n>/flake.nix`
is a fixed boilerplate that exports `nixosModules.default =
import ./agent.nix`; the meta flake imports that module and
wraps it with identity + `HIVE_PORT` / `HIVE_LABEL` /
`HIVE_DASHBOARD_PORT`.

### Tag state machine

Every approval id walks through a fixed set of tags on the
underlying commit inside the applied repo:

| Tag | When | Annotated? |
|---|---|---|
| `proposal/<id>` | request_apply_commit, after fetch | no |
| `approved/<id>` | operator approve | no |
| `building/<id>` | rebuild started | no |
| `deployed/<id>` | rebuild succeeded — `main` ff's here | no |
| `failed/<id>` | rebuild failed | yes (body = error) |
| `denied/<id>` | operator deny | yes (body = operator note) |

`applied/main` is always the latest `deployed/*`. `denied/` and
`failed/` are terminal; the manager submits a new commit + new
approval id to retry. Because tags are first-class git objects,
rejected and failed trees stay browsable forever — `git log
--tags` in the applied repo is the audit trail.

### Manager view of applied + meta

The manager container gets three host-side bind mounts via
`set_nspawn_flags`:

- `/var/lib/hyperhive/agents/` → `/agents/` (RW) — proposed
  repos. Manager edits + commits per-agent config here.
- `/var/lib/hyperhive/applied/` → `/applied/` (RO) — every
  agent's authoritative applied repo, including `.git`.
- `/var/lib/hyperhive/meta/` → `/meta/` (RO) — the swarm-wide
  deploy flake.

Each proposed repo (`/agents/<n>/config/`) is pre-configured
with `applied` as a git remote pointing at
`/applied/<n>/.git`. Useful incantations from inside the
manager:

```sh
git -C /agents/<n>/config fetch applied
git -C /agents/<n>/config log applied/main --oneline
git -C /agents/<n>/config show applied/refs/tags/deployed/<id>
git -C /agents/<n>/config show applied/refs/tags/failed/<id>   # body = build error
git -C /agents/<n>/config show applied/refs/tags/denied/<id>   # body = operator note
git -C /agents/<n>/config rebase applied/main                 # base in-flight work on what's deployed

git -C /meta log --oneline                                    # swarm-wide deploy history
cat /meta/flake.lock | jq '.nodes | with_entries(select(.key | startswith("agent-")))'
```

The RO binds block push at the kernel level, so the manager
can only fetch / read — git plumbing inside the container
cannot corrupt either authoritative repo.

## Migration from the pre-tag / pre-meta schemes

Both overhauls (tag-driven flow + meta flake) ship in-place
migrations that run on every hive-c0re startup. Idempotent;
each phase is a no-op once already applied. Behaviour:

- Tag-driven phase: assumes the operator ran the one-shot
  `git tag deployed/0 main` script (see commit history /
  earlier docs revisions) once per agent. Tagging is
  non-destructive: it doesn't touch live containers, state
  dirs, or claude creds.
- Meta-flake phase: rewrites each `applied/<n>/flake.nix` to
  the module-only boilerplate, wires the `applied` remote in
  each proposed repo, bootstraps the meta repo from the
  current agent list, and `nixos-container update`s every
  container at `meta#<n>`. The expensive last step is
  guarded by `/var/lib/hyperhive/.meta-migration-done` so
  it only runs once across hive-c0re restarts. Set
  `HIVE_SKIP_META_MIGRATION=1` on the service to defer.

No state loss in either migration. claude creds, /state/
notes, the events DB, proposed history, and applied history
all survive. The manager keeps its session; sub-agents stay
logged in.

## Manager (`hm1nd`) is hive-c0re-managed

The manager container runs through the **same lifecycle as
sub-agents**. On `hive-c0re serve` startup, if `hm1nd` is missing,
hive-c0re creates it. The manager's flake lives at
`/var/lib/hyperhive/applied/hm1nd/`; its proposed config at
`/var/lib/hyperhive/agents/hm1nd/config/`. Manager can edit its own
`agent.nix` (visible inside the container at `/agents/hm1nd/config/`)
and submit `request_apply_commit("hm1nd", <sha>)` for operator
approval.

Differences from sub-agents:

- `flake.nix` extends `hyperhive.nixosConfigurations.manager`
  (vs `agent-base`).
- Container name is `hm1nd` (no `h-` prefix).
- Fixed web UI port (`MANAGER_PORT = 8000`).
- `set_nspawn_flags` adds two extra binds: `/var/lib/hyperhive/agents`
  → `/agents` (RW) so the manager can edit per-agent proposed repos,
  and `/var/lib/hyperhive/applied` → `/applied` (RO) so the manager
  can `git fetch` deployed/failed/denied tags from any agent's
  authoritative applied repo (see "Manager view of applied" below).
- First-deploy spawn bypasses the approval queue (manager is
  required infrastructure).
- Per-agent socket lives at `/run/hyperhive/manager/`, owned by
  `manager_server::start`.

**Migration note** (for older hosts): drop any `containers.hm1nd =
{ ... }` block from your host NixOS config. hyperhive creates and
updates the manager itself.

## Manager policy

From `hive-ag3nt/prompts/manager.md`: the manager does NOT
rubber-stamp sub-agent config requests. It verifies (role match,
package legitimacy, cheaper alternative, blast radius) before
committing and calling `request_apply_commit`.

For ambiguous cases or anything that needs human signal, the
manager calls `ask(question, options?, multi?, ttl_seconds?, to?)` —
queues the question and returns the id immediately. When `to` is
omitted (or `"operator"`) the question shows up on the dashboard;
when `to` is a sub-agent's name, the recipient receives a
`HelperEvent::QuestionAsked` and answers via their own `answer`
tool. Either way the answer arrives back as
`HelperEvent::QuestionAnswered { id, question, answer, answerer }`
in the asker's inbox. Storage is `hive-c0re::operator_questions`
(sqlite) — same table, with a nullable `target` column
(NULL = operator). Dispatch goes through
`hive-c0re/src/questions.rs::{handle_ask, handle_answer}` so both
the agent + manager surfaces stay aligned. The answer flow is:

```
POST /answer-question/{id}                       agent: Answer { id, answer }
  → OperatorQuestions::answer(_, _, "operator")    → questions::handle_answer
  → notify_agent(asker, QuestionAnswered {         → OperatorQuestions::answer(_, _, agent)
       answerer: "operator", ... })                → notify_agent(asker, QuestionAnswered {
                                                       answerer: agent, ... })
```

Two more paths resolve a pending question with a sentinel answer:

- `POST /cancel-question/{id}` (✗ CANC3L button on the dashboard)
  resolves with `[cancelled]`. The manager sees a terminal state
  and can fall back.
- `ttl_seconds` deadline: a tokio watchdog spawned at submit time
  fires `answer(id, "[expired]")` once the ttl runs out. Already-
  resolved races no-op. The dashboard surfaces a `⏳ MM:SS` chip
  on each pending question with a deadline.

## Helper events to the manager

`Coordinator::notify_manager(&HelperEvent)` enqueues an inbox
message from sender `system` with the event JSON in the body. The
manager harness no longer short-circuits these — they drive a
regular claude turn so the manager can react. Variants
(`hive_sh4re::HelperEvent`):

- `ApprovalResolved { id, agent, commit_ref, status, note }` —
  fired by `actions::approve` + `actions::deny` whenever an
  approval transitions to its terminal state.
- `Spawned { agent, ok, note }` — `actions::approve` (Spawn-kind)
  + admin `HostRequest::Spawn`.
- `Rebuilt { agent, ok, note }` — `auto_update::rebuild_agent`
  (covers startup scan + manual `/rebuild` from dashboard) +
  `actions::approve` (ApplyCommit).
- `Killed { agent }` — admin `HostRequest::Kill` + dashboard
  `/kill` + manager `Kill` MCP tool.
- `Destroyed { agent }` — `actions::destroy`.
- `ContainerCrash { agent, note }` — `crash_watch`: a previously-
  running container went away with no operator-initiated transient
  state (Stopping / Restarting / Destroying / Rebuilding). Manager
  can `start` it again or escalate.
- `NeedsLogin { agent }` — sub-agent has no claude session yet.
  Manager can't act directly (interactive OAuth); typically flags
  the operator.
- `LoggedIn { agent }` — sub-agent just completed login. Manager
  often greets the agent on this event.
- `NeedsUpdate { agent }` — sub-agent's recorded flake rev is
  stale. Manager calls `update(name)` to rebuild — idempotent,
  no approval required.
- `QuestionAnswered { id, question, answer, answerer }` —
  dashboard `/answer-question/{id}` (answerer = `"operator"`),
  peer `Answer` request (answerer = agent name), or ttl watchdog
  expiry (answerer = `"ttl-watchdog"`, answer = `"[expired]"`).
- `QuestionAsked { id, asker, question, options, multi }` —
  fired when an agent calls `Ask { to: Some(<this-agent>), ... }`.
  The recipient responds via `Answer { id, answer }` and the
  asker sees the matching `QuestionAnswered`.

To add a new event: new `HelperEvent` variant + call sites + update
`prompts/manager.md` so the manager knows the new shape.

## Auto-update on startup

`hive-c0re serve` runs `auto_update::run` in a background task right
after opening the coordinator. It enumerates managed containers and
rebuilds any whose recorded hyperhive rev differs from the current
one — sub-agents and manager go through the same `lifecycle::rebuild`
path.

"Rev" = canonical filesystem path of `cfg.hyperhiveFlake`. Marker
file: `/var/lib/hyperhive/applied/.<name>.hyperhive-rev`. If the
flake input has no canonical path (e.g. a `github:` URL),
auto-update is a no-op — rebuild manually.

The dashboard surfaces pending updates per agent: a clickable
"needs update ↻" badge appears whenever the marker differs from
current rev. The badge POSTs `/rebuild/<name>`, calling the same
`auto_update::rebuild_agent` path so manual triggers and the
startup scan can't drift. When at least one container is stale, a
top-level `↻ UPD4TE 4LL` button appears that loops over every
stale container.