hyperhive/docs/turn-loop.md
müde 62d1a74929 docs sync + revert auto-unfree removal
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
2026-05-15 21:26:13 +02:00

153 lines
6.4 KiB
Markdown

# Turn loop + MCP
How the harness wakes up, what it asks claude to do, and what tools
claude has access to in return.
## The loop
Each agent harness (`hive-ag3nt serve` or `hive-m1nd serve`) runs:
1. Long-poll `Recv` on its socket. The host-side broker
(`broker.rs::recv_blocking`) returns immediately if there's a
pending message, otherwise waits up to 30 s for a broker `Sent`
event for this recipient.
2. Pop one message. Peek the remaining inbox depth with `Status`.
3. Emit `LiveEvent::TurnStart { from, body, unread }` onto the SSE
bus.
4. Spawn claude (one process per turn) and pipe the wake prompt
over stdin.
5. Stream stdout (JSON lines) into the bus as
`LiveEvent::Stream(value)`. Pump stderr as `Note`.
6. Wait for claude to exit. On `Prompt is too long`, run `/compact`
on the session once and retry the turn.
7. Emit `LiveEvent::TurnEnd { ok, note }`. Sleep `poll_ms` to avoid
tight loops on transient failures.
## The claude invocation
```
claude --print --verbose --output-format stream-json --model <name> \
--continue --settings /run/hive/claude-settings.json \
--system-prompt-file /run/hive/claude-system-prompt.md \
--mcp-config /run/hive/claude-mcp-config.json --strict-mcp-config \
--tools <builtins> --allowedTools <builtins+mcp>
# wake prompt piped over stdin
```
`<name>` is read from `Bus::model()` on each turn, default
`haiku`. Operator can flip it at runtime with `/model <name>` in
the web terminal — the next turn picks it up. The choice is
persisted to `/state/hyperhive-model` so it survives restart;
override path: `HYPERHIVE_MODEL_FILE` env var for tests.
`--continue` keeps a persistent session per agent (claude stores
sessions in `~/.claude/projects/`, which is bind-mounted
persistently). Auto-compact and auto-memory are disabled via
`--settings` because hyperhive owns compaction (`/compact` on
overflow, retry once; operator can also force one via `/api/compact`).
The wake prompt is intentionally minimal: just the popped message's
`from`/`body`, plus an inline `({unread} more pending — drain via
…)` hint when `unread > 0`. Claude drives any further `recv`/`send`
itself via the embedded MCP server.
Whenever hive-c0re starts / restarts / rebuilds a container, it
also drops a `system` message into the agent's inbox via
`Coordinator::kick_agent` — a one-line "you were just (re)started,
check /state/ for your notes, --continue session is intact". The
next turn picks it up like any other inbox message.
### On-boot files
`hive_ag3nt::turn::write_*` writes three files next to the per-agent
socket at `/run/hive/` once at startup:
- `claude-mcp-config.json` — re-invokes the running binary as `mcp`
child (so the same binary serves as harness + as claude's MCP
child process).
- `claude-settings.json` — the `--settings` blob (auto-compact and
auto-memory off, effortLevel medium).
- `claude-system-prompt.md` — rendered from
`hive-ag3nt/prompts/{agent,manager}.md` with `{label}`
substituted. Passed via `--system-prompt-file`.
The shared per-turn plumbing lives in `hive_ag3nt::turn::{write_mcp_config,
write_settings, write_system_prompt, run_turn, drive_turn,
emit_turn_end, wait_for_login, compact_session}` so the two binaries
can't drift.
## MCP surface
The harness ships an embedded MCP server (rmcp 1.7). Claude launches
it as a stdio child via `--mcp-config`. The hyperhive socket name is
`hyperhive`, so the tools land in claude as `mcp__hyperhive__<tool>`.
### Sub-agent tools
- `send(to, body)` — message a peer (logical agent name), another
agent, or the operator (recipient `operator`, surfaces in the
dashboard inbox).
- `recv(wait_seconds?)` — drain one inbox message. Long-polls
server-side; `wait_seconds` is capped at 180 (default 30 when
omitted). Agents use a long wait to park their turn waiting for
work instead of busy-looping with short polls — they wake
instantly when a message arrives.
### Manager tools (in addition to send/recv)
- `request_spawn(name)` — queue a Spawn approval for a brand-new
sub-agent (≤9 char name). Operator approves on the dashboard.
- `kill(name)` — graceful stop. No approval required.
- `start(name)` — start a stopped sub-agent. No approval.
- `restart(name)` — stop + start. No approval.
- `request_apply_commit(agent, commit_ref)` — submit a config
change for any agent (`hm1nd` for the manager's own config) for
operator approval.
- `ask_operator(question, options?, multi?, ttl_seconds?)`
surface a question on the dashboard. Non-blocking — returns the
queued question id; the operator's answer arrives later as
`HelperEvent::OperatorAnswered` in the manager inbox. Options
always render alongside a free-text fallback; `multi=true`
renders options as checkboxes. `ttl_seconds` auto-cancels with
answer `[expired]` after the deadline (useful for time-sensitive
decisions that become moot if the operator hasn't responded).
The operator can also manually cancel with `[cancelled]` via the
dashboard.
The boundary: lifecycle ops on *existing* sub-agents
(`kill`/`start`/`restart`) are at the manager's discretion — no
operator approval. Creating a new agent (`request_spawn`) and
changing any agent's config (`request_apply_commit`) still go
through the approval queue.
### Authoritative state
`hive_ag3nt::events::Bus` carries the current turn-loop state in
addition to the broadcast channel and the events history. Variants:
- `Idle` — sitting on `Recv` waiting for mail.
- `Thinking``claude --print` is running for a turn.
- `Compacting` — operator-triggered `/compact` is in flight.
The harness flips state at the relevant transitions
(`set_state(Thinking)` before `drive_turn`, `set_state(Idle)`
after; `set_state(Compacting)` around `compact_session`). Exposed
via `/api/state.turn_state` + `turn_state_since` (unix seconds);
the agent page renders this rather than deriving from SSE events.
### Tool envelope
`mcp::run_tool_envelope`: every MCP tool handler logs the request,
runs the body, logs the result. Pre-/post-log only — the inbox
status hint moved to the wake prompt + UI header.
### Tool whitelist (`mcp::ALLOWED_BUILTIN_TOOLS`)
- Allowed built-ins: `Bash`, `Edit`, `Glob`, `Grep`, `Read`,
`TodoWrite`, `Write`.
- Denied by omission: `WebFetch`, `WebSearch`, `Task`,
`NotebookEdit`.
- Allowed MCP tools: as listed above per flavor.
`Bash` is on the allow-list pending a finer-grained pattern allow-list
(`Bash(git *)`-style) — see [TODO](../TODO.md).