manager: needs_login / logged_in / needs_update events + update tool

crash_watch grows two more state-axes alongside running/stopped:

- logged-in (claude session dir populated for the agent)
- up-to-date (recorded flake rev matches current)

per-tick transitions emit HelperEvent::NeedsLogin / LoggedIn /
NeedsUpdate. seed-on-first-tick semantics retained — nothing fires
on harness boot for agents that were already in their state. only
needs_update fires the 'stale appeared' direction; the resolved
direction is already covered by Rebuilt.

new mcp__hyperhive__update(name) on the manager surface: idempotent
rebuild via auto_update::rebuild_agent. transient-aware (Rebuilding)
so the dashboard shows the spinner. login intentionally has NO tool
— it's interactive OAuth, only the operator can complete it.

prompts + approvals doc + turn-loop doc updated. todo grows a
'show per-agent applied config in dashboard' entry (separate
follow-up).
This commit is contained in:
müde 2026-05-15 21:42:13 +02:00
parent b374f39b0d
commit 80229c6af9
8 changed files with 230 additions and 34 deletions

View file

@ -44,6 +44,15 @@ Pick anything from here when relevant. Cross-cutting design notes live in
## UI / UX ## UI / UX
- **Dashboard: show per-agent applied config.** Surface
`/var/lib/hyperhive/applied/<name>/agent.nix` (the file the
container actually builds from) as a collapsible `<details>`
block on each container row, alongside the journald viewer.
Backend: new `GET /api/agent-config/{name}` returns the file
contents (text/plain). Frontend: lazy-fetch on expand, render
inside a `<pre>` with the same theming as the journal panel.
Useful for spot-checking what `request_apply_commit` produced
without ssh-ing in.
- **xterm.js terminal** embedded per-agent, attached to a PTY exposed by - **xterm.js terminal** embedded per-agent, attached to a PTY exposed by
the harness. Pairs well with the unprivileged-container work — would let the harness. Pairs well with the unprivileged-container work — would let
the operator drop into the container without `nixos-container root-login`. the operator drop into the container without `nixos-container root-login`.

View file

@ -129,6 +129,14 @@ regular claude turn so the manager can react. Variants
running container went away with no operator-initiated transient running container went away with no operator-initiated transient
state (Stopping / Restarting / Destroying / Rebuilding). Manager state (Stopping / Restarting / Destroying / Rebuilding). Manager
can `start` it again or escalate. can `start` it again or escalate.
- `NeedsLogin { agent }` — sub-agent has no claude session yet.
Manager can't act directly (interactive OAuth); typically flags
the operator.
- `LoggedIn { agent }` — sub-agent just completed login. Manager
often greets the agent on this event.
- `NeedsUpdate { agent }` — sub-agent's recorded flake rev is
stale. Manager calls `update(name)` to rebuild — idempotent,
no approval required.
- `OperatorAnswered { id, question, answer }` — dashboard - `OperatorAnswered { id, question, answer }` — dashboard
`/answer-question/{id}` after the operator submits the answer `/answer-question/{id}` after the operator submits the answer
form. form.

View file

@ -100,6 +100,9 @@ it as a stdio child via `--mcp-config`. The hyperhive socket name is
- `kill(name)` — graceful stop. No approval required. - `kill(name)` — graceful stop. No approval required.
- `start(name)` — start a stopped sub-agent. No approval. - `start(name)` — start a stopped sub-agent. No approval.
- `restart(name)` — stop + start. No approval. - `restart(name)` — stop + start. No approval.
- `update(name)` — rebuild (re-applies the current hyperhive flake
+ agent.nix, restarts). No approval, idempotent. Manager calls
this on receipt of a `needs_update` system event.
- `request_apply_commit(agent, commit_ref)` — submit a config - `request_apply_commit(agent, commit_ref)` — submit a config
change for any agent (`hm1nd` for the manager's own config) for change for any agent (`hm1nd` for the manager's own config) for
operator approval. operator approval.

View file

@ -8,6 +8,7 @@ Tools (hyperhive surface):
- `mcp__hyperhive__kill(name)` — graceful stop on a sub-agent. No approval required. - `mcp__hyperhive__kill(name)` — graceful stop on a sub-agent. No approval required.
- `mcp__hyperhive__start(name)` — start a stopped sub-agent. No approval required. - `mcp__hyperhive__start(name)` — start a stopped sub-agent. No approval required.
- `mcp__hyperhive__restart(name)` — stop + start a sub-agent. No approval required. - `mcp__hyperhive__restart(name)` — stop + start a sub-agent. No approval required.
- `mcp__hyperhive__update(name)` — rebuild a sub-agent (re-applies the current hyperhive flake + agent.nix, restarts the container). No approval required — idempotent. Use when you receive a `needs_update` system event.
- `mcp__hyperhive__request_apply_commit(agent, commit_ref)` — submit a config change for any agent (`hm1nd` for self) for operator approval. - `mcp__hyperhive__request_apply_commit(agent, commit_ref)` — submit a config change for any agent (`hm1nd` for self) for operator approval.
- `mcp__hyperhive__ask_operator(question, options?, multi?, ttl_seconds?)` — surface a question on the dashboard. Returns immediately with a question id; the operator's answer arrives later as a system `operator_answered` event in your inbox. Options are advisory: the dashboard always lets the operator type a free-text answer in addition. Set `multi: true` to render options as checkboxes (operator can pick multiple); the answer comes back as `, `-separated. Set `ttl_seconds` to auto-cancel after a deadline — useful when the decision becomes moot if the operator hasn't responded in time; on expiry the answer is `[expired]`. Do not poll inside the same turn — finish the current work and react when the event lands. - `mcp__hyperhive__ask_operator(question, options?, multi?, ttl_seconds?)` — surface a question on the dashboard. Returns immediately with a question id; the operator's answer arrives later as a system `operator_answered` event in your inbox. Options are advisory: the dashboard always lets the operator type a free-text answer in addition. Set `multi: true` to render options as checkboxes (operator can pick multiple); the answer comes back as `, `-separated. Set `ttl_seconds` to auto-cancel after a deadline — useful when the decision becomes moot if the operator hasn't responded in time; on expiry the answer is `[expired]`. Do not poll inside the same turn — finish the current work and react when the event lands.
@ -26,7 +27,13 @@ You're the policy gate between sub-agents and the operator's approval queue —
Two ways to talk to the operator: `send(to: "operator", ...)` for fire-and-forget status / pointers (surfaces in the operator inbox), or `ask_operator(question, options?)` when you need a decision. `ask_operator` is non-blocking — it queues the question and returns an id immediately; the answer arrives on a future turn as an `operator_answered` system event. Prefer `ask_operator` over an open-ended `send` for anything you actually need to wait on. Two ways to talk to the operator: `send(to: "operator", ...)` for fire-and-forget status / pointers (surfaces in the operator inbox), or `ask_operator(question, options?)` when you need a decision. `ask_operator` is non-blocking — it queues the question and returns an id immediately; the answer arrives on a future turn as an `operator_answered` system event. Prefer `ask_operator` over an open-ended `send` for anything you actually need to wait on.
Messages from sender `system` are hyperhive helper events (JSON body, `event` field discriminates): `approval_resolved`, `spawned`, `rebuilt`, `killed`, `destroyed`, `container_crash`, `operator_answered`. Use these to react to lifecycle changes — e.g. greet a freshly-spawned agent, retry a failed rebuild, restart an agent whose container crashed, or pick up the operator's answer to a question you previously asked. Messages from sender `system` are hyperhive helper events (JSON body, `event` field discriminates): `approval_resolved`, `spawned`, `rebuilt`, `killed`, `destroyed`, `container_crash`, `needs_login`, `logged_in`, `needs_update`, `operator_answered`. Use these to react to lifecycle changes:
- `needs_login` — agent has no claude session yet. You can't help directly (login is interactive OAuth on the operator side); flag the operator if it's been long.
- `logged_in` — agent just completed login; first useful turn is imminent. Good time to brief them on what to do.
- `needs_update` — agent's flake rev is stale. Call `update(name)` to rebuild — it's idempotent and doesn't need approval.
- `container_crash` — restart with `start(name)`. If it crashes again, ask the operator.
- otherwise greet freshly-spawned agents, retry failed rebuilds, pick up the operator's answer to questions you asked.
Durable knowledge: Durable knowledge:

View file

@ -240,6 +240,12 @@ pub struct RestartArgs {
pub name: String, pub name: String,
} }
#[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
pub struct UpdateArgs {
/// Sub-agent name (without the `h-` container prefix).
pub name: String,
}
#[derive(Debug, serde::Deserialize, schemars::JsonSchema)] #[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
pub struct AskOperatorArgs { pub struct AskOperatorArgs {
/// The question to surface on the dashboard. /// The question to surface on the dashboard.
@ -400,6 +406,23 @@ impl ManagerServer {
.await .await
} }
#[tool(
description = "Rebuild a sub-agent: re-applies the current hyperhive flake + agent.nix \
and restarts the container. No approval required idempotent. Use when you receive a \
`needs_update` system event for an agent."
)]
async fn update(&self, Parameters(args): Parameters<UpdateArgs>) -> String {
let log = format!("{args:?}");
let name = args.name.clone();
run_tool_envelope("update", log, async move {
let resp = self
.dispatch(hive_sh4re::ManagerRequest::Update { name: args.name })
.await;
format_ack(resp, "update", format!("updated {name}"))
})
.await
}
#[tool( #[tool(
description = "Surface a question to the operator on the dashboard. Returns immediately \ description = "Surface a question to the operator on the dashboard. Returns immediately \
with a question id do NOT wait inline. When the operator answers, a system message \ with a question id do NOT wait inline. When the operator answers, a system message \
@ -512,6 +535,7 @@ pub fn allowed_mcp_tools(flavor: Flavor) -> Vec<String> {
"kill", "kill",
"start", "start",
"restart", "restart",
"update",
"request_apply_commit", "request_apply_commit",
"ask_operator", "ask_operator",
], ],

View file

@ -1,15 +1,22 @@
//! Container crash watcher. Polls every managed container's running //! Per-container state watcher. Polls every managed container on a
//! state on a fixed interval; when a previously-running container is //! fixed interval, tracks three orthogonal state-sets across ticks,
//! suddenly stopped AND no operator-initiated transient (`Stopping`, //! and emits a `HelperEvent` to the manager on each transition:
//! `Restarting`, `Destroying`) was set, fire `HelperEvent::ContainerCrash`
//! into the manager's inbox. The manager can then react — usually
//! a `start` or a config rebuild.
//! //!
//! D-Bus subscription would be lower-latency, but polling is far //! - **running**: container is up. running → stopped without an
//! simpler and the failure modes are honest (a crash discovered 10s //! operator-initiated transient (`Stopping` / `Restarting` /
//! late is fine for our scale). //! `Destroying` / `Rebuilding`) → `ContainerCrash`.
//! - **logged-in**: claude session dir is populated. ! → ✓ →
//! `LoggedIn`; ✓ → ! → `NeedsLogin` (rare — usually only fires
//! on a fresh spawn / purge).
//! - **up-to-date**: agent's recorded flake rev matches current. ✓
//! → ! → `NeedsUpdate`. The reverse direction (`NeedsUpdate`
//! resolved) is covered by `Rebuilt`, so no separate event.
//!
//! D-Bus subscription would be lower-latency for the first axis,
//! but polling is simpler and a 10s detection delay is fine.
use std::collections::HashSet; use std::collections::HashSet;
use std::path::Path;
use std::sync::Arc; use std::sync::Arc;
use std::time::Duration; use std::time::Duration;
@ -20,14 +27,17 @@ const POLL_INTERVAL: Duration = Duration::from_secs(10);
pub fn spawn(coord: Arc<Coordinator>) { pub fn spawn(coord: Arc<Coordinator>) {
tokio::spawn(async move { tokio::spawn(async move {
// Seed the running-set from the first poll so we don't emit a
// crash for every agent on startup. First tick fills it; only
// running→stopped transitions across subsequent ticks count.
let mut prev_running: HashSet<String> = HashSet::new(); let mut prev_running: HashSet<String> = HashSet::new();
let mut prev_logged_in: HashSet<String> = HashSet::new();
let mut prev_updated: HashSet<String> = HashSet::new();
let mut seeded = false; let mut seeded = false;
loop { loop {
let raw = lifecycle::list().await.unwrap_or_default(); let raw = lifecycle::list().await.unwrap_or_default();
let current_rev = crate::auto_update::current_flake_rev(&coord.hyperhive_flake);
let mut current_running = HashSet::new(); let mut current_running = HashSet::new();
let mut current_logged_in = HashSet::new();
let mut current_updated = HashSet::new();
let mut sub_agents: Vec<String> = Vec::new();
for c in &raw { for c in &raw {
let logical = if c == MANAGER_NAME { let logical = if c == MANAGER_NAME {
MANAGER_NAME.to_owned() MANAGER_NAME.to_owned()
@ -36,14 +46,42 @@ pub fn spawn(coord: Arc<Coordinator>) {
} else { } else {
continue; continue;
}; };
if logical != MANAGER_NAME {
sub_agents.push(logical.clone());
}
if lifecycle::is_running(&logical).await { if lifecycle::is_running(&logical).await {
current_running.insert(logical); current_running.insert(logical.clone());
}
if logical != MANAGER_NAME
&& claude_has_session(&Coordinator::agent_claude_dir(&logical))
{
current_logged_in.insert(logical.clone());
}
if let Some(rev) = current_rev.as_deref()
&& !crate::auto_update::agent_needs_update(&logical, rev)
{
current_updated.insert(logical.clone());
} }
} }
if seeded { if seeded {
emit_crash_transitions(&coord, &prev_running, &current_running);
emit_login_transitions(&coord, &prev_logged_in, &current_logged_in, &sub_agents);
emit_update_transitions(&coord, &prev_updated, &current_updated, &sub_agents);
}
prev_running = current_running;
prev_logged_in = current_logged_in;
prev_updated = current_updated;
seeded = true;
tokio::time::sleep(POLL_INTERVAL).await;
}
});
}
fn emit_crash_transitions(coord: &Coordinator, prev: &HashSet<String>, current: &HashSet<String>) {
let transients = coord.transient_snapshot(); let transients = coord.transient_snapshot();
for stopped in prev_running.difference(&current_running) { for stopped in prev.difference(current) {
let deliberate = transients.get(stopped).is_some_and(|st| { let deliberate = transients.get(stopped).is_some_and(|st| {
matches!( matches!(
st.kind, st.kind,
@ -63,10 +101,76 @@ pub fn spawn(coord: Arc<Coordinator>) {
}); });
} }
} }
prev_running = current_running;
seeded = true;
tokio::time::sleep(POLL_INTERVAL).await; fn emit_login_transitions(
} coord: &Coordinator,
prev: &HashSet<String>,
current: &HashSet<String>,
sub_agents: &[String],
) {
for agent in current.difference(prev) {
tracing::info!(%agent, "agent logged in");
coord.notify_manager(&hive_sh4re::HelperEvent::LoggedIn {
agent: agent.clone(),
}); });
} }
// Only count NeedsLogin transitions for agents that exist and
// are *not* logged in — the difference set above already gives
// us "was in prev, gone from current" but we also want to fire
// for agents that newly appeared as not-logged-in (post-spawn /
// post-purge). Treat sub_agents minus current as the
// currently-needs-login set; emit when an agent enters it.
let prev_needs: HashSet<&str> = sub_agents
.iter()
.map(String::as_str)
.filter(|n| !prev.contains(*n))
.collect();
let current_needs: HashSet<&str> = sub_agents
.iter()
.map(String::as_str)
.filter(|n| !current.contains(*n))
.collect();
for agent in current_needs.difference(&prev_needs) {
tracing::info!(%agent, "agent needs login");
coord.notify_manager(&hive_sh4re::HelperEvent::NeedsLogin {
agent: (*agent).to_owned(),
});
}
}
fn emit_update_transitions(
coord: &Coordinator,
prev_updated: &HashSet<String>,
current_updated: &HashSet<String>,
sub_agents: &[String],
) {
// Fired on the "was up-to-date, now isn't" transition. The
// reverse (rebuilt) is already covered by HelperEvent::Rebuilt.
let prev_stale: HashSet<&str> = sub_agents
.iter()
.map(String::as_str)
.filter(|n| !prev_updated.contains(*n))
.collect();
let current_stale: HashSet<&str> = sub_agents
.iter()
.map(String::as_str)
.filter(|n| !current_updated.contains(*n))
.collect();
for agent in current_stale.difference(&prev_stale) {
tracing::info!(%agent, "agent needs update");
coord.notify_manager(&hive_sh4re::HelperEvent::NeedsUpdate {
agent: (*agent).to_owned(),
});
}
}
/// Mirrors `dashboard::claude_has_session`. Lives here too so the
/// watcher doesn't depend on dashboard internals.
fn claude_has_session(dir: &Path) -> bool {
let Ok(entries) = std::fs::read_dir(dir) else {
return false;
};
entries
.flatten()
.any(|e| e.file_type().is_ok_and(|t| t.is_file()))
}

View file

@ -204,6 +204,27 @@ async fn dispatch(req: &ManagerRequest, coord: &Arc<Coordinator>) -> ManagerResp
}, },
} }
} }
ManagerRequest::Update { name } => {
tracing::info!(%name, "manager: update");
let Some(current_rev) = crate::auto_update::current_flake_rev(&coord.hyperhive_flake)
else {
return ManagerResponse::Err {
message: "update: hyperhive_flake has no canonical path".into(),
};
};
coord.set_transient(name, crate::coordinator::TransientKind::Rebuilding);
let result = crate::auto_update::rebuild_agent(coord, name, &current_rev).await;
coord.clear_transient(name);
match result {
Ok(()) => {
coord.kick_agent(name, "container rebuilt");
ManagerResponse::Ok
}
Err(e) => ManagerResponse::Err {
message: format!("{e:#}"),
},
}
}
ManagerRequest::AskOperator { ManagerRequest::AskOperator {
question, question,
options, options,

View file

@ -259,6 +259,20 @@ pub enum HelperEvent {
/// A sub-agent's container was torn down (container removed; state /// A sub-agent's container was torn down (container removed; state
/// dirs preserved per `destroy` semantics). /// dirs preserved per `destroy` semantics).
Destroyed { agent: String }, Destroyed { agent: String },
/// A sub-agent's container has no claude session yet (first
/// spawn, or `--purge` wiped creds). Manager can't do anything
/// about it directly — login is interactive OAuth — but it
/// surfaces so the manager knows the agent is in partial-run
/// mode and can flag the operator.
NeedsLogin { agent: String },
/// An agent successfully completed claude login — the session
/// dir now contains creds. Transition fires once per login.
LoggedIn { agent: String },
/// An agent's recorded flake rev is stale relative to the
/// current hyperhive rev. The manager has the `update` tool to
/// trigger a rebuild without operator approval (it's a no-op
/// when nothing actually changed).
NeedsUpdate { agent: String },
/// Container exited without an operator-initiated stop. Fired by /// Container exited without an operator-initiated stop. Fired by
/// the crash watcher when an agent's container transitions from /// the crash watcher when an agent's container transitions from
/// running → stopped and no `Stopping` / `Restarting` / /// running → stopped and no `Stopping` / `Restarting` /
@ -326,6 +340,12 @@ pub enum ManagerRequest {
Restart { Restart {
name: String, name: String,
}, },
/// Rebuild a sub-agent: re-applies the current hyperhive flake +
/// agent.nix, restarts the container. No approval required —
/// it's idempotent and the manager owns its own update cadence.
Update {
name: String,
},
/// Submit a config commit for the user to approve. `commit_ref` is opaque /// Submit a config commit for the user to approve. `commit_ref` is opaque
/// to the host (typically a git sha pointing into the agent's config repo). /// to the host (typically a git sha pointing into the agent's config repo).
/// On approval the host applies the change via `nixos-container update`. /// On approval the host applies the change via `nixos-container update`.