diff --git a/TODO.md b/TODO.md index 28ff89e..da96b5f 100644 --- a/TODO.md +++ b/TODO.md @@ -44,6 +44,15 @@ Pick anything from here when relevant. Cross-cutting design notes live in ## UI / UX +- **Dashboard: show per-agent applied config.** Surface + `/var/lib/hyperhive/applied//agent.nix` (the file the + container actually builds from) as a collapsible `
` + block on each container row, alongside the journald viewer. + Backend: new `GET /api/agent-config/{name}` returns the file + contents (text/plain). Frontend: lazy-fetch on expand, render + inside a `
` with the same theming as the journal panel.
+  Useful for spot-checking what `request_apply_commit` produced
+  without ssh-ing in.
 - **xterm.js terminal** embedded per-agent, attached to a PTY exposed by
   the harness. Pairs well with the unprivileged-container work — would let
   the operator drop into the container without `nixos-container root-login`.
diff --git a/docs/approvals.md b/docs/approvals.md
index bc88982..f892838 100644
--- a/docs/approvals.md
+++ b/docs/approvals.md
@@ -129,6 +129,14 @@ regular claude turn so the manager can react. Variants
   running container went away with no operator-initiated transient
   state (Stopping / Restarting / Destroying / Rebuilding). Manager
   can `start` it again or escalate.
+- `NeedsLogin { agent }` — sub-agent has no claude session yet.
+  Manager can't act directly (interactive OAuth); typically flags
+  the operator.
+- `LoggedIn { agent }` — sub-agent just completed login. Manager
+  often greets the agent on this event.
+- `NeedsUpdate { agent }` — sub-agent's recorded flake rev is
+  stale. Manager calls `update(name)` to rebuild — idempotent,
+  no approval required.
 - `OperatorAnswered { id, question, answer }` — dashboard
   `/answer-question/{id}` after the operator submits the answer
   form.
diff --git a/docs/turn-loop.md b/docs/turn-loop.md
index 11178fd..3c8fb75 100644
--- a/docs/turn-loop.md
+++ b/docs/turn-loop.md
@@ -100,6 +100,9 @@ it as a stdio child via `--mcp-config`. The hyperhive socket name is
 - `kill(name)` — graceful stop. No approval required.
 - `start(name)` — start a stopped sub-agent. No approval.
 - `restart(name)` — stop + start. No approval.
+- `update(name)` — rebuild (re-applies the current hyperhive flake
+  + agent.nix, restarts). No approval, idempotent. Manager calls
+  this on receipt of a `needs_update` system event.
 - `request_apply_commit(agent, commit_ref)` — submit a config
   change for any agent (`hm1nd` for the manager's own config) for
   operator approval.
diff --git a/hive-ag3nt/prompts/manager.md b/hive-ag3nt/prompts/manager.md
index 9a239f4..5ae01fa 100644
--- a/hive-ag3nt/prompts/manager.md
+++ b/hive-ag3nt/prompts/manager.md
@@ -8,6 +8,7 @@ Tools (hyperhive surface):
 - `mcp__hyperhive__kill(name)` — graceful stop on a sub-agent. No approval required.
 - `mcp__hyperhive__start(name)` — start a stopped sub-agent. No approval required.
 - `mcp__hyperhive__restart(name)` — stop + start a sub-agent. No approval required.
+- `mcp__hyperhive__update(name)` — rebuild a sub-agent (re-applies the current hyperhive flake + agent.nix, restarts the container). No approval required — idempotent. Use when you receive a `needs_update` system event.
 - `mcp__hyperhive__request_apply_commit(agent, commit_ref)` — submit a config change for any agent (`hm1nd` for self) for operator approval.
 - `mcp__hyperhive__ask_operator(question, options?, multi?, ttl_seconds?)` — surface a question on the dashboard. Returns immediately with a question id; the operator's answer arrives later as a system `operator_answered` event in your inbox. Options are advisory: the dashboard always lets the operator type a free-text answer in addition. Set `multi: true` to render options as checkboxes (operator can pick multiple); the answer comes back as `, `-separated. Set `ttl_seconds` to auto-cancel after a deadline — useful when the decision becomes moot if the operator hasn't responded in time; on expiry the answer is `[expired]`. Do not poll inside the same turn — finish the current work and react when the event lands.
 
@@ -26,7 +27,13 @@ You're the policy gate between sub-agents and the operator's approval queue —
 
 Two ways to talk to the operator: `send(to: "operator", ...)` for fire-and-forget status / pointers (surfaces in the operator inbox), or `ask_operator(question, options?)` when you need a decision. `ask_operator` is non-blocking — it queues the question and returns an id immediately; the answer arrives on a future turn as an `operator_answered` system event. Prefer `ask_operator` over an open-ended `send` for anything you actually need to wait on.
 
-Messages from sender `system` are hyperhive helper events (JSON body, `event` field discriminates): `approval_resolved`, `spawned`, `rebuilt`, `killed`, `destroyed`, `container_crash`, `operator_answered`. Use these to react to lifecycle changes — e.g. greet a freshly-spawned agent, retry a failed rebuild, restart an agent whose container crashed, or pick up the operator's answer to a question you previously asked.
+Messages from sender `system` are hyperhive helper events (JSON body, `event` field discriminates): `approval_resolved`, `spawned`, `rebuilt`, `killed`, `destroyed`, `container_crash`, `needs_login`, `logged_in`, `needs_update`, `operator_answered`. Use these to react to lifecycle changes:
+
+- `needs_login` — agent has no claude session yet. You can't help directly (login is interactive OAuth on the operator side); flag the operator if it's been long.
+- `logged_in` — agent just completed login; first useful turn is imminent. Good time to brief them on what to do.
+- `needs_update` — agent's flake rev is stale. Call `update(name)` to rebuild — it's idempotent and doesn't need approval.
+- `container_crash` — restart with `start(name)`. If it crashes again, ask the operator.
+- otherwise greet freshly-spawned agents, retry failed rebuilds, pick up the operator's answer to questions you asked.
 
 Durable knowledge:
 
diff --git a/hive-ag3nt/src/mcp.rs b/hive-ag3nt/src/mcp.rs
index bd38c05..dcfe56e 100644
--- a/hive-ag3nt/src/mcp.rs
+++ b/hive-ag3nt/src/mcp.rs
@@ -240,6 +240,12 @@ pub struct RestartArgs {
     pub name: String,
 }
 
+#[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
+pub struct UpdateArgs {
+    /// Sub-agent name (without the `h-` container prefix).
+    pub name: String,
+}
+
 #[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
 pub struct AskOperatorArgs {
     /// The question to surface on the dashboard.
@@ -400,6 +406,23 @@ impl ManagerServer {
         .await
     }
 
+    #[tool(
+        description = "Rebuild a sub-agent: re-applies the current hyperhive flake + agent.nix \
+        and restarts the container. No approval required — idempotent. Use when you receive a \
+        `needs_update` system event for an agent."
+    )]
+    async fn update(&self, Parameters(args): Parameters) -> String {
+        let log = format!("{args:?}");
+        let name = args.name.clone();
+        run_tool_envelope("update", log, async move {
+            let resp = self
+                .dispatch(hive_sh4re::ManagerRequest::Update { name: args.name })
+                .await;
+            format_ack(resp, "update", format!("updated {name}"))
+        })
+        .await
+    }
+
     #[tool(
         description = "Surface a question to the operator on the dashboard. Returns immediately \
         with a question id — do NOT wait inline. When the operator answers, a system message \
@@ -512,6 +535,7 @@ pub fn allowed_mcp_tools(flavor: Flavor) -> Vec {
             "kill",
             "start",
             "restart",
+            "update",
             "request_apply_commit",
             "ask_operator",
         ],
diff --git a/hive-c0re/src/crash_watch.rs b/hive-c0re/src/crash_watch.rs
index a0b7dc0..66c3600 100644
--- a/hive-c0re/src/crash_watch.rs
+++ b/hive-c0re/src/crash_watch.rs
@@ -1,15 +1,22 @@
-//! Container crash watcher. Polls every managed container's running
-//! state on a fixed interval; when a previously-running container is
-//! suddenly stopped AND no operator-initiated transient (`Stopping`,
-//! `Restarting`, `Destroying`) was set, fire `HelperEvent::ContainerCrash`
-//! into the manager's inbox. The manager can then react — usually
-//! a `start` or a config rebuild.
+//! Per-container state watcher. Polls every managed container on a
+//! fixed interval, tracks three orthogonal state-sets across ticks,
+//! and emits a `HelperEvent` to the manager on each transition:
 //!
-//! D-Bus subscription would be lower-latency, but polling is far
-//! simpler and the failure modes are honest (a crash discovered 10s
-//! late is fine for our scale).
+//! - **running**: container is up. running → stopped without an
+//!   operator-initiated transient (`Stopping` / `Restarting` /
+//!   `Destroying` / `Rebuilding`) → `ContainerCrash`.
+//! - **logged-in**: claude session dir is populated. ! → ✓ →
+//!   `LoggedIn`; ✓ → ! → `NeedsLogin` (rare — usually only fires
+//!   on a fresh spawn / purge).
+//! - **up-to-date**: agent's recorded flake rev matches current. ✓
+//!   → ! → `NeedsUpdate`. The reverse direction (`NeedsUpdate`
+//!   resolved) is covered by `Rebuilt`, so no separate event.
+//!
+//! D-Bus subscription would be lower-latency for the first axis,
+//! but polling is simpler and a 10s detection delay is fine.
 
 use std::collections::HashSet;
+use std::path::Path;
 use std::sync::Arc;
 use std::time::Duration;
 
@@ -20,14 +27,17 @@ const POLL_INTERVAL: Duration = Duration::from_secs(10);
 
 pub fn spawn(coord: Arc) {
     tokio::spawn(async move {
-        // Seed the running-set from the first poll so we don't emit a
-        // crash for every agent on startup. First tick fills it; only
-        // running→stopped transitions across subsequent ticks count.
         let mut prev_running: HashSet = HashSet::new();
+        let mut prev_logged_in: HashSet = HashSet::new();
+        let mut prev_updated: HashSet = HashSet::new();
         let mut seeded = false;
         loop {
             let raw = lifecycle::list().await.unwrap_or_default();
+            let current_rev = crate::auto_update::current_flake_rev(&coord.hyperhive_flake);
             let mut current_running = HashSet::new();
+            let mut current_logged_in = HashSet::new();
+            let mut current_updated = HashSet::new();
+            let mut sub_agents: Vec = Vec::new();
             for c in &raw {
                 let logical = if c == MANAGER_NAME {
                     MANAGER_NAME.to_owned()
@@ -36,37 +46,131 @@ pub fn spawn(coord: Arc) {
                 } else {
                     continue;
                 };
+                if logical != MANAGER_NAME {
+                    sub_agents.push(logical.clone());
+                }
                 if lifecycle::is_running(&logical).await {
-                    current_running.insert(logical);
+                    current_running.insert(logical.clone());
+                }
+                if logical != MANAGER_NAME
+                    && claude_has_session(&Coordinator::agent_claude_dir(&logical))
+                {
+                    current_logged_in.insert(logical.clone());
+                }
+                if let Some(rev) = current_rev.as_deref()
+                    && !crate::auto_update::agent_needs_update(&logical, rev)
+                {
+                    current_updated.insert(logical.clone());
                 }
             }
 
             if seeded {
-                let transients = coord.transient_snapshot();
-                for stopped in prev_running.difference(¤t_running) {
-                    let deliberate = transients.get(stopped).is_some_and(|st| {
-                        matches!(
-                            st.kind,
-                            TransientKind::Stopping
-                                | TransientKind::Restarting
-                                | TransientKind::Destroying
-                                | TransientKind::Rebuilding
-                        )
-                    });
-                    if deliberate {
-                        continue;
-                    }
-                    tracing::warn!(agent = %stopped, "container crash detected");
-                    coord.notify_manager(&hive_sh4re::HelperEvent::ContainerCrash {
-                        agent: stopped.clone(),
-                        note: Some("container stopped without an operator action".into()),
-                    });
-                }
+                emit_crash_transitions(&coord, &prev_running, ¤t_running);
+                emit_login_transitions(&coord, &prev_logged_in, ¤t_logged_in, &sub_agents);
+                emit_update_transitions(&coord, &prev_updated, ¤t_updated, &sub_agents);
             }
             prev_running = current_running;
+            prev_logged_in = current_logged_in;
+            prev_updated = current_updated;
             seeded = true;
 
             tokio::time::sleep(POLL_INTERVAL).await;
         }
     });
 }
+
+fn emit_crash_transitions(coord: &Coordinator, prev: &HashSet, current: &HashSet) {
+    let transients = coord.transient_snapshot();
+    for stopped in prev.difference(current) {
+        let deliberate = transients.get(stopped).is_some_and(|st| {
+            matches!(
+                st.kind,
+                TransientKind::Stopping
+                    | TransientKind::Restarting
+                    | TransientKind::Destroying
+                    | TransientKind::Rebuilding
+            )
+        });
+        if deliberate {
+            continue;
+        }
+        tracing::warn!(agent = %stopped, "container crash detected");
+        coord.notify_manager(&hive_sh4re::HelperEvent::ContainerCrash {
+            agent: stopped.clone(),
+            note: Some("container stopped without an operator action".into()),
+        });
+    }
+}
+
+fn emit_login_transitions(
+    coord: &Coordinator,
+    prev: &HashSet,
+    current: &HashSet,
+    sub_agents: &[String],
+) {
+    for agent in current.difference(prev) {
+        tracing::info!(%agent, "agent logged in");
+        coord.notify_manager(&hive_sh4re::HelperEvent::LoggedIn {
+            agent: agent.clone(),
+        });
+    }
+    // Only count NeedsLogin transitions for agents that exist and
+    // are *not* logged in — the difference set above already gives
+    // us "was in prev, gone from current" but we also want to fire
+    // for agents that newly appeared as not-logged-in (post-spawn /
+    // post-purge). Treat sub_agents minus current as the
+    // currently-needs-login set; emit when an agent enters it.
+    let prev_needs: HashSet<&str> = sub_agents
+        .iter()
+        .map(String::as_str)
+        .filter(|n| !prev.contains(*n))
+        .collect();
+    let current_needs: HashSet<&str> = sub_agents
+        .iter()
+        .map(String::as_str)
+        .filter(|n| !current.contains(*n))
+        .collect();
+    for agent in current_needs.difference(&prev_needs) {
+        tracing::info!(%agent, "agent needs login");
+        coord.notify_manager(&hive_sh4re::HelperEvent::NeedsLogin {
+            agent: (*agent).to_owned(),
+        });
+    }
+}
+
+fn emit_update_transitions(
+    coord: &Coordinator,
+    prev_updated: &HashSet,
+    current_updated: &HashSet,
+    sub_agents: &[String],
+) {
+    // Fired on the "was up-to-date, now isn't" transition. The
+    // reverse (rebuilt) is already covered by HelperEvent::Rebuilt.
+    let prev_stale: HashSet<&str> = sub_agents
+        .iter()
+        .map(String::as_str)
+        .filter(|n| !prev_updated.contains(*n))
+        .collect();
+    let current_stale: HashSet<&str> = sub_agents
+        .iter()
+        .map(String::as_str)
+        .filter(|n| !current_updated.contains(*n))
+        .collect();
+    for agent in current_stale.difference(&prev_stale) {
+        tracing::info!(%agent, "agent needs update");
+        coord.notify_manager(&hive_sh4re::HelperEvent::NeedsUpdate {
+            agent: (*agent).to_owned(),
+        });
+    }
+}
+
+/// Mirrors `dashboard::claude_has_session`. Lives here too so the
+/// watcher doesn't depend on dashboard internals.
+fn claude_has_session(dir: &Path) -> bool {
+    let Ok(entries) = std::fs::read_dir(dir) else {
+        return false;
+    };
+    entries
+        .flatten()
+        .any(|e| e.file_type().is_ok_and(|t| t.is_file()))
+}
diff --git a/hive-c0re/src/manager_server.rs b/hive-c0re/src/manager_server.rs
index 4050641..3c9cab2 100644
--- a/hive-c0re/src/manager_server.rs
+++ b/hive-c0re/src/manager_server.rs
@@ -204,6 +204,27 @@ async fn dispatch(req: &ManagerRequest, coord: &Arc) -> ManagerResp
                 },
             }
         }
+        ManagerRequest::Update { name } => {
+            tracing::info!(%name, "manager: update");
+            let Some(current_rev) = crate::auto_update::current_flake_rev(&coord.hyperhive_flake)
+            else {
+                return ManagerResponse::Err {
+                    message: "update: hyperhive_flake has no canonical path".into(),
+                };
+            };
+            coord.set_transient(name, crate::coordinator::TransientKind::Rebuilding);
+            let result = crate::auto_update::rebuild_agent(coord, name, ¤t_rev).await;
+            coord.clear_transient(name);
+            match result {
+                Ok(()) => {
+                    coord.kick_agent(name, "container rebuilt");
+                    ManagerResponse::Ok
+                }
+                Err(e) => ManagerResponse::Err {
+                    message: format!("{e:#}"),
+                },
+            }
+        }
         ManagerRequest::AskOperator {
             question,
             options,
diff --git a/hive-sh4re/src/lib.rs b/hive-sh4re/src/lib.rs
index e8558d9..7a96322 100644
--- a/hive-sh4re/src/lib.rs
+++ b/hive-sh4re/src/lib.rs
@@ -259,6 +259,20 @@ pub enum HelperEvent {
     /// A sub-agent's container was torn down (container removed; state
     /// dirs preserved per `destroy` semantics).
     Destroyed { agent: String },
+    /// A sub-agent's container has no claude session yet (first
+    /// spawn, or `--purge` wiped creds). Manager can't do anything
+    /// about it directly — login is interactive OAuth — but it
+    /// surfaces so the manager knows the agent is in partial-run
+    /// mode and can flag the operator.
+    NeedsLogin { agent: String },
+    /// An agent successfully completed claude login — the session
+    /// dir now contains creds. Transition fires once per login.
+    LoggedIn { agent: String },
+    /// An agent's recorded flake rev is stale relative to the
+    /// current hyperhive rev. The manager has the `update` tool to
+    /// trigger a rebuild without operator approval (it's a no-op
+    /// when nothing actually changed).
+    NeedsUpdate { agent: String },
     /// Container exited without an operator-initiated stop. Fired by
     /// the crash watcher when an agent's container transitions from
     /// running → stopped and no `Stopping` / `Restarting` /
@@ -326,6 +340,12 @@ pub enum ManagerRequest {
     Restart {
         name: String,
     },
+    /// Rebuild a sub-agent: re-applies the current hyperhive flake +
+    /// agent.nix, restarts the container. No approval required —
+    /// it's idempotent and the manager owns its own update cadence.
+    Update {
+        name: String,
+    },
     /// Submit a config commit for the user to approve. `commit_ref` is opaque
     /// to the host (typically a git sha pointing into the agent's config repo).
     /// On approval the host applies the change via `nixos-container update`.