suppress crash_watch during background rebuilds + meta repoint

crash_watch fires ContainerCrash whenever it sees a previously- running container in a non-running state without a transient flag set. dashboard rebuilds already set Rebuilding via lifecycle_action; the two other rebuild paths didn't: - migrate::repoint_container: phase 4 walks every container, each nixos-container update activation briefly takes the systemd unit down. previously fired ContainerCrash for every agent during the migration; manager would then spuriously call start() on agents that were already coming back up. - auto_update::rebuild_agent: startup scan + admin-socket caller bypass lifecycle_action. both paths now set the Rebuilding transient around the rebuild + clear after. matches what dashboard does.
2026-05-16 01:12:48 +02:00 · 2026-05-16 01:12:48 +02:00 · d202f3785c
commit d202f3785c
parent 63e8a98df2
2 changed files with 15 additions and 1 deletions
--- a/hive-c0re/src/migrate.rs
+++ b/hive-c0re/src/migrate.rs
@ -74,7 +74,15 @@ pub async fn run(coord: &Arc<Coordinator>) -> Result<()> {
    }
    let mut all_ok = true;
    for name in &names {
-        if let Err(e) = repoint_container(name).await {
+        // Mark Rebuilding so the crash watcher skips this container
+        // during the brief stop+start window the nixos-container
+        // update activation triggers. Without this, crash_watch
+        // would fire ContainerCrash for every agent here and the
+        // manager would spuriously try to recover them.
+        coord.set_transient(name, crate::coordinator::TransientKind::Rebuilding);
+        let result = repoint_container(name).await;
+        coord.clear_transient(name);
+        if let Err(e) = result {
            tracing::warn!(%name, error = ?e, "migration: container repoint failed");
            all_ok = false;
        }