dashboard: surface silent unwrap_or_default in api_state

every snapshot source backing /api/state used .unwrap_or_default()
— sqlite errors, broker errors, nixos-container list failures,
operator_questions decode crashes all degraded to empty lists
without a log line. the 'pending question doesn't render'
bug we've been chasing was likely a row-decode panic in
OperatorQuestions::pending() being swallowed this way.

new log_default(what, result) replaces each call site: same
default value on Err but emits target=api_state warn with the
source name + dbg error first. five sources covered:
nixos-container list, approvals.pending,
approvals.recent_resolved, broker.recent_for(operator),
questions.pending. next time the question goes missing the
journal will say which source failed and how.

todo updated — pending-question entry now points at the new
log instead of three suspect paths.
This commit is contained in:
müde 2026-05-16 03:49:49 +02:00
parent 74ba8a63e1
commit 40938d8b54
2 changed files with 45 additions and 29 deletions

22
TODO.md
View file

@ -57,19 +57,15 @@ Pick anything from here when relevant. Cross-cutting design notes live in
Repro: manager calls `ask_operator`, tool result is Repro: manager calls `ask_operator`, tool result is
`question queued (id=N)` (so the row is in sqlite), but the `question queued (id=N)` (so the row is in sqlite), but the
M1ND H4S QU3STI0NS section keeps showing "no pending M1ND H4S QU3STI0NS section keeps showing "no pending
questions". Last seen with id=5. Suspected paths: questions". Last seen with id=5. Diagnostic step landed:
- `OperatorQuestions::pending()` returns Err and the `api_state` now warn-logs (target=`api_state`) when any of
`unwrap_or_default()` in `api_state` hides it. Surface the its source queries fail instead of silently
error (warn-log) and check. `unwrap_or_default`-ing — next repro should print the
- serialization: a new field in `OpQuestion` (e.g. underlying error in journald and tell us whether this is
`deadline_at: Option<i64>`) deserializes wrong against an sqlite (likely `OperatorQuestions::pending()` row-decode
old row whose columns don't match the new SELECT order → panic on a migrated column) or dashboard-JS-side
`row.get(N)?` panics for that row, the whole iterator (`renderQuestions` exception). Re-investigate with the new
errors, `pending()` returns Err. Diagnose by curl log once the bug fires.
`/api/state | jq '.questions'` and compare with sqlite
counts.
- dashboard JS swallows a render error. Open browser console
and look for exceptions during `renderQuestions`.
## UI / UX ## UI / UX

View file

@ -239,6 +239,25 @@ struct ApprovalView {
diff_html: Option<String>, diff_html: Option<String>,
} }
/// Replace silent `.unwrap_or_default()` on the data sources behind
/// `/api/state` so that whichever query degrades surfaces in journald
/// instead of leaving the operator staring at an empty list. The
/// dashboard still degrades to a sensible default value; the warn
/// is just the diagnostic breadcrumb the old code swallowed.
fn log_default<T, E>(what: &str, result: std::result::Result<T, E>) -> T
where
T: Default,
E: std::fmt::Debug,
{
match result {
Ok(v) => v,
Err(e) => {
tracing::warn!(target: "api_state", source = %what, error = ?e, "snapshot source failed; using default");
T::default()
}
}
}
async fn api_state(headers: HeaderMap, State(state): State<AppState>) -> axum::Json<StateSnapshot> { async fn api_state(headers: HeaderMap, State(state): State<AppState>) -> axum::Json<StateSnapshot> {
let host = headers let host = headers
.get("host") .get("host")
@ -246,35 +265,36 @@ async fn api_state(headers: HeaderMap, State(state): State<AppState>) -> axum::J
.unwrap_or("localhost"); .unwrap_or("localhost");
let hostname = host.split(':').next().unwrap_or(host).to_owned(); let hostname = host.split(':').next().unwrap_or(host).to_owned();
let raw_containers = lifecycle::list().await.unwrap_or_default(); let raw_containers = log_default("nixos-container list", lifecycle::list().await);
let current_rev = crate::auto_update::current_flake_rev(&state.coord.hyperhive_flake); let current_rev = crate::auto_update::current_flake_rev(&state.coord.hyperhive_flake);
let transient_snapshot = state.coord.transient_snapshot(); let transient_snapshot = state.coord.transient_snapshot();
let pending_approvals = gc_orphans( let pending_approvals = gc_orphans(
&state.coord, &state.coord,
state.coord.approvals.pending().unwrap_or_default(), log_default("approvals.pending", state.coord.approvals.pending()),
); );
let (containers, any_stale) = let (containers, any_stale) =
build_container_views(&raw_containers, current_rev.as_deref(), &transient_snapshot).await; build_container_views(&raw_containers, current_rev.as_deref(), &transient_snapshot).await;
let transients = build_transient_views(&raw_containers, &transient_snapshot); let transients = build_transient_views(&raw_containers, &transient_snapshot);
let approvals = build_approval_views(pending_approvals).await; let approvals = build_approval_views(pending_approvals).await;
let approval_history = state let approval_history = log_default(
.coord "approvals.recent_resolved",
.approvals state.coord.approvals.recent_resolved(30),
.recent_resolved(30) )
.unwrap_or_default() .into_iter()
.into_iter() .map(history_view)
.map(history_view) .collect();
.collect();
let tombstones = build_tombstone_views(&state.coord, &containers, &transient_snapshot); let tombstones = build_tombstone_views(&state.coord, &containers, &transient_snapshot);
let port_conflicts = build_port_conflicts(&containers); let port_conflicts = build_port_conflicts(&containers);
let operator_inbox = state let operator_inbox = log_default(
.coord "broker.recent_for(operator)",
.broker state
.recent_for(hive_sh4re::OPERATOR_RECIPIENT, 50) .coord
.unwrap_or_default(); .broker
let questions = state.coord.questions.pending().unwrap_or_default(); .recent_for(hive_sh4re::OPERATOR_RECIPIENT, 50),
);
let questions = log_default("questions.pending", state.coord.questions.pending());
axum::Json(StateSnapshot { axum::Json(StateSnapshot {
hostname, hostname,