Commit graph

332 commits

Author SHA1 Message Date
müde
68ef6ab433 todo: stream nixos-container output so slow != stuck
surfaced by a real hang investigation today — lifecycle::run
uses .output() which buffers stdout/stderr until exit, so a
multi-minute nix build through nixos-container update looks
identical to a wedged daemon. line-buffered streaming into
tracing (and ideally the per-agent live event bus when the
agent is known) makes 'still building, just slow' visible
without strace gymnastics.
2026-05-16 01:38:02 +02:00
müde
65bdde898e todo: tag retention, flake.nix tamper-check, sync_agents nix call
three things surfaced by the meta-flake overhaul + the nix CLI
deprecation we just fixed worth tracking explicitly. extend
the web-UI-for-config-repos entry to also cover the /meta
deploy log now that meta's git history is the swarm-wide
audit trail.
2026-05-16 01:21:27 +02:00
müde
d202f3785c suppress crash_watch during background rebuilds + meta repoint
crash_watch fires ContainerCrash whenever it sees a previously-
running container in a non-running state without a transient
flag set. dashboard rebuilds already set Rebuilding via
lifecycle_action; the two other rebuild paths didn't:

- migrate::repoint_container: phase 4 walks every container,
  each nixos-container update activation briefly takes the
  systemd unit down. previously fired ContainerCrash for every
  agent during the migration; manager would then spuriously
  call start() on agents that were already coming back up.
- auto_update::rebuild_agent: startup scan + admin-socket
  caller bypass lifecycle_action.

both paths now set the Rebuilding transient around the rebuild
+ clear after. matches what dashboard does.
2026-05-16 01:12:48 +02:00
müde
63e8a98df2 meta: stage before lock, single commit per change
git+file://'s dirty-tree fetcher reads tracked + staged content
from the index (not the working tree, not untracked files). so
staging is enough to make a new flake.nix or flake.lock visible
to nix without committing first. sync_agents now stages flake
.nix, runs lock, stages the resulting flake.lock, then commits
both together in a single 'regenerate meta flake' (or 'seed
meta from N agents') commit — no more two-commit churn.

prepare_deploy applies the same trick to the two-phase deploy:
runs nix flake update, stages flake.lock so nixos-container
update sees it, doesn't commit yet. finalize_deploy commits
with the deployed/<id> message on build success; abort_deploy
git-restores the staged lock back to HEAD on failure. meta
history continues to record only successful deploys (and now
one commit per success instead of one + amend).
2026-05-16 01:02:47 +02:00
müde
220e9b4af6 meta: commit before lock — git+file:// only sees tracked files
runtime error on first deploy attempt: 'source tree referenced
by git+file:///var/lib/hyperhive/meta does not contain
/flake.nix'. cause: sync_agents wrote flake.nix then ran
'nix flake lock' against a directory nix had just discovered
as a git repo (auto-upgraded to git+file://), which only sees
TRACKED content. fresh flake.nix was untracked, so nix saw an
empty source tree.

fix: commit flake.nix before locking. sync_agents now does
write → init (if first) → git add + commit → nix flake lock
→ commit lock if changed. two commits per change — one
'regenerate meta flake' and one 'lock update' — instead of
one combined; cleaner history.

same git+file:// gotcha bit the two-phase deploy: prepare_
deploy used to write the lock without committing, expecting
nixos-container update to read the working tree. it doesn't —
it reads the tracked commit. prepare_deploy now commits with
a placeholder 'deploy <n> (building)' message; finalize_deploy
amends to 'deploy <n> deployed/<id> <sha12>' on success;
abort_deploy git-reset --hard HEAD~1's it on failure. meta
history still records only successful deploys.
2026-05-16 00:59:35 +02:00
müde
d94712bde8 turn: unify run_turn / compact_session via TurnFiles
new TurnFiles bundle (mcp_config + settings + system_prompt +
flavor) materialised once per harness boot, passed to drive_turn
and compact_session alike. operator-initiated /compact now uses
the exact same session shape as a normal turn — same MCP
surface, same allowed tools, same role prompt — only the stdin
payload differs (/compact vs the wake-up body). web_ui's
AppState carries the TurnFiles instead of (label + socket +
flavor + ad-hoc file writes per click). bin/hive-ag3nt and
bin/hive-m1nd prepare TurnFiles before spawning the web UI and
pass them to both surfaces. web_ui::Flavor folds into a type
alias for mcp::Flavor — no two-stage enum mapping.

removes ClaudeMode + the run_claude variant fork (system prompt
was Option, mcp args were skipped on Compact). dead 'mode'
plumbing gone.
2026-05-16 00:57:58 +02:00
müde
87c7b05b05 meta: use 'nix flake update <input>' instead of removed --update-input
current nix CLI removed 'nix flake lock --update-input X' in
favour of 'nix flake update X'. switch all three call sites
(prepare_deploy, lock_update_for_rebuild,
lock_update_hyperhive). 'nix flake lock' with no args still
works for the seed path in sync_agents — it resolves missing
inputs without bumping existing ones.
2026-05-16 00:49:22 +02:00
müde
02139efbb1 turn: spawn claude with cwd = /state
every claude invocation now runs with current_dir set to
/state — relative paths in tool calls (Read notes.md, Bash ls,
Write blob) land in the agent's durable bind-mounted dir
instead of the harness's systemd cwd. /state is RW + survives
destroy/recreate so this matches where the agent is told to
keep notes anyway. dev/test setups without the bind silently
fall back to the parent cwd.
2026-05-16 00:46:19 +02:00
müde
034b4fde10 force fresh session: ↻ new session button + /new-session
bus carries a one-shot AtomicBool armed by POST
/api/new-session (or the /new-session slash command). next
turn drops --continue, starting a fresh claude session; the
flag clears automatically so subsequent turns resume normal
behavior. /compact still always uses --continue — compacting
a non-existent session is a no-op anyway.

per-agent page grows an ↻ new session button next to the
cancel-turn one (always visible, amber, confirms before
posting since dropping --continue context isn't reversible).
slash-command surface picks up /new-session for parity with
the button. note row emitted on the live feed both at arm-
time and again when the turn actually consumes the flag, so
the operator can confirm it landed.
2026-05-16 00:44:45 +02:00
müde
14aa7c7acc final docs + cleanup sync for meta-flake era
claude.md flips 'in flight' → 'just landed' for the meta
overhaul + extends the file map with meta.rs and migrate.rs.
docs/approvals.md replaces the in-flight callout with a
proper 'Meta flake' section (two-phase deploy walkthrough,
sync_agents semantics, single-phase variants), updates the
two-repo box diagram to include the /var/lib/hyperhive/meta/
tree and tracks flake.nix in applied, rewrites the
container --flake reference to meta#<name>, replaces the
'Manager view of applied' section with a unified
'/agents + /applied + /meta' inventory listing every useful
git incantation, and explains the in-place no-state-loss
migration that now runs on hive-c0re startup.
docs/persistence.md grows entries for the meta repo + the
.meta-migration-done marker. readme box diagram picks up the
/meta RO bind; approval-flow paragraph rewritten end to end
to describe the meta lock dance.

lifecycle::flake_base deleted — the meta render hardcodes
the manager vs agent-base choice as nix expression.
2026-05-16 00:40:06 +02:00
müde
2f6ecc4dc0 dashboard: deployed sha chip per container
ContainerView grows deployed_sha (first 12 chars of the rev
that /var/lib/hyperhive/meta/flake.lock currently has locked
for agent-<name>). renderContainers appends a 'deployed:<sha12>'
chip next to the container name + port — title attribute
explains it's the meta-lock sha. degrades gracefully when the
meta repo isn't seeded yet (missing / unparsable lock = empty
map = no chip). new read_meta_locked_revs helper does the JSON
parsing without unwraps.
2026-05-16 00:36:52 +02:00
müde
691057d2d3 manager prompt: meta-flake era
agent.nix becomes a plain NixOS module function — flake.nix is
fixed boilerplate the manager mustn't edit; meta flake at /meta/
owns the wrapper. proposed repos ship with an 'applied' remote
pre-wired, so 'git fetch applied' / 'git log applied/main' /
'git show applied/refs/tags/deployed/<id>' all just work without
constructing paths by hand. /meta/ exposes the system-wide
deploy log (git log /meta) + flake.lock for cross-agent sha
introspection.
2026-05-16 00:35:30 +02:00
müde
59a89314f0 startup auto-migration from pre-meta layout
new migrate module runs before auto_update on hive-c0re boot.
four idempotent phases:

1. for every applied/<n>/ whose flake.nix isn't already the
   module-only boilerplate, rewrite + commit + relocate
   deployed/0 to HEAD so setup_applied's existence check passes
2. for every proposed/<n>/config without an 'applied' remote,
   wire it (delegates to setup_proposed which is now
   idempotent and adds the remote itself)
3. meta::sync_agents over the current container list — inits
   the meta repo on first call, rerender + relock if drifted
4. nixos-container update <c> --flake meta#<name> for every
   container, guarded by /var/lib/hyperhive/.meta-migration-done
   so phase 4's expensive eval only runs once across restarts

env kill-switch HIVE_SKIP_META_MIGRATION=1 defers the whole
thing. each agent's failure is logged + skipped so one broken
agent doesn't block the rest. runs ahead of ensure_manager so
the manager auto-spawn comes up against meta from the first
attempt.
2026-05-16 00:34:58 +02:00
müde
87016cd567 auto_update: bump meta hyperhive input before per-agent rebuilds
auto_update::run now calls meta::lock_update_hyperhive once
up-front so the per-agent rebuilds it kicks off rebuild against
the new base. lifecycle::rebuild already drives sync_agents +
lock_update_for_rebuild per agent, so the rev-marker shortcut
keeps its meaning ('we've ack'd this rev for this agent')
without further plumbing. failures of the hyperhive lock bump
log + continue — individual rebuilds will surface concrete
errors if anything's really wrong.
2026-05-16 00:32:55 +02:00
müde
06fdbac1ac actions::run_apply_commit through meta two-phase
approval-driven deploys now walk the meta flake via
prepare_deploy / finalize_deploy / abort_deploy so a failed
build leaves no commit in meta's deploy log:

1. capture applied/main sha for rollback
2. tag approved/<id> + building/<id>
3. ff applied/main to proposal/<id>, read-tree sync working tree
4. meta::prepare_deploy(name) — nix flake lock --update-input
   agent-<n> without committing
5. lifecycle::rebuild_no_meta — container-level only (new
   extracted helper; public lifecycle::rebuild still wraps it
   with single-phase meta sync + commit for dashboard / auto
   _update callers that don't care about rollback)
6a. on success: tag deployed/<id>, meta::finalize_deploy commits
    the staged lock with 'deploy <n> deployed/<id> <sha12>'
6b. on failure: tag failed/<id> annotated with the build error,
    git_update_ref applied/main back to prev sha, read-tree to
    main, meta::abort_deploy git-restores flake.lock

meta's git log now records only successful deploys; failures
+ denials still live in applied as annotated tags.
2026-05-16 00:32:16 +02:00
müde
22f35def8f actions::destroy syncs meta after lifecycle
once nixos-container destroy lands + per-agent state cleanup is
done, rerender the meta flake from the remaining containers so
the destroyed agent's input + nixosConfiguration drop off and
its flake.lock entry vanishes. log + keep going on meta-sync
failure — the destroy already succeeded at the lifecycle level,
so meta drift here is just bookkeeping. new public
lifecycle::agents_for_meta_listing exposes the agent
enumeration for callers outside the module.
2026-05-16 00:29:26 +02:00
müde
4cb529351e lifecycle::rebuild through meta
rebuild now does sync_agents (idempotent — no-op when the
rendered flake matches disk; recovers from a divergent meta
repo on the side) followed by lock_update_for_rebuild which
relocks just this agent's input and commits the lock change
if any. flake ref for nixos-container update flips from
applied/<n>#default to meta#<name>. new helper
meta::lock_update_for_rebuild is single-phase (no separate
finalize): rebuild has no failure-revert semantics — it always
wants the latest applied/<n>/main. spawn already syncs meta
before container create; rebuild now picks up the meta side
on every manual ↻ R3BU1LD.
2026-05-16 00:28:26 +02:00
müde
8f94e4379a lifecycle::spawn through meta
after setup_proposed + setup_applied, spawn now syncs the meta
flake (one input + one nixosConfiguration per agent) so
`--flake /var/lib/hyperhive/meta#<name>` resolves before
nixos-container create runs. flake ref switches from
applied/<n>#default to meta#<name>; the wrapper modules
(identity, HIVE_PORT, HIVE_LABEL, HIVE_DASHBOARD_PORT) now
live in the meta flake's mkAgent. new helper agents_for_meta
builds the AgentSpec list by enumerating containers + optionally
appending a not-yet-present name for the spawn case. spawn
keeps its caller signature; rebuild + auto_update get wired up
in follow-up commits.
2026-05-16 00:27:12 +02:00
müde
c42ad1330c lifecycle: pre-wire applied remote in proposed
setup_proposed now lands a git remote named 'applied' on every
proposed/<n>/config pointing at /applied/<n>/.git — the path as
seen from inside the manager container, where the RO bind in
set_nspawn_flags makes the URL resolve. From the manager:

  git fetch applied
  git log applied/main
  git show applied/refs/tags/deployed/<id>
  git diff applied/main HEAD
  git rebase applied/main

all work without manually constructing the path each time. The
RO bind blocks push at the kernel level so the remote can only
fetch. Idempotent — also applied to pre-existing proposed repos
(no-op if the remote is already correct, set-url if drifted)
so the startup migration picks up the wiring on existing
agents.
2026-05-16 00:25:43 +02:00
müde
3d14ddeb7d lifecycle: bind /meta RO into manager
set_nspawn_flags now adds a third manager-only bind alongside
/agents (RW) and /applied (RO): --bind-ro=/var/lib/hyperhive/meta
:/meta. manager can git log /meta to see every deploy across the
swarm and cat /meta/flake.lock to introspect which sha each agent
is currently pinned at. defensive create_dir_all on the host
side so a cold start with no agents (meta repo not yet seeded)
doesn't trip systemd-nspawn's missing-bind-source check before
the migration plants the dir.
2026-05-16 00:24:39 +02:00
müde
92822efe16 meta: new hive-c0re module owns /var/lib/hyperhive/meta/
leaf module with no runtime callers yet (every public item is
#[allow(dead_code)] until lifecycle / actions / auto_update
rewire to use it). API surface:

- sync_agents — idempotent: render flake.nix for the given
  agent set, git-init on first call, nix flake lock, commit if
  anything changed.
- prepare_deploy / finalize_deploy / abort_deploy — two-phase
  for the request_apply_commit path. prepare runs nix flake
  lock --update-input agent-<n> without committing; finalize
  commits with a 'deploy <n> deployed/<id> <sha12>' message;
  abort git-restores the lock so a failed build leaves no
  orphan commit.
- lock_update_hyperhive — one-shot for the auto-update path.

flake.nix template defines mkAgent that pulls each agent's
nixosModules.default from its input and wraps with the
identity / HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT
module — what setup_applied used to generate inline. nix
invocations carry --extra-experimental-features as a belt
in case flakes aren't enabled in nix.conf.
2026-05-16 00:22:37 +02:00
müde
5b5a93e0c6 lifecycle: module-only agent flake.nix, tracked in proposed
setup_proposed now seeds both agent.nix (a regular NixOS module
function) and flake.nix (boilerplate exporting nixosModules.default
= import ./agent.nix) into the manager-editable proposed repo,
committed together. setup_applied's hyperhive_flake + dashboard
port wrapper generation is deleted entirely — the meta flake at
/var/lib/hyperhive/meta/ now owns the wrapper module. setup_
applied just fetches proposed's main on first spawn and tags
deployed/0; subsequent rebuilds touch nothing in applied that
the manager didn't author. spawn + rebuild keep their old param
list with the now-unused hyperhive_flake + dashboard_port
underscored — call sites get cleaned up after the meta module
lands and consumes them.
2026-05-16 00:10:06 +02:00
müde
a1cfb60fd0 docs: pre-load meta-flake design
scratchpad in claude.md and an in-flight callout at the top of
docs/approvals.md describe the upcoming overhaul so subsequent
commits can cite the design. covers: module-only agent flake
shape, /var/lib/hyperhive/meta/ as a hive-c0re-owned single
repo, applied remote pre-wired in proposed for manager git
plumbing, /meta RO bind for the system-wide deploy log,
auto-migration on hive-c0re startup with HIVE_SKIP_META_MIGRATION
kill-switch.
2026-05-16 00:06:42 +02:00
müde
e26143a412 dashboard: diff against applied/proposal/<id>, prefer fetched_sha
approval_diff now runs git diff refs/heads/main..refs/tags/
proposal/<id> against the applied repo instead of cobbling a
single-file diff from proposed. consequences: multi-file
proposals show every change, manager amendments in proposed
cannot lie about what'll be deployed, no-op proposals render
an explicit '(proposal matches currently-deployed tree)'.
displayed sha prefers fetched_sha (hive-c0re-vouched) and
falls back to commit_ref only for the brief pre-fetch window.
unified_diff helper + similar dep dropped — git diff is the
source of truth now. dead-code allows on the lifecycle git
helpers + approvals.set_fetched_sha come off since all are
wired up. readme picks up the tag flow + /applied RO mount.
2026-05-15 23:18:17 +02:00
müde
fc61cb9310 fmt: clippy doc_markdown backticks 2026-05-15 23:11:10 +02:00
müde
edb0108ae7 docs+prompt: tag-driven flow + /applied RO mount
manager prompt: explain that arbitrary files now travel with
the proposal, document the /applied/<n>/.git RO mount and the
tag scheme (git show applied/deployed/<id> etc.), call out
that applied/main only advances on deployed so a failed build
isn't terminal. approvals.md: drop the old per-agent
applied.git phrasing in favour of the single /applied RO
bind, mention both manager binds together. claude.md
scratchpad flips from in-flight to just-landed.
2026-05-15 23:03:48 +02:00
müde
4a8204f035 lifecycle: bind /applied into manager read-only
set_nspawn_flags now adds --bind-ro=/var/lib/hyperhive/applied
:/applied for the manager container alongside the existing
/agents RW mount. manager can git-fetch deployed/failed/denied
tags out of /applied/<n>/.git to mirror them into its proposed
clones; the read-only bind means git plumbing inside the
container cannot corrupt the authoritative repos. picked up by
the next rebuild of hm1nd (no spawn-time change needed since
set_nspawn_flags runs on every spawn + rebuild).
2026-05-15 23:02:31 +02:00
müde
6cf66e23dc actions: deny plants annotated denied/<id> tag
apply-commit denials now leave a git object behind: tag
denied/<id> annotated with the operator's note (or empty body
if they didn't supply one) at proposal/<id> inside the applied
repo. rejected configs become first-class git history — git
show denied/<id> in the manager's applied.git mount yields the
tree the operator rejected plus the reason. helper event
carries the tag for parity with deployed/failed. spawn denials
fall through unannotated since they have no proposal commit.
deny becomes async (single git plumbing call); dashboard +
admin-socket callers grow .await.
2026-05-15 23:01:22 +02:00
müde
df9da4d6e1 todo: recv default should not sleep, agent opts into wait 2026-05-15 23:00:25 +02:00
müde
315d4289c7 actions: tag-driven approve(ApplyCommit) flow
run_apply_commit walks the approval through the tag state
machine in applied: approved/<id> + building/<id> stamped
before the build, then git read-tree --reset to proposal/<id>
populates the working dir without moving HEAD. on rebuild
success deployed/<id> is planted and refs/heads/main fast-
forwards to the proposal. on failure failed/<id> is annotated
with the build error and the working tree resets back to main
so the agent stays evaluable. helper events Rebuilt +
ApprovalResolved both carry the terminal tag so the manager
can git-show the exact tree (and read the failure note from
an annotated tag) against its read-only applied.git mount.
finish_approval grows a terminal_tag param; spawn path passes
None. lifecycle::apply_commit deleted.
2026-05-15 23:00:01 +02:00
müde
35b0edaf27 manager_server: fetch+tag at request_apply_commit submit
submit_apply_commit (1) queues the approval row, (2) git-fetches
the manager-supplied sha from proposed into applied, pins it as
refs/tags/proposal/<id>, (3) persists the resolved sha on the
row via approvals.set_fetched_sha. from this point on the
proposal is immutable from the manager's perspective: amends
or force-pushes in proposed do not change what hive-c0re will
build. fetch failures mark the row failed and surface the error
to the manager so a phantom pending entry can't linger.
2026-05-15 22:57:43 +02:00
müde
8cb8fcedad lifecycle: setup_applied seeds via fetch + tags deployed/0
new shape: applied is git-init'd at first spawn, fetches
proposed's initial commit into its main, tags deployed/0 there.
the wrapper flake.nix is regenerated on every spawn/rebuild
but no longer tracked — apply churn vanishes, manager-authored
files in the proposal flow now survive untouched. setup_applied
gains an Option<&Path> for proposed (None on rebuild paths
that just refresh the flake). pre-overhaul applied dirs are
detected via the missing deployed/0 tag and bail loudly with
the destroy --purge migration hint. apply_commit is stubbed
with a clear error until the tag-driven approve flow lands.
2026-05-15 22:56:58 +02:00
müde
63ef69674b lifecycle: git helpers for tag-driven applied repo
new plumbing for the upcoming flow: git_fetch_to_tag (pulls a
sha from proposed into applied and pins it as a tag in one
shot), git_rev_parse (normalises shas + reads back tag
targets), git_tag / git_tag_annotated (lightweight vs body-
carrying for failed/denied), git_read_tree_reset (replace
working tree without moving HEAD — lets main stay on last
known-good across an in-flight build), git_update_ref (ff main
on deploy). annotated tag bodies go via stdin to avoid escape
games. all dead-code-allowed; callers land in subsequent
commits.
2026-05-15 22:52:23 +02:00
müde
b32c3d4f98 approvals: persist fetched_sha alongside the queue
new column fetched_sha records the canonical sha hive-c0re
plans to fetch from the proposed repo into applied at submit
time. distinct from commit_ref (manager-supplied, may be
amended out from under the queue). set_fetched_sha is unused
until manager_server wires the fetch step next commit.
2026-05-15 22:49:04 +02:00
müde
871e7bf3fa wire types: add sha + tag to Approval and HelperEvent
approval grows fetched_sha (canonical hive-c0re-vouched sha,
distinct from manager-supplied commit_ref). helperevent
{approvalresolved,spawned,rebuilt} grow optional sha + tag so
the manager can git-show the exact tree it's hearing about
(against the upcoming /agents/<n>/applied.git RO mount) and
know which terminal tag landed. all serde-defaulted; existing
construction sites pass none until the tag-driven flow lands.
2026-05-15 22:47:39 +02:00
müde
497cd15137 docs: tag-driven config-apply plan + migration story
scratchpad in claude.md marks this as in-flight; docs/approvals.md
gets the new tag state machine (proposal/approved/building/deployed/
failed/denied) and the manager applied.git read-only mount. todo
picks up the unprivileged-containers git-identity caveat and a web
ui for config repos as a downstream follow-up.
2026-05-15 22:43:47 +02:00
müde
75e7faff0c docs: full sync ahead of compaction + config-management overhaul
readme: manager mcp surface picks up update; operator-surface
recap mentions /model + last-turn + model chip + the three
collapsibles (inbox / journald / agent.nix).

web-ui.md: details-restore-key story under shape; port-conflict
banner mention on containers; agent.nix viewer alongside journald;
notifications use per-event tags + console.debug log on
block/show; deny endpoint takes note=<reason>; data-prompt /
data-prompt-field generalisation noted.

conventions.md: data-prompt and snapshot/restoreOpenDetails added
to the async-forms section.

persistence.md: operator_questions row picks up deadline_at (ttl)
column with a migration note.

todo.md: new 'Bugs' section captures the manager-question
not-rendering issue with three suspect paths to chase.

claude.md scratchpad rewritten as a clean handoff for the
compaction + the upcoming config-git overhaul. flags the
two-repo (proposed/ + applied/) split as the thing to
reconsider.
2026-05-15 22:12:40 +02:00
müde
6a2ffd521b surface agent-vs-agent port collisions (manager:8000 can't collide)
manager is fixed at 8000, sub-agents are 8100-8999, so collisions
are strictly between two sub-agents hashing to the same value.
the colliding container's harness restart-loops on AddrInUse —
which the user just hit on :8945. previously the only sign was a
buried journalctl warn line.

now surfaced two ways:

- lifecycle::spawn / rebuild preflight: walks the live container
  list, computes each agent's hashed port, refuses with
  'port N already taken by <other> — rename one of them' if any
  running sub-agent shares the new agent's port. so the operator
  sees an actionable error in the dashboard's transient pill /
  approve-result instead of waiting for the harness to die.

- /api/state grows a port_conflicts: [{port, agents: [...]}]
  array; dashboard renders a pulsing red banner above the
  containers list listing each cluster. matches the questions
  panel pulse so it's hard to miss.
2026-05-15 22:08:19 +02:00
müde
2029840671 deny: operator can attach a reason that reaches the manager
clicking DENY on the dashboard now prompts for an optional reason
('reason for denying (optional, sent to manager):'). the value
rides along as a hidden 'note' form field; backend chain:

  POST /deny/{id} { note }
    → actions::deny(coord, id, Some(note))
    → Approvals::mark_denied writes it to the row
    → HelperEvent::ApprovalResolved { ..., note: Some("...") }

manager already had note: Option<String> on the event, just never
populated for denials before. host admin socket (hive-c0re deny)
still passes None.

generalized the prompt-on-submit pattern: any form with a
data-prompt attribute pops a window.prompt() before the POST and
stashes the answer in a hidden input named by data-prompt-field
(default 'note'). reusable for future opt-in note fields.
2026-05-15 21:58:42 +02:00
müde
91c78d626f dashboard: per-container applied agent.nix viewer
new GET /api/agent-config/{name} returns the contents of
/var/lib/hyperhive/applied/<name>/agent.nix — the file the
container actually builds against. validated against the live
container list to avoid arbitrary filesystem reads.

frontend mirrors the journald viewer: collapsed <details> on each
container row, lazy-fetches on expand, refresh button re-fetches.
restore-keyed (agent-config:<name>) so it survives the dashboard
heartbeat refresh.

read-only — mutating the applied config goes through the existing
request_apply_commit + operator approval flow.
2026-05-15 21:46:25 +02:00
müde
80229c6af9 manager: needs_login / logged_in / needs_update events + update tool
crash_watch grows two more state-axes alongside running/stopped:

- logged-in (claude session dir populated for the agent)
- up-to-date (recorded flake rev matches current)

per-tick transitions emit HelperEvent::NeedsLogin / LoggedIn /
NeedsUpdate. seed-on-first-tick semantics retained — nothing fires
on harness boot for agents that were already in their state. only
needs_update fires the 'stale appeared' direction; the resolved
direction is already covered by Rebuilt.

new mcp__hyperhive__update(name) on the manager surface: idempotent
rebuild via auto_update::rebuild_agent. transient-aware (Rebuilding)
so the dashboard shows the spinner. login intentionally has NO tool
— it's interactive OAuth, only the operator can complete it.

prompts + approvals doc + turn-loop doc updated. todo grows a
'show per-agent applied config in dashboard' entry (separate
follow-up).
2026-05-15 21:42:13 +02:00
müde
b374f39b0d dashboard: preserve <details open> across refresh via data-restore-key
generalises the focus-preservation pattern to expanded details
sections (journald viewer was collapsing on every 5s refresh; same
issue for approval diff blocks). before re-render we snapshot
which <details data-restore-key=...> are open; after render we
re-apply. setting .open = true programmatically also fires the
toggle event, so journald's lazy-fetch listener re-runs cleanly.

tagged: journal:<container>, approval-diff:<id>. anything else
that should survive a refresh just needs a stable data-restore-key
attribute.
2026-05-15 21:37:17 +02:00
müde
fd0e493bf5 agent terminal: show full body for send tool calls
send was truncating to 80 chars in the tool_use row, hiding
anything past the first sentence. now renders as a collapsed
<details> like Write/Edit — summary still shows the recipient +
headline (so the operator can scan), expanding reveals the full
body unchanged.

recv side was already covered: the wake prompt shows the full
incoming body, and explicit recv() tool_result rows expand to
the full text via the existing collapsed-results path.
2026-05-15 21:35:48 +02:00
müde
3b532753b3 notifications: per-event tags + debug logs
bug: all notifications used tag='hyperhive', so each new fire
replaced the previous — operator only ever saw one at a time and
might miss the fact that a second arrived. now per-event tags
(hyperhive:approval:<id>, hyperhive<id>,
hyperhive:msg:<at>:<rand>) so distinct events stack in the OS
notification center.

dropped the bogus icon (was pointing at dashboard.css) — some
browsers refuse to display a notification with an invalid icon.

added console.debug at every block point (not supported, permission
not granted, muted) and a 'shown' log on success, so the operator
can see in the browser console exactly why a notification didn't
fire.

note for the operator: most browsers also suppress notifications
while the originating tab is FOCUSED. that's a browser-level
decision, not ours.
2026-05-15 21:34:21 +02:00
müde
62d1a74929 docs sync + revert auto-unfree removal
revert the earlier 'operator must set allowUnfree' move:
per-agent containers evaluate their own nixpkgs and the operator's
host-level allowUnfree doesn't propagate in. restoring the scoped
allowUnfreePredicate inside both the claude-unstable overlay and
harness-base.nix; documented in README + gotchas as 'nothing to
set on the operator side'.

docs:
- claude.md file map adds crash_watch.rs, kick_agent on coordinator,
  /api/model + journald viewer + bind-with-retry references.
- scratchpad rewritten to reflect the recent run.
- web-ui.md: notification row + browser notifications section,
  state row (badge + model chip + last-turn chip + cancel button),
  per-agent inbox, /model slash, /cancel-question + journald
  endpoints, focus-preservation on refresh.
- turn-loop.md: --model is read from Bus::model() per turn (runtime
  override via /model); recv(wait_seconds) up to 180s with the
  rationale; ask_operator gains ttl_seconds; new TurnState section;
  kick_agent inbox-on-startup hint.
- approvals.md: ttl/cancel resolution paths for operator questions.
- persistence.md: /state/hyperhive-model file.
- gotchas.md: web UI port collision policy (rename, don't probe);
  bind retry + SO_REUSEADDR shape; auto-unfree restored.
- todo.md: cleaned up empty sections and stale entries; /model
  shipped, dropped from the list.
2026-05-15 21:26:13 +02:00
müde
d275b50177 dashboard: don't yank the form away while operator is typing
every refreshState tick does root.innerHTML = '' across the managed
sections, which destroys any focused input. detect the case before
re-rendering: if document.activeElement is an INPUT / TEXTAREA /
SELECT inside one of the managed sections, skip this tick and try
again in 2s. eventually the operator blurs and the refresh lands.

managed section ids: containers / tombstones / questions / inbox /
approvals. msgflow + message-flow SSE rows don't have inputs so
they're not affected.
2026-05-15 21:19:01 +02:00
müde
acaa0eb895 agent_web_port: back to pure hash, drop port-file dance
operator's call: probing-forward + state-file machinery is more
brittle than the bug it tried to fix. revert to the original
deterministic FNV-1a hash mod 900. collisions are real but rare;
operator resolves by renaming (different name → different hash) and
rebuilding. no per-agent port file, no scan, no migration path,
nothing to drift out of sync with the running container.

existing port files on disk are silently ignored — operator
rebuilds affected agents to regenerate flakes from the deterministic
hash.
2026-05-15 21:17:31 +02:00
müde
c35f566d15 agent_web_port: actually resolve legacy collisions
previous attempt was wrong: the legacy branch returned port_hash
unconditionally, so two legacies hashing to the same port both
wrote that port and the collision persisted (test still trying to
bind coder's port).

new rule: always probe forward from port_hash, with scan_taken_ports
parameterised by include_implicit_hashes:

- legacy migration (applied dir exists, no port file): pass false.
  scan only counts other agents' port files. first-queried legacy
  claims its hash; subsequent colliders see the first's port file
  and probe forward. we don't know which legacy originally won
  the bind race, so first-write-wins; the loser was already
  crash-looping anyway and gets a fresh port to rebuild to.

- fresh spawn (no applied dir): pass true. counts port files AND
  implicit hashes for not-yet-migrated legacies, so a new spawn
  doesn't race with an unmigrated peer.

migration note for affected users: agents whose port file says
something other than their hashed port may have been corrupted by
the previous fix. Hit ↻ R3BU1LD on the offender to regenerate the
flake (uses the current port file) and the container will bind
the right port on restart.
2026-05-15 21:13:17 +02:00
müde
237b215c55 dashboard: browser notifications for operator-bound events
three signals fire OS notifications:
- new approval lands in the queue (per id, via /api/state delta)
- new ask_operator question queued (per id)
- broker message sent to operator (live via SSE)

first /api/state render after page load seeds the 'seen' sets
without firing — only items that arrive while the page is open
count. controls in a row under the banner: 🔔 enable
notifications (calls requestPermission, hides on grant), 🔕 mute /
🔔 unmute toggle (localStorage-backed so operator can silence
without revoking the permission), inline status text when blocked
or unsupported.

notification tag='hyperhive' collapses rapid bursts; onclick
focuses the dashboard tab. requires secure context (HTTPS or
localhost) — on other origins the API is unavailable and the
controls hide themselves.

todo: entry dropped.
2026-05-15 21:10:20 +02:00
müde
a67aada7c9 todo: browser notifications for approvals / questions / operator msgs
pure frontend — Notification API + existing /api/state and
/messages/stream signals. Caveats: secure-context requirement
(HTTPS or localhost), per-browser permission grant. Includes a
sketch of the implementation: request-permission button, count
deltas on refreshState, SSE hook on operator-bound sends,
localStorage 'muted' toggle.
2026-05-15 21:07:21 +02:00