Commit graph

207 commits

Author SHA1 Message Date
damocles
24eec69418 fix reminder tool issues: error on time overflow, optimize scheduler query 2026-05-16 13:00:56 +02:00
damocles
f38510930a reminder: add background scheduler loop - checks & delivers due reminders every 5s 2026-05-16 12:49:59 +02:00
damocles
4fc9c02934 reminder: add sqlite storage + broker methods + dispatch 2026-05-16 12:49:59 +02:00
damocles
7e9fd8e978 agent: add Remind request + ReminderTiming enum (stub implementation) 2026-05-16 12:49:59 +02:00
damocles
862bc1de44 Revert "agent: add Wake command - co-process self-wake via agent socket"
This reverts commit 68a9b8575b1647643c87bd753767acabf96528c3.
2026-05-16 12:49:59 +02:00
damocles
f0e87f0bc5 agent: add Wake command - co-process self-wake via agent socket 2026-05-16 12:49:59 +02:00
müde
d06b598c56 kick_agent on every rebuild + apply path
agents weren't being woken with the 'you were rebuilt — check
/state/ for notes, --continue intact' system message after
several recent rebuild surfaces:

- auto_update::rebuild_agent — used by the dashboard rebuild
  button, admin-CLI rebuild via lifecycle_action, the startup
  rev-scan, AND the new meta-input update batch loop. kick
  moves *into* rebuild_agent's success arm so all four
  paths benefit. (the dashboard's lifecycle_action extra
  closure was already firing kick — now it's a no-op for the
  rebuild path since rebuild_agent does it.)
- actions::run_apply_commit — apply-commit approve flow built
  + tagged deployed/<id> but never kicked. add kick on
  success with the more specific 'config update applied' hint.
- server.rs::HostRequest::Rebuild — the admin-CLI direct path
  calls lifecycle::rebuild bypassing rebuild_agent. add kick
  on success.

dashboard's restart / start lifecycle_action extras still
kick via their own closures since they don't route through
rebuild_agent. stop / kill / destroy intentionally don't
kick — there's nothing to wake.
2026-05-16 04:20:01 +02:00
müde
78aa830430 meta inputs panel: walk transitive inputs, slash-path names
read_meta_inputs() previously only included direct inputs of
meta's root node — so a manager-added 'inputs.mcp-matrix' in
agent-dmatrix's flake.nix never surfaced in the dashboard
panel even though it's a real fetched input that nix can
update.

now: BFS the flake.lock graph from root to depth 2. emits
one MetaInputView per fetched (non-follows) node, names are
slash-paths from root — 'hyperhive', 'agent-coder',
'agent-dmatrix/mcp-matrix', 'hyperhive/nixpkgs', etc. that's
the same syntax 'nix flake update' accepts for transitive
inputs, so the existing POST /meta-update path needs no
nix-side change.

depth limit of 2 keeps the panel readable — deeper transitives
(nixpkgs's own deps etc.) would explode it; bumping a level-2
entry re-fetches its sub-inputs anyway.

POST /meta-update's 'which agents to rebuild' derivation
updated for the slash names: anything under hyperhive/
fans out to all agents (shared base); 'agent-<n>/...' picks
out the agent name from before the first slash.

read_meta_locked_revs (used by the deployed:<sha> chip per
container) split out into its own straight root-input lookup
since the chip only cares about the agent's own input.
2026-05-16 04:12:04 +02:00
müde
d1c69b134a dashboard: reorder sections into grouped sequence
after reverting the 3-column attempt (74ba8a6), keep the
single-column layout but put related sections adjacent:

  swarm:     containers → kept-state → meta-inputs
  decisions: questions → approvals
  messages:  operator-inbox → message-flow + compose

this is a free improvement — the operator scrolls through one
logical group at a time instead of bouncing between swarm /
decisions / messages mid-page. follow-up improvements
(collapsing rarely-active sections, multi-column at wide
viewports done less aggressively) captured in TODO under
'Dashboard layout overhaul'.
2026-05-16 03:54:53 +02:00
müde
fe8fb15f8f Revert "dashboard: 3-column layout — swarm / 0per4t0r 1n / m3ss4g3s"
This reverts commit 74ba8a63e1.
2026-05-16 03:54:02 +02:00
müde
40938d8b54 dashboard: surface silent unwrap_or_default in api_state
every snapshot source backing /api/state used .unwrap_or_default()
— sqlite errors, broker errors, nixos-container list failures,
operator_questions decode crashes all degraded to empty lists
without a log line. the 'pending question doesn't render'
bug we've been chasing was likely a row-decode panic in
OperatorQuestions::pending() being swallowed this way.

new log_default(what, result) replaces each call site: same
default value on Err but emits target=api_state warn with the
source name + dbg error first. five sources covered:
nixos-container list, approvals.pending,
approvals.recent_resolved, broker.recent_for(operator),
questions.pending. next time the question goes missing the
journal will say which source failed and how.

todo updated — pending-question entry now points at the new
log instead of three suspect paths.
2026-05-16 03:49:49 +02:00
müde
74ba8a63e1 dashboard: 3-column layout — swarm / 0per4t0r 1n / m3ss4g3s
regroups the 7 stacked sections into three semantic columns
backed by a CSS grid (single column under 1400px, 3 columns
above). column headers are sticky so vertical scrolling
inside a column doesn't lose context.

- SW4RM (left, slightly wider): containers + kept-state +
  spawn-agent form + meta-input update form. all
  swarm-mutating operator knobs live here.
- 0PER4T0R 1N (middle): mind-questions + pending approvals.
  the two things waiting on operator action.
- M3SS4G3S (right): operator-inbox + msg-flow tail + the
  @-mention compose box. broker traffic in one place.

spawn form moves out of renderApprovals into static HTML
under sw4rm; renderApprovals no longer injects it.

cosmetic: per-section h2/divider replaced with smaller cyan
sub-heads + a dashed underline so each column reads as one
cohesive unit instead of seven competing banners. body
max-width grows 70em → 110em to actually use the new
horizontal real estate.
2026-05-16 03:47:16 +02:00
müde
266c2c7a77 dashboard: meta flake inputs UI + sequential rebuild loop
new section 'M3T4 1NPUTS' between approvals and message flow:
one row per input in meta/flake.lock (hyperhive first, then
agent-<n> alphabetically). each row shows the input name, the
first 12 chars of the locked sha, a relative timestamp from
locked.lastModified, and the original.url when available.
checkbox per row; submit button is disabled until at least one
box is checked; submitting confirms then POSTs the selected
names to /meta-update.

backend:
- meta::lock_update(inputs: &[String]) — runs 'nix flake update
  <names>' in the meta dir, commits the lock change with a
  combined message ('lock update: hyperhive, agent-coder').
  preserves the existing META_LOCK serialization. existing
  lock_update_for_rebuild / lock_update_hyperhive stay for
  their single-input callers.
- POST /meta-update — comma-separated 'inputs' form field
  (JS joins checkboxes since axum::Form doesn't natively
  decode repeated keys); spawns a background task that runs
  the lock update + per-agent rebuild loop. hyperhive
  selection fans out to all agents; agent-<n> selection only
  rebuilds <n>. each rebuild fires Rebuilt to the manager
  exactly like dashboard / admin-CLI / auto-update.

rebuild loop is sequential — auto_update::run too (was
parallel via tokio::spawn). parallel rebuilds collide on
nix-store's sqlite cache ('sqlite db busy, not using cache')
and the meta META_LOCK contention. nix-daemon serializes the
heavy build steps anyway, so this isn't a throughput loss.
2026-05-16 03:38:07 +02:00
müde
891223219e server: notify manager on admin-socket Rebuild outcomes
HostRequest::Rebuild was the only rebuild path that bypassed
notify_manager. dashboard / auto_update / actions::approve
already emit Rebuilt events on both success + failure, but a
'hive-c0re rebuild <name>' from the host CLI (and the recent
matrix-flake build failure that surfaced in journald) left the
manager in the dark.

mirror auto_update::rebuild_agent's pattern: on success →
Rebuilt{ok:true}, on failure → Rebuilt{ok:false, note=
format!('{e:#}')}. note carries the stderr tail lifecycle::run
collected (the actual nix error: missing prompt file, dep
build failure, etc.), so the manager has enough context to
adjust the agent's agent.nix without ssh-ing to the host.
2026-05-16 03:30:02 +02:00
müde
06af23c8a4 recv: None = peek, positive value = opt-in long-poll
old behavior: omitted wait_seconds fell through to the 30s
RECV_LONG_POLL_DEFAULT — claude calling 'is there anything in
my inbox right now?' between actions blocked the turn for half
a minute. flip the semantics: None (or 0) returns immediately,
positive value parks up to MAX (180s, unchanged). cleaner
'peek vs wait' distinction; tool descriptions + agent/manager
prompts updated to point at the new shape.

harness's own serve loops in hive-ag3nt + hive-m1nd relied on
the old default for their inbox poll. they now explicitly pass
wait_seconds: Some(180) to opt into the full park — same
effective behavior as before, just spelled out.

retires the matching TODO under Turn loop.
2026-05-16 03:22:42 +02:00
müde
90df2106bf agent socket: external wake-up path for in-container MCP servers
new AgentRequest::Wake { from, body } drops a message into
this agent's inbox via the per-agent socket. matrix-style MCP
servers can use it when they receive an external event
(matrix message, webhook, scrape result) to nudge claude
into running a turn. broker.send wakes whatever Recv is
currently long-polling, the harness picks the message up,
formats a wake prompt with the caller's chosen from label
('matrix: new dm', 'webhook: deploy succeeded', etc.).

new `hive-ag3nt wake --from <label> --body <text>` subcommand
on the harness binary so MCP servers can shell out instead of
implementing the line-JSON protocol themselves; body=='-'
reads from stdin for multi-line / quoting-friendly payloads.

identity = socket: anything that can connect to /run/hive/mcp
.sock is implicitly trusted to inject. that's fine because the
bind-mount is the agent's own container; no new auth surface
opens up.

docs/turn-loop.md gets a new 'Waking the agent from inside
the container' section pointing at both paths (CLI + raw
JSON).
2026-05-16 03:15:58 +02:00
müde
96cb9f84c9 dashboard: approval history tab on P3NDING APPR0VALS
new tabs above the approvals list: 'pending · N' and
'history · M'. active tab persists in localStorage so the
operator can park on history if they prefer. on a fresh
dashboard the default is pending (matches the prior shape).

history view shows the last 30 resolved approvals — newest
first by resolved_at — with one row per approval: status
glyph (✓ approved / ✗ denied / ⚠ failed), id, agent, kind,
short sha, status label, and a relative time chip. when the
row has a note (deny reason or build error), it renders
below in a muted block with line wraps preserved.

backend: Approvals::recent_resolved(limit) queries by
status IN ('approved', 'denied', 'failed') ORDER BY
resolved_at DESC. StateSnapshot gets approval_history (a
lean ApprovalHistoryView without diff_html — rendering 30
git diffs per state poll would be expensive and the operator
already saw the diff at decision time). dashboard's
history_view fn projects the sqlite row.

retires the matching TODO entry.
2026-05-16 03:07:50 +02:00
müde
7276e6d5d9 git identity: shorten to 'c0re' across all helpers
lifecycle::GIT_{NAME,EMAIL}, meta::GIT_{NAME,EMAIL}, and the
inline strings migrate.rs uses for its bootstrap commits all
move from 'hive-c0re' / 'hive-c0re@hyperhive' to 'c0re' /
'c0re@hyperhive'. shows up shorter in git log everywhere
(applied + meta repos).
2026-05-16 03:02:44 +02:00
müde
8336017eda lifecycle: annotated tags need a tagger identity
git_tag_annotated planted failed/<id> + denied/<id> as
annotated tags via 'git tag -a' — which produces a git
object and therefore needs user.name + user.email. without a
global git config on the host that fell through to
'fatal: unable to auto-detect email address (got
root@muede-lpt2.(none))' and the tag never landed.

pass the hive-c0re identity inline with -c user.name=… -c
user.email=… (same shape git_commit already uses), so the
applied repo's deny/failure audit tags get planted reliably
without depending on the host user's git config.
2026-05-16 03:00:44 +02:00
müde
c92108a11c lifecycle: fetch into checked-out main with --update-head-ok
setup_applied does `git init --initial-branch=main` then
`git fetch <proposed> main:refs/heads/main` to seed the
applied repo with proposed's initial commit. git's default
safeguard refuses to fetch into the currently-checked-out
branch, even though the working tree is empty (we just init'd).
add --update-head-ok to bypass — the read-tree-reset
immediately after fetches the right state, so the safeguard
the flag bypasses isn't relevant here anyway.

repro from the user: spawn of 'dmatrix' failed with
  fatal: refusing to fetch into branch 'refs/heads/main'
  checked out at '/var/lib/hyperhive/applied/dmatrix'
2026-05-16 02:58:34 +02:00
müde
6f1b664c85 lifecycle: stream nixos-container stdout/stderr line-by-line
run() previously buffered the child's output via .output() and
only logged at exit — a multi-minute 'nixos-container update'
(typical on a fresh hyperhive bump) showed nothing in journald
until the very end. operator watching 'journalctl -u hive-c0re
-f' couldn't tell 'slow nix build' from 'wedged daemon'.

new shape: spawn with piped stdio, pump each line into tracing
as it arrives (stdout → INFO, stderr → WARN), keep a tail of
the last 32 stderr lines for the bail message so the eventual
'failed (status 2)' still carries the actual nix eval error.
target field 'nixos-container', argv-equivalent attached via
the 'cmdline' field so filtering by subcommand works.
2026-05-16 02:57:16 +02:00
müde
78f21ccc5d meta: serialize all ops behind a tokio mutex + clear stale lock at startup
journal showed three concurrent rebuilds racing on the meta
repo's .git/index.lock — auto_update::run kicks off parallel
tokio::spawn for every stale agent, each rebuild eventually
calls into meta::sync_agents / lock_update_for_rebuild which
do git add + commit, git isn't safe across concurrent processes
on the same .git/, and one of the failing-mid-write children
left index.lock behind. subsequent ops blocked until somebody
rm'd it manually.

fix: static META_LOCK (tokio::sync::Mutex<()>) acquired at the
top of every public meta function. concurrent rebuilds take
turns on meta ops; the actual nix build (nixos-container update)
releases the lock first and runs without it, so parallel agent
builds still parallelize on nix-daemon's own concurrency model.

migrate::run additionally clears /var/lib/hyperhive/meta/.git/
index.lock on startup if it exists — we just booted, nothing
of ours is holding it. covers the 'previous crash left a stale
lock' case the user just hit so the daemon recovers without
manual intervention.
2026-05-16 02:44:39 +02:00
müde
3db33b0fe5 agent flake.nix: forward inputs as flakeInputs module arg
new boilerplate wraps agent.nix as a sub-module + passes every
flake input (minus self) through to it via _module.args.flake
Inputs. manager edits the inputs block of flake.nix to pull in
out-of-tree flakes (MCP servers etc.) and references them in
agent.nix as flakeInputs.<name>.packages.${pkgs.system}.default
— the new input's pinned sha lands in the agent's own flake
.lock (already tracked + part of the proposal flow), and
transitively rolls up into meta's lock.

migrate's MODULE_FLAKE_MARKER swaps to _module.args.flakeInputs
so existing agents on the old 'nixosModules.default = import
./agent.nix' template get re-rendered onto the new shape on
next hive-c0re start.

manager_server's flake.nix tamper-check goes away — the build
path's failed/<id> annotated tag already provides the safety
net when a manager edit breaks the flake; enforcing 'no
flake.nix edits at all' was overly strict (blocks the inputs-
addition pattern that's the whole point of this change).

manager prompt updated with a worked example for adding an
MCP-server flake input + wiring it through agent.nix.
2026-05-16 02:23:43 +02:00
müde
50ef806266 operator pronouns: configurable free-text, threaded into prompts
new NixOS module option services.hive-c0re.operatorPronouns
(free text, default 'she/her', example 'they/them'). hive-c0re
takes it as a CLI flag (--operator-pronouns, lib.escapeShellArg'd
in the systemd unit), stores it on Coordinator, threads it into
the meta flake's mkAgent so each agent's systemd service gets
HIVE_OPERATOR_PRONOUNS set. the harness reads the env at boot
and substitutes {operator_pronouns} into the agent / manager
system prompt alongside {label}. nix string is escaped against
backslash + double-quote so non-ascii / quoted values
round-trip safely. prompt addendum: both agent.md and
manager.md mention the operator's pronouns up front so claude
uses them naturally in third-person reference. propagates on
next ↻ R3BU1LD (meta lock bump, no per-agent approval).
2026-05-16 02:05:22 +02:00
müde
5208b0112a dashboard: terminal compose with @-mention sticky recipient
new section under MESS4GE FL0W. msgflow already tails only
broker traffic (sent + delivered), which is exactly the
'messages through core' view the operator wants; no
per-agent thinking leaks through. compose box below:

- a prompt span renders the sticky recipient ('@coder>'),
  rendered outside the textarea so it can't be edited
  inadvertently. on submit the recipient gets persisted to
  localStorage so it survives reload.
- start the input with '@name body' to redirect — the parser
  splits at the first whitespace and the new recipient
  becomes sticky.
- typing '@' at the start opens a completion dropdown over
  the textarea pulled from window.__hyperhive_state.containers;
  arrow keys cycle, tab/enter selects, escape closes. clicking
  works too.
- manager swap: agents flagged is_manager are surfaced as
  '@manager' (the broker's recipient string) instead of
  '@hm1nd' (the container name), so the message actually
  routes to the manager's inbox.

backend: new POST /op-send accepts {to, body} and drops a
broker.send({from:'operator', to, body}) — same shape as the
per-agent web UI's OperatorMsg, but lets the operator choose
the recipient explicitly from the main dashboard.
2026-05-16 01:55:00 +02:00
müde
2a6d084718 ask_operator: any agent can call it, answer routes by asker
new AgentRequest::AskOperator + AgentResponse::QuestionQueued on
the per-agent socket — same shape as the manager flavor, agent
gets the same wire surface (still uses the same operator_questions
table). agent_server::dispatch wires AskOperator through coord
.questions.submit(agent, ...) so the row's asker is the sub-agent
name; the ttl watchdog already in manager_server gets shared and
spawn_question_watchdog goes pub.

answer routing: operator_questions::answer now returns (question,
asker). post_answer_question + post_cancel_question + the watchdog
fire OperatorAnswered through new coord.notify_agent(asker, event)
instead of always notify_manager — the event lands in whichever
agent originally asked. notify_manager is now a thin wrapper.

agent socket plumbing: agent_server::start takes Arc<Coordinator>
instead of Arc<Broker> so dispatch has access to questions +
notify path; coordinator::{register_agent,ensure_runtime} take
self: &Arc<Self>. mcp::AgentServer grows the ask_operator tool;
allowed_mcp_tools(Agent) adds it; prompts/agent.md replaces the
'message the manager to ask the operator' guidance with the
direct tool description.
2026-05-16 01:48:10 +02:00
müde
6b3ef4549c manager_server: reject proposals that modify flake.nix
submit_apply_commit now diffs the freshly-tagged proposal/<id>
against applied/main and refuses if flake.nix is in the
changeset. flake.nix is fixed boilerplate the meta flake
depends on (it exports nixosModules.default = import ./agent
.nix); silent edits there would break the nixosConfiguration
in subtle ways. the manager prompt already says don't touch
it; this is the host-side belt — clear error to the manager
on submit, row marked failed in sqlite, no orphan pending
approval to chase. diff-failure is logged + ignored: the
build path surfaces concrete errors if flake.nix is actually
broken.
2026-05-16 01:42:11 +02:00
müde
d202f3785c suppress crash_watch during background rebuilds + meta repoint
crash_watch fires ContainerCrash whenever it sees a previously-
running container in a non-running state without a transient
flag set. dashboard rebuilds already set Rebuilding via
lifecycle_action; the two other rebuild paths didn't:

- migrate::repoint_container: phase 4 walks every container,
  each nixos-container update activation briefly takes the
  systemd unit down. previously fired ContainerCrash for every
  agent during the migration; manager would then spuriously
  call start() on agents that were already coming back up.
- auto_update::rebuild_agent: startup scan + admin-socket
  caller bypass lifecycle_action.

both paths now set the Rebuilding transient around the rebuild
+ clear after. matches what dashboard does.
2026-05-16 01:12:48 +02:00
müde
63e8a98df2 meta: stage before lock, single commit per change
git+file://'s dirty-tree fetcher reads tracked + staged content
from the index (not the working tree, not untracked files). so
staging is enough to make a new flake.nix or flake.lock visible
to nix without committing first. sync_agents now stages flake
.nix, runs lock, stages the resulting flake.lock, then commits
both together in a single 'regenerate meta flake' (or 'seed
meta from N agents') commit — no more two-commit churn.

prepare_deploy applies the same trick to the two-phase deploy:
runs nix flake update, stages flake.lock so nixos-container
update sees it, doesn't commit yet. finalize_deploy commits
with the deployed/<id> message on build success; abort_deploy
git-restores the staged lock back to HEAD on failure. meta
history continues to record only successful deploys (and now
one commit per success instead of one + amend).
2026-05-16 01:02:47 +02:00
müde
220e9b4af6 meta: commit before lock — git+file:// only sees tracked files
runtime error on first deploy attempt: 'source tree referenced
by git+file:///var/lib/hyperhive/meta does not contain
/flake.nix'. cause: sync_agents wrote flake.nix then ran
'nix flake lock' against a directory nix had just discovered
as a git repo (auto-upgraded to git+file://), which only sees
TRACKED content. fresh flake.nix was untracked, so nix saw an
empty source tree.

fix: commit flake.nix before locking. sync_agents now does
write → init (if first) → git add + commit → nix flake lock
→ commit lock if changed. two commits per change — one
'regenerate meta flake' and one 'lock update' — instead of
one combined; cleaner history.

same git+file:// gotcha bit the two-phase deploy: prepare_
deploy used to write the lock without committing, expecting
nixos-container update to read the working tree. it doesn't —
it reads the tracked commit. prepare_deploy now commits with
a placeholder 'deploy <n> (building)' message; finalize_deploy
amends to 'deploy <n> deployed/<id> <sha12>' on success;
abort_deploy git-reset --hard HEAD~1's it on failure. meta
history still records only successful deploys.
2026-05-16 00:59:35 +02:00
müde
87c7b05b05 meta: use 'nix flake update <input>' instead of removed --update-input
current nix CLI removed 'nix flake lock --update-input X' in
favour of 'nix flake update X'. switch all three call sites
(prepare_deploy, lock_update_for_rebuild,
lock_update_hyperhive). 'nix flake lock' with no args still
works for the seed path in sync_agents — it resolves missing
inputs without bumping existing ones.
2026-05-16 00:49:22 +02:00
müde
14aa7c7acc final docs + cleanup sync for meta-flake era
claude.md flips 'in flight' → 'just landed' for the meta
overhaul + extends the file map with meta.rs and migrate.rs.
docs/approvals.md replaces the in-flight callout with a
proper 'Meta flake' section (two-phase deploy walkthrough,
sync_agents semantics, single-phase variants), updates the
two-repo box diagram to include the /var/lib/hyperhive/meta/
tree and tracks flake.nix in applied, rewrites the
container --flake reference to meta#<name>, replaces the
'Manager view of applied' section with a unified
'/agents + /applied + /meta' inventory listing every useful
git incantation, and explains the in-place no-state-loss
migration that now runs on hive-c0re startup.
docs/persistence.md grows entries for the meta repo + the
.meta-migration-done marker. readme box diagram picks up the
/meta RO bind; approval-flow paragraph rewritten end to end
to describe the meta lock dance.

lifecycle::flake_base deleted — the meta render hardcodes
the manager vs agent-base choice as nix expression.
2026-05-16 00:40:06 +02:00
müde
2f6ecc4dc0 dashboard: deployed sha chip per container
ContainerView grows deployed_sha (first 12 chars of the rev
that /var/lib/hyperhive/meta/flake.lock currently has locked
for agent-<name>). renderContainers appends a 'deployed:<sha12>'
chip next to the container name + port — title attribute
explains it's the meta-lock sha. degrades gracefully when the
meta repo isn't seeded yet (missing / unparsable lock = empty
map = no chip). new read_meta_locked_revs helper does the JSON
parsing without unwraps.
2026-05-16 00:36:52 +02:00
müde
59a89314f0 startup auto-migration from pre-meta layout
new migrate module runs before auto_update on hive-c0re boot.
four idempotent phases:

1. for every applied/<n>/ whose flake.nix isn't already the
   module-only boilerplate, rewrite + commit + relocate
   deployed/0 to HEAD so setup_applied's existence check passes
2. for every proposed/<n>/config without an 'applied' remote,
   wire it (delegates to setup_proposed which is now
   idempotent and adds the remote itself)
3. meta::sync_agents over the current container list — inits
   the meta repo on first call, rerender + relock if drifted
4. nixos-container update <c> --flake meta#<name> for every
   container, guarded by /var/lib/hyperhive/.meta-migration-done
   so phase 4's expensive eval only runs once across restarts

env kill-switch HIVE_SKIP_META_MIGRATION=1 defers the whole
thing. each agent's failure is logged + skipped so one broken
agent doesn't block the rest. runs ahead of ensure_manager so
the manager auto-spawn comes up against meta from the first
attempt.
2026-05-16 00:34:58 +02:00
müde
87016cd567 auto_update: bump meta hyperhive input before per-agent rebuilds
auto_update::run now calls meta::lock_update_hyperhive once
up-front so the per-agent rebuilds it kicks off rebuild against
the new base. lifecycle::rebuild already drives sync_agents +
lock_update_for_rebuild per agent, so the rev-marker shortcut
keeps its meaning ('we've ack'd this rev for this agent')
without further plumbing. failures of the hyperhive lock bump
log + continue — individual rebuilds will surface concrete
errors if anything's really wrong.
2026-05-16 00:32:55 +02:00
müde
06fdbac1ac actions::run_apply_commit through meta two-phase
approval-driven deploys now walk the meta flake via
prepare_deploy / finalize_deploy / abort_deploy so a failed
build leaves no commit in meta's deploy log:

1. capture applied/main sha for rollback
2. tag approved/<id> + building/<id>
3. ff applied/main to proposal/<id>, read-tree sync working tree
4. meta::prepare_deploy(name) — nix flake lock --update-input
   agent-<n> without committing
5. lifecycle::rebuild_no_meta — container-level only (new
   extracted helper; public lifecycle::rebuild still wraps it
   with single-phase meta sync + commit for dashboard / auto
   _update callers that don't care about rollback)
6a. on success: tag deployed/<id>, meta::finalize_deploy commits
    the staged lock with 'deploy <n> deployed/<id> <sha12>'
6b. on failure: tag failed/<id> annotated with the build error,
    git_update_ref applied/main back to prev sha, read-tree to
    main, meta::abort_deploy git-restores flake.lock

meta's git log now records only successful deploys; failures
+ denials still live in applied as annotated tags.
2026-05-16 00:32:16 +02:00
müde
22f35def8f actions::destroy syncs meta after lifecycle
once nixos-container destroy lands + per-agent state cleanup is
done, rerender the meta flake from the remaining containers so
the destroyed agent's input + nixosConfiguration drop off and
its flake.lock entry vanishes. log + keep going on meta-sync
failure — the destroy already succeeded at the lifecycle level,
so meta drift here is just bookkeeping. new public
lifecycle::agents_for_meta_listing exposes the agent
enumeration for callers outside the module.
2026-05-16 00:29:26 +02:00
müde
4cb529351e lifecycle::rebuild through meta
rebuild now does sync_agents (idempotent — no-op when the
rendered flake matches disk; recovers from a divergent meta
repo on the side) followed by lock_update_for_rebuild which
relocks just this agent's input and commits the lock change
if any. flake ref for nixos-container update flips from
applied/<n>#default to meta#<name>. new helper
meta::lock_update_for_rebuild is single-phase (no separate
finalize): rebuild has no failure-revert semantics — it always
wants the latest applied/<n>/main. spawn already syncs meta
before container create; rebuild now picks up the meta side
on every manual ↻ R3BU1LD.
2026-05-16 00:28:26 +02:00
müde
8f94e4379a lifecycle::spawn through meta
after setup_proposed + setup_applied, spawn now syncs the meta
flake (one input + one nixosConfiguration per agent) so
`--flake /var/lib/hyperhive/meta#<name>` resolves before
nixos-container create runs. flake ref switches from
applied/<n>#default to meta#<name>; the wrapper modules
(identity, HIVE_PORT, HIVE_LABEL, HIVE_DASHBOARD_PORT) now
live in the meta flake's mkAgent. new helper agents_for_meta
builds the AgentSpec list by enumerating containers + optionally
appending a not-yet-present name for the spawn case. spawn
keeps its caller signature; rebuild + auto_update get wired up
in follow-up commits.
2026-05-16 00:27:12 +02:00
müde
c42ad1330c lifecycle: pre-wire applied remote in proposed
setup_proposed now lands a git remote named 'applied' on every
proposed/<n>/config pointing at /applied/<n>/.git — the path as
seen from inside the manager container, where the RO bind in
set_nspawn_flags makes the URL resolve. From the manager:

  git fetch applied
  git log applied/main
  git show applied/refs/tags/deployed/<id>
  git diff applied/main HEAD
  git rebase applied/main

all work without manually constructing the path each time. The
RO bind blocks push at the kernel level so the remote can only
fetch. Idempotent — also applied to pre-existing proposed repos
(no-op if the remote is already correct, set-url if drifted)
so the startup migration picks up the wiring on existing
agents.
2026-05-16 00:25:43 +02:00
müde
3d14ddeb7d lifecycle: bind /meta RO into manager
set_nspawn_flags now adds a third manager-only bind alongside
/agents (RW) and /applied (RO): --bind-ro=/var/lib/hyperhive/meta
:/meta. manager can git log /meta to see every deploy across the
swarm and cat /meta/flake.lock to introspect which sha each agent
is currently pinned at. defensive create_dir_all on the host
side so a cold start with no agents (meta repo not yet seeded)
doesn't trip systemd-nspawn's missing-bind-source check before
the migration plants the dir.
2026-05-16 00:24:39 +02:00
müde
92822efe16 meta: new hive-c0re module owns /var/lib/hyperhive/meta/
leaf module with no runtime callers yet (every public item is
#[allow(dead_code)] until lifecycle / actions / auto_update
rewire to use it). API surface:

- sync_agents — idempotent: render flake.nix for the given
  agent set, git-init on first call, nix flake lock, commit if
  anything changed.
- prepare_deploy / finalize_deploy / abort_deploy — two-phase
  for the request_apply_commit path. prepare runs nix flake
  lock --update-input agent-<n> without committing; finalize
  commits with a 'deploy <n> deployed/<id> <sha12>' message;
  abort git-restores the lock so a failed build leaves no
  orphan commit.
- lock_update_hyperhive — one-shot for the auto-update path.

flake.nix template defines mkAgent that pulls each agent's
nixosModules.default from its input and wraps with the
identity / HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT
module — what setup_applied used to generate inline. nix
invocations carry --extra-experimental-features as a belt
in case flakes aren't enabled in nix.conf.
2026-05-16 00:22:37 +02:00
müde
5b5a93e0c6 lifecycle: module-only agent flake.nix, tracked in proposed
setup_proposed now seeds both agent.nix (a regular NixOS module
function) and flake.nix (boilerplate exporting nixosModules.default
= import ./agent.nix) into the manager-editable proposed repo,
committed together. setup_applied's hyperhive_flake + dashboard
port wrapper generation is deleted entirely — the meta flake at
/var/lib/hyperhive/meta/ now owns the wrapper module. setup_
applied just fetches proposed's main on first spawn and tags
deployed/0; subsequent rebuilds touch nothing in applied that
the manager didn't author. spawn + rebuild keep their old param
list with the now-unused hyperhive_flake + dashboard_port
underscored — call sites get cleaned up after the meta module
lands and consumes them.
2026-05-16 00:10:06 +02:00
müde
e26143a412 dashboard: diff against applied/proposal/<id>, prefer fetched_sha
approval_diff now runs git diff refs/heads/main..refs/tags/
proposal/<id> against the applied repo instead of cobbling a
single-file diff from proposed. consequences: multi-file
proposals show every change, manager amendments in proposed
cannot lie about what'll be deployed, no-op proposals render
an explicit '(proposal matches currently-deployed tree)'.
displayed sha prefers fetched_sha (hive-c0re-vouched) and
falls back to commit_ref only for the brief pre-fetch window.
unified_diff helper + similar dep dropped — git diff is the
source of truth now. dead-code allows on the lifecycle git
helpers + approvals.set_fetched_sha come off since all are
wired up. readme picks up the tag flow + /applied RO mount.
2026-05-15 23:18:17 +02:00
müde
fc61cb9310 fmt: clippy doc_markdown backticks 2026-05-15 23:11:10 +02:00
müde
4a8204f035 lifecycle: bind /applied into manager read-only
set_nspawn_flags now adds --bind-ro=/var/lib/hyperhive/applied
:/applied for the manager container alongside the existing
/agents RW mount. manager can git-fetch deployed/failed/denied
tags out of /applied/<n>/.git to mirror them into its proposed
clones; the read-only bind means git plumbing inside the
container cannot corrupt the authoritative repos. picked up by
the next rebuild of hm1nd (no spawn-time change needed since
set_nspawn_flags runs on every spawn + rebuild).
2026-05-15 23:02:31 +02:00
müde
6cf66e23dc actions: deny plants annotated denied/<id> tag
apply-commit denials now leave a git object behind: tag
denied/<id> annotated with the operator's note (or empty body
if they didn't supply one) at proposal/<id> inside the applied
repo. rejected configs become first-class git history — git
show denied/<id> in the manager's applied.git mount yields the
tree the operator rejected plus the reason. helper event
carries the tag for parity with deployed/failed. spawn denials
fall through unannotated since they have no proposal commit.
deny becomes async (single git plumbing call); dashboard +
admin-socket callers grow .await.
2026-05-15 23:01:22 +02:00
müde
315d4289c7 actions: tag-driven approve(ApplyCommit) flow
run_apply_commit walks the approval through the tag state
machine in applied: approved/<id> + building/<id> stamped
before the build, then git read-tree --reset to proposal/<id>
populates the working dir without moving HEAD. on rebuild
success deployed/<id> is planted and refs/heads/main fast-
forwards to the proposal. on failure failed/<id> is annotated
with the build error and the working tree resets back to main
so the agent stays evaluable. helper events Rebuilt +
ApprovalResolved both carry the terminal tag so the manager
can git-show the exact tree (and read the failure note from
an annotated tag) against its read-only applied.git mount.
finish_approval grows a terminal_tag param; spawn path passes
None. lifecycle::apply_commit deleted.
2026-05-15 23:00:01 +02:00
müde
35b0edaf27 manager_server: fetch+tag at request_apply_commit submit
submit_apply_commit (1) queues the approval row, (2) git-fetches
the manager-supplied sha from proposed into applied, pins it as
refs/tags/proposal/<id>, (3) persists the resolved sha on the
row via approvals.set_fetched_sha. from this point on the
proposal is immutable from the manager's perspective: amends
or force-pushes in proposed do not change what hive-c0re will
build. fetch failures mark the row failed and surface the error
to the manager so a phantom pending entry can't linger.
2026-05-15 22:57:43 +02:00
müde
8cb8fcedad lifecycle: setup_applied seeds via fetch + tags deployed/0
new shape: applied is git-init'd at first spawn, fetches
proposed's initial commit into its main, tags deployed/0 there.
the wrapper flake.nix is regenerated on every spawn/rebuild
but no longer tracked — apply churn vanishes, manager-authored
files in the proposal flow now survive untouched. setup_applied
gains an Option<&Path> for proposed (None on rebuild paths
that just refresh the flake). pre-overhaul applied dirs are
detected via the missing deployed/0 tag and bail loudly with
the destroy --purge migration hint. apply_commit is stubbed
with a clear error until the tag-driven approve flow lands.
2026-05-15 22:56:58 +02:00