Commit graph

59 commits

Author SHA1 Message Date
müde
411cf86632 nix fmt + rustfmt sweep 2026-05-17 01:40:28 +02:00
damocles
a6d1464071 refactor: per-agent state paths (/agents/{label}/state), centralize in paths.rs 2026-05-16 15:18:32 +02:00
damocles
ecaa178199 refactor: compute per-agent mount points for /agents/<name>/ structure 2026-05-16 15:18:19 +02:00
damocles
37e56af6ba add /shared mount: new shared directory accessible to all agents 2026-05-16 13:42:41 +02:00
müde
7276e6d5d9 git identity: shorten to 'c0re' across all helpers
lifecycle::GIT_{NAME,EMAIL}, meta::GIT_{NAME,EMAIL}, and the
inline strings migrate.rs uses for its bootstrap commits all
move from 'hive-c0re' / 'hive-c0re@hyperhive' to 'c0re' /
'c0re@hyperhive'. shows up shorter in git log everywhere
(applied + meta repos).
2026-05-16 03:02:44 +02:00
müde
8336017eda lifecycle: annotated tags need a tagger identity
git_tag_annotated planted failed/<id> + denied/<id> as
annotated tags via 'git tag -a' — which produces a git
object and therefore needs user.name + user.email. without a
global git config on the host that fell through to
'fatal: unable to auto-detect email address (got
root@muede-lpt2.(none))' and the tag never landed.

pass the hive-c0re identity inline with -c user.name=… -c
user.email=… (same shape git_commit already uses), so the
applied repo's deny/failure audit tags get planted reliably
without depending on the host user's git config.
2026-05-16 03:00:44 +02:00
müde
c92108a11c lifecycle: fetch into checked-out main with --update-head-ok
setup_applied does `git init --initial-branch=main` then
`git fetch <proposed> main:refs/heads/main` to seed the
applied repo with proposed's initial commit. git's default
safeguard refuses to fetch into the currently-checked-out
branch, even though the working tree is empty (we just init'd).
add --update-head-ok to bypass — the read-tree-reset
immediately after fetches the right state, so the safeguard
the flag bypasses isn't relevant here anyway.

repro from the user: spawn of 'dmatrix' failed with
  fatal: refusing to fetch into branch 'refs/heads/main'
  checked out at '/var/lib/hyperhive/applied/dmatrix'
2026-05-16 02:58:34 +02:00
müde
6f1b664c85 lifecycle: stream nixos-container stdout/stderr line-by-line
run() previously buffered the child's output via .output() and
only logged at exit — a multi-minute 'nixos-container update'
(typical on a fresh hyperhive bump) showed nothing in journald
until the very end. operator watching 'journalctl -u hive-c0re
-f' couldn't tell 'slow nix build' from 'wedged daemon'.

new shape: spawn with piped stdio, pump each line into tracing
as it arrives (stdout → INFO, stderr → WARN), keep a tail of
the last 32 stderr lines for the bail message so the eventual
'failed (status 2)' still carries the actual nix eval error.
target field 'nixos-container', argv-equivalent attached via
the 'cmdline' field so filtering by subcommand works.
2026-05-16 02:57:16 +02:00
müde
3db33b0fe5 agent flake.nix: forward inputs as flakeInputs module arg
new boilerplate wraps agent.nix as a sub-module + passes every
flake input (minus self) through to it via _module.args.flake
Inputs. manager edits the inputs block of flake.nix to pull in
out-of-tree flakes (MCP servers etc.) and references them in
agent.nix as flakeInputs.<name>.packages.${pkgs.system}.default
— the new input's pinned sha lands in the agent's own flake
.lock (already tracked + part of the proposal flow), and
transitively rolls up into meta's lock.

migrate's MODULE_FLAKE_MARKER swaps to _module.args.flakeInputs
so existing agents on the old 'nixosModules.default = import
./agent.nix' template get re-rendered onto the new shape on
next hive-c0re start.

manager_server's flake.nix tamper-check goes away — the build
path's failed/<id> annotated tag already provides the safety
net when a manager edit breaks the flake; enforcing 'no
flake.nix edits at all' was overly strict (blocks the inputs-
addition pattern that's the whole point of this change).

manager prompt updated with a worked example for adding an
MCP-server flake input + wiring it through agent.nix.
2026-05-16 02:23:43 +02:00
müde
50ef806266 operator pronouns: configurable free-text, threaded into prompts
new NixOS module option services.hive-c0re.operatorPronouns
(free text, default 'she/her', example 'they/them'). hive-c0re
takes it as a CLI flag (--operator-pronouns, lib.escapeShellArg'd
in the systemd unit), stores it on Coordinator, threads it into
the meta flake's mkAgent so each agent's systemd service gets
HIVE_OPERATOR_PRONOUNS set. the harness reads the env at boot
and substitutes {operator_pronouns} into the agent / manager
system prompt alongside {label}. nix string is escaped against
backslash + double-quote so non-ascii / quoted values
round-trip safely. prompt addendum: both agent.md and
manager.md mention the operator's pronouns up front so claude
uses them naturally in third-person reference. propagates on
next ↻ R3BU1LD (meta lock bump, no per-agent approval).
2026-05-16 02:05:22 +02:00
müde
14aa7c7acc final docs + cleanup sync for meta-flake era
claude.md flips 'in flight' → 'just landed' for the meta
overhaul + extends the file map with meta.rs and migrate.rs.
docs/approvals.md replaces the in-flight callout with a
proper 'Meta flake' section (two-phase deploy walkthrough,
sync_agents semantics, single-phase variants), updates the
two-repo box diagram to include the /var/lib/hyperhive/meta/
tree and tracks flake.nix in applied, rewrites the
container --flake reference to meta#<name>, replaces the
'Manager view of applied' section with a unified
'/agents + /applied + /meta' inventory listing every useful
git incantation, and explains the in-place no-state-loss
migration that now runs on hive-c0re startup.
docs/persistence.md grows entries for the meta repo + the
.meta-migration-done marker. readme box diagram picks up the
/meta RO bind; approval-flow paragraph rewritten end to end
to describe the meta lock dance.

lifecycle::flake_base deleted — the meta render hardcodes
the manager vs agent-base choice as nix expression.
2026-05-16 00:40:06 +02:00
müde
59a89314f0 startup auto-migration from pre-meta layout
new migrate module runs before auto_update on hive-c0re boot.
four idempotent phases:

1. for every applied/<n>/ whose flake.nix isn't already the
   module-only boilerplate, rewrite + commit + relocate
   deployed/0 to HEAD so setup_applied's existence check passes
2. for every proposed/<n>/config without an 'applied' remote,
   wire it (delegates to setup_proposed which is now
   idempotent and adds the remote itself)
3. meta::sync_agents over the current container list — inits
   the meta repo on first call, rerender + relock if drifted
4. nixos-container update <c> --flake meta#<name> for every
   container, guarded by /var/lib/hyperhive/.meta-migration-done
   so phase 4's expensive eval only runs once across restarts

env kill-switch HIVE_SKIP_META_MIGRATION=1 defers the whole
thing. each agent's failure is logged + skipped so one broken
agent doesn't block the rest. runs ahead of ensure_manager so
the manager auto-spawn comes up against meta from the first
attempt.
2026-05-16 00:34:58 +02:00
müde
06fdbac1ac actions::run_apply_commit through meta two-phase
approval-driven deploys now walk the meta flake via
prepare_deploy / finalize_deploy / abort_deploy so a failed
build leaves no commit in meta's deploy log:

1. capture applied/main sha for rollback
2. tag approved/<id> + building/<id>
3. ff applied/main to proposal/<id>, read-tree sync working tree
4. meta::prepare_deploy(name) — nix flake lock --update-input
   agent-<n> without committing
5. lifecycle::rebuild_no_meta — container-level only (new
   extracted helper; public lifecycle::rebuild still wraps it
   with single-phase meta sync + commit for dashboard / auto
   _update callers that don't care about rollback)
6a. on success: tag deployed/<id>, meta::finalize_deploy commits
    the staged lock with 'deploy <n> deployed/<id> <sha12>'
6b. on failure: tag failed/<id> annotated with the build error,
    git_update_ref applied/main back to prev sha, read-tree to
    main, meta::abort_deploy git-restores flake.lock

meta's git log now records only successful deploys; failures
+ denials still live in applied as annotated tags.
2026-05-16 00:32:16 +02:00
müde
22f35def8f actions::destroy syncs meta after lifecycle
once nixos-container destroy lands + per-agent state cleanup is
done, rerender the meta flake from the remaining containers so
the destroyed agent's input + nixosConfiguration drop off and
its flake.lock entry vanishes. log + keep going on meta-sync
failure — the destroy already succeeded at the lifecycle level,
so meta drift here is just bookkeeping. new public
lifecycle::agents_for_meta_listing exposes the agent
enumeration for callers outside the module.
2026-05-16 00:29:26 +02:00
müde
4cb529351e lifecycle::rebuild through meta
rebuild now does sync_agents (idempotent — no-op when the
rendered flake matches disk; recovers from a divergent meta
repo on the side) followed by lock_update_for_rebuild which
relocks just this agent's input and commits the lock change
if any. flake ref for nixos-container update flips from
applied/<n>#default to meta#<name>. new helper
meta::lock_update_for_rebuild is single-phase (no separate
finalize): rebuild has no failure-revert semantics — it always
wants the latest applied/<n>/main. spawn already syncs meta
before container create; rebuild now picks up the meta side
on every manual ↻ R3BU1LD.
2026-05-16 00:28:26 +02:00
müde
8f94e4379a lifecycle::spawn through meta
after setup_proposed + setup_applied, spawn now syncs the meta
flake (one input + one nixosConfiguration per agent) so
`--flake /var/lib/hyperhive/meta#<name>` resolves before
nixos-container create runs. flake ref switches from
applied/<n>#default to meta#<name>; the wrapper modules
(identity, HIVE_PORT, HIVE_LABEL, HIVE_DASHBOARD_PORT) now
live in the meta flake's mkAgent. new helper agents_for_meta
builds the AgentSpec list by enumerating containers + optionally
appending a not-yet-present name for the spawn case. spawn
keeps its caller signature; rebuild + auto_update get wired up
in follow-up commits.
2026-05-16 00:27:12 +02:00
müde
c42ad1330c lifecycle: pre-wire applied remote in proposed
setup_proposed now lands a git remote named 'applied' on every
proposed/<n>/config pointing at /applied/<n>/.git — the path as
seen from inside the manager container, where the RO bind in
set_nspawn_flags makes the URL resolve. From the manager:

  git fetch applied
  git log applied/main
  git show applied/refs/tags/deployed/<id>
  git diff applied/main HEAD
  git rebase applied/main

all work without manually constructing the path each time. The
RO bind blocks push at the kernel level so the remote can only
fetch. Idempotent — also applied to pre-existing proposed repos
(no-op if the remote is already correct, set-url if drifted)
so the startup migration picks up the wiring on existing
agents.
2026-05-16 00:25:43 +02:00
müde
3d14ddeb7d lifecycle: bind /meta RO into manager
set_nspawn_flags now adds a third manager-only bind alongside
/agents (RW) and /applied (RO): --bind-ro=/var/lib/hyperhive/meta
:/meta. manager can git log /meta to see every deploy across the
swarm and cat /meta/flake.lock to introspect which sha each agent
is currently pinned at. defensive create_dir_all on the host
side so a cold start with no agents (meta repo not yet seeded)
doesn't trip systemd-nspawn's missing-bind-source check before
the migration plants the dir.
2026-05-16 00:24:39 +02:00
müde
92822efe16 meta: new hive-c0re module owns /var/lib/hyperhive/meta/
leaf module with no runtime callers yet (every public item is
#[allow(dead_code)] until lifecycle / actions / auto_update
rewire to use it). API surface:

- sync_agents — idempotent: render flake.nix for the given
  agent set, git-init on first call, nix flake lock, commit if
  anything changed.
- prepare_deploy / finalize_deploy / abort_deploy — two-phase
  for the request_apply_commit path. prepare runs nix flake
  lock --update-input agent-<n> without committing; finalize
  commits with a 'deploy <n> deployed/<id> <sha12>' message;
  abort git-restores the lock so a failed build leaves no
  orphan commit.
- lock_update_hyperhive — one-shot for the auto-update path.

flake.nix template defines mkAgent that pulls each agent's
nixosModules.default from its input and wraps with the
identity / HIVE_PORT / HIVE_LABEL / HIVE_DASHBOARD_PORT
module — what setup_applied used to generate inline. nix
invocations carry --extra-experimental-features as a belt
in case flakes aren't enabled in nix.conf.
2026-05-16 00:22:37 +02:00
müde
5b5a93e0c6 lifecycle: module-only agent flake.nix, tracked in proposed
setup_proposed now seeds both agent.nix (a regular NixOS module
function) and flake.nix (boilerplate exporting nixosModules.default
= import ./agent.nix) into the manager-editable proposed repo,
committed together. setup_applied's hyperhive_flake + dashboard
port wrapper generation is deleted entirely — the meta flake at
/var/lib/hyperhive/meta/ now owns the wrapper module. setup_
applied just fetches proposed's main on first spawn and tags
deployed/0; subsequent rebuilds touch nothing in applied that
the manager didn't author. spawn + rebuild keep their old param
list with the now-unused hyperhive_flake + dashboard_port
underscored — call sites get cleaned up after the meta module
lands and consumes them.
2026-05-16 00:10:06 +02:00
müde
e26143a412 dashboard: diff against applied/proposal/<id>, prefer fetched_sha
approval_diff now runs git diff refs/heads/main..refs/tags/
proposal/<id> against the applied repo instead of cobbling a
single-file diff from proposed. consequences: multi-file
proposals show every change, manager amendments in proposed
cannot lie about what'll be deployed, no-op proposals render
an explicit '(proposal matches currently-deployed tree)'.
displayed sha prefers fetched_sha (hive-c0re-vouched) and
falls back to commit_ref only for the brief pre-fetch window.
unified_diff helper + similar dep dropped — git diff is the
source of truth now. dead-code allows on the lifecycle git
helpers + approvals.set_fetched_sha come off since all are
wired up. readme picks up the tag flow + /applied RO mount.
2026-05-15 23:18:17 +02:00
müde
fc61cb9310 fmt: clippy doc_markdown backticks 2026-05-15 23:11:10 +02:00
müde
4a8204f035 lifecycle: bind /applied into manager read-only
set_nspawn_flags now adds --bind-ro=/var/lib/hyperhive/applied
:/applied for the manager container alongside the existing
/agents RW mount. manager can git-fetch deployed/failed/denied
tags out of /applied/<n>/.git to mirror them into its proposed
clones; the read-only bind means git plumbing inside the
container cannot corrupt the authoritative repos. picked up by
the next rebuild of hm1nd (no spawn-time change needed since
set_nspawn_flags runs on every spawn + rebuild).
2026-05-15 23:02:31 +02:00
müde
315d4289c7 actions: tag-driven approve(ApplyCommit) flow
run_apply_commit walks the approval through the tag state
machine in applied: approved/<id> + building/<id> stamped
before the build, then git read-tree --reset to proposal/<id>
populates the working dir without moving HEAD. on rebuild
success deployed/<id> is planted and refs/heads/main fast-
forwards to the proposal. on failure failed/<id> is annotated
with the build error and the working tree resets back to main
so the agent stays evaluable. helper events Rebuilt +
ApprovalResolved both carry the terminal tag so the manager
can git-show the exact tree (and read the failure note from
an annotated tag) against its read-only applied.git mount.
finish_approval grows a terminal_tag param; spawn path passes
None. lifecycle::apply_commit deleted.
2026-05-15 23:00:01 +02:00
müde
8cb8fcedad lifecycle: setup_applied seeds via fetch + tags deployed/0
new shape: applied is git-init'd at first spawn, fetches
proposed's initial commit into its main, tags deployed/0 there.
the wrapper flake.nix is regenerated on every spawn/rebuild
but no longer tracked — apply churn vanishes, manager-authored
files in the proposal flow now survive untouched. setup_applied
gains an Option<&Path> for proposed (None on rebuild paths
that just refresh the flake). pre-overhaul applied dirs are
detected via the missing deployed/0 tag and bail loudly with
the destroy --purge migration hint. apply_commit is stubbed
with a clear error until the tag-driven approve flow lands.
2026-05-15 22:56:58 +02:00
müde
63ef69674b lifecycle: git helpers for tag-driven applied repo
new plumbing for the upcoming flow: git_fetch_to_tag (pulls a
sha from proposed into applied and pins it as a tag in one
shot), git_rev_parse (normalises shas + reads back tag
targets), git_tag / git_tag_annotated (lightweight vs body-
carrying for failed/denied), git_read_tree_reset (replace
working tree without moving HEAD — lets main stay on last
known-good across an in-flight build), git_update_ref (ff main
on deploy). annotated tag bodies go via stdin to avoid escape
games. all dead-code-allowed; callers land in subsequent
commits.
2026-05-15 22:52:23 +02:00
müde
6a2ffd521b surface agent-vs-agent port collisions (manager:8000 can't collide)
manager is fixed at 8000, sub-agents are 8100-8999, so collisions
are strictly between two sub-agents hashing to the same value.
the colliding container's harness restart-loops on AddrInUse —
which the user just hit on :8945. previously the only sign was a
buried journalctl warn line.

now surfaced two ways:

- lifecycle::spawn / rebuild preflight: walks the live container
  list, computes each agent's hashed port, refuses with
  'port N already taken by <other> — rename one of them' if any
  running sub-agent shares the new agent's port. so the operator
  sees an actionable error in the dashboard's transient pill /
  approve-result instead of waiting for the harness to die.

- /api/state grows a port_conflicts: [{port, agents: [...]}]
  array; dashboard renders a pulsing red banner above the
  containers list listing each cluster. matches the questions
  panel pulse so it's hard to miss.
2026-05-15 22:08:19 +02:00
müde
acaa0eb895 agent_web_port: back to pure hash, drop port-file dance
operator's call: probing-forward + state-file machinery is more
brittle than the bug it tried to fix. revert to the original
deterministic FNV-1a hash mod 900. collisions are real but rare;
operator resolves by renaming (different name → different hash) and
rebuilding. no per-agent port file, no scan, no migration path,
nothing to drift out of sync with the running container.

existing port files on disk are silently ignored — operator
rebuilds affected agents to regenerate flakes from the deterministic
hash.
2026-05-15 21:17:31 +02:00
müde
c35f566d15 agent_web_port: actually resolve legacy collisions
previous attempt was wrong: the legacy branch returned port_hash
unconditionally, so two legacies hashing to the same port both
wrote that port and the collision persisted (test still trying to
bind coder's port).

new rule: always probe forward from port_hash, with scan_taken_ports
parameterised by include_implicit_hashes:

- legacy migration (applied dir exists, no port file): pass false.
  scan only counts other agents' port files. first-queried legacy
  claims its hash; subsequent colliders see the first's port file
  and probe forward. we don't know which legacy originally won
  the bind race, so first-write-wins; the loser was already
  crash-looping anyway and gets a fresh port to rebuild to.

- fresh spawn (no applied dir): pass true. counts port files AND
  implicit hashes for not-yet-migrated legacies, so a new spawn
  doesn't race with an unmigrated peer.

migration note for affected users: agents whose port file says
something other than their hashed port may have been corrupted by
the previous fix. Hit ↻ R3BU1LD on the offender to regenerate the
flake (uses the current port file) and the container will bind
the right port on restart.
2026-05-15 21:13:17 +02:00
müde
6db38cf70c model: runtime override via /model slash; fixes for port + bind
- runtime model override: Bus::{model,set_model} + POST /api/model
  (form-encoded {model: name}). turn.rs reads bus.model() per turn
  so a flip lands on the next claude invocation. /api/state grows
  a model field; agent page shows a 'model · <name>' chip in the
  state row. '/model <name>' slash command POSTs to the endpoint
  and refreshes state.

- port regression fix: agent_web_port no longer probes forward for
  *existing* agents (the previous fix shifted ports for any agent
  without a port file, including legacy ones whose container was
  already bound to the bare hashed port — dashboard rendered the
  new port, container was still on the old one, conn errors). new
  rule: port file exists → use it; absent + applied flake present
  → legacy, persist port_hash without probing; absent + no applied
  flake → fresh spawn, probe forward.

- SO_REUSEADDR on both the dashboard and per-agent web UI binds
  via tokio::net::TcpSocket. operator hit 12 retries failing on
  manager :8000 — REUSEADDR handles the TIME_WAIT case cleanly
  without a new dep; retry still covers the genuine
  process-still-alive overlap.

todo: drops the model-override entry (shipped); adds two new
items — model persistence (optional, future), and custom
per-agent MCP tools (groundwork for moving bitburner-agent into
hyperhive).
2026-05-15 20:59:45 +02:00
müde
79a46f359a agent_web_port: collision-aware sticky allocation
operator hit 'coder' and 'test' colliding on the same hashed port —
fnv-1a mod 900 has ~0.1% collision probability per pair and clearly
that's not enough. agent_web_port goes stateful:

- per-agent port persisted to /var/lib/hyperhive/agents/<name>/port
- on first call, look up the file; if absent, hash, then probe
  forward through the allocated range skipping any port other
  agents already claim, then write the chosen value back
- subsequent calls return the persisted port (sticky)

other agents' ports come from their port file if present, else the
fallback is the hashed value — that handles existing deployments
without forcing a rebuild-all just to migrate. rebuilding the
colliding agent re-runs agent_web_port, sees its peer's implicit
hash port as taken, picks the next free slot, persists.

range exhaustion (very unlikely — 900 slots) logs a warning and
returns the hash; the bind-with-retry on the harness will surface
the failure honestly rather than silently looping.
2026-05-15 20:41:18 +02:00
müde
ff8f8c7c56 per-agent /state dir for durable notes; manager sees them via /agents 2026-05-15 18:00:08 +02:00
müde
8428c693e0 dashboard: stop/restart per-container + update-all when any stale 2026-05-15 17:00:56 +02:00
müde
0f0e242906 programs.git.enable + harness PATH tracks systemPackages
- harness-base.nix: switch to programs.git for declarative gitconfig.
- agent + manager service path = /run/current-system/sw → agents pick up
  new packages from their own agent.nix without harness edits.
- generated applied/<name>/flake.nix overrides programs.git.config.user
  (no more raw etc.gitconfig collision).
2026-05-15 16:16:14 +02:00
müde
e1289a3e4c nix templates: factor harness-base.nix (shared scaffolding incl. gitconfig) 2026-05-15 16:10:55 +02:00
müde
f1fd787f17 rebuild button on agent UI (cross-origin POST to dashboard /rebuild) 2026-05-15 15:57:11 +02:00
müde
f99ed3fe7a manager: same lifecycle as agents; auto-spawn on hive-c0re start 2026-05-15 13:43:32 +02:00
müde
a42fdb3a5c phase 8 step 1: per-agent claude creds bind + destroy keeps state 2026-05-15 12:39:22 +02:00
müde
0fc287c768 fmt 2026-05-15 02:58:35 +02:00
müde
b711296460 destroy verb: CLI + admin socket + dashboard button; purges state + approvals 2026-05-15 02:57:22 +02:00
müde
fcd6563887 fmt 2026-05-15 02:02:20 +02:00
müde
07a5d3a778 lifecycle: clear HOST_ADDRESS/LOCAL_ADDRESS/HOST_BRIDGE — start script's --network-veth was forcing private netns 2026-05-15 01:51:12 +02:00
müde
59de7fa3c5 lifecycle: force PRIVATE_NETWORK=0 so per-agent web UI port reaches host 2026-05-15 01:35:30 +02:00
müde
ee99774d17 Phase 7d: per-container MemoryMax + CPUQuota via systemd drop-in 2026-05-15 00:30:48 +02:00
müde
7c1ed07cf2 lifecycle: HYPERHIVE_GIT env override (bypass PATH); module sets it 2026-05-15 00:24:51 +02:00
müde
6dbf4eedd7 lifecycle: u16::try_from instead of as-cast 2026-05-14 23:39:53 +02:00
müde
d0f954bbc1 Phase 6a: per-container web UI (axum); per-agent port hashed from name 2026-05-14 23:39:06 +02:00
müde
967ec7c9d7 fmt 2026-05-14 23:22:00 +02:00
müde
2fd80dbd68 Phase 5c: separate proposed (manager) and applied (hive-c0re) repos; per-agent gitconfig 2026-05-14 23:20:32 +02:00
müde
3c702cf43f fmt 2026-05-14 23:10:37 +02:00