Commit graph

36 commits

Author SHA1 Message Date
müde
315d4289c7 actions: tag-driven approve(ApplyCommit) flow
run_apply_commit walks the approval through the tag state
machine in applied: approved/<id> + building/<id> stamped
before the build, then git read-tree --reset to proposal/<id>
populates the working dir without moving HEAD. on rebuild
success deployed/<id> is planted and refs/heads/main fast-
forwards to the proposal. on failure failed/<id> is annotated
with the build error and the working tree resets back to main
so the agent stays evaluable. helper events Rebuilt +
ApprovalResolved both carry the terminal tag so the manager
can git-show the exact tree (and read the failure note from
an annotated tag) against its read-only applied.git mount.
finish_approval grows a terminal_tag param; spawn path passes
None. lifecycle::apply_commit deleted.
2026-05-15 23:00:01 +02:00
müde
8cb8fcedad lifecycle: setup_applied seeds via fetch + tags deployed/0
new shape: applied is git-init'd at first spawn, fetches
proposed's initial commit into its main, tags deployed/0 there.
the wrapper flake.nix is regenerated on every spawn/rebuild
but no longer tracked — apply churn vanishes, manager-authored
files in the proposal flow now survive untouched. setup_applied
gains an Option<&Path> for proposed (None on rebuild paths
that just refresh the flake). pre-overhaul applied dirs are
detected via the missing deployed/0 tag and bail loudly with
the destroy --purge migration hint. apply_commit is stubbed
with a clear error until the tag-driven approve flow lands.
2026-05-15 22:56:58 +02:00
müde
63ef69674b lifecycle: git helpers for tag-driven applied repo
new plumbing for the upcoming flow: git_fetch_to_tag (pulls a
sha from proposed into applied and pins it as a tag in one
shot), git_rev_parse (normalises shas + reads back tag
targets), git_tag / git_tag_annotated (lightweight vs body-
carrying for failed/denied), git_read_tree_reset (replace
working tree without moving HEAD — lets main stay on last
known-good across an in-flight build), git_update_ref (ff main
on deploy). annotated tag bodies go via stdin to avoid escape
games. all dead-code-allowed; callers land in subsequent
commits.
2026-05-15 22:52:23 +02:00
müde
6a2ffd521b surface agent-vs-agent port collisions (manager:8000 can't collide)
manager is fixed at 8000, sub-agents are 8100-8999, so collisions
are strictly between two sub-agents hashing to the same value.
the colliding container's harness restart-loops on AddrInUse —
which the user just hit on :8945. previously the only sign was a
buried journalctl warn line.

now surfaced two ways:

- lifecycle::spawn / rebuild preflight: walks the live container
  list, computes each agent's hashed port, refuses with
  'port N already taken by <other> — rename one of them' if any
  running sub-agent shares the new agent's port. so the operator
  sees an actionable error in the dashboard's transient pill /
  approve-result instead of waiting for the harness to die.

- /api/state grows a port_conflicts: [{port, agents: [...]}]
  array; dashboard renders a pulsing red banner above the
  containers list listing each cluster. matches the questions
  panel pulse so it's hard to miss.
2026-05-15 22:08:19 +02:00
müde
acaa0eb895 agent_web_port: back to pure hash, drop port-file dance
operator's call: probing-forward + state-file machinery is more
brittle than the bug it tried to fix. revert to the original
deterministic FNV-1a hash mod 900. collisions are real but rare;
operator resolves by renaming (different name → different hash) and
rebuilding. no per-agent port file, no scan, no migration path,
nothing to drift out of sync with the running container.

existing port files on disk are silently ignored — operator
rebuilds affected agents to regenerate flakes from the deterministic
hash.
2026-05-15 21:17:31 +02:00
müde
c35f566d15 agent_web_port: actually resolve legacy collisions
previous attempt was wrong: the legacy branch returned port_hash
unconditionally, so two legacies hashing to the same port both
wrote that port and the collision persisted (test still trying to
bind coder's port).

new rule: always probe forward from port_hash, with scan_taken_ports
parameterised by include_implicit_hashes:

- legacy migration (applied dir exists, no port file): pass false.
  scan only counts other agents' port files. first-queried legacy
  claims its hash; subsequent colliders see the first's port file
  and probe forward. we don't know which legacy originally won
  the bind race, so first-write-wins; the loser was already
  crash-looping anyway and gets a fresh port to rebuild to.

- fresh spawn (no applied dir): pass true. counts port files AND
  implicit hashes for not-yet-migrated legacies, so a new spawn
  doesn't race with an unmigrated peer.

migration note for affected users: agents whose port file says
something other than their hashed port may have been corrupted by
the previous fix. Hit ↻ R3BU1LD on the offender to regenerate the
flake (uses the current port file) and the container will bind
the right port on restart.
2026-05-15 21:13:17 +02:00
müde
6db38cf70c model: runtime override via /model slash; fixes for port + bind
- runtime model override: Bus::{model,set_model} + POST /api/model
  (form-encoded {model: name}). turn.rs reads bus.model() per turn
  so a flip lands on the next claude invocation. /api/state grows
  a model field; agent page shows a 'model · <name>' chip in the
  state row. '/model <name>' slash command POSTs to the endpoint
  and refreshes state.

- port regression fix: agent_web_port no longer probes forward for
  *existing* agents (the previous fix shifted ports for any agent
  without a port file, including legacy ones whose container was
  already bound to the bare hashed port — dashboard rendered the
  new port, container was still on the old one, conn errors). new
  rule: port file exists → use it; absent + applied flake present
  → legacy, persist port_hash without probing; absent + no applied
  flake → fresh spawn, probe forward.

- SO_REUSEADDR on both the dashboard and per-agent web UI binds
  via tokio::net::TcpSocket. operator hit 12 retries failing on
  manager :8000 — REUSEADDR handles the TIME_WAIT case cleanly
  without a new dep; retry still covers the genuine
  process-still-alive overlap.

todo: drops the model-override entry (shipped); adds two new
items — model persistence (optional, future), and custom
per-agent MCP tools (groundwork for moving bitburner-agent into
hyperhive).
2026-05-15 20:59:45 +02:00
müde
79a46f359a agent_web_port: collision-aware sticky allocation
operator hit 'coder' and 'test' colliding on the same hashed port —
fnv-1a mod 900 has ~0.1% collision probability per pair and clearly
that's not enough. agent_web_port goes stateful:

- per-agent port persisted to /var/lib/hyperhive/agents/<name>/port
- on first call, look up the file; if absent, hash, then probe
  forward through the allocated range skipping any port other
  agents already claim, then write the chosen value back
- subsequent calls return the persisted port (sticky)

other agents' ports come from their port file if present, else the
fallback is the hashed value — that handles existing deployments
without forcing a rebuild-all just to migrate. rebuilding the
colliding agent re-runs agent_web_port, sees its peer's implicit
hash port as taken, picks the next free slot, persists.

range exhaustion (very unlikely — 900 slots) logs a warning and
returns the hash; the bind-with-retry on the harness will surface
the failure honestly rather than silently looping.
2026-05-15 20:41:18 +02:00
müde
ff8f8c7c56 per-agent /state dir for durable notes; manager sees them via /agents 2026-05-15 18:00:08 +02:00
müde
8428c693e0 dashboard: stop/restart per-container + update-all when any stale 2026-05-15 17:00:56 +02:00
müde
0f0e242906 programs.git.enable + harness PATH tracks systemPackages
- harness-base.nix: switch to programs.git for declarative gitconfig.
- agent + manager service path = /run/current-system/sw → agents pick up
  new packages from their own agent.nix without harness edits.
- generated applied/<name>/flake.nix overrides programs.git.config.user
  (no more raw etc.gitconfig collision).
2026-05-15 16:16:14 +02:00
müde
e1289a3e4c nix templates: factor harness-base.nix (shared scaffolding incl. gitconfig) 2026-05-15 16:10:55 +02:00
müde
f1fd787f17 rebuild button on agent UI (cross-origin POST to dashboard /rebuild) 2026-05-15 15:57:11 +02:00
müde
f99ed3fe7a manager: same lifecycle as agents; auto-spawn on hive-c0re start 2026-05-15 13:43:32 +02:00
müde
a42fdb3a5c phase 8 step 1: per-agent claude creds bind + destroy keeps state 2026-05-15 12:39:22 +02:00
müde
0fc287c768 fmt 2026-05-15 02:58:35 +02:00
müde
b711296460 destroy verb: CLI + admin socket + dashboard button; purges state + approvals 2026-05-15 02:57:22 +02:00
müde
fcd6563887 fmt 2026-05-15 02:02:20 +02:00
müde
07a5d3a778 lifecycle: clear HOST_ADDRESS/LOCAL_ADDRESS/HOST_BRIDGE — start script's --network-veth was forcing private netns 2026-05-15 01:51:12 +02:00
müde
59de7fa3c5 lifecycle: force PRIVATE_NETWORK=0 so per-agent web UI port reaches host 2026-05-15 01:35:30 +02:00
müde
ee99774d17 Phase 7d: per-container MemoryMax + CPUQuota via systemd drop-in 2026-05-15 00:30:48 +02:00
müde
7c1ed07cf2 lifecycle: HYPERHIVE_GIT env override (bypass PATH); module sets it 2026-05-15 00:24:51 +02:00
müde
6dbf4eedd7 lifecycle: u16::try_from instead of as-cast 2026-05-14 23:39:53 +02:00
müde
d0f954bbc1 Phase 6a: per-container web UI (axum); per-agent port hashed from name 2026-05-14 23:39:06 +02:00
müde
967ec7c9d7 fmt 2026-05-14 23:22:00 +02:00
müde
2fd80dbd68 Phase 5c: separate proposed (manager) and applied (hive-c0re) repos; per-agent gitconfig 2026-05-14 23:20:32 +02:00
müde
3c702cf43f fmt 2026-05-14 23:10:37 +02:00
müde
433c0d212e Phase 5b: per-agent config flakes; approve validates + advances commit 2026-05-14 23:09:35 +02:00
müde
6e7fd2e897 Phase 3c: nixpkgs-unstable for claude-code; harness calls claude --print, falls back to echo 2026-05-14 22:26:14 +02:00
müde
764d6497dd lifecycle: rebuild reconciles bind flag idempotently and restarts 2026-05-14 22:09:22 +02:00
müde
377eb994a1 lifecycle: bind via EXTRA_NSPAWN_FLAGS in /etc/nixos-containers/<name>.conf 2026-05-14 22:06:27 +02:00
müde
326da5a7bf naming: h-<name> for agents, hm1nd for manager (11-char limit) 2026-05-14 21:59:01 +02:00
müde
7ce0f0022f lifecycle: bind agent dir via /run/systemd/nspawn override (nixos-container lacks --bind) 2026-05-14 21:52:17 +02:00
müde
f6cf4223a4 lifecycle: surface nixos-container stderr in error + log 2026-05-14 21:48:23 +02:00
müde
d79b5a39a1 hive-c0re: in-memory broker + per-agent sockets + coordinator state 2026-05-14 21:42:51 +02:00
müde
90798b936e hive-c0re: nixos-container lifecycle (spawn/kill/rebuild/list) 2026-05-14 20:51:35 +02:00