hyperhive

Author	SHA1	Message	Date
müde	58c3cd853b	container crash watcher → HelperEvent::ContainerCrash new hive_c0re::crash_watch task polls every 10s, builds the set of currently-running containers, and on running→stopped transitions checks the transient snapshot: if no Stopping / Restarting / Destroying / Rebuilding flag is set, the container exited unexpectedly and we fire HelperEvent::ContainerCrash into the manager's inbox so it can react (typically: start it again). first poll is a seeding pass — no events on harness startup. dbus subscription would be lower-latency but polling is honest and debuggable, and a 10s delay on crash detection is fine for our scale. manager prompt + approvals doc updated to advertise the new event variant. todo drops the entry (and the journald-viewer entry that already shipped).	2026-05-15 21:02:05 +02:00
müde	6db38cf70c	model: runtime override via /model slash; fixes for port + bind - runtime model override: Bus::{model,set_model} + POST /api/model (form-encoded {model: name}). turn.rs reads bus.model() per turn so a flip lands on the next claude invocation. /api/state grows a model field; agent page shows a 'model · <name>' chip in the state row. '/model <name>' slash command POSTs to the endpoint and refreshes state. - port regression fix: agent_web_port no longer probes forward for existing agents (the previous fix shifted ports for any agent without a port file, including legacy ones whose container was already bound to the bare hashed port — dashboard rendered the new port, container was still on the old one, conn errors). new rule: port file exists → use it; absent + applied flake present → legacy, persist port_hash without probing; absent + no applied flake → fresh spawn, probe forward. - SO_REUSEADDR on both the dashboard and per-agent web UI binds via tokio::net::TcpSocket. operator hit 12 retries failing on manager :8000 — REUSEADDR handles the TIME_WAIT case cleanly without a new dep; retry still covers the genuine process-still-alive overlap. todo: drops the model-override entry (shipped); adds two new items — model persistence (optional, future), and custom per-agent MCP tools (groundwork for moving bitburner-agent into hyperhive).	2026-05-15 20:59:45 +02:00
müde	7d93dd9db4	no nap tool — recv with long wait_seconds replaces it; max raised to 180s recv-with-timeout is strictly better than a fixed sleep because it wakes instantly on incoming messages. drop the half-written nap MCP tool, raise the recv wait_seconds cap from 60s to 180s on both agent and manager sockets. prompts updated: agent.md + manager.md now spell out the pattern — when there's nothing else useful to do, call recv with wait_seconds=180 to park the turn; do NOT use Bash sleep for the same purpose. todo drops the nap entry and the napping-state-badge follow-up; both replaced by 'just use a long recv'.	2026-05-15 20:53:15 +02:00
müde	f65ee88269	recv: optional wait_seconds parameter, capped at 60s AgentRequest::Recv and ManagerRequest::Recv grow an optional wait_seconds field (default None → 30s, capped at 60s server-side). agent_server / manager_server clamp via recv_timeout(). MCP tool schemas advertise the param so claude can pick its own poll window — useful when an agent wants to throttle wakes without entering a distinct nap state. both harness loops still pass None, keeping the existing 30s default behaviour for system-level Recvs.	2026-05-15 20:49:33 +02:00
müde	0385d96bf3	dashboard: per-container journald viewer new GET /api/journal/{name}?unit=&lines= shells out journalctl -M <container> -b --no-pager --output=short-iso --lines=<N> (cap 5000). optional unit filter, restricted to hive-ag3nt.service / hive-m1nd.service so the shell-out can't be coerced into reading unrelated units. validates the container name against the live list before invoking journalctl. frontend renders a collapsed '↳ logs · <container>' details block on each container row. expanding triggers a lazy fetch; refresh button re-fetches; unit dropdown switches between the harness service (default) and the full machine journal. output sits in a 24em-tall monospace pre, auto-scrolled to the bottom on fresh fetch. hive-c0re's systemd unit already runs as root, so journalctl has the access it needs.	2026-05-15 20:42:56 +02:00
müde	79a46f359a	agent_web_port: collision-aware sticky allocation operator hit 'coder' and 'test' colliding on the same hashed port — fnv-1a mod 900 has ~0.1% collision probability per pair and clearly that's not enough. agent_web_port goes stateful: - per-agent port persisted to /var/lib/hyperhive/agents/<name>/port - on first call, look up the file; if absent, hash, then probe forward through the allocated range skipping any port other agents already claim, then write the chosen value back - subsequent calls return the persisted port (sticky) other agents' ports come from their port file if present, else the fallback is the hashed value — that handles existing deployments without forcing a rebuild-all just to migrate. rebuilding the colliding agent re-runs agent_web_port, sees its peer's implicit hash port as taken, picks the next free slot, persists. range exhaustion (very unlikely — 900 slots) logs a warning and returns the hash; the bind-with-retry on the harness will surface the failure honestly rather than silently looping.	2026-05-15 20:41:18 +02:00
müde	754db7830e	ask_operator: ttl_seconds auto-cancel + remaining-time chip manager can pass ttl_seconds to ask_operator. on submit, host stores deadline_at = now + ttl in operator_questions (new column, migrated via existing pragma_table_info pattern), spawns a tokio task that sleeps until the deadline then resolves the question with answer '[expired]' and fires the same OperatorAnswered helper event. already-resolved races no-op silently. dashboard renders a '⏳ MM:SS' chip on the question row when deadline_at is set. format collapses seconds → s, < 1h → m s, ≥ 1h → h m. heartbeat refresh (5s) keeps the chip current; the operator sees it tick down. manager prompt + mcp tool description updated. journald viewer per container queued in todo (separate task).	2026-05-15 20:38:02 +02:00
müde	2146e47770	web ui: retry binding on AddrInUse during restart races operator hit 'Address already in use (os error 98)' on a harness restart — the new harness raced the old socket's release. add a bind_with_retry helper that backs off (250ms doubling, capped at 2s, 12 tries ≈ 22s total) on AddrInUse before giving up. applied to both the per-agent web UI and the hive-c0re dashboard. proper fix would be SO_REUSEADDR via socket2 but retry covers the TIME_WAIT case fine and keeps the dep count down. Other bind errors still fail immediately (port permission, fd exhaustion).	2026-05-15 20:33:51 +02:00
müde	538e0446d7	agent page: inbox view of last 30 messages addressed to this agent new wire request AgentRequest::Recent { limit } / ManagerRequest::Recent (plus matching responses with Vec<InboxRow>). InboxRow moved to hive-sh4re so it lives on both surfaces without an internal-to-wire conversion. host-side dispatch in agent_server / manager_server calls broker.recent_for(name, limit). per-agent web_ui /api/state grew an inbox: Vec<InboxRow> populated via the same per-agent socket (best-effort; transport failure returns empty). frontend renders as a collapsible <details> section between the state row and the terminal — fmt timestamp / from / body in a tight grid, capped at 16em scrollable. only visible when there are rows.	2026-05-15 20:32:19 +02:00
müde	ee5b85716d	ask_operator: operator-side ✗ CANC3L on pending questions new POST /cancel-question/{id} resolves a pending operator question with the sentinel answer '[cancelled]' and fires the usual HelperEvent::OperatorAnswered so the manager sees a terminal state and can fall back. uses the same OperatorQuestions::answer path — no special handling, the manager already has to deal with arbitrary answer strings. dashboard renders the cancel as a separate <form> below the main qform so the answer-merge submit handler on the main form doesn't inadvertently fire when the operator clicks cancel. confirm dialog spells out what the manager will see. ttl-based auto-cancel is still on the todo (would spawn a tokio task per submitted question).	2026-05-15 20:25:11 +02:00
müde	2413d664a1	agents get a kickoff inbox message on start/restart/rebuild new Coordinator::kick_agent(name, reason) drops a system message into the agent's inbox so the next turn picks it up with a 'you were just (re)started, check /state/ for notes, --continue session is intact' hint. wakes the turn loop without any harness-side handling needed — it's just another inbox message with sender = 'system'. wired from: - dashboard /start /restart /rebuild handlers (via lifecycle_action's on-success tail) - manager mcp_hyperhive_start / restart dashboard: pending approvals + tombstones + questions now refresh on a 5s heartbeat when nothing else is happening. previously refresh only fired on async-form submit or on broker traffic addressed to operator — manager-queued approvals went through neither, so the operator had to reload to see them. 5s is the slow-path; 2s remains for in-flight transients.	2026-05-15 20:19:36 +02:00
müde	c27111ac32	dashboard: split api_state into per-section builders drops the #[allow(clippy::too_many_lines)] on api_state by extracting four pure helpers: - build_container_views — live containers + any_stale flag - build_transient_views — agents in pre-creation Spawning state only - build_approval_views — pending approvals with diff html - build_tombstone_views — destroyed-but-kept state dirs api_state itself is now ~30 lines of orchestration. zero behavior change. each helper is independently readable + testable.	2026-05-15 20:13:08 +02:00
müde	7b4adea325	dashboard: lifecycle_action helper collapses start/stop/restart/rebuild five POST handlers (post_kill / post_restart / post_start / post_rebuild) were all repeating the same boilerplate: strip prefix, set_transient, call lifecycle::X, clear_transient, match the result. extract one helper that takes the transient kind, error-message verb, the work body, and an optional 'on success' tail (used by kill to also unregister + emit HelperEvent::Killed). each handler shrinks to a single lifecycle_action(..) call. zero behavior change.	2026-05-15 20:12:03 +02:00
müde	89ccc5e6c5	events.sqlite vacuum moves host-side retention is a host concern — agents have no business doing their own cleanup, and a misbehaving harness could skip it. drop spawn_events_vacuum from both hive-ag3nt and hive-m1nd, drop the matching Bus::vacuum + EventStore::vacuum methods. new hive_c0re::events_vacuum module sweeps every existing agents/<name>/state/hyperhive-events.sqlite on the same hourly cadence as the broker vacuum. same two-stage delete (older than 7 days, trim to 2000 newest). called from main alongside broker vacuum. also: server-side state badge entered into todo.md (today's badge is derived client-side from sse, fine for idle/thinking but a state machine that grows compacting/napping wants authoritative status from the harness).	2026-05-15 20:10:34 +02:00
müde	5ee65d2f15	dashboard: K3PT ST4T3 section + agent links open in new tab new section between containers and questions: lists every name with a state dir under /var/lib/hyperhive/agents/ that doesn't correspond to a live container. shows state size + last-modified age + whether claude creds are kept. two actions per row: - R3V1V3 — queues a spawn approval with the same name (operator approves to recreate; spawn flow reuses prior config + claude creds, no re-login needed) - PURG3 — wipes the agent's state + applied dirs (post /purge-tombstone/ endpoint; refuses if a live container with that name still exists) dashboard also opens agent links in new tabs now (target=_blank + rel=noopener) so the operator's overview tab stays put when they dive into an agent.	2026-05-15 19:55:27 +02:00
müde	8344dd9ab7	ask_operator: multi-select + free-text fallback ask_operator now accepts a multi: bool. when true and options is non-empty, the dashboard renders the choices as checkboxes — operator picks any subset, answer comes back as a ', '-joined string. when false (default), options are radio buttons. independent of multi, a free-text input ('or type your own…') is always rendered alongside options so the operator is never trapped by an incomplete list. submit merges checked options + free text into the single 'answer' field. schema migration: operator_questions grows a multi INTEGER column with a one-shot ALTER TABLE on open. backward compatible — old rows default to 0 (not multi). prompt + mcp tool description updated; existing dashboard css for .qform was rewritten around the new vertical layout.	2026-05-15 19:52:44 +02:00
müde	c337cc06f8	dashboard: spinners on in-flight lifecycle actions + cleaner row layout backend: - TransientKind grows Starting / Stopping / Restarting / Rebuilding / Destroying alongside the existing Spawning. each dashboard handler (start/restart/kill/rebuild/destroy) wraps the lifecycle call with set_transient + clear_transient so the dashboard knows what's in flight. transient kind is surfaced inline on ContainerView.pending (existing-container actions) — only Spawning (pre-creation) lands in the separate transients list. frontend: - container row is now two lines: identity + meta on top, action buttons below. less cluttered, leaves room for the pending state pill. pending rows dim their actions and surface a pulsing '◐ spawning… / starting… / stopping… / restarting… / rebuilding… / destroying…' indicator next to the name. - 'needs login' / 'needs update' chips moved into a unified .badge styling for consistency. - auto-refresh kicks in not only on transient spawn but on any container with a pending action.	2026-05-15 19:49:43 +02:00
müde	6d52f67292	broker: hourly vacuum of delivered messages older than 30 days undelivered rows are always kept regardless of age (still in flight). sweep runs immediately on serve start then every hour. logs row count when non-zero. keep_secs is hard-coded for now (30 days); can be config-driven later if a host wants to retain more / less for audit.	2026-05-15 19:40:38 +02:00
müde	48ebfefd1a	destroy --purge: also wipe agent state dirs new --purge flag on the destroy verb (cli + admin socket + dashboard). default destroy still keeps /var/lib/hyperhive/{agents,applied}/<name>/ so recreating with the same name reuses prior config + creds. with --purge, both dirs go too (config history, claude creds, /state/ notes). no undo. dashboard adds a separate PURG3 button with an explicit confirmation copy; the existing DESTR0Y button keeps the soft semantics. claude.md dashboard-action-surface section updated; todo entry dropped.	2026-05-15 19:29:14 +02:00
müde	ac1b5fde8e	manager: start/restart at will, no approval; refuse self new manager tools mcp__hyperhive__{start,restart} that delegate to the existing lifecycle::start / lifecycle::restart on the host. kill was already at the manager's discretion; rounding out start + restart for parity so day-to-day container care doesn't have to round-trip through the operator. guard: refuse self-targeting on kill/start/restart — the manager would just be cutting its own legs. spawn (request_spawn) and config changes (request_apply_commit) still go through the approval queue, since those are the actual gate. prompt + claude.md updated to make the boundary explicit. kill now also emits HelperEvent::Killed (it didn't before).	2026-05-15 18:57:25 +02:00
müde	2770630f33	ask_operator tool: non-blocking; operator answer arrives as helper event new mcp tool on the manager surface that queues a question on the dashboard and returns the question id immediately. operator submits an answer via /answer-question/<id>; the dashboard fires HelperEvent::OperatorAnswered { id, question, answer } into the manager inbox so the next turn picks it up. also: fix async-form button stuck on spinner after successful submit (refreshState skipped re-rendering, so the button was never re-enabled).	2026-05-15 18:44:42 +02:00
müde	ff8f8c7c56	per-agent /state dir for durable notes; manager sees them via /agents	2026-05-15 18:00:08 +02:00
müde	37c6504462	manager events: Spawned/Rebuilt/Killed/Destroyed + start button	2026-05-15 17:38:41 +02:00
müde	06ea0cf283	operator inbox view on dashboard; agent ui doesn't clobber typing	2026-05-15 17:23:53 +02:00
müde	6fc9862c3c	dashboard: SPA shell — static index.html + app.js, /api/state JSON	2026-05-15 17:10:57 +02:00
müde	8428c693e0	dashboard: stop/restart per-container + update-all when any stale	2026-05-15 17:00:56 +02:00
müde	4f91dfef99	module: thread hyperhive package directly — operators don't apply overlays	2026-05-15 16:51:18 +02:00
müde	edf42b7e93	extract dashboard + agent CSS/JS to assets/ (include_str!)	2026-05-15 16:32:35 +02:00
müde	8fbee4fbf2	dashboard: async forms with spinner + rebuild button on every container	2026-05-15 16:21:25 +02:00
müde	e2ed58c1a7	dashboard: per-line color on approval diffs	2026-05-15 16:17:48 +02:00
müde	0f0e242906	programs.git.enable + harness PATH tracks systemPackages - harness-base.nix: switch to programs.git for declarative gitconfig. - agent + manager service path = /run/current-system/sw → agents pick up new packages from their own agent.nix without harness edits. - generated applied/<name>/flake.nix overrides programs.git.config.user (no more raw etc.gitconfig collision).	2026-05-15 16:16:14 +02:00
müde	e1289a3e4c	nix templates: factor harness-base.nix (shared scaffolding incl. gitconfig)	2026-05-15 16:10:55 +02:00
müde	dfbcf2b9d1	agents wake on send: broker.recv_blocking + 30s long-poll on Recv	2026-05-15 16:00:31 +02:00
müde	f1fd787f17	rebuild button on agent UI (cross-origin POST to dashboard /rebuild)	2026-05-15 15:57:11 +02:00
müde	824914807a	ensure_manager: rebuild hm1nd if applied flake missing (migration safety)	2026-05-15 15:53:39 +02:00
müde	409263f1c9	operator input: per-agent /send form (dashboard T4LK removed)	2026-05-15 15:28:17 +02:00
müde	accb1445e3	claude: pipe prompt via stdin (variadic --allowedTools was eating it); + ManagerRequest::Status	2026-05-15 15:06:09 +02:00
müde	3c9d42b2a7	agent loop: claude drives; tool envelope (log/run/status/log)	2026-05-15 14:54:10 +02:00
müde	f99ed3fe7a	manager: same lifecycle as agents; auto-spawn on hive-c0re start	2026-05-15 13:43:32 +02:00
müde	e777576528	auto-update: surface pending updates in dashboard + include manager	2026-05-15 13:31:33 +02:00
müde	a4e1556f90	auto-update agents on startup when hyperhive rev changes	2026-05-15 13:25:27 +02:00
müde	d07f5eadaa	dashboard: needs-login badge links to per-agent ui	2026-05-15 13:12:12 +02:00
müde	78fae44ee5	phase 8 step 3: needs-login partial-run mode + dashboard badge	2026-05-15 12:57:06 +02:00
müde	c59fa8541c	phase 8 step 2: approval-gated spawn + dashboard spinner	2026-05-15 12:53:13 +02:00
müde	a42fdb3a5c	phase 8 step 1: per-agent claude creds bind + destroy keeps state	2026-05-15 12:39:22 +02:00
müde	0fc287c768	fmt	2026-05-15 02:58:35 +02:00
müde	b711296460	destroy verb: CLI + admin socket + dashboard button; purges state + approvals	2026-05-15 02:57:22 +02:00
müde	fcd6563887	fmt	2026-05-15 02:02:20 +02:00
müde	1333532d3f	dashboard: T4LK form — operator sends messages from the browser	2026-05-15 01:59:53 +02:00
müde	07a5d3a778	lifecycle: clear HOST_ADDRESS/LOCAL_ADDRESS/HOST_BRIDGE — start script's --network-veth was forcing private netns	2026-05-15 01:51:12 +02:00

1 2

88 commits