hyperhive/CLAUDE.md

226 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# hyperhive
Multi-Claude-Code-agent orchestration on **nixos-containers**. A host-side Rust
daemon spawns nspawn-isolated agent containers and brokers messages between
them. Eventually a manager agent (another Claude Code session in its own
container) coordinates the swarm and gates lifecycle changes on user approval
via git commits.
**PLAN.md** is the living design doc. Read it for the *why* and the phase
roadmap; this file is the operator/developer reference for the *how*.
## Architecture
```
host
├── hive-c0re (Rust daemon, NixOS service)
│ ├── lifecycle — nixos-container CRUD
│ ├── broker — sqlite message store (/var/lib/hyperhive/broker.sqlite)
│ ├── server — host admin socket (JSON line protocol)
│ └── agent_server — per-agent MCP-ish sockets
├── /run/hyperhive/
│ ├── host.sock — admin CLI ↔ daemon
│ └── agents/<name>/mcp.sock — bind-mounted into each container at /run/hive
└── nixos-containers
├── h-<name> (sub-agents, hive-ag3nt binary)
└── hm1nd (manager, hive-m1nd binary — Phase 4+)
```
## Crates / file map
```
hive-c0re/ host daemon + CLI (one binary, subcommand-dispatched)
main.rs clap setup; serve vs spawn/kill/rebuild/list
server.rs host admin socket
client.rs host admin socket client (for spawn/kill/rebuild/list)
broker.rs sqlite-backed Message store (rusqlite)
agent_server.rs per-agent socket listener
coordinator.rs shared runtime state (broker + map<name, AgentSocket>)
lifecycle.rs `nixos-container` shellouts (spawn/kill/rebuild/list)
hive-ag3nt/ in-container harness; produces TWO binaries from one crate
src/lib.rs DEFAULT_SOCKET, re-exports
src/client.rs AgentRequest/AgentResponse over /run/hive/mcp.sock
src/bin/hive-ag3nt.rs sub-agent CLI (serve/send/recv)
src/bin/hive-m1nd.rs manager placeholder (Phase 4)
hive-sh4re/ wire types (HostRequest/Response, AgentRequest/Response, Message)
nix/
modules/hive-c0re.nix systemd service wiring
templates/agent-base.nix nixos-container template (boot.isNspawnContainer = true)
tests/roundtrip.sh Phase 3 end-to-end smoke test
```
## Conventions
- **Naming.** Containers are length-bounded (`nixos-container` ≤ 11 chars).
Sub-agents are `h-<name>` with `<name>` ≤ 9 chars; the manager is `hm1nd`.
`MAX_AGENT_NAME` enforces the cap in `lifecycle.rs`.
- **Identity = socket.** No auth/tokens on the per-agent sockets. The socket
*path* identifies the principal; perms come from "who has the bind-mount."
- **Wire protocol.** JSON line-delimited over unix sockets in both directions.
See `hive-sh4re` for the types. (Phase 6+ may swap to real MCP stdio.)
- **Commit messages.** Short, lowercase, no Co-Authored-By trailer.
- **Commit before test.** Stage and commit when work *looks* ready, then run
validation (`cargo check`, `nix flake check`, real lpt2 deploy). Failures get
a follow-up commit rather than an amend.
- **`rebuild` is the reconcile verb.** It rewrites `/etc/nixos-containers/<C>.conf`
EXTRA_NSPAWN_FLAGS idempotently *and* does `nixos-container update` *and*
stop+start so nspawn-level changes (bind mounts) take effect. Anything that
changes per-container state on the host should be re-applied here.
## Gotchas / lessons learned
- **`nixos-container` doesn't expose `--bind` on the CLI.** Path is via
`EXTRA_NSPAWN_FLAGS` in `/etc/nixos-containers/<NAME>.conf` — the start
script (`/nix/store/.../container_-start`) expands it unquoted into the
`systemd-nspawn` invocation. We rewrite this line in `set_nspawn_flags()`.
- **`/run/systemd/nspawn/*.nspawn` overrides are *ignored*** by `nixos-container`'s
start script (it builds the nspawn cmd line directly). Don't bother.
- **`boot.isNspawnContainer = true`**, not `boot.isContainer = true`. The
latter was renamed in nixos-25.11+.
- **systemd service PATH ≠ host PATH.** Our service explicitly sets
`path = [ "/run/current-system/sw" ]` so `nixos-container` (which lives in
the system profile, not nixpkgs) is reachable.
- **`RuntimeDirectoryPreserve = "yes"`** keeps `/run/hyperhive/` (and the
agent sub-dirs) across `hive-c0re` restarts. Without it, every restart wipes
bind sources and existing containers can't be started.
- **`register_agent` is idempotent** — drops any prior socket task before
rebinding. Required so a `hive-c0re` restart followed by `rebuild alice`
recreates the agent's socket without needing a clean reinstall.
- **`claude-code` is unfree.** `agent-base.nix` allow-list's it specifically.
The flake pins it to **nixpkgs-unstable** via `overlays.claude-unstable`
(stable lags too far). The overlay imports unstable with its own
`allowUnfreePredicate` so the access inside the overlay doesn't itself trip.
- **Claude credentials are stateful and per-container.** No `ANTHROPIC_API_KEY`
env var path. For now: `nixos-container root-login h-<name>``claude`
(interactive) → log in once. The harness falls back to echo replies when
`claude --print` fails. Future: bind-mount a shared `~/.claude` dir from the
host so creds survive container destroy/recreate.
- **Echo guard.** `hive-ag3nt serve` skips auto-reply when the incoming body
starts with `"echo: "`. Prevents ping-pong loops when both sides fall back to
echo. Real conversations between claude-backed agents *will* runaway — that's
the manager's job to bound (Phase 4+).
## Build / deploy / test
```sh
# inside the repo (devshell first; no global cargo)
nix develop -c cargo check
nix develop -c cargo build
# evaluate everything (incl. fmt check)
nix flake check
# build only the workspace package
nix build .#default
./result/bin/{hive-c0re,hive-ag3nt,hive-m1nd}
# deploy to an existing host that imports hyperhive.nixosModules.hive-c0re
cd ~/Repos/<nixos-config-repo>
nix flake update --update-input hyperhive
sudo nixos-rebuild switch --flake .#<host>
# end-to-end test (lpt2 or any host with the module enabled)
sudo bash tests/roundtrip.sh
```
The host config also needs `hyperhive.overlays.default` applied — the module's
default `package = pkgs.hyperhive` requires the overlay to bring the package
in.
## Phase status
- ✅ Phase 0 — repo + Cargo workspace + flake + agent-base template
- ✅ Phase 1 — container lifecycle (spawn/kill/rebuild/list); nixos-container update
hot-reload works under the patch stack (validated empirically on muede-lpt2)
- ✅ Phase 2 — per-agent sockets, in-memory broker, agent harness round-trips messages
- ✅ Phase 3 — sqlite broker (durable across restart) + claude-or-echo turn loop
- ✅ Phase 4 — `hm1nd` manager binary + manager socket + declarative `containers.hm1nd`
- ✅ Phase 5 — git-commit approval flow:
- 5a — sqlite approval queue (`request_apply_commit` / `pending` / `approve` / `deny`)
- 5b — per-agent config flakes (proposed + applied repos)
- 5c — split: manager edits `proposed`, hive-c0re writes-only `applied`; the
container builds from `applied`. Approve = read `agent.nix` at the
approved commit from `proposed`, copy into `applied`, commit + rebuild.
Manager cannot move `main` on its own.
- ✅ Phase 6 — per-container web UIs + hive-c0re dashboard:
- Each `hive-ag3nt` / `hive-m1nd` serves an `axum` HTTP page on `HIVE_PORT`
(deterministic hash for sub-agents in 81008999; fixed 8000 for the manager).
Vibec0re-styled placeholder for now (status / inbox / xterm coming later).
- `hive-c0re` serves a dashboard on `cfg.dashboardPort` (default 7000)
listing containers (deep-linked to their per-container UI) + pending
approvals. Same aesthetic.
- Firewall opens 7000, 8000, 81008999 when the module is enabled.
- ✅ Phase 7 — polish:
- 7a — dashboard `POST /approve/<id>` / `/deny/<id>` buttons + unified
diff (via `similar`) of applied vs proposed `agent.nix`.
- 7b — broker broadcast channel + `/messages/stream` SSE + live message-flow
panel (cyan `→` sent / green `✓` delivered, 200 row cap).
- 7c — `ApprovalResolved` helper events into the manager's inbox
(`SYSTEM_SENDER` + `HelperEvent` JSON). Manager harness recognises and
logs them distinctly.
- 7d — default `MemoryMax=2G` + `CPUQuota=50%` applied to every managed
container via `/run/systemd/system/container@<NAME>.service.d/hyperhive-limits.conf`
drop-in (regenerated on every spawn / rebuild).
- 7e — damocles migration plan (`docs/damocles-migration.md`).
## Approval flow (Phase 5)
End-to-end: manager edits per-agent config repo → commits → submits commit sha
for approval → user approves on host CLI → `hive-c0re` advances `main` + rebuilds.
```
# Inside the hm1nd container (manager has /agents bind-mounted RW):
cd /agents/alice/config
$EDITOR agent.nix # add `environment.systemPackages = [ pkgs.htop ];`
git commit -am "add htop"
SHA=$(git rev-parse HEAD)
hive-m1nd request-apply-commit alice $SHA
exit
# On the host:
sudo hive-c0re pending # shows the queued approval with id N
sudo hive-c0re approve N # validates, advances main, rebuilds h-alice
sudo nixos-container run h-alice -- which htop # /run/current-system/sw/bin/htop
```
Per-agent layout — two separate git repos:
```
/var/lib/hyperhive/agents/<name>/config/ # proposed — manager edits, hive-c0re reads only
├── .git/
└── agent.nix # the only file the manager can change
# (initial commit by hive-c0re on first spawn,
# never touched by hive-c0re again)
/var/lib/hyperhive/applied/<name>/ # applied — hive-c0re-only; container builds here
├── .git/
├── flake.nix # hive-c0re-managed; references hyperhive_flake
└── agent.nix # overwritten by approve from the proposed commit
```
The container's `--flake` ref is `<applied_dir>#default`. The flake's
`nixosConfigurations.default` extends `hyperhive.nixosConfigurations.agent-base`
with `./agent.nix` plus an inline module setting `environment.etc."gitconfig".text`
with the agent's name as the git committer identity.
On approve: `git show <commit>:agent.nix` from `proposed/<name>`, write the bytes
into `applied/<name>/agent.nix`, commit there as `hive-c0re`, then
`nixos-container update`. The manager can only propose; only hive-c0re advances
`applied`'s `main`.
See PLAN.md for the full design and the deferred-out-of-scope list.
## Inspirations
- **`~/Repos/bitburner-agent`** — sibling project, drives Claude Code in a turn
loop against a Bitburner CDP session. Patterns to steal as we grow:
per-cycle prompt diffing (vs full state), notes compaction as a separate
short-lived Claude session, MCP server registering tools from a single
`TOOLS` array, dashboard with SSE + xterm.js + sqlite stats sampler, opaque
"terminal event" stream that unifies tool-call / sleep / op-notice / etc.