hyperhive/docs/gotchas.md

141 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Gotchas
NixOS + nspawn quirks and lessons we hit the hard way. If something
here looks unmotivated in the code, there's usually a story underneath.
## `nixos-container` doesn't expose `--bind` on the CLI
The CLI doesn't accept `--bind`. Path is via `EXTRA_NSPAWN_FLAGS` in
`/etc/nixos-containers/<NAME>.conf` — the start script
(`/nix/store/.../container_-start`) expands it unquoted into the
`systemd-nspawn` invocation. `lifecycle::set_nspawn_flags()` rewrites
this line.
## `/run/systemd/nspawn/*.nspawn` overrides are ignored
`nixos-container`'s start script builds the nspawn command line
directly. Dropping a `.nspawn` file under `/run/systemd/nspawn/`
looks like the obvious extension point and does nothing. Use
`EXTRA_NSPAWN_FLAGS` (above).
## `boot.isNspawnContainer = true`
Not `boot.isContainer = true`. Renamed in nixos-25.11+.
## `nixos-container create` auto-assigns `HOST_ADDRESS` / `LOCAL_ADDRESS`
…in the `.conf`. The start script's `if HOST_ADDRESS set →
--network-veth` branch then forces a private netns — silently fatal
for our web UIs (the bind is invisible from the host). We
force-clear `HOST_ADDRESS` / `LOCAL_ADDRESS` / `HOST_ADDRESS6` /
`LOCAL_ADDRESS6` / `HOST_BRIDGE` and set `PRIVATE_NETWORK=0`.
## systemd service PATH ≠ host PATH
The hive-c0re service sets `path = [ pkgs.git "/run/current-system/sw" ]`.
In-container harness services do the same so anything an agent adds
to its own `agent.nix` (`environment.systemPackages`) is visible to
claude's Bash tool without editing the service definition.
`environment.HYPERHIVE_GIT` bakes git's absolute path in (read by
`lifecycle::git_command()`) for the host.
## `RuntimeDirectoryPreserve = "yes"`
…keeps `/run/hyperhive/` (and the per-agent sub-dirs) across
hive-c0re restarts. Without it, every restart wipes bind sources and
existing containers can't be started.
## `register_agent` is idempotent
Drops any prior socket task before rebinding. Required so a
hive-c0re restart followed by `rebuild alice` recreates the agent's
socket without needing a clean reinstall.
## `claude-code` is unfree
The flake pins it to **nixpkgs-unstable** via
`overlays.claude-unstable` (stable lags too far). The overlay sets
`config.allowUnfreePredicate` on its unstable import to whitelist
`claude-code` specifically — scoped, only this one package.
`harness-base.nix` does the same at the container level because
each per-agent `nixosConfiguration` evaluates its own nixpkgs
instance and the operator's host-level `allowUnfree` does **not**
propagate in. Operators don't need to set anything on their side.
## Claude credentials are per-agent
`/var/lib/hyperhive/agents/<name>/claude/` bind-mounts to
`/root/.claude` (RW). Sharing one dir across agents is NOT viable —
OAuth refresh tokens rotate, so any sibling refresh invalidates all
the others. Login flow runs from the per-agent web UI; creds persist
across `destroy`/recreate (`--purge` wipes them).
## Persistent notes dir per agent
`/var/lib/hyperhive/agents/<name>/state/` bind-mounts to `/state`
(RW). System prompts tell agents to keep durable knowledge here
(`/state/notes.md`, anything else under `/state/`). The harness also
writes its events log here (`/state/hyperhive-events.sqlite`).
Survives `destroy`/recreate alongside the claude dir.
## Web UI ports collide on hash
Sub-agent web UI ports are deterministic FNV-1a of the agent name
modulo 900 (range 8100..8999). With ~30 agents the birthday-paradox
collision rate gets meaningful; at 23 agents you can still get
unlucky. Operator resolves a collision by renaming the offending
agent (different hash → different port) and rebuilding. No state
file, no probing, no port-allocation drift — the value is
reproducible from just the name. Manager is fixed at 8000;
dashboard at `cfg.dashboardPort` (default 7000).
## Restart races on TCP bind
Both the dashboard and per-agent web UI use `tokio::net::TcpSocket`
with `SO_REUSEADDR` plus a retry-on-`AddrInUse` loop (12 tries,
exponential backoff capped at 2s, ~22s total). REUSEADDR handles
the `TIME_WAIT` case from a clean previous exit; retry covers the
genuine "previous process is still alive during a systemd restart
overlap" case. REUSEADDR does **not** allow two simultaneous
`LISTEN` sockets on the same port (that would be `SO_REUSEPORT`,
which we don't use) — exclusivity is preserved.
## Orphan approvals
If state dirs are wiped out from under a pending approval (test
scripts, manual `rm -rf`), the dashboard's next render marks them
`failed` with note `"agent state dir missing"` so they fall out of
`pending`. They stay in sqlite for audit.
## Nix store `cp -r` preserves read-only bits
Copying a nix store path with `cp -r src/. $out/` inside a
`pkgs.runCommand` derivation preserves the read-only permissions of
store files. Any subsequent write into the copied tree (adding new
files in subdirectories) fails with `EPERM`. Fix: pass
`--no-preserve=mode,ownership` so the output tree is writable.
## `hive-forge`: prefer over raw curl pipelines
Every agent container has `hive-forge` in PATH (installed via
`harness-base.nix`; lives in `/hive-forge` as a proper Rust binary
since #280). Use it instead of ad-hoc curl pipelines:
```bash
hive-forge view 42 # title + body + comments
hive-forge comment 42 --body "..." # post comment (inline body)
hive-forge comment 42 --body-file - <<EOF # ...or pipe a HEREDOC
multi-line body
EOF
hive-forge assign 42 damocles
hive-forge close 42
hive-forge labels 42 add feature
hive-forge pr 42 # PR metadata as JSON
hive-forge branches deployed/ # filter branches by pattern
hive-forge -r other-org/other-repo pr 7 # target a different repo
```
`hive-forge <verb> --help` prints the full signature for any verb.
Credentials come from `$HYPERHIVE_STATE_DIR/forge-token`; default
repo from `$HIVE_FORGE_REPO`, overridden per-invocation by the
global `-r/--repo` flag.