hyperhive/TODO-ops.md
müde 56e7eb6e73 agent ui: answer questions inline from the per-agent page
loose-ends question rows get a textarea + send button; the operator
answers as operator by POSTing to the core dashboard's
/answer-question route, not the per-agent socket — keeps the
operator-authority path off the agent's own socket. cross-origin POST
needs a CORS shim on that route for now; drops out once the gateway
makes the page same-origin.

also splits deployment/ops/boundaries/gateway work into TODO-ops.md.
2026-05-20 10:01:12 +02:00

5 KiB

Hyperhive — deployment, ops & boundaries

Tracking the deployment-shape + operational-hardening work: container network isolation, the unifying gateway, the operator-vs-agent trust boundary, and process privilege separation.

These items interlock. Today "the operator surface" and "the agent surface" are a convention, not a boundary — nothing stops a container from curling the core daemon on localhost:<port>, or another agent's web UI. The gateway, network isolation, and privsep together turn that convention into an enforced boundary. Sequencing matters; see the order at the bottom.

The boundary we're building toward

Two principals, two paths:

  • Operator — reaches every UI (the dashboard + every per-agent page) through the gateway, on one origin. Operator-authority actions (approve / deny, answer-as-operator, lifecycle POSTs) are served by the core daemon and only reachable via the gateway.
  • Agent — speaks only for itself, only over its per-agent unix socket. The socket's identity is the agent (see docs/conventions.md, "identity = socket"). An agent must not be able to reach the core daemon's HTTP surface, another agent's socket, or another agent's web UI.

Design rule that falls out of this: operator-authority actions never get a per-agent-socket entry point. They live on the core backend. Worked example — answering an operator-targeted question is a POST /answer-question/{id} on the core dashboard, never an AgentRequest variant. If it were a per-agent-socket request, an agent could curl its own socket and spoof an operator answer. The per-agent web UI POSTs cross-origin to the core for these (see the inline-answer feature — the loose-ends section on each agent page).

Workstreams

1. Container network isolation

Today containers share the host network namespace, so a container can reach localhost:<core-port>, the dashboard, and every other agent's web port. Until this changes, nothing below is actually enforced — the operator/agent split is on the honour system.

  • Give each container a private veth / bridge with no route to the host's loopback-bound services.
  • The per-agent unix socket stays the only host-bound channel (it already is the intended one).
  • Open question: the per-agent web UI still needs to be reachable by the operator's browser — that is what the gateway is for (below). The container itself should not be able to reach the gateway or the core daemon.

2. Unifying gateway / reverse proxy

(Moved here from TODO.md "Dashboard".)

Today every agent's web UI is reached at <host>:<per-agent-port>/, so operators juggle a port list. Stand up nginx (or similar) terminating one domain that fans requests to /agent/<name>/... out to each container's web port, and / to the main dashboard. Touches: a NixOS module on the host, the dashboard's per-agent link rendering, and the per-agent web server's base-path handling (currently assumes root). Lets bookmarks survive port reshuffles and unblocks per-agent stats links being relative URLs instead of hard-coded ports.

Boundary payoff: once the dashboard and the per-agent pages are same-origin behind the gateway, the cross-origin CORS shim on POST /answer-question/{id} (added with the inline-answer feature) can be deleted — the per-agent page's POST becomes a plain same-origin request. Grep for with_cors / Access-Control-Allow-Origin in hive-c0re/src/dashboard.rs and remove it when this lands.

The gateway is also the natural home for auth, if/when the operator surface ever needs it.

3. Privsep the core daemon from the web UI

(Moved here from TODO.md "Security".)

hive-c0re runs as root (it has to — nixos-container create / start / destroy, the meta git repo, every per-agent bind mount). The HTTP server lives in the same process, so every read-endpoint (/api/state-file, /api/journal/{name}, /api/agent-config/{name}) is one allow-list bug away from serving arbitrary host files. Split it: keep the privileged daemon doing lifecycle + git + ipc, run the web UI as an unprivileged user that talks to the daemon over a unix socket with a narrow request surface (ReadAgentStateFile { agent, rel_path } etc.). The unprivileged process can't read /etc/shadow even if every check in get_state_file is bypassed — it doesn't have the bits. Container-lifecycle POSTs (/restart, /destroy, etc.) become forwarded RPCs the privileged side authorises on its terms.

Cheaper once the harness/state split lands (see TODO.md "Split harness-internal state from agent-visible state") — the unprivileged web server then only needs read access to /agents/<n>/state/, not /agents/<n>/harness/.

Suggested sequencing

  1. Gateway first — pure ergonomics win, unblocks same-origin, no behavioural risk.
  2. Network isolation next — the step that makes the operator/agent boundary real. Everything before it is honour-system.
  3. Privsep last — defence in depth on the core process itself; valuable independent of the other two, but the biggest refactor.