docs: move backlog to forge issue tracker, extract boundary doc
This commit is contained in:
parent
44c86b9278
commit
4715e88fff
9 changed files with 78 additions and 184 deletions
17
CLAUDE.md
17
CLAUDE.md
|
|
@ -6,9 +6,11 @@ when you need depth on a subsystem. This file is the index +
|
|||
scratchpad.
|
||||
|
||||
- High-level project intro: **[README.md](README.md)**.
|
||||
- Open work + backlog: **[TODO.md](TODO.md)**.
|
||||
- Deployment / ops / boundaries / gateway backlog:
|
||||
**[TODO-ops.md](TODO-ops.md)**.
|
||||
- Open work + backlog: the **[forge issue
|
||||
tracker](http://localhost:3000/hyperhive/hyperhive/issues)**.
|
||||
- Operator/agent trust-boundary design:
|
||||
**[docs/boundary.md](docs/boundary.md)** (`area:ops` issues
|
||||
for the deployment/gateway/privsep work).
|
||||
|
||||
## File map
|
||||
|
||||
|
|
@ -286,8 +288,9 @@ Prune freely.
|
|||
`/answer-question/{id}` (CORS shim `with_cors` on that
|
||||
route), never the per-agent socket — keeps the
|
||||
operator-authority path off the agent's own socket. See
|
||||
`TODO-ops.md` for the boundary rationale + the deployment/
|
||||
gateway/privsep cluster.
|
||||
`docs/boundary.md` for the boundary rationale; the
|
||||
deployment/gateway/privsep work is tracked as `area:ops`
|
||||
forge issues.
|
||||
- **Just landed:** sub-agents get a read-only view of their own
|
||||
config repo. `set_nspawn_flags` now adds
|
||||
`--bind-ro={proposed_dir}:/agents/<name>/config` for every
|
||||
|
|
@ -607,7 +610,7 @@ Prune freely.
|
|||
<details data-restore-key> survival, prompt-on-submit pattern.
|
||||
- **Open threads:** two-step spawn, notes compaction,
|
||||
unprivileged containers, Bash allow-list, xterm.js. The
|
||||
deployment / gateway / privsep cluster is tracked in
|
||||
`TODO-ops.md`. (Landed since this note was first written:
|
||||
deployment / gateway / privsep cluster is tracked as
|
||||
`area:ops` forge issues. (Landed since this note was first written:
|
||||
extra per-agent MCP servers, per-agent send allow-list,
|
||||
telemetry + the `/stats` page.)
|
||||
|
|
|
|||
119
TODO-ops.md
119
TODO-ops.md
|
|
@ -1,119 +0,0 @@
|
|||
# Hyperhive — deployment, ops & boundaries
|
||||
|
||||
Tracking the deployment-shape + operational-hardening work:
|
||||
container network isolation, the unifying gateway, the
|
||||
operator-vs-agent trust boundary, and process privilege
|
||||
separation.
|
||||
|
||||
These items interlock. Today "the operator surface" and "the
|
||||
agent surface" are a *convention*, not a boundary — nothing
|
||||
stops a container from curling the core daemon on
|
||||
`localhost:<port>`, or another agent's web UI. The gateway,
|
||||
network isolation, and privsep together turn that convention
|
||||
into an enforced boundary. Sequencing matters; see the order at
|
||||
the bottom.
|
||||
|
||||
## The boundary we're building toward
|
||||
|
||||
Two principals, two paths:
|
||||
|
||||
- **Operator** — reaches every UI (the dashboard + every
|
||||
per-agent page) through the gateway, on one origin.
|
||||
Operator-authority actions (approve / deny, answer-as-operator,
|
||||
lifecycle POSTs) are served by the core daemon and only
|
||||
reachable via the gateway.
|
||||
- **Agent** — speaks only for itself, only over its per-agent
|
||||
unix socket. The socket's identity *is* the agent (see
|
||||
`docs/conventions.md`, "identity = socket"). An agent must not
|
||||
be able to reach the core daemon's HTTP surface, another
|
||||
agent's socket, or another agent's web UI.
|
||||
|
||||
Design rule that falls out of this: **operator-authority
|
||||
actions never get a per-agent-socket entry point.** They live on
|
||||
the core backend. Worked example — answering an
|
||||
operator-targeted question is a `POST /answer-question/{id}` on
|
||||
the core dashboard, *never* an `AgentRequest` variant. If it
|
||||
were a per-agent-socket request, an agent could `curl` its own
|
||||
socket and spoof an operator answer. The per-agent web UI POSTs
|
||||
cross-origin to the core for these (see the inline-answer
|
||||
feature — the loose-ends section on each agent page).
|
||||
|
||||
## Workstreams
|
||||
|
||||
### 1. Container network isolation
|
||||
|
||||
Today containers share the host network namespace, so a
|
||||
container can reach `localhost:<core-port>`, the dashboard, and
|
||||
every other agent's web port. **Until this changes, nothing
|
||||
below is actually enforced** — the operator/agent split is on
|
||||
the honour system.
|
||||
|
||||
- Give each container a private veth / bridge with no route to
|
||||
the host's loopback-bound services.
|
||||
- The per-agent unix socket stays the only host-bound channel
|
||||
(it already is the intended one).
|
||||
- Open question: the per-agent web UI still needs to be
|
||||
reachable *by the operator's browser* — that is what the
|
||||
gateway is for (below). The container itself should not be
|
||||
able to reach the gateway or the core daemon.
|
||||
|
||||
### 2. Unifying gateway / reverse proxy
|
||||
|
||||
(Moved here from TODO.md "Dashboard".)
|
||||
|
||||
Today every agent's web UI is reached at
|
||||
`<host>:<per-agent-port>/`, so operators juggle a port list.
|
||||
Stand up nginx (or similar) terminating one domain that fans
|
||||
requests to `/agent/<name>/...` out to each container's web
|
||||
port, and `/` to the main dashboard. Touches: a NixOS module on
|
||||
the host, the dashboard's per-agent link rendering, and the
|
||||
per-agent web server's base-path handling (currently assumes
|
||||
root). Lets bookmarks survive port reshuffles and unblocks
|
||||
per-agent stats links being relative URLs instead of hard-coded
|
||||
ports.
|
||||
|
||||
Boundary payoff: once the dashboard and the per-agent pages are
|
||||
same-origin behind the gateway, the cross-origin CORS shim on
|
||||
`POST /answer-question/{id}` (added with the inline-answer
|
||||
feature) can be deleted — the per-agent page's POST becomes a
|
||||
plain same-origin request. Grep for `with_cors` /
|
||||
`Access-Control-Allow-Origin` in `hive-c0re/src/dashboard.rs`
|
||||
and remove it when this lands.
|
||||
|
||||
The gateway is also the natural home for auth, if/when the
|
||||
operator surface ever needs it.
|
||||
|
||||
### 3. Privsep the core daemon from the web UI
|
||||
|
||||
(Moved here from TODO.md "Security".)
|
||||
|
||||
hive-c0re runs as root (it has to — `nixos-container` create /
|
||||
start / destroy, the meta git repo, every per-agent bind
|
||||
mount). The HTTP server lives in the same process, so every
|
||||
read-endpoint (`/api/state-file`, `/api/journal/{name}`,
|
||||
`/api/agent-config/{name}`) is one allow-list bug away from
|
||||
serving arbitrary host files. Split it: keep the privileged
|
||||
daemon doing lifecycle + git + ipc, run the web UI as an
|
||||
unprivileged user that talks to the daemon over a unix socket
|
||||
with a narrow request surface (`ReadAgentStateFile { agent,
|
||||
rel_path }` etc.). The unprivileged process can't read
|
||||
`/etc/shadow` even if every check in `get_state_file` is
|
||||
bypassed — it doesn't have the bits. Container-lifecycle POSTs
|
||||
(`/restart`, `/destroy`, etc.) become forwarded RPCs the
|
||||
privileged side authorises on its terms.
|
||||
|
||||
Cheaper once the harness/state split lands (see TODO.md "Split
|
||||
harness-internal state from agent-visible state") — the
|
||||
unprivileged web server then only needs read access to
|
||||
`/agents/<n>/state/`, not `/agents/<n>/harness/`.
|
||||
|
||||
## Suggested sequencing
|
||||
|
||||
1. **Gateway** first — pure ergonomics win, unblocks
|
||||
same-origin, no behavioural risk.
|
||||
2. **Network isolation** next — the step that makes the
|
||||
operator/agent boundary *real*. Everything before it is
|
||||
honour-system.
|
||||
3. **Privsep** last — defence in depth on the core process
|
||||
itself; valuable independent of the other two, but the
|
||||
biggest refactor.
|
||||
56
TODO.md
56
TODO.md
|
|
@ -1,56 +1,6 @@
|
|||
# Hyperhive TODOs
|
||||
|
||||
> **Rough split for who picks up what:** harness ergonomics + host-side
|
||||
> harness plumbing tend to be damocles' interest area; UI-polish work
|
||||
> for the operator is not. Use that as a hint when picking up items,
|
||||
> not a hard rule.
|
||||
The backlog moved to the forge issue tracker:
|
||||
<http://localhost:3000/hyperhive/hyperhive/issues>
|
||||
|
||||
**Deployment / ops / boundaries:** the unifying gateway, container
|
||||
network isolation, the operator-vs-agent trust boundary, and process
|
||||
privsep are tracked separately in [`TODO-ops.md`](TODO-ops.md).
|
||||
|
||||
## Architecture / Features
|
||||
|
||||
- Shared space for all agents to access documents/files without manager routing
|
||||
- Private git forge agents can push to and create new repos in
|
||||
- Move bind mounts in agents to `/agents/<name>/state` so path for agent = path for manager
|
||||
- **Split harness-internal state from agent-visible state**: the `/agents/<n>/state/` mount (host `/var/lib/hyperhive/agents/<n>/state/`) currently mixes the agent's durable notes with harness internals — `hyperhive-events.sqlite`, `hyperhive-turn-stats.sqlite`, `hyperhive-model`, future per-agent skill caches, etc. The agent can accidentally overwrite a harness file, the harness clutters what claude thinks is "my notes dir", and the host-side vacuum has to special-case filenames it owns. Move harness internals to a sibling dir, e.g. `/var/lib/hyperhive/agents/<n>/harness/`, bind-mounted RW into the container as `/agents/<n>/harness/` (same path inside + out, same convention as state). Container's `/agents/<n>/state/` becomes purely agent-owned. Touches: `paths.rs` (new `harness_dir()`), `events.rs`, `turn_stats.rs` (default paths flip), `events_vacuum.rs` (sweep root flips), `lifecycle.rs` (extra bind mount), and a migration that moves existing files on first boot under the new layout. Side benefit: makes the privsep TODO cheaper — the unprivileged web server only needs read access to `/agents/<n>/state/` (operator-meaningful files), not `/agents/<n>/harness/`. The legacy bare `/state` mount the manager still uses (`container_state_prefix("manager") == "/state/"`, manager bind in `lifecycle::set_nspawn_flags`) gets removed in the same pass — manager goes to `/agents/manager/state/` + `/agents/manager/harness/` like every other agent.
|
||||
- **Broadcast messaging**: allow sending messages with recipient "*" to all agents; deliver with hint "this was a broadcast and may not need any action from you"
|
||||
- **Multi-agent restart coordination**: when rebuilding all agents, manager should start first so it can coordinate post-restart confusion (notify agents, suppress unnecessary retries, etc)
|
||||
- **Shared docs/skills repo (RO)**: a single repo on the hive forge that every agent has read-only access to — common references, prompts, runbooks, "skills" the operator wants every agent to inherit without baking into the system prompt or `/shared`. Implementation likely: seed an `org-shared/docs` repo on first hive-forge boot, grant every per-agent user a read membership in the org. Agents `git clone` it (or use the API) to read; only the manager + operator can push.
|
||||
|
||||
## Reminder Tool
|
||||
|
||||
- Per-agent reminder limits (burst capacity, rate limiting)
|
||||
- **Scheduler shutdown**: add graceful shutdown signal when coordinator is destroyed (currently runs forever)
|
||||
- **DB lock contention**: under high reminder volume, the broker's `Mutex<Connection>` serializes every delivery transaction. Consider batching multiple deliveries into one tx, or moving reminders onto a separate sqlite connection.
|
||||
|
||||
## Dashboard
|
||||
|
||||
- **Delivered-reminder rollup on the per-agent stats page**: surface attempt / success / failure counts for reminders this agent fired (in the existing `/stats` page). Needs an `AgentRequest::ReminderRollup { since_secs }` / matching `ManagerRequest::ReminderRollup` RPC so the agent can pull the counts from the host's broker DB (the reminders table is host-owned; agent state doesn't have them). Deferred from the initial stats page so the first cut stays self-contained to data the agent already owns.
|
||||
|
||||
## Harness Ergonomics (agent-side wishlist)
|
||||
|
||||
Filed by damocles, who actually lives in this thing. Loosely ranked by
|
||||
how often the friction bites in normal use.
|
||||
|
||||
- **Optional `in_reply_to: <msg_id>` on send** — pure wire addition; no
|
||||
behavioural change. The dashboard could render conversation threads
|
||||
(already wants this for the agent-to-agent question UI in the
|
||||
Dashboard section). Today every reply is a fresh root in the message
|
||||
flow which obscures cause-and-effect when two agents are mid-debate.
|
||||
Field is optional, ignored if the referenced id is unknown / cross-
|
||||
agent / out of retention.
|
||||
|
||||
## Telemetry
|
||||
|
||||
- **Per-turn stats: host-side vacuum sweep**: the sink writes to `/state/hyperhive-turn-stats.sqlite` on each agent's state dir; needs a periodic retention sweep mirroring `events_vacuum.rs` so the table doesn't grow forever. Default keep-window: 90 days (turn-stats are denser than events but smaller per-row, ~200B each).
|
||||
|
||||
## Harness Behaviour
|
||||
|
||||
- **Auto session-reset when context is large and cache is cold**: today every turn uses `--continue`, so a long-lived agent carries its entire transcript forward indefinitely. When the next turn's context is above some threshold (rough starting point: ~50% of the session limit — hive startup alone burned ~15%, so the headroom disappears fast) *and* the prompt cache is no longer warm (last turn ended past the cache TTL), it's cheaper to start fresh than to re-send the whole history uncached. Open question: drop `--continue` vs. trigger `--compact` first — needs measurement of what each actually costs (uncached re-read of N tokens vs. a compact turn's own token spend + the post-compact uncached re-read). Decision should be data-driven, not guessed. Needs: a context-size estimate per turn (turn_stats already tracks token usage), a cache-warmth heuristic (time since last turn vs. cache TTL), and a one-shot fresh-session path in `turn.rs` mirroring the existing `↻ new session` button.
|
||||
|
||||
## Bugs
|
||||
|
||||
- **Token-budget exhaustion crashes the harness**: when claude's account hits its rate/token cap, the in-flight `claude --print` invocation returns an error the harness doesn't recognise as recoverable, the serve loop exits, and the container stays up with a dead daemon. Operator only notices when an unrelated wake fails to drive a turn. Want: detect the budget-exceeded class of failure (likely a specific stderr line or stream-json `rate_limit_event` shape), fire a `LiveEvent::StatusChanged("rate_limited")` or new status, surface as a red badge + banner on the dashboard + per-agent UI, and have the serve loop park (sleep N minutes, retry) instead of returning Err. Operator can also see "this agent is rate-limited until ~HH:MM" if claude tells us when. Inspect `crate::turn::run_claude`'s `bail!` paths + claude's stderr conventions for the budget error string.
|
||||
- **Post-rebuild system-message missed wake**: at 09:13:14 the dashboard showed `system → damocles container rebuilt` as ✓ delivered, but the agent harness never ran a turn for it (no claude invocation, no operator-visible activity). A subsequent `recv()` from inside the agent returned `(empty)`, confirming the message was popped + marked delivered server-side — yet drove no turn. Most likely cause: the agent_server `serve_agent_stdio` task is up and answering MCP/socket calls, but the `hive-ag3nt::serve` long-poll loop that drives `drive_turn` either died silently during rebuild or never restarted. Investigate: (a) does hive-ag3nt's serve loop survive `nixos-container update` cleanly, or does its tokio runtime get torn down mid-loop? (b) is there an early-exit path on a transient socket error during rebuild that drops the serve task without notifying the manager? (c) compare timeline with manager's own post-rebuild wake to see if this is rebuilt-agents-only or universal. Could be related to the `recv_blocking` fix in `e423d57` if the rebuild restarts the broker mid-subscribe.
|
||||
Operator/agent trust-boundary design rationale: [`docs/boundary.md`](docs/boundary.md).
|
||||
|
|
|
|||
59
docs/boundary.md
Normal file
59
docs/boundary.md
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
# The operator/agent boundary
|
||||
|
||||
Design rationale for hyperhive's two-principal trust model. The
|
||||
*implementation* work — container network isolation, the unifying
|
||||
gateway, core-daemon privsep — is tracked as `area:ops` issues on
|
||||
the forge.
|
||||
|
||||
Today "the operator surface" and "the agent surface" are a
|
||||
*convention*, not a boundary — nothing stops a container from
|
||||
curling the core daemon on `localhost:<port>`, or another agent's
|
||||
web UI. Network isolation, the gateway, and privsep together turn
|
||||
that convention into an enforced boundary.
|
||||
|
||||
## Two principals, two paths
|
||||
|
||||
- **Operator** — reaches every UI (the dashboard + every
|
||||
per-agent page) through the gateway, on one origin.
|
||||
Operator-authority actions (approve / deny, answer-as-operator,
|
||||
lifecycle POSTs) are served by the core daemon and only
|
||||
reachable via the gateway.
|
||||
- **Agent** — speaks only for itself, only over its per-agent
|
||||
unix socket. The socket's identity *is* the agent (see
|
||||
`docs/conventions.md`, "identity = socket"). An agent must not
|
||||
be able to reach the core daemon's HTTP surface, another
|
||||
agent's socket, or another agent's web UI.
|
||||
|
||||
## Design rule
|
||||
|
||||
**Operator-authority actions never get a per-agent-socket entry
|
||||
point.** They live on the core backend.
|
||||
|
||||
Worked example — answering an operator-targeted question is a
|
||||
`POST /answer-question/{id}` on the core dashboard, *never* an
|
||||
`AgentRequest` variant. If it were a per-agent-socket request, an
|
||||
agent could `curl` its own socket and spoof an operator answer.
|
||||
The per-agent web UI POSTs cross-origin to the core for these
|
||||
(see the inline-answer feature — the loose-ends section on each
|
||||
agent page).
|
||||
|
||||
## Why network isolation is the load-bearing step
|
||||
|
||||
Containers currently share the host network namespace, so a
|
||||
container can reach `localhost:<core-port>`, the dashboard, and
|
||||
every other agent's web port. Until that changes, the
|
||||
operator/agent split is on the honour system — every boundary
|
||||
claim above is aspirational. Network isolation is what makes the
|
||||
boundary *real*; the gateway and privsep are ergonomics and
|
||||
defence-in-depth layered on top.
|
||||
|
||||
Suggested sequencing of the `area:ops` issues:
|
||||
|
||||
1. **Gateway** first — pure ergonomics win, unblocks same-origin
|
||||
(lets the cross-origin CORS shim on `/answer-question/{id}` go
|
||||
away), no behavioural risk.
|
||||
2. **Network isolation** next — the step that makes the boundary
|
||||
real. Everything before it is honour-system.
|
||||
3. **Privsep** last — defence in depth on the core process
|
||||
itself; valuable independent of the other two, but the
|
||||
biggest refactor.
|
||||
|
|
@ -68,7 +68,8 @@ Bin-loop helpers `build_row` + `record` land each row at
|
|||
`turn_end`; writes are best-effort, a sqlite hiccup logs + lets
|
||||
the turn loop continue.
|
||||
|
||||
No host-side vacuum yet — tracked in `TODO.md` under Telemetry
|
||||
No host-side vacuum yet — tracked as forge issue
|
||||
[#10](http://localhost:3000/hyperhive/hyperhive/issues/10)
|
||||
(target retention ~90 days, age-only sweep like events_vacuum).
|
||||
|
||||
### `/state/hyperhive-model` (per agent)
|
||||
|
|
|
|||
|
|
@ -256,4 +256,4 @@ status hint moved to the wake prompt + UI header.
|
|||
- Allowed MCP tools: as listed above per flavor.
|
||||
|
||||
`Bash` is on the allow-list pending a finer-grained pattern allow-list
|
||||
(`Bash(git *)`-style) — see [TODO](../TODO.md).
|
||||
(`Bash(git *)`-style) — see [issue #21](http://localhost:3000/hyperhive/hyperhive/issues/21).
|
||||
|
|
|
|||
|
|
@ -411,7 +411,7 @@ Layout, top to bottom:
|
|||
Shift+Enter newlines); submitting POSTs cross-origin to the
|
||||
core dashboard's `/answer-question/{id}` so the operator
|
||||
answers *as operator*. The per-agent socket deliberately gets
|
||||
no operator-authority path — see `TODO-ops.md`.
|
||||
no operator-authority path — see `docs/boundary.md`.
|
||||
- Terminal-wrap: live event tail (sticky-bottom auto-scroll +
|
||||
`↓ N new` pill when not at bottom) followed by an
|
||||
operator-input textarea acting as a prompt.
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
// Base URL of the host dashboard (core backend). Set once the first
|
||||
// /api/state lands. Operator-authority actions (answering a question
|
||||
// as the operator) POST here rather than to this agent's own socket —
|
||||
// see TODO-ops.md for why the boundary lives on the core side.
|
||||
// see docs/boundary.md for why the boundary lives on the core side.
|
||||
let dashboardBase = '';
|
||||
|
||||
// ─── async-form submit (shared with dashboard) ──────────────────────────
|
||||
|
|
|
|||
|
|
@ -779,7 +779,7 @@ struct AnswerForm {
|
|||
/// result. The dashboard has no auth, so `*` exposes nothing a plain
|
||||
/// cross-origin form-POST couldn't already reach. This shim disappears
|
||||
/// once the unifying gateway makes the agent page same-origin; see
|
||||
/// `TODO-ops.md`.
|
||||
/// `docs/boundary.md`.
|
||||
fn with_cors(mut resp: Response) -> Response {
|
||||
resp.headers_mut().insert(
|
||||
axum::http::header::ACCESS_CONTROL_ALLOW_ORIGIN,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue