phase 8 step 1: per-agent claude creds bind + destroy keeps state

2026-05-15 12:39:22 +02:00 · 2026-05-15 12:39:22 +02:00 · a42fdb3a5c
commit a42fdb3a5c
parent 0fc287c768
9 changed files with 158 additions and 24 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -99,7 +99,13 @@ A multi-Claude-Code-agent setup on a single host:

 **Manager concurrency = event loop.** `hive-m1nd` pulls from a heterogeneous `next_event` stream: inbound agent messages, replies to sync sends, lifecycle events from `hive-c0re` (crash, OOM, approval-resolved), and dashboard signals. One queue, claude turn per event.

-**Anthropic credentials.** Shared key on host, bind-mounted into every container. No per-agent keys in v1.
+**Anthropic credentials.** ~~Shared key on host~~ — revised in Phase 8.
+Per-agent persistent `~/.claude/` dir bind-mounted from
+`/var/lib/hyperhive/agents/<name>/claude/`. OAuth refresh tokens rotate, so
+sharing across agents is a non-starter (any sibling refresh invalidates all
+the others). One interactive login per agent, ever; creds survive
+`destroy`/recreate by default. Login flow runs from the per-agent web UI
+(see Phase 8).

 **Workdir bootstrap.** Each agent's `state/` starts empty. Initial-task message tells the agent what to clone/set up. Manager can drop big artefacts into `state/` directly (it has RW) and pass the path as a message reference.

@ -230,6 +236,55 @@ The original open-decisions list, with what we picked:
  subcommand (`serve` / `spawn` / `kill` / `rebuild` / `list` / `pending` /
  `approve` / `deny`).

+### ⏳ Phase 8 — real claude in containers + login UX
+
+Until this lands the harness falls back to the echo path; we've never run an
+end-to-end turn with a real model in a real container.
+
+**Credential model.** Per-agent persistent dir at
+`/var/lib/hyperhive/agents/<name>/claude/` bind-mounted RW to `/root/.claude`
+inside the container. *Not* shared across agents: OAuth refresh tokens rotate,
+and sharing one dir means the first refresh by any sibling invalidates all the
+others. Each agent owns its own token lineage from first login onward.
+
+**State-dir persistence.** Agent state dirs (including the claude creds dir)
+persist across `destroy`/recreate by default. The `destroy` verb only purges
+state when given an explicit "wipe" flag from the operator — recreating an
+agent of the same name reuses prior creds with no re-login.
+
+**First-deploy approval.** Spawning a brand-new agent name goes through the
+existing approval queue (same path as config edits). The dashboard shows a
+spinner while `nixos-container create` + `update` + `start` run.
+
+**"needs login" agent state.** If the bound `~/.claude/` has no valid session,
+the harness boots in a partial mode: per-agent web UI is up, but the turn
+loop does NOT start. Dashboard surfaces the state per-agent so the operator
+knows where to click.
+
+**Login over the per-agent web UI.** No more `nixos-container root-login` for
+the common case. The agent's web UI exposes a "log in" action that:
+1. Spawns `claude /login` (or equivalent) inside the container with plain
+   stdio pipes — no PTY unless we discover we need one.
+2. Reads the OAuth URL from the process stdout and shows it on the page.
+3. Provides a paste field for the resulting code; writes it to the process
+   stdin.
+4. On success, transitions out of "needs login" and starts the turn loop.
+
+If `claude` turns out to require a TTY (refuses on `!isatty()`, uses raw-mode
+input, or only renders the URL with ANSI styling), redo the backend with a
+PTY (e.g. `portable-pty`). Don't pre-build for that — start simple.
+
+**Sequence.** Ship in this order — don't do (4) before (3) or there's nowhere
+for the login UI to live: (1) bind-mount + per-agent dir creation in
+`lifecycle::set_nspawn_flags`, (2) approval-gated first spawn + dashboard
+spinner, (3) harness "needs login" partial-run mode, (4) PTY-backed login
+endpoint on the per-agent UI.
+
+**Exit:** spawn a new agent from the dashboard → approve → wait for spinner
+→ click "log in" on the agent's page → complete OAuth in the browser →
+paste code → agent enters the turn loop and replies to a T4LK message via
+real `claude --print`.
+
 ## Polish backlog (not phased)

 See CLAUDE.md → "Polish backlog" for the live list. Highlights: operator