agent_web_port: actually resolve legacy collisions

previous attempt was wrong: the legacy branch returned port_hash
unconditionally, so two legacies hashing to the same port both
wrote that port and the collision persisted (test still trying to
bind coder's port).

new rule: always probe forward from port_hash, with scan_taken_ports
parameterised by include_implicit_hashes:

- legacy migration (applied dir exists, no port file): pass false.
  scan only counts other agents' port files. first-queried legacy
  claims its hash; subsequent colliders see the first's port file
  and probe forward. we don't know which legacy originally won
  the bind race, so first-write-wins; the loser was already
  crash-looping anyway and gets a fresh port to rebuild to.

- fresh spawn (no applied dir): pass true. counts port files AND
  implicit hashes for not-yet-migrated legacies, so a new spawn
  doesn't race with an unmigrated peer.

migration note for affected users: agents whose port file says
something other than their hashed port may have been corrupted by
the previous fix. Hit ↻ R3BU1LD on the offender to regenerate the
flake (uses the current port file) and the container will bind
the right port on restart.
This commit is contained in:
müde 2026-05-15 21:13:17 +02:00
parent 237b215c55
commit c35f566d15

View file

@ -46,18 +46,26 @@ const DEFAULT_MEMORY_MAX: &str = "2G";
const DEFAULT_CPU_QUOTA: &str = "50%"; const DEFAULT_CPU_QUOTA: &str = "50%";
/// Returns the per-agent web UI port. Manager is fixed at `MANAGER_PORT`. /// Returns the per-agent web UI port. Manager is fixed at `MANAGER_PORT`.
/// For sub-agents the port is sticky once chosen:
/// ///
/// - **Port file present** (`state_root/port`): use it. End of story. /// The port is **sticky** once chosen: looked up from
/// - **Port file absent, applied flake present**: this is a legacy /// `state_root/port` if present. On first call we have to decide
/// agent whose container is already bound to the bare /// what to write there, and the answer depends on whether the agent
/// `port_hash(name)`. Don't probe; just migrate by writing that /// is legacy (has an applied flake — container already exists at
/// value to the port file. The container stays where it is and /// some port) or fresh (no applied dir yet — we're about to pick).
/// subsequent renders agree with it. ///
/// - **Port file absent, no applied flake**: this is a fresh spawn. /// In both cases we probe forward from `port_hash(name)` to skip
/// Probe forward from `port_hash(name)` to skip any port another /// ports already claimed by another agent's *port file*. The
/// sub-agent has already claimed (via port file or legacy hash). /// difference is what we count as "claimed":
/// Write the chosen port back. ///
/// - **Legacy migration**: only port files count. We do NOT treat
/// other legacy agents' implicit hashes as taken — if two legacy
/// agents collide on hash, the first queried claims the hash port
/// and the others probe forward. (We don't know which originally
/// won the bind race; first-write-wins is good enough — the
/// loser was already crash-looping anyway.)
/// - **Fresh spawn**: port files AND implicit hashes for any legacy
/// agent that hasn't been migrated yet. Without that, a new
/// agent could pick the same port as a yet-to-be-migrated legacy.
#[must_use] #[must_use]
pub fn agent_web_port(name: &str) -> u16 { pub fn agent_web_port(name: &str) -> u16 {
if name == MANAGER_NAME { if name == MANAGER_NAME {
@ -71,31 +79,20 @@ pub fn agent_web_port(name: &str) -> u16 {
{ {
return port; return port;
} }
let applied_exists = crate::coordinator::Coordinator::agent_applied_dir(name).exists(); let is_legacy = crate::coordinator::Coordinator::agent_applied_dir(name).exists();
let chosen = if applied_exists { let taken = scan_taken_ports(name, /* include_implicit_hashes = */ !is_legacy);
// Legacy agent — container already running on the hashed
// port. Don't move it; just persist the value so future
// calls bypass this path.
port_hash(name)
} else {
let taken = scan_taken_ports(name);
let start = port_hash(name); let start = port_hash(name);
let mut port = start; let mut chosen = start;
for _ in 0..WEB_PORT_RANGE { for _ in 0..WEB_PORT_RANGE {
if !taken.contains(&port) { if !taken.contains(&chosen) {
break; break;
} }
port = next_port(port); chosen = next_port(chosen);
if port == start { if chosen == start {
// Range fully exhausted (very unlikely — 900 slots) —
// give up and use the hashed value; collisions are
// surfaced as bind errors by the harness retry loop.
tracing::warn!(%name, "agent_web_port: range exhausted, returning hash"); tracing::warn!(%name, "agent_web_port: range exhausted, returning hash");
break; break;
} }
} }
port
};
let _ = std::fs::create_dir_all(&state_root); let _ = std::fs::create_dir_all(&state_root);
if let Err(e) = std::fs::write(&port_file, format!("{chosen}\n")) { if let Err(e) = std::fs::write(&port_file, format!("{chosen}\n")) {
tracing::warn!(error = ?e, file = %port_file.display(), "persisting agent port failed"); tracing::warn!(error = ?e, file = %port_file.display(), "persisting agent port failed");
@ -122,14 +119,20 @@ fn next_port(port: u16) -> u16 {
} }
} }
/// Scan every other agent's effective web UI port: prefer the /// Set of ports claimed by other agents.
/// persisted `port` file when present, fall back to the hashed ///
/// value for legacy agents that pre-date the port-file scheme. The /// - Always includes ports persisted to other agents' port files.
/// latter is important on existing deployments — without it, a new /// - `include_implicit_hashes`: if true (fresh-spawn case), also
/// agent's collision check wouldn't see incumbents that haven't /// include `port_hash(other_name)` for every other agent that has
/// written their port file yet, and we'd re-emit the same /// NOT yet been migrated to a port file. This protects new spawns
/// collision the operator just hit. /// from racing with not-yet-migrated legacies.
fn scan_taken_ports(name: &str) -> std::collections::HashSet<u16> { ///
/// When false (legacy migration), we do NOT count implicit hashes
/// — otherwise two legacies that hash to the same port would each
/// see each other's hash as taken and both probe forward away
/// from the port one of them is actually running on, leaving a
/// dashboard URL that doesn't match the running container.
fn scan_taken_ports(name: &str, include_implicit_hashes: bool) -> std::collections::HashSet<u16> {
let mut out = std::collections::HashSet::new(); let mut out = std::collections::HashSet::new();
let Ok(rd) = std::fs::read_dir("/var/lib/hyperhive/agents") else { let Ok(rd) = std::fs::read_dir("/var/lib/hyperhive/agents") else {
return out; return out;
@ -146,9 +149,9 @@ fn scan_taken_ports(name: &str) -> std::collections::HashSet<u16> {
&& let Ok(port) = s.trim().parse::<u16>() && let Ok(port) = s.trim().parse::<u16>()
{ {
out.insert(port); out.insert(port);
} else { } else if include_implicit_hashes {
// Legacy: no port file yet → its effective port is the // Legacy not yet migrated. From the fresh-spawn POV its
// bare hash. Treat as taken so we don't collide with it. // effective port is the bare hash.
out.insert(port_hash(&file_name)); out.insert(port_hash(&file_name));
} }
} }