Move incident report to log/ directory

Incident reports go in log/ (operational work logs), not docs/ (architecture and reference documentation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:22:42 -07:00
parent ad34904b0b
commit a474f17a8b
1 changed files with 0 additions and 0 deletions
--- a/log/2026-04-03-uid-incident.md
+++ b/log/2026-04-03-uid-incident.md
@@ -0,0 +1,223 @@
+# Incident Report: UID Change Cascading Failure
+
+**Date**: 2026-04-03
+**Duration**: ~2 hours (08:45–10:45 PDT)
+**Severity**: Full platform outage on rift (all containers lost)
+**Root cause**: Changing the `mcp` system user UID from 995 to 850
+
+## Timeline
+
+### Background
+
+Orion was being provisioned as a new worker node. Its NixOS config
+imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
+was already assigned to the `sshd` user, causing a UID collision:
+
+```
+uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
+```
+
+Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
+rejected rootless podman operations because the calling process's UID
+(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
+
+### The UID Change
+
+To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
+the 800–899 range, empty on all nodes). Both rift and orion were
+rebuilt with `nixos-rebuild switch`.
+
+**Problem 1: NixOS doesn't change UIDs for existing users.** The
+rebuild created the NixOS config with `uid = 850` but the existing
+`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
+was required on each node.
+
+**Problem 2: Rootless podman caches the UID everywhere.**
+- Podman's SQLite database (`db.sql`) stores absolute paths like
+  `/run/user/995/libpod/tmp` and `/run/user/995/containers`
+- The systemd user session (`/run/user/995/`) is tied to the UID
+- subuid/subgid mappings reference the user by name but the kernel
+  checks the actual UID
+- Container storage overlay directories have file ownership based on
+  the old UID namespace mapping (995 → 100000)
+
+After changing the UID, `podman` operations failed with:
+```
+newuidmap: write to uid_map failed: Operation not permitted
+```
+
+### The Reboot
+
+Rift was rebooted to get a clean systemd user session for UID 850.
+The reboot succeeded, but **all containers were gone**:
+
+```
+$ podman ps -a
+(empty)
+```
+
+Podman's database was recreated fresh on boot because the old database
+referenced paths under `/run/user/995/` which no longer existed. The
+images were still in overlay storage but the container definitions
+(names, port mappings, volume mounts, restart policies) were lost.
+
+### DNS Collapse
+
+MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
+ran as a container on rift. When all containers were lost, DNS
+resolution broke:
+
+- `mcq.svc.mcp.metacircular.net` → no answer
+- MCNS also served as a recursive resolver for the LAN
+- `google.com` → NXDOMAIN on machines using MCNS as their resolver
+
+Tailscale DNS (MagicDNS) was also affected because resolved's global
+DNS config pointed to MCNS. Tailscale itself remained functional
+(its coordination servers are external), but hostname resolution via
+Tailscale DNS names failed.
+
+The operator turned off Tailscale on vade (the workstation) thinking
+Tailscale was the problem. This broke connectivity to rift entirely
+since the MCP agent binds to the Tailnet IP only (`100.95.252.120:9444`).
+
+### Recovery
+
+**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
+connectivity works without MCNS — MagicDNS uses Tailscale's own
+servers for `.ts.net` names.
+
+**Step 2**: Start MCNS manually via `podman run`. The image was cached
+in overlay storage. MCNS needed explicit port bindings (not `--network
+host`) because systemd-resolved holds port 53 on localhost:
+
+```bash
+podman run -d --name mcns --restart unless-stopped \
+  -p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
+  -p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
+  -p 127.0.0.1:38443:8443 \
+  -v /srv/mcns:/srv/mcns \
+  mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
+  server --config /srv/mcns/mcns.toml
+```
+
+DNS resolution restored within seconds.
+
+**Step 3**: Start remaining services manually via `podman run`. Images
+were all cached. The `mcp deploy` CLI couldn't work because:
+- MCR was down (can't pull images)
+- The agent's registry was empty (podman DB reset)
+- Auto-build failed (`/etc/resolv.conf` permission denied in build
+  containers)
+
+Each service was started with explicit `podman run` commands matching
+the service definitions in `~/.config/mcp/services/*.toml`.
+
+**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
+were owned by UID 850 (the mcp user on the host). Inside containers,
+UID 0 (root) maps to host UID 850 via subuid. But:
+
+- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
+  to match the container's user namespace
+- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
+  in the database directory
+- modernc.org/sqlite calls `fchmod()` on the database file, which is
+  denied inside rootless podman user namespaces (even for UID 0 in the
+  namespace)
+
+**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
+was fatal for MCR and Metacrypt. The fix:
+
+```bash
+# Stop the container
+podman stop metacrypt-api
+# Delete the database (WAL and SHM too)
+rm -f /srv/metacrypt/metacrypt.db*
+# Restart — the service recreates the database
+podman start metacrypt-api
+```
+
+The `fchmod` error still occurs on the newly created database but is
+non-fatal — the service logs a warning and continues.
+
+**Data loss**: MCR and Metacrypt databases were deleted and recreated
+empty. MCR lost its manifest/tag metadata (images still exist in
+overlay storage but are unregistered). Metacrypt lost its CA state
+(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
+etc.) started successfully because their databases survived the
+ownership changes.
+
+## Root Causes
+
+1. **UID collision between system users**: NixOS auto-assigns UIDs
+   downward from 999. Pinning UID 995 for mcp collided with sshd on
+   orion.
+
+2. **Rootless podman's deep UID dependency**: Changing a user's UID
+   after rootless podman has been used requires:
+   - Updating podman's internal database paths
+   - Recreating the systemd user session
+   - Fixing subuid/subgid mappings
+   - Fixing overlay storage ownership
+   - Fixing service data file ownership
+   - None of these happen automatically
+
+3. **No boot sequencing**: When rift rebooted with no running
+   containers, there was no mechanism to start services in dependency
+   order. The boot sequence feature in the v2 architecture exists
+   precisely for this, but wasn't implemented yet.
+
+4. **MCNS as a single point of DNS failure**: All machines used MCNS
+   as their DNS resolver. When MCNS went down, everything broke
+   including the ability to manage infrastructure.
+
+5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
+   library calls `fchmod()` on database files, which is denied inside
+   rootless podman user namespaces. This is a known incompatibility
+   that was masked by the previous UID setup.
+
+## Lessons Learned
+
+1. **Never change a rootless podman user's UID.** If a UID collision
+   exists, resolve it on the conflicting node (change sshd, not mcp)
+   or use a per-host UID override. Changing the UID after podman has
+   been used is destructive.
+
+2. **DNS must not be a single point of failure.** All machines should
+   have fallback DNS resolvers that work independently of MCNS. The
+   NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
+   fallbacks, not just MCNS.
+
+3. **Boot sequencing is critical.** The v2 architecture's boot sequence
+   (foundation → core → management) is not a nice-to-have. Without it,
+   manual recovery requires knowing the exact dependency order and the
+   exact `podman run` commands for each service.
+
+4. **The MCP agent should be able to recover containers from its
+   registry.** After a podman database reset, the agent's SQLite
+   registry still knows what should be running. A `mcp agent recover`
+   command that recreates containers from the registry would eliminate
+   the manual `podman run` recovery.
+
+5. **Service definitions must include all runtime parameters.** The
+   manual recovery required knowing port mappings, volume mounts,
+   network modes, user overrides, and command arguments for each
+   service. All of this is in the service definition files, but there
+   was no tool to translate a service definition into a `podman run`
+   command without the full MCP deploy pipeline.
+
+6. **Tailscale is not the problem when DNS breaks.** Tailscale's
+   control plane is external. Turning off Tailscale makes things worse,
+   not better, because the agents bind to Tailnet IPs.
+
+## Action Items
+
+- [ ] Write disaster recovery runbook (separate document)
+- [ ] Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8)
+- [ ] Implement `mcp agent recover` command (recreate containers from
+      registry)
+- [ ] Implement boot sequencing in the agent (Phase 3+ architecture)
+- [ ] Investigate modernc.org/sqlite `fchmod` workaround for rootless
+      podman
+- [ ] Add multi-address support to node config (Tailnet IP + LAN IP
+      + hostname)
+- [ ] Revert mcp UID to a stable value and never change it again