Add incident report and disaster recovery runbook

incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID change cascading failure that took down all services on rift. Documents the timeline, root causes, recovery steps, and lessons learned. disaster-recovery.md: Step-by-step runbook for bootstrapping the platform from zero when all containers are gone. Covers the boot order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman run commands for each service, common errors, and verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:18:45 -07:00
parent cdf3d30bfe
commit ad34904b0b
2 changed files with 567 additions and 0 deletions
--- a/docs/disaster-recovery.md
+++ b/docs/disaster-recovery.md
@@ -0,0 +1,344 @@
 # Disaster Recovery: Bootstrap from Zero
 This document covers recovering the Metacircular platform when all
 services on rift are down and no containers are running. It assumes:
 - The machine boots and NixOS is functional
 - The mcp-agent systemd service starts automatically
 - Tailscale is configured and connects on boot
 - Service data directories (`/srv/<service>/`) are intact on disk
 - Container images are cached in podman's overlay storage
 If images are NOT cached (fresh machine, disk wipe), see the
 "Cold Start" section at the end.
 ## Prerequisites
 Before starting recovery, verify:
 ```bash
 # 1. Machine is up
 hostname    # should print "rift"
 # 2. Tailscale is connected
 tailscale status --self
 # Should show the Tailnet IP (100.95.252.120)
 # 3. The mcp user exists
 id mcp
 # Should show uid=850(mcp) gid=850(mcp)
 # 4. The agent is running
 systemctl status mcp-agent
 # Should be active
 # 5. Images are cached
 su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
 # Should be > 0
 ```
 If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up`
 If the agent is not running: check `/srv/mcp/mcp-agent` exists and
 `/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`.
 ## Recovery Order
 Services must be started in dependency order. Each stage must be
 healthy before the next starts.
 ```
 Stage 1 (Foundation): MCNS → DNS works
 Stage 2 (Core):       mc-proxy, MCR, Metacrypt → routing + images + certs
 Stage 3 (Management): mcp-master → orchestration
 Stage 4 (Services):   mcq, mcdoc, mcat, kls, sgard, exo → applications
 ```
 ## Manual Recovery Commands
 All commands run as the mcp user. Use this shell prefix:
 ```bash
 # Set up the environment
 export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
 # Or SSH as mcp directly (if SSH login is enabled):
 ssh mcp@rift
 ```
 For brevity, commands below show the `podman run` portion only. Prefix
 with the environment setup above.
 ### Stage 1: MCNS (DNS)
 MCNS must start first. Without it, no hostname resolution works.
 ```bash
 podman run -d --name mcns --restart unless-stopped \
  -p 192.168.88.181:53:53/tcp \
  -p 192.168.88.181:53:53/udp \
  -p 100.95.252.120:53:53/tcp \
  -p 100.95.252.120:53:53/udp \
  -p 127.0.0.1:38443:8443 \
  -v /srv/mcns:/srv/mcns \
  mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
  server --config /srv/mcns/mcns.toml
 ```
 **Verify:**
 ```bash
 dig @192.168.88.181 google.com +short
 # Should return an IP address
 dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
 # Should return a Tailnet IP
 ```
 **Note:** MCNS binds to specific IPs, not `0.0.0.0`, because
 systemd-resolved holds port 53 on localhost. The explicit bindings
 avoid the conflict.
 ### Stage 2: Core Infrastructure
 #### mc-proxy (TLS routing)
 ```bash
 podman run -d --name mc-proxy --restart unless-stopped \
  --network host \
  -v /srv/mc-proxy:/srv/mc-proxy \
  mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
  server --config /srv/mc-proxy/mc-proxy.toml
 ```
 **Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1`
 (should get a response, even if 404)
 #### MCR (Container Registry)
 ```bash
 # API server
 podman run -d --name mcr-api --restart unless-stopped \
  -v /srv/mcr:/srv/mcr \
  -p 127.0.0.1:28443:8443 \
  -p 127.0.0.1:29443:9443 \
  mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
  server --config /srv/mcr/mcr.toml
 # Web UI
 podman run -d --name mcr-web --restart unless-stopped \
  --user 0:0 \
  -v /srv/mcr:/srv/mcr \
  -p 127.0.0.1:28080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
  server --config /srv/mcr/mcr.toml
 ```
 **If MCR fails with "chmod" or "readonly database":**
 ```bash
 podman stop mcr-api
 rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
 podman start mcr-api
 ```
 This recreates the database empty. Image blobs in `/srv/mcr/layers/`
 are preserved but tag metadata is lost. Re-push images to rebuild the
 registry.
 #### Metacrypt (PKI / Secrets)
 ```bash
 # API server
 podman run -d --name metacrypt-api --restart unless-stopped \
  -v /srv/metacrypt:/srv/metacrypt \
  -p 127.0.0.1:18443:8443 \
  -p 127.0.0.1:19443:9443 \
  mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
  server --config /srv/metacrypt/metacrypt.toml
 # Web UI
 podman run -d --name metacrypt-web --restart unless-stopped \
  -v /srv/metacrypt:/srv/metacrypt \
  -p 127.0.0.1:18080:8080 \
  mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
  --config /srv/metacrypt/metacrypt.toml
 ```
 **If Metacrypt fails with "chmod" or "readonly database":**
 Same fix as MCR — delete the database files. **Warning:** this loses
 all encrypted secrets, issued certs tracking, and CA state. The CA
 key itself is in the sealed vault (password-protected), not in SQLite.
 ### Stage 3: MCP Master
 ```bash
 podman run -d --name mcp-master --restart unless-stopped \
  --network host \
  -v /srv/mcp-master:/srv/mcp-master \
  mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
  server --config /srv/mcp-master/mcp-master.toml
 ```
 **Verify:**
 ```bash
 # From vade (operator workstation):
 mcp node list
 # Should show rift, svc, orion
 ```
 ### Stage 4: Application Services
 Once the master is running, deploy applications through MCP:
 ```bash
 mcp deploy mcq --direct
 mcp deploy mcdoc --direct
 mcp deploy mcat --direct
 mcp deploy kls --direct
 ```
 Or start them manually:
 ```bash
 # MCQ
 podman run -d --name mcq --restart unless-stopped \
  -v /srv/mcq:/srv/mcq \
  -p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
  server --config /srv/mcq/mcq.toml
 # MCDoc
 podman run -d --name mcdoc --restart unless-stopped \
  -v /srv/mcdoc:/srv/mcdoc \
  -p 127.0.0.1:38080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
  server --config /srv/mcdoc/mcdoc.toml
 # MCAT
 podman run -d --name mcat --restart unless-stopped \
  -v /srv/mcat:/srv/mcat \
  -p 127.0.0.1:48116:8443 \
  mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
  server --config /srv/mcat/mcat.toml
 # KLS
 podman run -d --name kls --restart unless-stopped \
  -v /srv/kls:/srv/kls \
  -p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
  mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
  -f /srv/kls/kls.conf
 # Sgard
 podman run -d --name sgardd --restart unless-stopped \
  -v /srv/sgard:/srv/sgard \
  -p 127.0.0.1:19473:9473 \
  mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
  --repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
  --tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
 ```
 ## Verification Checklist
 After all services are running:
 ```bash
 # Fleet status
 mcp ps
 # All services should show "running"
 # DNS
 dig @192.168.88.181 google.com +short
 dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
 # MCIAS (runs on svc, should be unaffected by rift outage)
 curl -sk https://mcias.metacircular.net:8443/v1/health
 # MCR
 curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
 # Metacrypt
 curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
 # Public routes via svc
 curl -sk https://mcq.metacircular.net/
 curl -sk https://docs.metacircular.net/
 ```
 ## Common Errors
 ### "chmod: operation not permitted"
 modernc.org/sqlite calls `fchmod()` on database files. This is denied
 inside rootless podman user namespaces. Fix:
 ```bash
 # Delete the database and let the service recreate it
 podman stop <container>
 rm -f /srv/<service>/<service>.db*
 podman start <container>
 ```
 The `fchmod` error will still appear in logs as a warning but is
 non-fatal for newly created databases.
 ### "address already in use" on port 53
 systemd-resolved holds port 53 on localhost. MCNS must bind to
 specific IPs, not `0.0.0.0:53`. Use explicit port bindings:
 `-p 192.168.88.181:53:53 -p 100.95.252.120:53:53`
 ### "connection refused" to MCR
 MCR is down. Images are cached locally — you can start services that
 use cached images without MCR. MCR itself starts from its cached
 image.
 ### Agent shows "error" for all nodes
 Check:
 1. Tailscale is running on both the CLI machine and the target node
 2. The agent is listening: `ss -tlnp | grep 9444`
 3. The CLI config has the correct addresses
 4. TLS certs have the right SANs for the Tailnet IP
 ### "podman: executable file not found"
 This warning appears for svc (which uses Docker, not podman). It's
 benign — svc is an edge node that doesn't run containers.
 ## Cold Start (No Cached Images)
 If the machine was wiped and no images are cached:
 1. **MCIAS** runs on svc (Docker container), not rift. It should be
   unaffected by a rift failure. Verify: `ssh svc.metacircular.net
   "docker ps | grep mcias"`.
 2. **Pre-stage images** by pulling from a backup or building locally:
   ```bash
   # On vade (operator workstation), build and push to a temp location
   cd ~/src/metacircular/mcns && make docker
   podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
     ssh rift "podman load"
   ```
   Repeat for each service.
 3. Alternatively, if another node has MCR access, push images there
   first, then pull from the running MCR instance.
 ## Service Reference
 Quick reference for all services, their images, and critical flags:
 | Service | Image | Network | Key Ports | Config Path |
 |---------|-------|---------|-----------|-------------|
 | mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
 | mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
 | mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
 | mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
 | metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
 | metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
 | mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
 | mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
 | mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
 | mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
 | kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
 | sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
 All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`.
--- a/docs/incident-2026-04-03-uid-change.md
+++ b/docs/incident-2026-04-03-uid-change.md
@@ -0,0 +1,223 @@
 # Incident Report: UID Change Cascading Failure
 **Date**: 2026-04-03
 **Duration**: ~2 hours (08:45–10:45 PDT)
 **Severity**: Full platform outage on rift (all containers lost)
 **Root cause**: Changing the `mcp` system user UID from 995 to 850
 ## Timeline
 ### Background
 Orion was being provisioned as a new worker node. Its NixOS config
 imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
 was already assigned to the `sshd` user, causing a UID collision:
 ```
 uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
 ```
 Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
 rejected rootless podman operations because the calling process's UID
 (995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
 ### The UID Change
 To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
 the 800–899 range, empty on all nodes). Both rift and orion were
 rebuilt with `nixos-rebuild switch`.
 **Problem 1: NixOS doesn't change UIDs for existing users.** The
 rebuild created the NixOS config with `uid = 850` but the existing
 `mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
 was required on each node.
 **Problem 2: Rootless podman caches the UID everywhere.**
 - Podman's SQLite database (`db.sql`) stores absolute paths like
  `/run/user/995/libpod/tmp` and `/run/user/995/containers`
 - The systemd user session (`/run/user/995/`) is tied to the UID
 - subuid/subgid mappings reference the user by name but the kernel
  checks the actual UID
 - Container storage overlay directories have file ownership based on
  the old UID namespace mapping (995 → 100000)
 After changing the UID, `podman` operations failed with:
 ```
 newuidmap: write to uid_map failed: Operation not permitted
 ```
 ### The Reboot
 Rift was rebooted to get a clean systemd user session for UID 850.
 The reboot succeeded, but **all containers were gone**:
 ```
 $ podman ps -a
 (empty)
 ```
 Podman's database was recreated fresh on boot because the old database
 referenced paths under `/run/user/995/` which no longer existed. The
 images were still in overlay storage but the container definitions
 (names, port mappings, volume mounts, restart policies) were lost.
 ### DNS Collapse
 MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
 ran as a container on rift. When all containers were lost, DNS
 resolution broke:
 - `mcq.svc.mcp.metacircular.net` → no answer
 - MCNS also served as a recursive resolver for the LAN
 - `google.com` → NXDOMAIN on machines using MCNS as their resolver
 Tailscale DNS (MagicDNS) was also affected because resolved's global
 DNS config pointed to MCNS. Tailscale itself remained functional
 (its coordination servers are external), but hostname resolution via
 Tailscale DNS names failed.
 The operator turned off Tailscale on vade (the workstation) thinking
 Tailscale was the problem. This broke connectivity to rift entirely
 since the MCP agent binds to the Tailnet IP only (`100.95.252.120:9444`).
 ### Recovery
 **Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
 connectivity works without MCNS — MagicDNS uses Tailscale's own
 servers for `.ts.net` names.
 **Step 2**: Start MCNS manually via `podman run`. The image was cached
 in overlay storage. MCNS needed explicit port bindings (not `--network
 host`) because systemd-resolved holds port 53 on localhost:
 ```bash
 podman run -d --name mcns --restart unless-stopped \
  -p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
  -p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
  -p 127.0.0.1:38443:8443 \
  -v /srv/mcns:/srv/mcns \
  mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
  server --config /srv/mcns/mcns.toml
 ```
 DNS resolution restored within seconds.
 **Step 3**: Start remaining services manually via `podman run`. Images
 were all cached. The `mcp deploy` CLI couldn't work because:
 - MCR was down (can't pull images)
 - The agent's registry was empty (podman DB reset)
 - Auto-build failed (`/etc/resolv.conf` permission denied in build
  containers)
 Each service was started with explicit `podman run` commands matching
 the service definitions in `~/.config/mcp/services/*.toml`.
 **Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
 were owned by UID 850 (the mcp user on the host). Inside containers,
 UID 0 (root) maps to host UID 850 via subuid. But:
 - `podman unshare chown -R 0:0 /srv/<service>` translated ownership
  to match the container's user namespace
 - SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
  in the database directory
 - modernc.org/sqlite calls `fchmod()` on the database file, which is
  denied inside rootless podman user namespaces (even for UID 0 in the
  namespace)
 **Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
 was fatal for MCR and Metacrypt. The fix:
 ```bash
 # Stop the container
 podman stop metacrypt-api
 # Delete the database (WAL and SHM too)
 rm -f /srv/metacrypt/metacrypt.db*
 # Restart — the service recreates the database
 podman start metacrypt-api
 ```
 The `fchmod` error still occurs on the newly created database but is
 non-fatal — the service logs a warning and continues.
 **Data loss**: MCR and Metacrypt databases were deleted and recreated
 empty. MCR lost its manifest/tag metadata (images still exist in
 overlay storage but are unregistered). Metacrypt lost its CA state
 (encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
 etc.) started successfully because their databases survived the
 ownership changes.
 ## Root Causes
 1. **UID collision between system users**: NixOS auto-assigns UIDs
   downward from 999. Pinning UID 995 for mcp collided with sshd on
   orion.
 2. **Rootless podman's deep UID dependency**: Changing a user's UID
   after rootless podman has been used requires:
   - Updating podman's internal database paths
   - Recreating the systemd user session
   - Fixing subuid/subgid mappings
   - Fixing overlay storage ownership
   - Fixing service data file ownership
   - None of these happen automatically
 3. **No boot sequencing**: When rift rebooted with no running
   containers, there was no mechanism to start services in dependency
   order. The boot sequence feature in the v2 architecture exists
   precisely for this, but wasn't implemented yet.
 4. **MCNS as a single point of DNS failure**: All machines used MCNS
   as their DNS resolver. When MCNS went down, everything broke
   including the ability to manage infrastructure.
 5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
   library calls `fchmod()` on database files, which is denied inside
   rootless podman user namespaces. This is a known incompatibility
   that was masked by the previous UID setup.
 ## Lessons Learned
 1. **Never change a rootless podman user's UID.** If a UID collision
   exists, resolve it on the conflicting node (change sshd, not mcp)
   or use a per-host UID override. Changing the UID after podman has
   been used is destructive.
 2. **DNS must not be a single point of failure.** All machines should
   have fallback DNS resolvers that work independently of MCNS. The
   NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
   fallbacks, not just MCNS.
 3. **Boot sequencing is critical.** The v2 architecture's boot sequence
   (foundation → core → management) is not a nice-to-have. Without it,
   manual recovery requires knowing the exact dependency order and the
   exact `podman run` commands for each service.
 4. **The MCP agent should be able to recover containers from its
   registry.** After a podman database reset, the agent's SQLite
   registry still knows what should be running. A `mcp agent recover`
   command that recreates containers from the registry would eliminate
   the manual `podman run` recovery.
 5. **Service definitions must include all runtime parameters.** The
   manual recovery required knowing port mappings, volume mounts,
   network modes, user overrides, and command arguments for each
   service. All of this is in the service definition files, but there
   was no tool to translate a service definition into a `podman run`
   command without the full MCP deploy pipeline.
 6. **Tailscale is not the problem when DNS breaks.** Tailscale's
   control plane is external. Turning off Tailscale makes things worse,
   not better, because the agents bind to Tailnet IPs.
 ## Action Items
 - [ ] Write disaster recovery runbook (separate document)
 - [ ] Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8)
 - [ ] Implement `mcp agent recover` command (recreate containers from
      registry)
 - [ ] Implement boot sequencing in the agent (Phase 3+ architecture)
 - [ ] Investigate modernc.org/sqlite `fchmod` workaround for rootless
      podman
 - [ ] Add multi-address support to node config (Tailnet IP + LAN IP
      + hostname)
 - [ ] Revert mcp UID to a stable value and never change it again