diff --git a/docs/disaster-recovery.md b/docs/disaster-recovery.md new file mode 100644 index 0000000..5753419 --- /dev/null +++ b/docs/disaster-recovery.md @@ -0,0 +1,344 @@ +# Disaster Recovery: Bootstrap from Zero + +This document covers recovering the Metacircular platform when all +services on rift are down and no containers are running. It assumes: + +- The machine boots and NixOS is functional +- The mcp-agent systemd service starts automatically +- Tailscale is configured and connects on boot +- Service data directories (`/srv//`) are intact on disk +- Container images are cached in podman's overlay storage + +If images are NOT cached (fresh machine, disk wipe), see the +"Cold Start" section at the end. + +## Prerequisites + +Before starting recovery, verify: + +```bash +# 1. Machine is up +hostname # should print "rift" + +# 2. Tailscale is connected +tailscale status --self +# Should show the Tailnet IP (100.95.252.120) + +# 3. The mcp user exists +id mcp +# Should show uid=850(mcp) gid=850(mcp) + +# 4. The agent is running +systemctl status mcp-agent +# Should be active + +# 5. Images are cached +su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l +# Should be > 0 +``` + +If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up` + +If the agent is not running: check `/srv/mcp/mcp-agent` exists and +`/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`. + +## Recovery Order + +Services must be started in dependency order. Each stage must be +healthy before the next starts. + +``` +Stage 1 (Foundation): MCNS → DNS works +Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs +Stage 3 (Management): mcp-master → orchestration +Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications +``` + +## Manual Recovery Commands + +All commands run as the mcp user. Use this shell prefix: + +```bash +# Set up the environment +export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"' +# Or SSH as mcp directly (if SSH login is enabled): +ssh mcp@rift +``` + +For brevity, commands below show the `podman run` portion only. Prefix +with the environment setup above. + +### Stage 1: MCNS (DNS) + +MCNS must start first. Without it, no hostname resolution works. + +```bash +podman run -d --name mcns --restart unless-stopped \ + -p 192.168.88.181:53:53/tcp \ + -p 192.168.88.181:53:53/udp \ + -p 100.95.252.120:53:53/tcp \ + -p 100.95.252.120:53:53/udp \ + -p 127.0.0.1:38443:8443 \ + -v /srv/mcns:/srv/mcns \ + mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \ + server --config /srv/mcns/mcns.toml +``` + +**Verify:** +```bash +dig @192.168.88.181 google.com +short +# Should return an IP address +dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short +# Should return a Tailnet IP +``` + +**Note:** MCNS binds to specific IPs, not `0.0.0.0`, because +systemd-resolved holds port 53 on localhost. The explicit bindings +avoid the conflict. + +### Stage 2: Core Infrastructure + +#### mc-proxy (TLS routing) + +```bash +podman run -d --name mc-proxy --restart unless-stopped \ + --network host \ + -v /srv/mc-proxy:/srv/mc-proxy \ + mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \ + server --config /srv/mc-proxy/mc-proxy.toml +``` + +**Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1` +(should get a response, even if 404) + +#### MCR (Container Registry) + +```bash +# API server +podman run -d --name mcr-api --restart unless-stopped \ + -v /srv/mcr:/srv/mcr \ + -p 127.0.0.1:28443:8443 \ + -p 127.0.0.1:29443:9443 \ + mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \ + server --config /srv/mcr/mcr.toml + +# Web UI +podman run -d --name mcr-web --restart unless-stopped \ + --user 0:0 \ + -v /srv/mcr:/srv/mcr \ + -p 127.0.0.1:28080:8080 \ + mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \ + server --config /srv/mcr/mcr.toml +``` + +**If MCR fails with "chmod" or "readonly database":** +```bash +podman stop mcr-api +rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm +podman start mcr-api +``` +This recreates the database empty. Image blobs in `/srv/mcr/layers/` +are preserved but tag metadata is lost. Re-push images to rebuild the +registry. + +#### Metacrypt (PKI / Secrets) + +```bash +# API server +podman run -d --name metacrypt-api --restart unless-stopped \ + -v /srv/metacrypt:/srv/metacrypt \ + -p 127.0.0.1:18443:8443 \ + -p 127.0.0.1:19443:9443 \ + mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \ + server --config /srv/metacrypt/metacrypt.toml + +# Web UI +podman run -d --name metacrypt-web --restart unless-stopped \ + -v /srv/metacrypt:/srv/metacrypt \ + -p 127.0.0.1:18080:8080 \ + mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \ + --config /srv/metacrypt/metacrypt.toml +``` + +**If Metacrypt fails with "chmod" or "readonly database":** +Same fix as MCR — delete the database files. **Warning:** this loses +all encrypted secrets, issued certs tracking, and CA state. The CA +key itself is in the sealed vault (password-protected), not in SQLite. + +### Stage 3: MCP Master + +```bash +podman run -d --name mcp-master --restart unless-stopped \ + --network host \ + -v /srv/mcp-master:/srv/mcp-master \ + mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \ + server --config /srv/mcp-master/mcp-master.toml +``` + +**Verify:** +```bash +# From vade (operator workstation): +mcp node list +# Should show rift, svc, orion +``` + +### Stage 4: Application Services + +Once the master is running, deploy applications through MCP: + +```bash +mcp deploy mcq --direct +mcp deploy mcdoc --direct +mcp deploy mcat --direct +mcp deploy kls --direct +``` + +Or start them manually: + +```bash +# MCQ +podman run -d --name mcq --restart unless-stopped \ + -v /srv/mcq:/srv/mcq \ + -p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \ + mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \ + server --config /srv/mcq/mcq.toml + +# MCDoc +podman run -d --name mcdoc --restart unless-stopped \ + -v /srv/mcdoc:/srv/mcdoc \ + -p 127.0.0.1:38080:8080 \ + mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \ + server --config /srv/mcdoc/mcdoc.toml + +# MCAT +podman run -d --name mcat --restart unless-stopped \ + -v /srv/mcat:/srv/mcat \ + -p 127.0.0.1:48116:8443 \ + mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \ + server --config /srv/mcat/mcat.toml + +# KLS +podman run -d --name kls --restart unless-stopped \ + -v /srv/kls:/srv/kls \ + -p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \ + mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \ + -f /srv/kls/kls.conf + +# Sgard +podman run -d --name sgardd --restart unless-stopped \ + -v /srv/sgard:/srv/sgard \ + -p 127.0.0.1:19473:9473 \ + mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \ + --repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \ + --tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key +``` + +## Verification Checklist + +After all services are running: + +```bash +# Fleet status +mcp ps +# All services should show "running" + +# DNS +dig @192.168.88.181 google.com +short +dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short + +# MCIAS (runs on svc, should be unaffected by rift outage) +curl -sk https://mcias.metacircular.net:8443/v1/health + +# MCR +curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/ + +# Metacrypt +curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health + +# Public routes via svc +curl -sk https://mcq.metacircular.net/ +curl -sk https://docs.metacircular.net/ +``` + +## Common Errors + +### "chmod: operation not permitted" + +modernc.org/sqlite calls `fchmod()` on database files. This is denied +inside rootless podman user namespaces. Fix: + +```bash +# Delete the database and let the service recreate it +podman stop +rm -f /srv//.db* +podman start +``` + +The `fchmod` error will still appear in logs as a warning but is +non-fatal for newly created databases. + +### "address already in use" on port 53 + +systemd-resolved holds port 53 on localhost. MCNS must bind to +specific IPs, not `0.0.0.0:53`. Use explicit port bindings: +`-p 192.168.88.181:53:53 -p 100.95.252.120:53:53` + +### "connection refused" to MCR + +MCR is down. Images are cached locally — you can start services that +use cached images without MCR. MCR itself starts from its cached +image. + +### Agent shows "error" for all nodes + +Check: +1. Tailscale is running on both the CLI machine and the target node +2. The agent is listening: `ss -tlnp | grep 9444` +3. The CLI config has the correct addresses +4. TLS certs have the right SANs for the Tailnet IP + +### "podman: executable file not found" + +This warning appears for svc (which uses Docker, not podman). It's +benign — svc is an edge node that doesn't run containers. + +## Cold Start (No Cached Images) + +If the machine was wiped and no images are cached: + +1. **MCIAS** runs on svc (Docker container), not rift. It should be + unaffected by a rift failure. Verify: `ssh svc.metacircular.net + "docker ps | grep mcias"`. + +2. **Pre-stage images** by pulling from a backup or building locally: + ```bash + # On vade (operator workstation), build and push to a temp location + cd ~/src/metacircular/mcns && make docker + podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \ + ssh rift "podman load" + ``` + Repeat for each service. + +3. Alternatively, if another node has MCR access, push images there + first, then pull from the running MCR instance. + +## Service Reference + +Quick reference for all services, their images, and critical flags: + +| Service | Image | Network | Key Ports | Config Path | +|---------|-------|---------|-----------|-------------| +| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml | +| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml | +| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml | +| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml | +| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml | +| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml | +| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml | +| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml | +| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml | +| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml | +| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf | +| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) | + +All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`. diff --git a/docs/incident-2026-04-03-uid-change.md b/docs/incident-2026-04-03-uid-change.md new file mode 100644 index 0000000..efdb8c0 --- /dev/null +++ b/docs/incident-2026-04-03-uid-change.md @@ -0,0 +1,223 @@ +# Incident Report: UID Change Cascading Failure + +**Date**: 2026-04-03 +**Duration**: ~2 hours (08:45–10:45 PDT) +**Severity**: Full platform outage on rift (all containers lost) +**Root cause**: Changing the `mcp` system user UID from 995 to 850 + +## Timeline + +### Background + +Orion was being provisioned as a new worker node. Its NixOS config +imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995 +was already assigned to the `sshd` user, causing a UID collision: + +``` +uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd) +``` + +Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool +rejected rootless podman operations because the calling process's UID +(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`. + +### The UID Change + +To resolve the collision, `mcp.nix` was updated to pin UID 850 (in +the 800–899 range, empty on all nodes). Both rift and orion were +rebuilt with `nixos-rebuild switch`. + +**Problem 1: NixOS doesn't change UIDs for existing users.** The +rebuild created the NixOS config with `uid = 850` but the existing +`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp` +was required on each node. + +**Problem 2: Rootless podman caches the UID everywhere.** +- Podman's SQLite database (`db.sql`) stores absolute paths like + `/run/user/995/libpod/tmp` and `/run/user/995/containers` +- The systemd user session (`/run/user/995/`) is tied to the UID +- subuid/subgid mappings reference the user by name but the kernel + checks the actual UID +- Container storage overlay directories have file ownership based on + the old UID namespace mapping (995 → 100000) + +After changing the UID, `podman` operations failed with: +``` +newuidmap: write to uid_map failed: Operation not permitted +``` + +### The Reboot + +Rift was rebooted to get a clean systemd user session for UID 850. +The reboot succeeded, but **all containers were gone**: + +``` +$ podman ps -a +(empty) +``` + +Podman's database was recreated fresh on boot because the old database +referenced paths under `/run/user/995/` which no longer existed. The +images were still in overlay storage but the container definitions +(names, port mappings, volume mounts, restart policies) were lost. + +### DNS Collapse + +MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`) +ran as a container on rift. When all containers were lost, DNS +resolution broke: + +- `mcq.svc.mcp.metacircular.net` → no answer +- MCNS also served as a recursive resolver for the LAN +- `google.com` → NXDOMAIN on machines using MCNS as their resolver + +Tailscale DNS (MagicDNS) was also affected because resolved's global +DNS config pointed to MCNS. Tailscale itself remained functional +(its coordination servers are external), but hostname resolution via +Tailscale DNS names failed. + +The operator turned off Tailscale on vade (the workstation) thinking +Tailscale was the problem. This broke connectivity to rift entirely +since the MCP agent binds to the Tailnet IP only (`100.95.252.120:9444`). + +### Recovery + +**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale +connectivity works without MCNS — MagicDNS uses Tailscale's own +servers for `.ts.net` names. + +**Step 2**: Start MCNS manually via `podman run`. The image was cached +in overlay storage. MCNS needed explicit port bindings (not `--network +host`) because systemd-resolved holds port 53 on localhost: + +```bash +podman run -d --name mcns --restart unless-stopped \ + -p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \ + -p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \ + -p 127.0.0.1:38443:8443 \ + -v /srv/mcns:/srv/mcns \ + mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \ + server --config /srv/mcns/mcns.toml +``` + +DNS resolution restored within seconds. + +**Step 3**: Start remaining services manually via `podman run`. Images +were all cached. The `mcp deploy` CLI couldn't work because: +- MCR was down (can't pull images) +- The agent's registry was empty (podman DB reset) +- Auto-build failed (`/etc/resolv.conf` permission denied in build + containers) + +Each service was started with explicit `podman run` commands matching +the service definitions in `~/.config/mcp/services/*.toml`. + +**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*` +were owned by UID 850 (the mcp user on the host). Inside containers, +UID 0 (root) maps to host UID 850 via subuid. But: + +- `podman unshare chown -R 0:0 /srv/` translated ownership + to match the container's user namespace +- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files + in the database directory +- modernc.org/sqlite calls `fchmod()` on the database file, which is + denied inside rootless podman user namespaces (even for UID 0 in the + namespace) + +**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial +was fatal for MCR and Metacrypt. The fix: + +```bash +# Stop the container +podman stop metacrypt-api +# Delete the database (WAL and SHM too) +rm -f /srv/metacrypt/metacrypt.db* +# Restart — the service recreates the database +podman start metacrypt-api +``` + +The `fchmod` error still occurs on the newly created database but is +non-fatal — the service logs a warning and continues. + +**Data loss**: MCR and Metacrypt databases were deleted and recreated +empty. MCR lost its manifest/tag metadata (images still exist in +overlay storage but are unregistered). Metacrypt lost its CA state +(encrypted keys, issued certs tracking). Other services (mcq, mcdoc, +etc.) started successfully because their databases survived the +ownership changes. + +## Root Causes + +1. **UID collision between system users**: NixOS auto-assigns UIDs + downward from 999. Pinning UID 995 for mcp collided with sshd on + orion. + +2. **Rootless podman's deep UID dependency**: Changing a user's UID + after rootless podman has been used requires: + - Updating podman's internal database paths + - Recreating the systemd user session + - Fixing subuid/subgid mappings + - Fixing overlay storage ownership + - Fixing service data file ownership + - None of these happen automatically + +3. **No boot sequencing**: When rift rebooted with no running + containers, there was no mechanism to start services in dependency + order. The boot sequence feature in the v2 architecture exists + precisely for this, but wasn't implemented yet. + +4. **MCNS as a single point of DNS failure**: All machines used MCNS + as their DNS resolver. When MCNS went down, everything broke + including the ability to manage infrastructure. + +5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite + library calls `fchmod()` on database files, which is denied inside + rootless podman user namespaces. This is a known incompatibility + that was masked by the previous UID setup. + +## Lessons Learned + +1. **Never change a rootless podman user's UID.** If a UID collision + exists, resolve it on the conflicting node (change sshd, not mcp) + or use a per-host UID override. Changing the UID after podman has + been used is destructive. + +2. **DNS must not be a single point of failure.** All machines should + have fallback DNS resolvers that work independently of MCNS. The + NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as + fallbacks, not just MCNS. + +3. **Boot sequencing is critical.** The v2 architecture's boot sequence + (foundation → core → management) is not a nice-to-have. Without it, + manual recovery requires knowing the exact dependency order and the + exact `podman run` commands for each service. + +4. **The MCP agent should be able to recover containers from its + registry.** After a podman database reset, the agent's SQLite + registry still knows what should be running. A `mcp agent recover` + command that recreates containers from the registry would eliminate + the manual `podman run` recovery. + +5. **Service definitions must include all runtime parameters.** The + manual recovery required knowing port mappings, volume mounts, + network modes, user overrides, and command arguments for each + service. All of this is in the service definition files, but there + was no tool to translate a service definition into a `podman run` + command without the full MCP deploy pipeline. + +6. **Tailscale is not the problem when DNS breaks.** Tailscale's + control plane is external. Turning off Tailscale makes things worse, + not better, because the agents bind to Tailnet IPs. + +## Action Items + +- [ ] Write disaster recovery runbook (separate document) +- [ ] Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8) +- [ ] Implement `mcp agent recover` command (recreate containers from + registry) +- [ ] Implement boot sequencing in the agent (Phase 3+ architecture) +- [ ] Investigate modernc.org/sqlite `fchmod` workaround for rootless + podman +- [ ] Add multi-address support to node config (Tailnet IP + LAN IP + + hostname) +- [ ] Revert mcp UID to a stable value and never change it again