Files

Kyle Isom ad34904b0b Add incident report and disaster recovery runbook

incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID
change cascading failure that took down all services on rift. Documents
the timeline, root causes, recovery steps, and lessons learned.

disaster-recovery.md: Step-by-step runbook for bootstrapping the
platform from zero when all containers are gone. Covers the boot
order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman
run commands for each service, common errors, and verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 09:18:45 -07:00

8.7 KiB

Raw Blame History

Incident Report: UID Change Cascading Failure

Date: 2026-04-03 Duration: ~2 hours (08:45–10:45 PDT) Severity: Full platform outage on rift (all containers lost) Root cause: Changing the mcp system user UID from 995 to 850

Timeline

Background

Orion was being provisioned as a new worker node. Its NixOS config imports mcp.nix which pins the mcp user to UID 995. On orion, UID 995 was already assigned to the sshd user, causing a UID collision:

uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)

Both sshd and mcp had UID 995 on orion. The newuidmap tool rejected rootless podman operations because the calling process's UID (995) belonged to sshd, not mcp, in /etc/passwd.

The UID Change

To resolve the collision, mcp.nix was updated to pin UID 850 (in the 800–899 range, empty on all nodes). Both rift and orion were rebuilt with nixos-rebuild switch.

Problem 1: NixOS doesn't change UIDs for existing users. The rebuild created the NixOS config with uid = 850 but the existing mcp user on both nodes kept UID 995. Manual usermod -u 850 mcp was required on each node.

Problem 2: Rootless podman caches the UID everywhere.

Podman's SQLite database (db.sql) stores absolute paths like /run/user/995/libpod/tmp and /run/user/995/containers
The systemd user session (/run/user/995/) is tied to the UID
subuid/subgid mappings reference the user by name but the kernel checks the actual UID
Container storage overlay directories have file ownership based on the old UID namespace mapping (995 → 100000)

After changing the UID, podman operations failed with:

newuidmap: write to uid_map failed: Operation not permitted

The Reboot

Rift was rebooted to get a clean systemd user session for UID 850. The reboot succeeded, but all containers were gone:

$ podman ps -a
(empty)

Podman's database was recreated fresh on boot because the old database referenced paths under /run/user/995/ which no longer existed. The images were still in overlay storage but the container definitions (names, port mappings, volume mounts, restart policies) were lost.

DNS Collapse

MCNS (the authoritative DNS server for .svc.mcp.metacircular.net) ran as a container on rift. When all containers were lost, DNS resolution broke:

mcq.svc.mcp.metacircular.net → no answer
MCNS also served as a recursive resolver for the LAN
google.com → NXDOMAIN on machines using MCNS as their resolver

Tailscale DNS (MagicDNS) was also affected because resolved's global DNS config pointed to MCNS. Tailscale itself remained functional (its coordination servers are external), but hostname resolution via Tailscale DNS names failed.

The operator turned off Tailscale on vade (the workstation) thinking Tailscale was the problem. This broke connectivity to rift entirely since the MCP agent binds to the Tailnet IP only (100.95.252.120:9444).

Recovery

Step 1: Turn Tailscale back on (on both rift and vade). Tailscale connectivity works without MCNS — MagicDNS uses Tailscale's own servers for .ts.net names.

Step 2: Start MCNS manually via podman run. The image was cached in overlay storage. MCNS needed explicit port bindings (not --network host) because systemd-resolved holds port 53 on localhost:

podman run -d --name mcns --restart unless-stopped \
  -p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
  -p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
  -p 127.0.0.1:38443:8443 \
  -v /srv/mcns:/srv/mcns \
  mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
  server --config /srv/mcns/mcns.toml

DNS resolution restored within seconds.

Step 3: Start remaining services manually via podman run. Images were all cached. The mcp deploy CLI couldn't work because:

MCR was down (can't pull images)
The agent's registry was empty (podman DB reset)
Auto-build failed (/etc/resolv.conf permission denied in build containers)

Each service was started with explicit podman run commands matching the service definitions in ~/.config/mcp/services/*.toml.

Step 4: Fix file ownership for rootless podman. Files in /srv/* were owned by UID 850 (the mcp user on the host). Inside containers, UID 0 (root) maps to host UID 850 via subuid. But:

podman unshare chown -R 0:0 /srv/<service> translated ownership to match the container's user namespace
SQLite's PRAGMA journal_mode = WAL requires creating WAL/SHM files in the database directory
modernc.org/sqlite calls fchmod() on the database file, which is denied inside rootless podman user namespaces (even for UID 0 in the namespace)

Step 5: Delete and recreate SQLite databases. The fchmod denial was fatal for MCR and Metacrypt. The fix:

# Stop the container
podman stop metacrypt-api
# Delete the database (WAL and SHM too)
rm -f /srv/metacrypt/metacrypt.db*
# Restart — the service recreates the database
podman start metacrypt-api

The fchmod error still occurs on the newly created database but is non-fatal — the service logs a warning and continues.

Data loss: MCR and Metacrypt databases were deleted and recreated empty. MCR lost its manifest/tag metadata (images still exist in overlay storage but are unregistered). Metacrypt lost its CA state (encrypted keys, issued certs tracking). Other services (mcq, mcdoc, etc.) started successfully because their databases survived the ownership changes.

Root Causes

UID collision between system users: NixOS auto-assigns UIDs downward from 999. Pinning UID 995 for mcp collided with sshd on orion.
Rootless podman's deep UID dependency: Changing a user's UID after rootless podman has been used requires:
- Updating podman's internal database paths
- Recreating the systemd user session
- Fixing subuid/subgid mappings
- Fixing overlay storage ownership
- Fixing service data file ownership
- None of these happen automatically
No boot sequencing: When rift rebooted with no running containers, there was no mechanism to start services in dependency order. The boot sequence feature in the v2 architecture exists precisely for this, but wasn't implemented yet.
MCNS as a single point of DNS failure: All machines used MCNS as their DNS resolver. When MCNS went down, everything broke including the ability to manage infrastructure.
modernc.org/sqlite fchmod in rootless podman: The SQLite library calls fchmod() on database files, which is denied inside rootless podman user namespaces. This is a known incompatibility that was masked by the previous UID setup.

Lessons Learned

Never change a rootless podman user's UID. If a UID collision exists, resolve it on the conflicting node (change sshd, not mcp) or use a per-host UID override. Changing the UID after podman has been used is destructive.
DNS must not be a single point of failure. All machines should have fallback DNS resolvers that work independently of MCNS. The NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as fallbacks, not just MCNS.
Boot sequencing is critical. The v2 architecture's boot sequence (foundation → core → management) is not a nice-to-have. Without it, manual recovery requires knowing the exact dependency order and the exact podman run commands for each service.
The MCP agent should be able to recover containers from its registry. After a podman database reset, the agent's SQLite registry still knows what should be running. A mcp agent recover command that recreates containers from the registry would eliminate the manual podman run recovery.
Service definitions must include all runtime parameters. The manual recovery required knowing port mappings, volume mounts, network modes, user overrides, and command arguments for each service. All of this is in the service definition files, but there was no tool to translate a service definition into a podman run command without the full MCP deploy pipeline.
Tailscale is not the problem when DNS breaks. Tailscale's control plane is external. Turning off Tailscale makes things worse, not better, because the agents bind to Tailnet IPs.

Action Items

Write disaster recovery runbook (separate document)
Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8)
Implement mcp agent recover command (recreate containers from registry)
Implement boot sequencing in the agent (Phase 3+ architecture)
Investigate modernc.org/sqlite fchmod workaround for rootless podman
Add multi-address support to node config (Tailnet IP + LAN IP + hostname)
Revert mcp UID to a stable value and never change it again

8.7 KiB Raw Blame History Unescape Escape