Done: disaster recovery runbook, fallback DNS resolvers, sqlite chmod fix (mcdsl v1.8.0), UID stabilization. Remaining: agent recover command, boot sequencing, multi-address nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.1 KiB
Incident Report: UID Change Cascading Failure
Date: 2026-04-03
Duration: ~2 hours (08:45–10:45 PDT)
Severity: Full platform outage on rift (all containers lost)
Root cause: Changing the mcp system user UID from 995 to 850
Timeline
Background
Orion was being provisioned as a new worker node. Its NixOS config
imports mcp.nix which pins the mcp user to UID 995. On orion, UID 995
was already assigned to the sshd user, causing a UID collision:
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
Both sshd and mcp had UID 995 on orion. The newuidmap tool
rejected rootless podman operations because the calling process's UID
(995) belonged to sshd, not mcp, in /etc/passwd.
The UID Change
To resolve the collision, mcp.nix was updated to pin UID 850 (in
the 800–899 range, empty on all nodes). Both rift and orion were
rebuilt with nixos-rebuild switch.
Problem 1: NixOS doesn't change UIDs for existing users. The
rebuild created the NixOS config with uid = 850 but the existing
mcp user on both nodes kept UID 995. Manual usermod -u 850 mcp
was required on each node.
Problem 2: Rootless podman caches the UID everywhere.
- Podman's SQLite database (
db.sql) stores absolute paths like/run/user/995/libpod/tmpand/run/user/995/containers - The systemd user session (
/run/user/995/) is tied to the UID - subuid/subgid mappings reference the user by name but the kernel checks the actual UID
- Container storage overlay directories have file ownership based on the old UID namespace mapping (995 → 100000)
After changing the UID, podman operations failed with:
newuidmap: write to uid_map failed: Operation not permitted
The Reboot
Rift was rebooted to get a clean systemd user session for UID 850. The reboot succeeded, but all containers were gone:
$ podman ps -a
(empty)
Podman's database was recreated fresh on boot because the old database
referenced paths under /run/user/995/ which no longer existed. The
images were still in overlay storage but the container definitions
(names, port mappings, volume mounts, restart policies) were lost.
DNS Collapse
MCNS (the authoritative DNS server for .svc.mcp.metacircular.net)
ran as a container on rift. When all containers were lost, DNS
resolution broke:
mcq.svc.mcp.metacircular.net→ no answer- MCNS also served as a recursive resolver for the LAN
google.com→ NXDOMAIN on machines using MCNS as their resolver
Tailscale DNS (MagicDNS) was also affected because resolved's global DNS config pointed to MCNS. Tailscale itself remained functional (its coordination servers are external), but hostname resolution via Tailscale DNS names failed.
The operator turned off Tailscale on vade (the workstation) because
Tailscale's MagicDNS was routing ALL DNS queries through the broken
MCNS resolver — external services including Claude Code and Gitea
were unreachable. Disabling Tailscale was the only way to restore
external DNS resolution. However, this also broke connectivity to
rift since the MCP agent binds to the Tailnet IP only
(100.95.252.120:9444).
Recovery
Step 1: Turn Tailscale back on (on both rift and vade). Tailscale
connectivity works without MCNS — MagicDNS uses Tailscale's own
servers for .ts.net names.
Step 2: Start MCNS manually via podman run. The image was cached
in overlay storage. MCNS needed explicit port bindings (not --network host) because systemd-resolved holds port 53 on localhost:
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
DNS resolution restored within seconds.
Step 3: Start remaining services manually via podman run. Images
were all cached. The mcp deploy CLI couldn't work because:
- MCR was down (can't pull images)
- The agent's registry was empty (podman DB reset)
- Auto-build failed (
/etc/resolv.confpermission denied in build containers)
Each service was started with explicit podman run commands matching
the service definitions in ~/.config/mcp/services/*.toml.
Step 4: Fix file ownership for rootless podman. Files in /srv/*
were owned by UID 850 (the mcp user on the host). Inside containers,
UID 0 (root) maps to host UID 850 via subuid. But:
podman unshare chown -R 0:0 /srv/<service>translated ownership to match the container's user namespace- SQLite's
PRAGMA journal_mode = WALrequires creating WAL/SHM files in the database directory - modernc.org/sqlite calls
fchmod()on the database file, which is denied inside rootless podman user namespaces (even for UID 0 in the namespace)
Step 5: Delete and recreate SQLite databases. The fchmod denial
was fatal for MCR and Metacrypt. The fix:
# Stop the container
podman stop metacrypt-api
# Delete the database (WAL and SHM too)
rm -f /srv/metacrypt/metacrypt.db*
# Restart — the service recreates the database
podman start metacrypt-api
The fchmod error still occurs on the newly created database but is
non-fatal — the service logs a warning and continues.
Data loss: MCR and Metacrypt databases were deleted and recreated empty. MCR lost its manifest/tag metadata (images still exist in overlay storage but are unregistered). Metacrypt lost its CA state (encrypted keys, issued certs tracking). Other services (mcq, mcdoc, etc.) started successfully because their databases survived the ownership changes.
Root Causes
-
UID collision between system users: NixOS auto-assigns UIDs downward from 999. Pinning UID 995 for mcp collided with sshd on orion.
-
Rootless podman's deep UID dependency: Changing a user's UID after rootless podman has been used requires:
- Updating podman's internal database paths
- Recreating the systemd user session
- Fixing subuid/subgid mappings
- Fixing overlay storage ownership
- Fixing service data file ownership
- None of these happen automatically
-
No boot sequencing: When rift rebooted with no running containers, there was no mechanism to start services in dependency order. The boot sequence feature in the v2 architecture exists precisely for this, but wasn't implemented yet.
-
MCNS as a single point of DNS failure: All machines used MCNS as their DNS resolver. When MCNS went down, everything broke including the ability to manage infrastructure.
-
modernc.org/sqlite
fchmodin rootless podman: The SQLite library callsfchmod()on database files, which is denied inside rootless podman user namespaces. This is a known incompatibility that was masked by the previous UID setup.
Lessons Learned
-
Never change a rootless podman user's UID. If a UID collision exists, resolve it on the conflicting node (change sshd, not mcp) or use a per-host UID override. Changing the UID after podman has been used is destructive.
-
DNS must not be a single point of failure. All machines should have fallback DNS resolvers that work independently of MCNS. The NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as fallbacks, not just MCNS.
-
Boot sequencing is critical. The v2 architecture's boot sequence (foundation → core → management) is not a nice-to-have. Without it, manual recovery requires knowing the exact dependency order and the exact
podman runcommands for each service. -
The MCP agent should be able to recover containers from its registry. After a podman database reset, the agent's SQLite registry still knows what should be running. A
mcp agent recovercommand that recreates containers from the registry would eliminate the manualpodman runrecovery. -
Service definitions must include all runtime parameters. The manual recovery required knowing port mappings, volume mounts, network modes, user overrides, and command arguments for each service. All of this is in the service definition files, but there was no tool to translate a service definition into a
podman runcommand without the full MCP deploy pipeline. -
Tailscale MagicDNS amplifies DNS failures. When MCNS is down and MagicDNS routes through it, ALL DNS breaks — not just internal names. Disabling Tailscale restores external DNS but loses Tailnet connectivity. The fix is fallback resolvers that bypass MCNS, not disabling Tailscale.
Action Items
- Write disaster recovery runbook →
docs/disaster-recovery.md - Add fallback DNS resolvers to NixOS config → all nodes now have 1.1.1.1 and 8.8.8.8 as fallbacks after MCNS
- Implement
mcp agent recovercommand (recreate containers from registry) - Implement boot sequencing in the agent (Phase 4+ architecture)
- Fix modernc.org/sqlite
fchmod→ was our ownos.Chmodinmcdsl/db/db.go, not sqlite. Made best-effort in mcdsl v1.8.0. - Add multi-address support to node config (Tailnet IP + LAN IP + hostname)
- Stabilize mcp UID → pinned at 850 with NEVER CHANGE comment