incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID change cascading failure that took down all services on rift. Documents the timeline, root causes, recovery steps, and lessons learned. disaster-recovery.md: Step-by-step runbook for bootstrapping the platform from zero when all containers are gone. Covers the boot order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman run commands for each service, common errors, and verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
224 lines
8.7 KiB
Markdown
224 lines
8.7 KiB
Markdown
# Incident Report: UID Change Cascading Failure
|
||
|
||
**Date**: 2026-04-03
|
||
**Duration**: ~2 hours (08:45–10:45 PDT)
|
||
**Severity**: Full platform outage on rift (all containers lost)
|
||
**Root cause**: Changing the `mcp` system user UID from 995 to 850
|
||
|
||
## Timeline
|
||
|
||
### Background
|
||
|
||
Orion was being provisioned as a new worker node. Its NixOS config
|
||
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
|
||
was already assigned to the `sshd` user, causing a UID collision:
|
||
|
||
```
|
||
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
|
||
```
|
||
|
||
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
|
||
rejected rootless podman operations because the calling process's UID
|
||
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
|
||
|
||
### The UID Change
|
||
|
||
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
|
||
the 800–899 range, empty on all nodes). Both rift and orion were
|
||
rebuilt with `nixos-rebuild switch`.
|
||
|
||
**Problem 1: NixOS doesn't change UIDs for existing users.** The
|
||
rebuild created the NixOS config with `uid = 850` but the existing
|
||
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
|
||
was required on each node.
|
||
|
||
**Problem 2: Rootless podman caches the UID everywhere.**
|
||
- Podman's SQLite database (`db.sql`) stores absolute paths like
|
||
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
|
||
- The systemd user session (`/run/user/995/`) is tied to the UID
|
||
- subuid/subgid mappings reference the user by name but the kernel
|
||
checks the actual UID
|
||
- Container storage overlay directories have file ownership based on
|
||
the old UID namespace mapping (995 → 100000)
|
||
|
||
After changing the UID, `podman` operations failed with:
|
||
```
|
||
newuidmap: write to uid_map failed: Operation not permitted
|
||
```
|
||
|
||
### The Reboot
|
||
|
||
Rift was rebooted to get a clean systemd user session for UID 850.
|
||
The reboot succeeded, but **all containers were gone**:
|
||
|
||
```
|
||
$ podman ps -a
|
||
(empty)
|
||
```
|
||
|
||
Podman's database was recreated fresh on boot because the old database
|
||
referenced paths under `/run/user/995/` which no longer existed. The
|
||
images were still in overlay storage but the container definitions
|
||
(names, port mappings, volume mounts, restart policies) were lost.
|
||
|
||
### DNS Collapse
|
||
|
||
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
|
||
ran as a container on rift. When all containers were lost, DNS
|
||
resolution broke:
|
||
|
||
- `mcq.svc.mcp.metacircular.net` → no answer
|
||
- MCNS also served as a recursive resolver for the LAN
|
||
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
|
||
|
||
Tailscale DNS (MagicDNS) was also affected because resolved's global
|
||
DNS config pointed to MCNS. Tailscale itself remained functional
|
||
(its coordination servers are external), but hostname resolution via
|
||
Tailscale DNS names failed.
|
||
|
||
The operator turned off Tailscale on vade (the workstation) thinking
|
||
Tailscale was the problem. This broke connectivity to rift entirely
|
||
since the MCP agent binds to the Tailnet IP only (`100.95.252.120:9444`).
|
||
|
||
### Recovery
|
||
|
||
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
|
||
connectivity works without MCNS — MagicDNS uses Tailscale's own
|
||
servers for `.ts.net` names.
|
||
|
||
**Step 2**: Start MCNS manually via `podman run`. The image was cached
|
||
in overlay storage. MCNS needed explicit port bindings (not `--network
|
||
host`) because systemd-resolved holds port 53 on localhost:
|
||
|
||
```bash
|
||
podman run -d --name mcns --restart unless-stopped \
|
||
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
|
||
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
|
||
-p 127.0.0.1:38443:8443 \
|
||
-v /srv/mcns:/srv/mcns \
|
||
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
|
||
server --config /srv/mcns/mcns.toml
|
||
```
|
||
|
||
DNS resolution restored within seconds.
|
||
|
||
**Step 3**: Start remaining services manually via `podman run`. Images
|
||
were all cached. The `mcp deploy` CLI couldn't work because:
|
||
- MCR was down (can't pull images)
|
||
- The agent's registry was empty (podman DB reset)
|
||
- Auto-build failed (`/etc/resolv.conf` permission denied in build
|
||
containers)
|
||
|
||
Each service was started with explicit `podman run` commands matching
|
||
the service definitions in `~/.config/mcp/services/*.toml`.
|
||
|
||
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
|
||
were owned by UID 850 (the mcp user on the host). Inside containers,
|
||
UID 0 (root) maps to host UID 850 via subuid. But:
|
||
|
||
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
|
||
to match the container's user namespace
|
||
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
|
||
in the database directory
|
||
- modernc.org/sqlite calls `fchmod()` on the database file, which is
|
||
denied inside rootless podman user namespaces (even for UID 0 in the
|
||
namespace)
|
||
|
||
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
|
||
was fatal for MCR and Metacrypt. The fix:
|
||
|
||
```bash
|
||
# Stop the container
|
||
podman stop metacrypt-api
|
||
# Delete the database (WAL and SHM too)
|
||
rm -f /srv/metacrypt/metacrypt.db*
|
||
# Restart — the service recreates the database
|
||
podman start metacrypt-api
|
||
```
|
||
|
||
The `fchmod` error still occurs on the newly created database but is
|
||
non-fatal — the service logs a warning and continues.
|
||
|
||
**Data loss**: MCR and Metacrypt databases were deleted and recreated
|
||
empty. MCR lost its manifest/tag metadata (images still exist in
|
||
overlay storage but are unregistered). Metacrypt lost its CA state
|
||
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
|
||
etc.) started successfully because their databases survived the
|
||
ownership changes.
|
||
|
||
## Root Causes
|
||
|
||
1. **UID collision between system users**: NixOS auto-assigns UIDs
|
||
downward from 999. Pinning UID 995 for mcp collided with sshd on
|
||
orion.
|
||
|
||
2. **Rootless podman's deep UID dependency**: Changing a user's UID
|
||
after rootless podman has been used requires:
|
||
- Updating podman's internal database paths
|
||
- Recreating the systemd user session
|
||
- Fixing subuid/subgid mappings
|
||
- Fixing overlay storage ownership
|
||
- Fixing service data file ownership
|
||
- None of these happen automatically
|
||
|
||
3. **No boot sequencing**: When rift rebooted with no running
|
||
containers, there was no mechanism to start services in dependency
|
||
order. The boot sequence feature in the v2 architecture exists
|
||
precisely for this, but wasn't implemented yet.
|
||
|
||
4. **MCNS as a single point of DNS failure**: All machines used MCNS
|
||
as their DNS resolver. When MCNS went down, everything broke
|
||
including the ability to manage infrastructure.
|
||
|
||
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
|
||
library calls `fchmod()` on database files, which is denied inside
|
||
rootless podman user namespaces. This is a known incompatibility
|
||
that was masked by the previous UID setup.
|
||
|
||
## Lessons Learned
|
||
|
||
1. **Never change a rootless podman user's UID.** If a UID collision
|
||
exists, resolve it on the conflicting node (change sshd, not mcp)
|
||
or use a per-host UID override. Changing the UID after podman has
|
||
been used is destructive.
|
||
|
||
2. **DNS must not be a single point of failure.** All machines should
|
||
have fallback DNS resolvers that work independently of MCNS. The
|
||
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
|
||
fallbacks, not just MCNS.
|
||
|
||
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
|
||
(foundation → core → management) is not a nice-to-have. Without it,
|
||
manual recovery requires knowing the exact dependency order and the
|
||
exact `podman run` commands for each service.
|
||
|
||
4. **The MCP agent should be able to recover containers from its
|
||
registry.** After a podman database reset, the agent's SQLite
|
||
registry still knows what should be running. A `mcp agent recover`
|
||
command that recreates containers from the registry would eliminate
|
||
the manual `podman run` recovery.
|
||
|
||
5. **Service definitions must include all runtime parameters.** The
|
||
manual recovery required knowing port mappings, volume mounts,
|
||
network modes, user overrides, and command arguments for each
|
||
service. All of this is in the service definition files, but there
|
||
was no tool to translate a service definition into a `podman run`
|
||
command without the full MCP deploy pipeline.
|
||
|
||
6. **Tailscale is not the problem when DNS breaks.** Tailscale's
|
||
control plane is external. Turning off Tailscale makes things worse,
|
||
not better, because the agents bind to Tailnet IPs.
|
||
|
||
## Action Items
|
||
|
||
- [ ] Write disaster recovery runbook (separate document)
|
||
- [ ] Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8)
|
||
- [ ] Implement `mcp agent recover` command (recreate containers from
|
||
registry)
|
||
- [ ] Implement boot sequencing in the agent (Phase 3+ architecture)
|
||
- [ ] Investigate modernc.org/sqlite `fchmod` workaround for rootless
|
||
podman
|
||
- [ ] Add multi-address support to node config (Tailnet IP + LAN IP
|
||
+ hostname)
|
||
- [ ] Revert mcp UID to a stable value and never change it again
|