Move incident report to log/ directory

Incident reports go in log/ (operational work logs), not docs/
(architecture and reference documentation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-03 09:22:42 -07:00
parent ad34904b0b
commit a474f17a8b

View File

@@ -1,223 +0,0 @@
# Incident Report: UID Change Cascading Failure
**Date**: 2026-04-03
**Duration**: ~2 hours (08:4510:45 PDT)
**Severity**: Full platform outage on rift (all containers lost)
**Root cause**: Changing the `mcp` system user UID from 995 to 850
## Timeline
### Background
Orion was being provisioned as a new worker node. Its NixOS config
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
was already assigned to the `sshd` user, causing a UID collision:
```
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
```
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
rejected rootless podman operations because the calling process's UID
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
### The UID Change
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
the 800899 range, empty on all nodes). Both rift and orion were
rebuilt with `nixos-rebuild switch`.
**Problem 1: NixOS doesn't change UIDs for existing users.** The
rebuild created the NixOS config with `uid = 850` but the existing
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
was required on each node.
**Problem 2: Rootless podman caches the UID everywhere.**
- Podman's SQLite database (`db.sql`) stores absolute paths like
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
- The systemd user session (`/run/user/995/`) is tied to the UID
- subuid/subgid mappings reference the user by name but the kernel
checks the actual UID
- Container storage overlay directories have file ownership based on
the old UID namespace mapping (995 → 100000)
After changing the UID, `podman` operations failed with:
```
newuidmap: write to uid_map failed: Operation not permitted
```
### The Reboot
Rift was rebooted to get a clean systemd user session for UID 850.
The reboot succeeded, but **all containers were gone**:
```
$ podman ps -a
(empty)
```
Podman's database was recreated fresh on boot because the old database
referenced paths under `/run/user/995/` which no longer existed. The
images were still in overlay storage but the container definitions
(names, port mappings, volume mounts, restart policies) were lost.
### DNS Collapse
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
ran as a container on rift. When all containers were lost, DNS
resolution broke:
- `mcq.svc.mcp.metacircular.net` → no answer
- MCNS also served as a recursive resolver for the LAN
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
Tailscale DNS (MagicDNS) was also affected because resolved's global
DNS config pointed to MCNS. Tailscale itself remained functional
(its coordination servers are external), but hostname resolution via
Tailscale DNS names failed.
The operator turned off Tailscale on vade (the workstation) thinking
Tailscale was the problem. This broke connectivity to rift entirely
since the MCP agent binds to the Tailnet IP only (`100.95.252.120:9444`).
### Recovery
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
connectivity works without MCNS — MagicDNS uses Tailscale's own
servers for `.ts.net` names.
**Step 2**: Start MCNS manually via `podman run`. The image was cached
in overlay storage. MCNS needed explicit port bindings (not `--network
host`) because systemd-resolved holds port 53 on localhost:
```bash
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
```
DNS resolution restored within seconds.
**Step 3**: Start remaining services manually via `podman run`. Images
were all cached. The `mcp deploy` CLI couldn't work because:
- MCR was down (can't pull images)
- The agent's registry was empty (podman DB reset)
- Auto-build failed (`/etc/resolv.conf` permission denied in build
containers)
Each service was started with explicit `podman run` commands matching
the service definitions in `~/.config/mcp/services/*.toml`.
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
were owned by UID 850 (the mcp user on the host). Inside containers,
UID 0 (root) maps to host UID 850 via subuid. But:
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
to match the container's user namespace
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
in the database directory
- modernc.org/sqlite calls `fchmod()` on the database file, which is
denied inside rootless podman user namespaces (even for UID 0 in the
namespace)
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
was fatal for MCR and Metacrypt. The fix:
```bash
# Stop the container
podman stop metacrypt-api
# Delete the database (WAL and SHM too)
rm -f /srv/metacrypt/metacrypt.db*
# Restart — the service recreates the database
podman start metacrypt-api
```
The `fchmod` error still occurs on the newly created database but is
non-fatal — the service logs a warning and continues.
**Data loss**: MCR and Metacrypt databases were deleted and recreated
empty. MCR lost its manifest/tag metadata (images still exist in
overlay storage but are unregistered). Metacrypt lost its CA state
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
etc.) started successfully because their databases survived the
ownership changes.
## Root Causes
1. **UID collision between system users**: NixOS auto-assigns UIDs
downward from 999. Pinning UID 995 for mcp collided with sshd on
orion.
2. **Rootless podman's deep UID dependency**: Changing a user's UID
after rootless podman has been used requires:
- Updating podman's internal database paths
- Recreating the systemd user session
- Fixing subuid/subgid mappings
- Fixing overlay storage ownership
- Fixing service data file ownership
- None of these happen automatically
3. **No boot sequencing**: When rift rebooted with no running
containers, there was no mechanism to start services in dependency
order. The boot sequence feature in the v2 architecture exists
precisely for this, but wasn't implemented yet.
4. **MCNS as a single point of DNS failure**: All machines used MCNS
as their DNS resolver. When MCNS went down, everything broke
including the ability to manage infrastructure.
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
library calls `fchmod()` on database files, which is denied inside
rootless podman user namespaces. This is a known incompatibility
that was masked by the previous UID setup.
## Lessons Learned
1. **Never change a rootless podman user's UID.** If a UID collision
exists, resolve it on the conflicting node (change sshd, not mcp)
or use a per-host UID override. Changing the UID after podman has
been used is destructive.
2. **DNS must not be a single point of failure.** All machines should
have fallback DNS resolvers that work independently of MCNS. The
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
fallbacks, not just MCNS.
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
(foundation → core → management) is not a nice-to-have. Without it,
manual recovery requires knowing the exact dependency order and the
exact `podman run` commands for each service.
4. **The MCP agent should be able to recover containers from its
registry.** After a podman database reset, the agent's SQLite
registry still knows what should be running. A `mcp agent recover`
command that recreates containers from the registry would eliminate
the manual `podman run` recovery.
5. **Service definitions must include all runtime parameters.** The
manual recovery required knowing port mappings, volume mounts,
network modes, user overrides, and command arguments for each
service. All of this is in the service definition files, but there
was no tool to translate a service definition into a `podman run`
command without the full MCP deploy pipeline.
6. **Tailscale is not the problem when DNS breaks.** Tailscale's
control plane is external. Turning off Tailscale makes things worse,
not better, because the agents bind to Tailnet IPs.
## Action Items
- [ ] Write disaster recovery runbook (separate document)
- [ ] Add fallback DNS resolvers to NixOS config (1.1.1.1, 8.8.8.8)
- [ ] Implement `mcp agent recover` command (recreate containers from
registry)
- [ ] Implement boot sequencing in the agent (Phase 3+ architecture)
- [ ] Investigate modernc.org/sqlite `fchmod` workaround for rootless
podman
- [ ] Add multi-address support to node config (Tailnet IP + LAN IP
+ hostname)
- [ ] Revert mcp UID to a stable value and never change it again