incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID change cascading failure that took down all services on rift. Documents the timeline, root causes, recovery steps, and lessons learned. disaster-recovery.md: Step-by-step runbook for bootstrapping the platform from zero when all containers are gone. Covers the boot order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman run commands for each service, common errors, and verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 KiB
Disaster Recovery: Bootstrap from Zero
This document covers recovering the Metacircular platform when all services on rift are down and no containers are running. It assumes:
- The machine boots and NixOS is functional
- The mcp-agent systemd service starts automatically
- Tailscale is configured and connects on boot
- Service data directories (
/srv/<service>/) are intact on disk - Container images are cached in podman's overlay storage
If images are NOT cached (fresh machine, disk wipe), see the "Cold Start" section at the end.
Prerequisites
Before starting recovery, verify:
# 1. Machine is up
hostname # should print "rift"
# 2. Tailscale is connected
tailscale status --self
# Should show the Tailnet IP (100.95.252.120)
# 3. The mcp user exists
id mcp
# Should show uid=850(mcp) gid=850(mcp)
# 4. The agent is running
systemctl status mcp-agent
# Should be active
# 5. Images are cached
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
# Should be > 0
If Tailscale is not running: doas systemctl start tailscaled && doas tailscale up
If the agent is not running: check /srv/mcp/mcp-agent exists and
/srv/mcp/mcp-agent.toml is correct, then doas systemctl restart mcp-agent.
Recovery Order
Services must be started in dependency order. Each stage must be healthy before the next starts.
Stage 1 (Foundation): MCNS → DNS works
Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs
Stage 3 (Management): mcp-master → orchestration
Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications
Manual Recovery Commands
All commands run as the mcp user. Use this shell prefix:
# Set up the environment
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
# Or SSH as mcp directly (if SSH login is enabled):
ssh mcp@rift
For brevity, commands below show the podman run portion only. Prefix
with the environment setup above.
Stage 1: MCNS (DNS)
MCNS must start first. Without it, no hostname resolution works.
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp \
-p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp \
-p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
Verify:
dig @192.168.88.181 google.com +short
# Should return an IP address
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# Should return a Tailnet IP
Note: MCNS binds to specific IPs, not 0.0.0.0, because
systemd-resolved holds port 53 on localhost. The explicit bindings
avoid the conflict.
Stage 2: Core Infrastructure
mc-proxy (TLS routing)
podman run -d --name mc-proxy --restart unless-stopped \
--network host \
-v /srv/mc-proxy:/srv/mc-proxy \
mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
server --config /srv/mc-proxy/mc-proxy.toml
Verify: curl -sk https://localhost:443/ 2>&1 | head -1
(should get a response, even if 404)
MCR (Container Registry)
# API server
podman run -d --name mcr-api --restart unless-stopped \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28443:8443 \
-p 127.0.0.1:29443:9443 \
mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
server --config /srv/mcr/mcr.toml
# Web UI
podman run -d --name mcr-web --restart unless-stopped \
--user 0:0 \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
server --config /srv/mcr/mcr.toml
If MCR fails with "chmod" or "readonly database":
podman stop mcr-api
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
podman start mcr-api
This recreates the database empty. Image blobs in /srv/mcr/layers/
are preserved but tag metadata is lost. Re-push images to rebuild the
registry.
Metacrypt (PKI / Secrets)
# API server
podman run -d --name metacrypt-api --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18443:8443 \
-p 127.0.0.1:19443:9443 \
mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
server --config /srv/metacrypt/metacrypt.toml
# Web UI
podman run -d --name metacrypt-web --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18080:8080 \
mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
--config /srv/metacrypt/metacrypt.toml
If Metacrypt fails with "chmod" or "readonly database": Same fix as MCR — delete the database files. Warning: this loses all encrypted secrets, issued certs tracking, and CA state. The CA key itself is in the sealed vault (password-protected), not in SQLite.
Stage 3: MCP Master
podman run -d --name mcp-master --restart unless-stopped \
--network host \
-v /srv/mcp-master:/srv/mcp-master \
mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
server --config /srv/mcp-master/mcp-master.toml
Verify:
# From vade (operator workstation):
mcp node list
# Should show rift, svc, orion
Stage 4: Application Services
Once the master is running, deploy applications through MCP:
mcp deploy mcq --direct
mcp deploy mcdoc --direct
mcp deploy mcat --direct
mcp deploy kls --direct
Or start them manually:
# MCQ
podman run -d --name mcq --restart unless-stopped \
-v /srv/mcq:/srv/mcq \
-p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
server --config /srv/mcq/mcq.toml
# MCDoc
podman run -d --name mcdoc --restart unless-stopped \
-v /srv/mcdoc:/srv/mcdoc \
-p 127.0.0.1:38080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
server --config /srv/mcdoc/mcdoc.toml
# MCAT
podman run -d --name mcat --restart unless-stopped \
-v /srv/mcat:/srv/mcat \
-p 127.0.0.1:48116:8443 \
mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
server --config /srv/mcat/mcat.toml
# KLS
podman run -d --name kls --restart unless-stopped \
-v /srv/kls:/srv/kls \
-p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
-f /srv/kls/kls.conf
# Sgard
podman run -d --name sgardd --restart unless-stopped \
-v /srv/sgard:/srv/sgard \
-p 127.0.0.1:19473:9473 \
mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
--repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
--tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
Verification Checklist
After all services are running:
# Fleet status
mcp ps
# All services should show "running"
# DNS
dig @192.168.88.181 google.com +short
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# MCIAS (runs on svc, should be unaffected by rift outage)
curl -sk https://mcias.metacircular.net:8443/v1/health
# MCR
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
# Metacrypt
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
# Public routes via svc
curl -sk https://mcq.metacircular.net/
curl -sk https://docs.metacircular.net/
Common Errors
"chmod: operation not permitted"
modernc.org/sqlite calls fchmod() on database files. This is denied
inside rootless podman user namespaces. Fix:
# Delete the database and let the service recreate it
podman stop <container>
rm -f /srv/<service>/<service>.db*
podman start <container>
The fchmod error will still appear in logs as a warning but is
non-fatal for newly created databases.
"address already in use" on port 53
systemd-resolved holds port 53 on localhost. MCNS must bind to
specific IPs, not 0.0.0.0:53. Use explicit port bindings:
-p 192.168.88.181:53:53 -p 100.95.252.120:53:53
"connection refused" to MCR
MCR is down. Images are cached locally — you can start services that use cached images without MCR. MCR itself starts from its cached image.
Agent shows "error" for all nodes
Check:
- Tailscale is running on both the CLI machine and the target node
- The agent is listening:
ss -tlnp | grep 9444 - The CLI config has the correct addresses
- TLS certs have the right SANs for the Tailnet IP
"podman: executable file not found"
This warning appears for svc (which uses Docker, not podman). It's benign — svc is an edge node that doesn't run containers.
Cold Start (No Cached Images)
If the machine was wiped and no images are cached:
-
MCIAS runs on svc (Docker container), not rift. It should be unaffected by a rift failure. Verify:
ssh svc.metacircular.net "docker ps | grep mcias". -
Pre-stage images by pulling from a backup or building locally:
# On vade (operator workstation), build and push to a temp location cd ~/src/metacircular/mcns && make docker podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \ ssh rift "podman load"Repeat for each service.
-
Alternatively, if another node has MCR access, push images there first, then pull from the running MCR instance.
Service Reference
Quick reference for all services, their images, and critical flags:
| Service | Image | Network | Key Ports | Config Path |
|---|---|---|---|---|
| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
All images are prefixed with mcr.svc.mcp.metacircular.net:8443/.