Files

Kyle Isom ad34904b0b Add incident report and disaster recovery runbook

incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID
change cascading failure that took down all services on rift. Documents
the timeline, root causes, recovery steps, and lessons learned.

disaster-recovery.md: Step-by-step runbook for bootstrapping the
platform from zero when all containers are gone. Covers the boot
order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman
run commands for each service, common errors, and verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 09:18:45 -07:00

10 KiB

Raw Blame History

Disaster Recovery: Bootstrap from Zero

This document covers recovering the Metacircular platform when all services on rift are down and no containers are running. It assumes:

The machine boots and NixOS is functional
The mcp-agent systemd service starts automatically
Tailscale is configured and connects on boot
Service data directories (/srv/<service>/) are intact on disk
Container images are cached in podman's overlay storage

If images are NOT cached (fresh machine, disk wipe), see the "Cold Start" section at the end.

Prerequisites

Before starting recovery, verify:

# 1. Machine is up
hostname    # should print "rift"

# 2. Tailscale is connected
tailscale status --self
# Should show the Tailnet IP (100.95.252.120)

# 3. The mcp user exists
id mcp
# Should show uid=850(mcp) gid=850(mcp)

# 4. The agent is running
systemctl status mcp-agent
# Should be active

# 5. Images are cached
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
# Should be > 0

If Tailscale is not running: doas systemctl start tailscaled && doas tailscale up

If the agent is not running: check /srv/mcp/mcp-agent exists and /srv/mcp/mcp-agent.toml is correct, then doas systemctl restart mcp-agent.

Recovery Order

Services must be started in dependency order. Each stage must be healthy before the next starts.

Stage 1 (Foundation): MCNS → DNS works
Stage 2 (Core):       mc-proxy, MCR, Metacrypt → routing + images + certs
Stage 3 (Management): mcp-master → orchestration
Stage 4 (Services):   mcq, mcdoc, mcat, kls, sgard, exo → applications

Manual Recovery Commands

All commands run as the mcp user. Use this shell prefix:

# Set up the environment
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
# Or SSH as mcp directly (if SSH login is enabled):
ssh mcp@rift

For brevity, commands below show the podman run portion only. Prefix with the environment setup above.

Stage 1: MCNS (DNS)

MCNS must start first. Without it, no hostname resolution works.

podman run -d --name mcns --restart unless-stopped \
  -p 192.168.88.181:53:53/tcp \
  -p 192.168.88.181:53:53/udp \
  -p 100.95.252.120:53:53/tcp \
  -p 100.95.252.120:53:53/udp \
  -p 127.0.0.1:38443:8443 \
  -v /srv/mcns:/srv/mcns \
  mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
  server --config /srv/mcns/mcns.toml

Verify:

dig @192.168.88.181 google.com +short
# Should return an IP address
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# Should return a Tailnet IP

Note: MCNS binds to specific IPs, not 0.0.0.0, because systemd-resolved holds port 53 on localhost. The explicit bindings avoid the conflict.

Stage 2: Core Infrastructure

mc-proxy (TLS routing)

podman run -d --name mc-proxy --restart unless-stopped \
  --network host \
  -v /srv/mc-proxy:/srv/mc-proxy \
  mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
  server --config /srv/mc-proxy/mc-proxy.toml

Verify: curl -sk https://localhost:443/ 2>&1 | head -1 (should get a response, even if 404)

MCR (Container Registry)

# API server
podman run -d --name mcr-api --restart unless-stopped \
  -v /srv/mcr:/srv/mcr \
  -p 127.0.0.1:28443:8443 \
  -p 127.0.0.1:29443:9443 \
  mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
  server --config /srv/mcr/mcr.toml

# Web UI
podman run -d --name mcr-web --restart unless-stopped \
  --user 0:0 \
  -v /srv/mcr:/srv/mcr \
  -p 127.0.0.1:28080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
  server --config /srv/mcr/mcr.toml

If MCR fails with "chmod" or "readonly database":

podman stop mcr-api
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
podman start mcr-api

This recreates the database empty. Image blobs in /srv/mcr/layers/ are preserved but tag metadata is lost. Re-push images to rebuild the registry.

Metacrypt (PKI / Secrets)

# API server
podman run -d --name metacrypt-api --restart unless-stopped \
  -v /srv/metacrypt:/srv/metacrypt \
  -p 127.0.0.1:18443:8443 \
  -p 127.0.0.1:19443:9443 \
  mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
  server --config /srv/metacrypt/metacrypt.toml

# Web UI
podman run -d --name metacrypt-web --restart unless-stopped \
  -v /srv/metacrypt:/srv/metacrypt \
  -p 127.0.0.1:18080:8080 \
  mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
  --config /srv/metacrypt/metacrypt.toml

If Metacrypt fails with "chmod" or "readonly database": Same fix as MCR — delete the database files. Warning: this loses all encrypted secrets, issued certs tracking, and CA state. The CA key itself is in the sealed vault (password-protected), not in SQLite.

Stage 3: MCP Master

podman run -d --name mcp-master --restart unless-stopped \
  --network host \
  -v /srv/mcp-master:/srv/mcp-master \
  mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
  server --config /srv/mcp-master/mcp-master.toml

Verify:

# From vade (operator workstation):
mcp node list
# Should show rift, svc, orion

Stage 4: Application Services

Once the master is running, deploy applications through MCP:

mcp deploy mcq --direct
mcp deploy mcdoc --direct
mcp deploy mcat --direct
mcp deploy kls --direct

Or start them manually:

# MCQ
podman run -d --name mcq --restart unless-stopped \
  -v /srv/mcq:/srv/mcq \
  -p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
  server --config /srv/mcq/mcq.toml

# MCDoc
podman run -d --name mcdoc --restart unless-stopped \
  -v /srv/mcdoc:/srv/mcdoc \
  -p 127.0.0.1:38080:8080 \
  mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
  server --config /srv/mcdoc/mcdoc.toml

# MCAT
podman run -d --name mcat --restart unless-stopped \
  -v /srv/mcat:/srv/mcat \
  -p 127.0.0.1:48116:8443 \
  mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
  server --config /srv/mcat/mcat.toml

# KLS
podman run -d --name kls --restart unless-stopped \
  -v /srv/kls:/srv/kls \
  -p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
  mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
  -f /srv/kls/kls.conf

# Sgard
podman run -d --name sgardd --restart unless-stopped \
  -v /srv/sgard:/srv/sgard \
  -p 127.0.0.1:19473:9473 \
  mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
  --repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
  --tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key

Verification Checklist

After all services are running:

# Fleet status
mcp ps
# All services should show "running"

# DNS
dig @192.168.88.181 google.com +short
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short

# MCIAS (runs on svc, should be unaffected by rift outage)
curl -sk https://mcias.metacircular.net:8443/v1/health

# MCR
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/

# Metacrypt
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health

# Public routes via svc
curl -sk https://mcq.metacircular.net/
curl -sk https://docs.metacircular.net/

Common Errors

"chmod: operation not permitted"

modernc.org/sqlite calls fchmod() on database files. This is denied inside rootless podman user namespaces. Fix:

# Delete the database and let the service recreate it
podman stop <container>
rm -f /srv/<service>/<service>.db*
podman start <container>

The fchmod error will still appear in logs as a warning but is non-fatal for newly created databases.

"address already in use" on port 53

systemd-resolved holds port 53 on localhost. MCNS must bind to specific IPs, not 0.0.0.0:53. Use explicit port bindings: -p 192.168.88.181:53:53 -p 100.95.252.120:53:53

"connection refused" to MCR

MCR is down. Images are cached locally — you can start services that use cached images without MCR. MCR itself starts from its cached image.

Agent shows "error" for all nodes

Check:

Tailscale is running on both the CLI machine and the target node
The agent is listening: ss -tlnp | grep 9444
The CLI config has the correct addresses
TLS certs have the right SANs for the Tailnet IP

"podman: executable file not found"

This warning appears for svc (which uses Docker, not podman). It's benign — svc is an edge node that doesn't run containers.

Cold Start (No Cached Images)

If the machine was wiped and no images are cached:

MCIAS runs on svc (Docker container), not rift. It should be unaffected by a rift failure. Verify: ssh svc.metacircular.net "docker ps | grep mcias".

Pre-stage images by pulling from a backup or building locally:

# On vade (operator workstation), build and push to a temp location
cd ~/src/metacircular/mcns && make docker
podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
  ssh rift "podman load"

Repeat for each service.

Alternatively, if another node has MCR access, push images there first, then pull from the running MCR instance.

Service Reference

Quick reference for all services, their images, and critical flags:

Service	Image	Network	Key Ports	Config Path
mcns	mcns:v1.2.0	bridge	53/tcp+udp, 38443→8443	/srv/mcns/mcns.toml
mc-proxy	mc-proxy:v1.2.2	host	443, 8443, 9443	/srv/mc-proxy/mc-proxy.toml
mcr (api)	mcr:v1.2.1	bridge	28443→8443, 29443→9443	/srv/mcr/mcr.toml
mcr (web)	mcr-web:v1.3.2	bridge	28080→8080	/srv/mcr/mcr.toml
metacrypt (api)	metacrypt:v1.3.1	bridge	18443→8443, 19443→9443	/srv/metacrypt/metacrypt.toml
metacrypt (web)	metacrypt-web:v1.4.1	bridge	18080→8080	/srv/metacrypt/metacrypt.toml
mcp-master	mcp-master:v0.10.3	host	9555	/srv/mcp-master/mcp-master.toml
mcq	mcq:v0.4.2	bridge	48080→8080	/srv/mcq/mcq.toml
mcdoc	mcdoc:v0.1.0	bridge	38080→8080	/srv/mcdoc/mcdoc.toml
mcat	mcat:v1.2.0	bridge	48116→8443	/srv/mcat/mcat.toml
kls	kls:v0.2.0	bridge	58080→8080	/srv/kls/kls.conf
sgard	sgardd:v3.2.0	bridge	19473→9473	(flags, see above)

All images are prefixed with mcr.svc.mcp.metacircular.net:8443/.

10 KiB Raw Blame History