- docs/bootstrap.md: step-by-step bootstrap procedure with lessons learned from the first deployment (NixOS sandbox issues, podman rootless setup, container naming, MCR auth workaround) - README.md: quick-start guide, command reference, doc links - RUNBOOK.md: operational procedures for operators (health checks, common operations, unsealing metacrypt, cert renewal, incident response, disaster recovery, file locations) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.4 KiB
MCP Runbook
Operational procedures for the Metacircular Control Plane. Written for operators at 3 AM.
Service Overview
MCP manages container lifecycle on Metacircular nodes. Two components:
- mcp-agent — systemd service on each node (rift). Manages containers via rootless podman, stores registry in SQLite, monitors for drift.
- mcp — CLI on the operator's workstation (vade). Pushes desired state, queries status.
Health Checks
Quick status
mcp status
Shows all services, desired vs observed state, drift, and recent events. No drift = healthy.
Agent process
ssh rift "doas systemctl status mcp-agent"
ssh rift "doas journalctl -u mcp-agent --since '10 min ago' --no-pager"
Individual service
mcp status metacrypt
Common Operations
Check what's running
mcp ps # live check with uptime
mcp list # from registry (no runtime query)
mcp status # full picture with drift and events
Restart a service
mcp restart metacrypt
Restarts all components. Does not change the active flag. Metacrypt
will need to be unsealed after restart.
Stop a service
mcp stop metacrypt
Sets active = false in the service definition file and stops all
containers. The agent will not restart them.
Start a stopped service
mcp start metacrypt
Sets active = true and starts all containers.
Deploy an update
Edit the service definition to update the image tag, then deploy:
mcp service edit metacrypt # opens in $EDITOR
mcp deploy metacrypt # deploys all components
mcp deploy metacrypt/web # deploy just the web component
Push a config file to a node
mcp push metacrypt.toml metacrypt # → /srv/metacrypt/metacrypt.toml
mcp push cert.pem metacrypt certs/cert.pem # → /srv/metacrypt/certs/cert.pem
Pull a file from a node
mcp pull metacrypt metacrypt.toml ./local-copy.toml
Sync desired state
Push all service definitions to the agent without deploying:
mcp sync
View service definition
mcp service show metacrypt # from agent registry
cat ~/.config/mcp/services/metacrypt.toml # local file
Export service definition from agent
mcp service export metacrypt
Writes the agent's current spec to the local service definition file.
Unsealing Metacrypt
Metacrypt starts sealed after any restart. Unseal via the API:
curl -sk -X POST https://metacrypt.svc.mcp.metacircular.net:8443/v1/unseal \
-H "Content-Type: application/json" \
-d '{"password":"<unseal-password>"}'
Or via the web UI at https://metacrypt.svc.mcp.metacircular.net.
Important: Restarting metacrypt-api requires unsealing. To avoid this when updating just the UI, deploy only the web component:
mcp deploy metacrypt/web
Agent Management
Restart the agent
ssh rift "doas systemctl restart mcp-agent"
Containers keep running — the agent is stateless w.r.t. container lifecycle. Podman's restart policy keeps containers up.
View agent logs
ssh rift "doas journalctl -u mcp-agent -f" # follow
ssh rift "doas journalctl -u mcp-agent --since today" # today's logs
Agent database backup
ssh rift "doas -u mcp /usr/local/bin/mcp-agent snapshot --config /srv/mcp/mcp-agent.toml"
Backups go to /srv/mcp/backups/.
Update the agent binary
# On vade, in the mcp repo:
make clean && make mcp-agent
scp mcp-agent rift:/tmp/
ssh rift "doas systemctl stop mcp-agent && \
doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent && \
doas systemctl start mcp-agent"
Update the CLI binary
make clean && make mcp
cp mcp ~/.local/bin/
Node Management
List nodes
mcp node list
Add a node
mcp node add <name> <address:port>
Remove a node
mcp node remove <name>
TLS Certificate Renewal
The agent's TLS cert is at /srv/mcp/certs/cert.pem. Check expiry:
ssh rift "openssl x509 -in /srv/mcp/certs/cert.pem -noout -enddate"
To renew (requires a Metacrypt token):
export METACRYPT_TOKEN="<token>"
ssh rift "curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $METACRYPT_TOKEN' \
-d '{
\"mount\": \"pki\",
\"operation\": \"issue\",
\"path\": \"web\",
\"data\": {
\"issuer\": \"web\",
\"common_name\": \"mcp-agent.svc.mcp.metacircular.net\",
\"profile\": \"server\",
\"dns_names\": [\"mcp-agent.svc.mcp.metacircular.net\"],
\"ip_addresses\": [\"100.95.252.120\", \"192.168.88.181\"],
\"ttl\": \"2160h\"
}
}'" > /tmp/cert-response.json
# Extract and install cert+key from the JSON response, then:
ssh rift "doas systemctl restart mcp-agent"
Incident Procedures
Service not running (drift detected)
mcp status— identify which service/component drifted.- Check agent logs:
ssh rift "doas journalctl -u mcp-agent --since '10 min ago'" - Check container logs:
ssh rift "doas -u mcp podman logs <container-name>" - Restart:
mcp restart <service> - If metacrypt: unseal after restart.
Agent unreachable
- Check if the agent process is running:
ssh rift "doas systemctl status mcp-agent" - If stopped:
ssh rift "doas systemctl start mcp-agent" - Check logs for crash reason:
ssh rift "doas journalctl -u mcp-agent -n 50" - Containers keep running independently — podman's restart policy handles them.
Token expired
MCP CLI shows UNAUTHENTICATED or PERMISSION_DENIED:
- Check token: the mcp-agent service account token is at
~/.config/mcp/token - Validate:
curl -sk -X POST -H "Authorization: Bearer $(cat ~/.config/mcp/token)" https://mcias.metacircular.net:8443/v1/token/validate - If expired: generate a new service account token from MCIAS admin dashboard.
Database corruption
The agent's SQLite database is at /srv/mcp/mcp.db:
- Stop the agent:
ssh rift "doas systemctl stop mcp-agent" - Restore from backup:
ssh rift "doas -u mcp cp /srv/mcp/backups/<latest>.db /srv/mcp/mcp.db" - Start the agent:
ssh rift "doas systemctl start mcp-agent" - Run
mcp syncto re-push desired state.
If no backup exists, delete the database and re-bootstrap:
ssh rift "doas -u mcp rm /srv/mcp/mcp.db"ssh rift "doas systemctl start mcp-agent"(creates fresh database)mcp sync(pushes all service definitions)
Disaster recovery (rift lost)
- Provision new machine, connect to overlay network.
- Apply NixOS config (creates mcp user, installs agent).
- Install mcp-agent binary.
- Restore
/srv/from backups (each service's backup timer creates daily snapshots). - Provision TLS cert from Metacrypt.
- Start agent:
doas systemctl start mcp-agent mcp syncfrom vade to push service definitions.- Unseal Metacrypt.
File Locations
On rift (agent)
| Path | Purpose |
|---|---|
/srv/mcp/mcp-agent.toml |
Agent config |
/srv/mcp/mcp.db |
Registry database |
/srv/mcp/certs/ |
Agent TLS cert and key |
/srv/mcp/backups/ |
Database snapshots |
/srv/<service>/ |
Service data directories |
On vade (CLI)
| Path | Purpose |
|---|---|
~/.config/mcp/mcp.toml |
CLI config |
~/.config/mcp/token |
MCIAS bearer token |
~/.config/mcp/services/ |
Service definition files |