Files
mcp/RUNBOOK.md
Kyle Isom ea8a42a696 P5.2 + P5.3: Bootstrap docs, README, and RUNBOOK
- docs/bootstrap.md: step-by-step bootstrap procedure with lessons
  learned from the first deployment (NixOS sandbox issues, podman
  rootless setup, container naming, MCR auth workaround)
- README.md: quick-start guide, command reference, doc links
- RUNBOOK.md: operational procedures for operators (health checks,
  common operations, unsealing metacrypt, cert renewal, incident
  response, disaster recovery, file locations)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:32:22 -07:00

7.4 KiB

MCP Runbook

Operational procedures for the Metacircular Control Plane. Written for operators at 3 AM.

Service Overview

MCP manages container lifecycle on Metacircular nodes. Two components:

  • mcp-agent — systemd service on each node (rift). Manages containers via rootless podman, stores registry in SQLite, monitors for drift.
  • mcp — CLI on the operator's workstation (vade). Pushes desired state, queries status.

Health Checks

Quick status

mcp status

Shows all services, desired vs observed state, drift, and recent events. No drift = healthy.

Agent process

ssh rift "doas systemctl status mcp-agent"
ssh rift "doas journalctl -u mcp-agent --since '10 min ago' --no-pager"

Individual service

mcp status metacrypt

Common Operations

Check what's running

mcp ps          # live check with uptime
mcp list        # from registry (no runtime query)
mcp status      # full picture with drift and events

Restart a service

mcp restart metacrypt

Restarts all components. Does not change the active flag. Metacrypt will need to be unsealed after restart.

Stop a service

mcp stop metacrypt

Sets active = false in the service definition file and stops all containers. The agent will not restart them.

Start a stopped service

mcp start metacrypt

Sets active = true and starts all containers.

Deploy an update

Edit the service definition to update the image tag, then deploy:

mcp service edit metacrypt    # opens in $EDITOR
mcp deploy metacrypt          # deploys all components
mcp deploy metacrypt/web      # deploy just the web component

Push a config file to a node

mcp push metacrypt.toml metacrypt               # → /srv/metacrypt/metacrypt.toml
mcp push cert.pem metacrypt certs/cert.pem      # → /srv/metacrypt/certs/cert.pem

Pull a file from a node

mcp pull metacrypt metacrypt.toml ./local-copy.toml

Sync desired state

Push all service definitions to the agent without deploying:

mcp sync

View service definition

mcp service show metacrypt    # from agent registry
cat ~/.config/mcp/services/metacrypt.toml  # local file

Export service definition from agent

mcp service export metacrypt

Writes the agent's current spec to the local service definition file.

Unsealing Metacrypt

Metacrypt starts sealed after any restart. Unseal via the API:

curl -sk -X POST https://metacrypt.svc.mcp.metacircular.net:8443/v1/unseal \
  -H "Content-Type: application/json" \
  -d '{"password":"<unseal-password>"}'

Or via the web UI at https://metacrypt.svc.mcp.metacircular.net.

Important: Restarting metacrypt-api requires unsealing. To avoid this when updating just the UI, deploy only the web component:

mcp deploy metacrypt/web

Agent Management

Restart the agent

ssh rift "doas systemctl restart mcp-agent"

Containers keep running — the agent is stateless w.r.t. container lifecycle. Podman's restart policy keeps containers up.

View agent logs

ssh rift "doas journalctl -u mcp-agent -f"            # follow
ssh rift "doas journalctl -u mcp-agent --since today"  # today's logs

Agent database backup

ssh rift "doas -u mcp /usr/local/bin/mcp-agent snapshot --config /srv/mcp/mcp-agent.toml"

Backups go to /srv/mcp/backups/.

Update the agent binary

# On vade, in the mcp repo:
make clean && make mcp-agent
scp mcp-agent rift:/tmp/
ssh rift "doas systemctl stop mcp-agent && \
  doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent && \
  doas systemctl start mcp-agent"

Update the CLI binary

make clean && make mcp
cp mcp ~/.local/bin/

Node Management

List nodes

mcp node list

Add a node

mcp node add <name> <address:port>

Remove a node

mcp node remove <name>

TLS Certificate Renewal

The agent's TLS cert is at /srv/mcp/certs/cert.pem. Check expiry:

ssh rift "openssl x509 -in /srv/mcp/certs/cert.pem -noout -enddate"

To renew (requires a Metacrypt token):

export METACRYPT_TOKEN="<token>"
ssh rift "curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer $METACRYPT_TOKEN' \
  -d '{
    \"mount\": \"pki\",
    \"operation\": \"issue\",
    \"path\": \"web\",
    \"data\": {
      \"issuer\": \"web\",
      \"common_name\": \"mcp-agent.svc.mcp.metacircular.net\",
      \"profile\": \"server\",
      \"dns_names\": [\"mcp-agent.svc.mcp.metacircular.net\"],
      \"ip_addresses\": [\"100.95.252.120\", \"192.168.88.181\"],
      \"ttl\": \"2160h\"
    }
  }'" > /tmp/cert-response.json

# Extract and install cert+key from the JSON response, then:
ssh rift "doas systemctl restart mcp-agent"

Incident Procedures

Service not running (drift detected)

  1. mcp status — identify which service/component drifted.
  2. Check agent logs: ssh rift "doas journalctl -u mcp-agent --since '10 min ago'"
  3. Check container logs: ssh rift "doas -u mcp podman logs <container-name>"
  4. Restart: mcp restart <service>
  5. If metacrypt: unseal after restart.

Agent unreachable

  1. Check if the agent process is running: ssh rift "doas systemctl status mcp-agent"
  2. If stopped: ssh rift "doas systemctl start mcp-agent"
  3. Check logs for crash reason: ssh rift "doas journalctl -u mcp-agent -n 50"
  4. Containers keep running independently — podman's restart policy handles them.

Token expired

MCP CLI shows UNAUTHENTICATED or PERMISSION_DENIED:

  1. Check token: the mcp-agent service account token is at ~/.config/mcp/token
  2. Validate: curl -sk -X POST -H "Authorization: Bearer $(cat ~/.config/mcp/token)" https://mcias.metacircular.net:8443/v1/token/validate
  3. If expired: generate a new service account token from MCIAS admin dashboard.

Database corruption

The agent's SQLite database is at /srv/mcp/mcp.db:

  1. Stop the agent: ssh rift "doas systemctl stop mcp-agent"
  2. Restore from backup: ssh rift "doas -u mcp cp /srv/mcp/backups/<latest>.db /srv/mcp/mcp.db"
  3. Start the agent: ssh rift "doas systemctl start mcp-agent"
  4. Run mcp sync to re-push desired state.

If no backup exists, delete the database and re-bootstrap:

  1. ssh rift "doas -u mcp rm /srv/mcp/mcp.db"
  2. ssh rift "doas systemctl start mcp-agent" (creates fresh database)
  3. mcp sync (pushes all service definitions)

Disaster recovery (rift lost)

  1. Provision new machine, connect to overlay network.
  2. Apply NixOS config (creates mcp user, installs agent).
  3. Install mcp-agent binary.
  4. Restore /srv/ from backups (each service's backup timer creates daily snapshots).
  5. Provision TLS cert from Metacrypt.
  6. Start agent: doas systemctl start mcp-agent
  7. mcp sync from vade to push service definitions.
  8. Unseal Metacrypt.

File Locations

On rift (agent)

Path Purpose
/srv/mcp/mcp-agent.toml Agent config
/srv/mcp/mcp.db Registry database
/srv/mcp/certs/ Agent TLS cert and key
/srv/mcp/backups/ Database snapshots
/srv/<service>/ Service data directories

On vade (CLI)

Path Purpose
~/.config/mcp/mcp.toml CLI config
~/.config/mcp/token MCIAS bearer token
~/.config/mcp/services/ Service definition files