# MCP Runbook Operational procedures for the Metacircular Control Plane. Written for operators at 3 AM. ## Service Overview MCP manages container lifecycle on Metacircular nodes. Two components: - **mcp-agent** — systemd service on each node (rift). Manages containers via rootless podman, stores registry in SQLite, monitors for drift. - **mcp** — CLI on the operator's workstation (vade). Pushes desired state, queries status. ## Health Checks ### Quick status ```bash mcp status ``` Shows all services, desired vs observed state, drift, and recent events. No drift = healthy. ### Agent process ```bash ssh rift "doas systemctl status mcp-agent" ssh rift "doas journalctl -u mcp-agent --since '10 min ago' --no-pager" ``` ### Individual service ```bash mcp status metacrypt ``` ## Common Operations ### Check what's running ```bash mcp ps # live check with uptime mcp list # from registry (no runtime query) mcp status # full picture with drift and events ``` ### Restart a service ```bash mcp restart metacrypt ``` Restarts all components. Does not change the `active` flag. Metacrypt will need to be unsealed after restart. ### Stop a service ```bash mcp stop metacrypt ``` Sets `active = false` in the service definition file and stops all containers. The agent will not restart them. ### Start a stopped service ```bash mcp start metacrypt ``` Sets `active = true` and starts all containers. ### Deploy an update Edit the service definition to update the image tag, then deploy: ```bash mcp service edit metacrypt # opens in $EDITOR mcp deploy metacrypt # deploys all components mcp deploy metacrypt/web # deploy just the web component ``` ### Push a config file to a node ```bash mcp push metacrypt.toml metacrypt # → /srv/metacrypt/metacrypt.toml mcp push cert.pem metacrypt certs/cert.pem # → /srv/metacrypt/certs/cert.pem ``` ### Pull a file from a node ```bash mcp pull metacrypt metacrypt.toml ./local-copy.toml ``` ### Sync desired state Push all service definitions to the agent without deploying: ```bash mcp sync ``` ### View service definition ```bash mcp service show metacrypt # from agent registry cat ~/.config/mcp/services/metacrypt.toml # local file ``` ### Export service definition from agent ```bash mcp service export metacrypt ``` Writes the agent's current spec to the local service definition file. ## Unsealing Metacrypt Metacrypt starts sealed after any restart. Unseal via the API: ```bash curl -sk -X POST https://metacrypt.svc.mcp.metacircular.net:8443/v1/unseal \ -H "Content-Type: application/json" \ -d '{"password":""}' ``` Or via the web UI at `https://metacrypt.svc.mcp.metacircular.net`. **Important:** Restarting metacrypt-api requires unsealing. To avoid this when updating just the UI, deploy only the web component: ```bash mcp deploy metacrypt/web ``` ## Agent Management ### Restart the agent ```bash ssh rift "doas systemctl restart mcp-agent" ``` Containers keep running — the agent is stateless w.r.t. container lifecycle. Podman's restart policy keeps containers up. ### View agent logs ```bash ssh rift "doas journalctl -u mcp-agent -f" # follow ssh rift "doas journalctl -u mcp-agent --since today" # today's logs ``` ### Agent database backup ```bash ssh rift "doas -u mcp /usr/local/bin/mcp-agent snapshot --config /srv/mcp/mcp-agent.toml" ``` Backups go to `/srv/mcp/backups/`. ### Update the agent binary ```bash # On vade, in the mcp repo: make clean && make mcp-agent scp mcp-agent rift:/tmp/ ssh rift "doas systemctl stop mcp-agent && \ doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent && \ doas systemctl start mcp-agent" ``` ### Update the CLI binary ```bash make clean && make mcp cp mcp ~/.local/bin/ ``` ## Node Management ### List nodes ```bash mcp node list ``` ### Add a node ```bash mcp node add ``` ### Remove a node ```bash mcp node remove ``` ## TLS Certificate Renewal The agent's TLS cert is at `/srv/mcp/certs/cert.pem`. Check expiry: ```bash ssh rift "openssl x509 -in /srv/mcp/certs/cert.pem -noout -enddate" ``` To renew (requires a Metacrypt token): ```bash export METACRYPT_TOKEN="" ssh rift "curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer $METACRYPT_TOKEN' \ -d '{ \"mount\": \"pki\", \"operation\": \"issue\", \"path\": \"web\", \"data\": { \"issuer\": \"web\", \"common_name\": \"mcp-agent.svc.mcp.metacircular.net\", \"profile\": \"server\", \"dns_names\": [\"mcp-agent.svc.mcp.metacircular.net\"], \"ip_addresses\": [\"100.95.252.120\", \"192.168.88.181\"], \"ttl\": \"2160h\" } }'" > /tmp/cert-response.json # Extract and install cert+key from the JSON response, then: ssh rift "doas systemctl restart mcp-agent" ``` ## Incident Procedures ### Service not running (drift detected) 1. `mcp status` — identify which service/component drifted. 2. Check agent logs: `ssh rift "doas journalctl -u mcp-agent --since '10 min ago'"` 3. Check container logs: `ssh rift "doas -u mcp podman logs "` 4. Restart: `mcp restart ` 5. If metacrypt: unseal after restart. ### Agent unreachable 1. Check if the agent process is running: `ssh rift "doas systemctl status mcp-agent"` 2. If stopped: `ssh rift "doas systemctl start mcp-agent"` 3. Check logs for crash reason: `ssh rift "doas journalctl -u mcp-agent -n 50"` 4. Containers keep running independently — podman's restart policy handles them. ### Token expired MCP CLI shows `UNAUTHENTICATED` or `PERMISSION_DENIED`: 1. Check token: the mcp-agent service account token is at `~/.config/mcp/token` 2. Validate: `curl -sk -X POST -H "Authorization: Bearer $(cat ~/.config/mcp/token)" https://mcias.metacircular.net:8443/v1/token/validate` 3. If expired: generate a new service account token from MCIAS admin dashboard. ### Database corruption The agent's SQLite database is at `/srv/mcp/mcp.db`: 1. Stop the agent: `ssh rift "doas systemctl stop mcp-agent"` 2. Restore from backup: `ssh rift "doas -u mcp cp /srv/mcp/backups/.db /srv/mcp/mcp.db"` 3. Start the agent: `ssh rift "doas systemctl start mcp-agent"` 4. Run `mcp sync` to re-push desired state. If no backup exists, delete the database and re-bootstrap: 1. `ssh rift "doas -u mcp rm /srv/mcp/mcp.db"` 2. `ssh rift "doas systemctl start mcp-agent"` (creates fresh database) 3. `mcp sync` (pushes all service definitions) ### Disaster recovery (rift lost) 1. Provision new machine, connect to overlay network. 2. Apply NixOS config (creates mcp user, installs agent). 3. Install mcp-agent binary. 4. Restore `/srv/` from backups (each service's backup timer creates daily snapshots). 5. Provision TLS cert from Metacrypt. 6. Start agent: `doas systemctl start mcp-agent` 7. `mcp sync` from vade to push service definitions. 8. Unseal Metacrypt. ## File Locations ### On rift (agent) | Path | Purpose | |------|---------| | `/srv/mcp/mcp-agent.toml` | Agent config | | `/srv/mcp/mcp.db` | Registry database | | `/srv/mcp/certs/` | Agent TLS cert and key | | `/srv/mcp/backups/` | Database snapshots | | `/srv//` | Service data directories | ### On vade (CLI) | Path | Purpose | |------|---------| | `~/.config/mcp/mcp.toml` | CLI config | | `~/.config/mcp/token` | MCIAS bearer token | | `~/.config/mcp/services/` | Service definition files |