- docs/bootstrap.md: step-by-step bootstrap procedure with lessons learned from the first deployment (NixOS sandbox issues, podman rootless setup, container naming, MCR auth workaround) - README.md: quick-start guide, command reference, doc links - RUNBOOK.md: operational procedures for operators (health checks, common operations, unsealing metacrypt, cert renewal, incident response, disaster recovery, file locations) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
306 lines
7.4 KiB
Markdown
306 lines
7.4 KiB
Markdown
# MCP Runbook
|
|
|
|
Operational procedures for the Metacircular Control Plane. Written for
|
|
operators at 3 AM.
|
|
|
|
## Service Overview
|
|
|
|
MCP manages container lifecycle on Metacircular nodes. Two components:
|
|
- **mcp-agent** — systemd service on each node (rift). Manages containers
|
|
via rootless podman, stores registry in SQLite, monitors for drift.
|
|
- **mcp** — CLI on the operator's workstation (vade). Pushes desired state,
|
|
queries status.
|
|
|
|
## Health Checks
|
|
|
|
### Quick status
|
|
|
|
```bash
|
|
mcp status
|
|
```
|
|
|
|
Shows all services, desired vs observed state, drift, and recent events.
|
|
No drift = healthy.
|
|
|
|
### Agent process
|
|
|
|
```bash
|
|
ssh rift "doas systemctl status mcp-agent"
|
|
ssh rift "doas journalctl -u mcp-agent --since '10 min ago' --no-pager"
|
|
```
|
|
|
|
### Individual service
|
|
|
|
```bash
|
|
mcp status metacrypt
|
|
```
|
|
|
|
## Common Operations
|
|
|
|
### Check what's running
|
|
|
|
```bash
|
|
mcp ps # live check with uptime
|
|
mcp list # from registry (no runtime query)
|
|
mcp status # full picture with drift and events
|
|
```
|
|
|
|
### Restart a service
|
|
|
|
```bash
|
|
mcp restart metacrypt
|
|
```
|
|
|
|
Restarts all components. Does not change the `active` flag. Metacrypt
|
|
will need to be unsealed after restart.
|
|
|
|
### Stop a service
|
|
|
|
```bash
|
|
mcp stop metacrypt
|
|
```
|
|
|
|
Sets `active = false` in the service definition file and stops all
|
|
containers. The agent will not restart them.
|
|
|
|
### Start a stopped service
|
|
|
|
```bash
|
|
mcp start metacrypt
|
|
```
|
|
|
|
Sets `active = true` and starts all containers.
|
|
|
|
### Deploy an update
|
|
|
|
Edit the service definition to update the image tag, then deploy:
|
|
|
|
```bash
|
|
mcp service edit metacrypt # opens in $EDITOR
|
|
mcp deploy metacrypt # deploys all components
|
|
mcp deploy metacrypt/web # deploy just the web component
|
|
```
|
|
|
|
### Push a config file to a node
|
|
|
|
```bash
|
|
mcp push metacrypt.toml metacrypt # → /srv/metacrypt/metacrypt.toml
|
|
mcp push cert.pem metacrypt certs/cert.pem # → /srv/metacrypt/certs/cert.pem
|
|
```
|
|
|
|
### Pull a file from a node
|
|
|
|
```bash
|
|
mcp pull metacrypt metacrypt.toml ./local-copy.toml
|
|
```
|
|
|
|
### Sync desired state
|
|
|
|
Push all service definitions to the agent without deploying:
|
|
|
|
```bash
|
|
mcp sync
|
|
```
|
|
|
|
### View service definition
|
|
|
|
```bash
|
|
mcp service show metacrypt # from agent registry
|
|
cat ~/.config/mcp/services/metacrypt.toml # local file
|
|
```
|
|
|
|
### Export service definition from agent
|
|
|
|
```bash
|
|
mcp service export metacrypt
|
|
```
|
|
|
|
Writes the agent's current spec to the local service definition file.
|
|
|
|
## Unsealing Metacrypt
|
|
|
|
Metacrypt starts sealed after any restart. Unseal via the API:
|
|
|
|
```bash
|
|
curl -sk -X POST https://metacrypt.svc.mcp.metacircular.net:8443/v1/unseal \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"password":"<unseal-password>"}'
|
|
```
|
|
|
|
Or via the web UI at `https://metacrypt.svc.mcp.metacircular.net`.
|
|
|
|
**Important:** Restarting metacrypt-api requires unsealing. To avoid this
|
|
when updating just the UI, deploy only the web component:
|
|
|
|
```bash
|
|
mcp deploy metacrypt/web
|
|
```
|
|
|
|
## Agent Management
|
|
|
|
### Restart the agent
|
|
|
|
```bash
|
|
ssh rift "doas systemctl restart mcp-agent"
|
|
```
|
|
|
|
Containers keep running — the agent is stateless w.r.t. container
|
|
lifecycle. Podman's restart policy keeps containers up.
|
|
|
|
### View agent logs
|
|
|
|
```bash
|
|
ssh rift "doas journalctl -u mcp-agent -f" # follow
|
|
ssh rift "doas journalctl -u mcp-agent --since today" # today's logs
|
|
```
|
|
|
|
### Agent database backup
|
|
|
|
```bash
|
|
ssh rift "doas -u mcp /usr/local/bin/mcp-agent snapshot --config /srv/mcp/mcp-agent.toml"
|
|
```
|
|
|
|
Backups go to `/srv/mcp/backups/`.
|
|
|
|
### Update the agent binary
|
|
|
|
```bash
|
|
# On vade, in the mcp repo:
|
|
make clean && make mcp-agent
|
|
scp mcp-agent rift:/tmp/
|
|
ssh rift "doas systemctl stop mcp-agent && \
|
|
doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent && \
|
|
doas systemctl start mcp-agent"
|
|
```
|
|
|
|
### Update the CLI binary
|
|
|
|
```bash
|
|
make clean && make mcp
|
|
cp mcp ~/.local/bin/
|
|
```
|
|
|
|
## Node Management
|
|
|
|
### List nodes
|
|
|
|
```bash
|
|
mcp node list
|
|
```
|
|
|
|
### Add a node
|
|
|
|
```bash
|
|
mcp node add <name> <address:port>
|
|
```
|
|
|
|
### Remove a node
|
|
|
|
```bash
|
|
mcp node remove <name>
|
|
```
|
|
|
|
## TLS Certificate Renewal
|
|
|
|
The agent's TLS cert is at `/srv/mcp/certs/cert.pem`. Check expiry:
|
|
|
|
```bash
|
|
ssh rift "openssl x509 -in /srv/mcp/certs/cert.pem -noout -enddate"
|
|
```
|
|
|
|
To renew (requires a Metacrypt token):
|
|
|
|
```bash
|
|
export METACRYPT_TOKEN="<token>"
|
|
ssh rift "curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'Authorization: Bearer $METACRYPT_TOKEN' \
|
|
-d '{
|
|
\"mount\": \"pki\",
|
|
\"operation\": \"issue\",
|
|
\"path\": \"web\",
|
|
\"data\": {
|
|
\"issuer\": \"web\",
|
|
\"common_name\": \"mcp-agent.svc.mcp.metacircular.net\",
|
|
\"profile\": \"server\",
|
|
\"dns_names\": [\"mcp-agent.svc.mcp.metacircular.net\"],
|
|
\"ip_addresses\": [\"100.95.252.120\", \"192.168.88.181\"],
|
|
\"ttl\": \"2160h\"
|
|
}
|
|
}'" > /tmp/cert-response.json
|
|
|
|
# Extract and install cert+key from the JSON response, then:
|
|
ssh rift "doas systemctl restart mcp-agent"
|
|
```
|
|
|
|
## Incident Procedures
|
|
|
|
### Service not running (drift detected)
|
|
|
|
1. `mcp status` — identify which service/component drifted.
|
|
2. Check agent logs: `ssh rift "doas journalctl -u mcp-agent --since '10 min ago'"`
|
|
3. Check container logs: `ssh rift "doas -u mcp podman logs <container-name>"`
|
|
4. Restart: `mcp restart <service>`
|
|
5. If metacrypt: unseal after restart.
|
|
|
|
### Agent unreachable
|
|
|
|
1. Check if the agent process is running: `ssh rift "doas systemctl status mcp-agent"`
|
|
2. If stopped: `ssh rift "doas systemctl start mcp-agent"`
|
|
3. Check logs for crash reason: `ssh rift "doas journalctl -u mcp-agent -n 50"`
|
|
4. Containers keep running independently — podman's restart policy handles them.
|
|
|
|
### Token expired
|
|
|
|
MCP CLI shows `UNAUTHENTICATED` or `PERMISSION_DENIED`:
|
|
|
|
1. Check token: the mcp-agent service account token is at `~/.config/mcp/token`
|
|
2. Validate: `curl -sk -X POST -H "Authorization: Bearer $(cat ~/.config/mcp/token)" https://mcias.metacircular.net:8443/v1/token/validate`
|
|
3. If expired: generate a new service account token from MCIAS admin dashboard.
|
|
|
|
### Database corruption
|
|
|
|
The agent's SQLite database is at `/srv/mcp/mcp.db`:
|
|
|
|
1. Stop the agent: `ssh rift "doas systemctl stop mcp-agent"`
|
|
2. Restore from backup: `ssh rift "doas -u mcp cp /srv/mcp/backups/<latest>.db /srv/mcp/mcp.db"`
|
|
3. Start the agent: `ssh rift "doas systemctl start mcp-agent"`
|
|
4. Run `mcp sync` to re-push desired state.
|
|
|
|
If no backup exists, delete the database and re-bootstrap:
|
|
|
|
1. `ssh rift "doas -u mcp rm /srv/mcp/mcp.db"`
|
|
2. `ssh rift "doas systemctl start mcp-agent"` (creates fresh database)
|
|
3. `mcp sync` (pushes all service definitions)
|
|
|
|
### Disaster recovery (rift lost)
|
|
|
|
1. Provision new machine, connect to overlay network.
|
|
2. Apply NixOS config (creates mcp user, installs agent).
|
|
3. Install mcp-agent binary.
|
|
4. Restore `/srv/` from backups (each service's backup timer creates daily snapshots).
|
|
5. Provision TLS cert from Metacrypt.
|
|
6. Start agent: `doas systemctl start mcp-agent`
|
|
7. `mcp sync` from vade to push service definitions.
|
|
8. Unseal Metacrypt.
|
|
|
|
## File Locations
|
|
|
|
### On rift (agent)
|
|
|
|
| Path | Purpose |
|
|
|------|---------|
|
|
| `/srv/mcp/mcp-agent.toml` | Agent config |
|
|
| `/srv/mcp/mcp.db` | Registry database |
|
|
| `/srv/mcp/certs/` | Agent TLS cert and key |
|
|
| `/srv/mcp/backups/` | Database snapshots |
|
|
| `/srv/<service>/` | Service data directories |
|
|
|
|
### On vade (CLI)
|
|
|
|
| Path | Purpose |
|
|
|------|---------|
|
|
| `~/.config/mcp/mcp.toml` | CLI config |
|
|
| `~/.config/mcp/token` | MCIAS bearer token |
|
|
| `~/.config/mcp/services/` | Service definition files |
|