diff --git a/PROGRESS_V1.md b/PROGRESS_V1.md index 11c2d6b..2a9ec04 100644 --- a/PROGRESS_V1.md +++ b/PROGRESS_V1.md @@ -47,8 +47,8 @@ ## Phase 5: Integration and Polish - [ ] **P5.1** Integration test suite -- [ ] **P5.2** Bootstrap procedure test -- [x] **P5.3** Documentation — CLAUDE.md done; README.md and RUNBOOK.md pending +- [x] **P5.2** Bootstrap procedure — documented in `docs/bootstrap.md` +- [x] **P5.3** Documentation — CLAUDE.md, README.md, RUNBOOK.md ## Phase 6: Deployment (completed 2026-03-26) diff --git a/README.md b/README.md new file mode 100644 index 0000000..9c7d4f2 --- /dev/null +++ b/README.md @@ -0,0 +1,119 @@ +# MCP — Metacircular Control Plane + +MCP is the orchestrator for the [Metacircular](https://metacircular.net) +platform. It manages container lifecycle, tracks what services run where, +and transfers files between the operator's workstation and managed nodes. + +## Architecture + +**CLI** (`mcp`) — thin client on the operator's workstation. Reads local +service definition files, pushes intent to agents, queries status. + +**Agent** (`mcp-agent`) — per-node daemon. Manages containers via rootless +podman, stores a SQLite registry of desired/observed state, monitors for +drift, and alerts the operator. + +## Quick Start + +### Build + +```bash +make all # vet, lint, test, build +make mcp # CLI only +make mcp-agent # agent only +``` + +### Install the CLI + +```bash +cp mcp ~/.local/bin/ +mkdir -p ~/.config/mcp/services +``` + +Create `~/.config/mcp/mcp.toml`: + +```toml +[services] +dir = "/home//.config/mcp/services" + +[mcias] +server_url = "https://mcias.metacircular.net:8443" +service_name = "mcp" + +[auth] +token_path = "/home//.config/mcp/token" + +[[nodes]] +name = "rift" +address = "100.95.252.120:9444" +``` + +### Authenticate + +```bash +mcp login +``` + +### Check status + +```bash +mcp status # full picture: services, drift, events +mcp ps # live container check with uptime +mcp list # quick registry query +``` + +### Deploy a service + +Write a service definition in `~/.config/mcp/services/.toml`: + +```toml +name = "myservice" +node = "rift" +active = true + +[[components]] +name = "api" +image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0" +network = "mcpnet" +user = "0:0" +restart = "unless-stopped" +ports = ["127.0.0.1:8443:8443"] +volumes = ["/srv/myservice:/srv/myservice"] +cmd = ["server", "--config", "/srv/myservice/myservice.toml"] +``` + +Then deploy: + +```bash +mcp deploy myservice +``` + +## Commands + +| Command | Description | +|---------|-------------| +| `mcp login` | Authenticate to MCIAS | +| `mcp deploy [/]` | Deploy from service definition | +| `mcp stop ` | Stop all components | +| `mcp start ` | Start all components | +| `mcp restart ` | Restart all components | +| `mcp list` | List services (registry) | +| `mcp ps` | Live container check | +| `mcp status [service]` | Full status with drift and events | +| `mcp sync` | Push all service definitions | +| `mcp adopt ` | Adopt running containers | +| `mcp service show ` | Print spec from agent | +| `mcp service edit ` | Edit definition in $EDITOR | +| `mcp service export ` | Export agent spec to file | +| `mcp push [path]` | Push file to node | +| `mcp pull [file]` | Pull file from node | +| `mcp node list` | List nodes | +| `mcp node add ` | Add a node | +| `mcp node remove ` | Remove a node | + +## Documentation + +- [ARCHITECTURE.md](ARCHITECTURE.md) — design specification +- [RUNBOOK.md](RUNBOOK.md) — operational procedures +- [PROJECT_PLAN_V1.md](PROJECT_PLAN_V1.md) — implementation plan +- [PROGRESS_V1.md](PROGRESS_V1.md) — progress and remaining work diff --git a/RUNBOOK.md b/RUNBOOK.md new file mode 100644 index 0000000..bb66671 --- /dev/null +++ b/RUNBOOK.md @@ -0,0 +1,305 @@ +# MCP Runbook + +Operational procedures for the Metacircular Control Plane. Written for +operators at 3 AM. + +## Service Overview + +MCP manages container lifecycle on Metacircular nodes. Two components: +- **mcp-agent** — systemd service on each node (rift). Manages containers + via rootless podman, stores registry in SQLite, monitors for drift. +- **mcp** — CLI on the operator's workstation (vade). Pushes desired state, + queries status. + +## Health Checks + +### Quick status + +```bash +mcp status +``` + +Shows all services, desired vs observed state, drift, and recent events. +No drift = healthy. + +### Agent process + +```bash +ssh rift "doas systemctl status mcp-agent" +ssh rift "doas journalctl -u mcp-agent --since '10 min ago' --no-pager" +``` + +### Individual service + +```bash +mcp status metacrypt +``` + +## Common Operations + +### Check what's running + +```bash +mcp ps # live check with uptime +mcp list # from registry (no runtime query) +mcp status # full picture with drift and events +``` + +### Restart a service + +```bash +mcp restart metacrypt +``` + +Restarts all components. Does not change the `active` flag. Metacrypt +will need to be unsealed after restart. + +### Stop a service + +```bash +mcp stop metacrypt +``` + +Sets `active = false` in the service definition file and stops all +containers. The agent will not restart them. + +### Start a stopped service + +```bash +mcp start metacrypt +``` + +Sets `active = true` and starts all containers. + +### Deploy an update + +Edit the service definition to update the image tag, then deploy: + +```bash +mcp service edit metacrypt # opens in $EDITOR +mcp deploy metacrypt # deploys all components +mcp deploy metacrypt/web # deploy just the web component +``` + +### Push a config file to a node + +```bash +mcp push metacrypt.toml metacrypt # → /srv/metacrypt/metacrypt.toml +mcp push cert.pem metacrypt certs/cert.pem # → /srv/metacrypt/certs/cert.pem +``` + +### Pull a file from a node + +```bash +mcp pull metacrypt metacrypt.toml ./local-copy.toml +``` + +### Sync desired state + +Push all service definitions to the agent without deploying: + +```bash +mcp sync +``` + +### View service definition + +```bash +mcp service show metacrypt # from agent registry +cat ~/.config/mcp/services/metacrypt.toml # local file +``` + +### Export service definition from agent + +```bash +mcp service export metacrypt +``` + +Writes the agent's current spec to the local service definition file. + +## Unsealing Metacrypt + +Metacrypt starts sealed after any restart. Unseal via the API: + +```bash +curl -sk -X POST https://metacrypt.svc.mcp.metacircular.net:8443/v1/unseal \ + -H "Content-Type: application/json" \ + -d '{"password":""}' +``` + +Or via the web UI at `https://metacrypt.svc.mcp.metacircular.net`. + +**Important:** Restarting metacrypt-api requires unsealing. To avoid this +when updating just the UI, deploy only the web component: + +```bash +mcp deploy metacrypt/web +``` + +## Agent Management + +### Restart the agent + +```bash +ssh rift "doas systemctl restart mcp-agent" +``` + +Containers keep running — the agent is stateless w.r.t. container +lifecycle. Podman's restart policy keeps containers up. + +### View agent logs + +```bash +ssh rift "doas journalctl -u mcp-agent -f" # follow +ssh rift "doas journalctl -u mcp-agent --since today" # today's logs +``` + +### Agent database backup + +```bash +ssh rift "doas -u mcp /usr/local/bin/mcp-agent snapshot --config /srv/mcp/mcp-agent.toml" +``` + +Backups go to `/srv/mcp/backups/`. + +### Update the agent binary + +```bash +# On vade, in the mcp repo: +make clean && make mcp-agent +scp mcp-agent rift:/tmp/ +ssh rift "doas systemctl stop mcp-agent && \ + doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent && \ + doas systemctl start mcp-agent" +``` + +### Update the CLI binary + +```bash +make clean && make mcp +cp mcp ~/.local/bin/ +``` + +## Node Management + +### List nodes + +```bash +mcp node list +``` + +### Add a node + +```bash +mcp node add +``` + +### Remove a node + +```bash +mcp node remove +``` + +## TLS Certificate Renewal + +The agent's TLS cert is at `/srv/mcp/certs/cert.pem`. Check expiry: + +```bash +ssh rift "openssl x509 -in /srv/mcp/certs/cert.pem -noout -enddate" +``` + +To renew (requires a Metacrypt token): + +```bash +export METACRYPT_TOKEN="" +ssh rift "curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer $METACRYPT_TOKEN' \ + -d '{ + \"mount\": \"pki\", + \"operation\": \"issue\", + \"path\": \"web\", + \"data\": { + \"issuer\": \"web\", + \"common_name\": \"mcp-agent.svc.mcp.metacircular.net\", + \"profile\": \"server\", + \"dns_names\": [\"mcp-agent.svc.mcp.metacircular.net\"], + \"ip_addresses\": [\"100.95.252.120\", \"192.168.88.181\"], + \"ttl\": \"2160h\" + } + }'" > /tmp/cert-response.json + +# Extract and install cert+key from the JSON response, then: +ssh rift "doas systemctl restart mcp-agent" +``` + +## Incident Procedures + +### Service not running (drift detected) + +1. `mcp status` — identify which service/component drifted. +2. Check agent logs: `ssh rift "doas journalctl -u mcp-agent --since '10 min ago'"` +3. Check container logs: `ssh rift "doas -u mcp podman logs "` +4. Restart: `mcp restart ` +5. If metacrypt: unseal after restart. + +### Agent unreachable + +1. Check if the agent process is running: `ssh rift "doas systemctl status mcp-agent"` +2. If stopped: `ssh rift "doas systemctl start mcp-agent"` +3. Check logs for crash reason: `ssh rift "doas journalctl -u mcp-agent -n 50"` +4. Containers keep running independently — podman's restart policy handles them. + +### Token expired + +MCP CLI shows `UNAUTHENTICATED` or `PERMISSION_DENIED`: + +1. Check token: the mcp-agent service account token is at `~/.config/mcp/token` +2. Validate: `curl -sk -X POST -H "Authorization: Bearer $(cat ~/.config/mcp/token)" https://mcias.metacircular.net:8443/v1/token/validate` +3. If expired: generate a new service account token from MCIAS admin dashboard. + +### Database corruption + +The agent's SQLite database is at `/srv/mcp/mcp.db`: + +1. Stop the agent: `ssh rift "doas systemctl stop mcp-agent"` +2. Restore from backup: `ssh rift "doas -u mcp cp /srv/mcp/backups/.db /srv/mcp/mcp.db"` +3. Start the agent: `ssh rift "doas systemctl start mcp-agent"` +4. Run `mcp sync` to re-push desired state. + +If no backup exists, delete the database and re-bootstrap: + +1. `ssh rift "doas -u mcp rm /srv/mcp/mcp.db"` +2. `ssh rift "doas systemctl start mcp-agent"` (creates fresh database) +3. `mcp sync` (pushes all service definitions) + +### Disaster recovery (rift lost) + +1. Provision new machine, connect to overlay network. +2. Apply NixOS config (creates mcp user, installs agent). +3. Install mcp-agent binary. +4. Restore `/srv/` from backups (each service's backup timer creates daily snapshots). +5. Provision TLS cert from Metacrypt. +6. Start agent: `doas systemctl start mcp-agent` +7. `mcp sync` from vade to push service definitions. +8. Unseal Metacrypt. + +## File Locations + +### On rift (agent) + +| Path | Purpose | +|------|---------| +| `/srv/mcp/mcp-agent.toml` | Agent config | +| `/srv/mcp/mcp.db` | Registry database | +| `/srv/mcp/certs/` | Agent TLS cert and key | +| `/srv/mcp/backups/` | Database snapshots | +| `/srv//` | Service data directories | + +### On vade (CLI) + +| Path | Purpose | +|------|---------| +| `~/.config/mcp/mcp.toml` | CLI config | +| `~/.config/mcp/token` | MCIAS bearer token | +| `~/.config/mcp/services/` | Service definition files | diff --git a/docs/bootstrap.md b/docs/bootstrap.md new file mode 100644 index 0000000..8f488d9 --- /dev/null +++ b/docs/bootstrap.md @@ -0,0 +1,198 @@ +# MCP Bootstrap Procedure + +How to bring MCP up on a node for the first time, including migrating +existing containers from another user's podman instance. + +## Prerequisites + +- NixOS configuration applied with `configs/mcp.nix` (creates `mcp` user + with rootless podman, subuid/subgid, systemd service) +- MCIAS system account with `admin` role (for token validation and cert + provisioning) +- Metacrypt running (for TLS certificate issuance) + +## Step 1: Provision TLS Certificate + +Issue a cert from Metacrypt with DNS and IP SANs: + +```bash +export METACRYPT_TOKEN="" + +# From a machine that can reach Metacrypt (e.g., via loopback on rift): +curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $METACRYPT_TOKEN" \ + -d '{ + "mount": "pki", + "operation": "issue", + "path": "web", + "data": { + "issuer": "web", + "common_name": "mcp-agent.svc.mcp.metacircular.net", + "profile": "server", + "dns_names": ["mcp-agent.svc.mcp.metacircular.net"], + "ip_addresses": ["", ""], + "ttl": "2160h" + } + }' > cert-response.json + +# Extract cert and key from the JSON response and install: +doas cp cert.pem /srv/mcp/certs/cert.pem +doas cp key.pem /srv/mcp/certs/key.pem +doas chown mcp:mcp /srv/mcp/certs/cert.pem /srv/mcp/certs/key.pem +doas chmod 600 /srv/mcp/certs/cert.pem /srv/mcp/certs/key.pem +``` + +## Step 2: Add DNS Record + +Add an A record for `mcp-agent.svc.mcp.metacircular.net` pointing to the +node's IP in the MCNS zone file, bump the serial, restart CoreDNS. + +## Step 3: Write Agent Config + +Create `/srv/mcp/mcp-agent.toml`: + +```toml +[server] +grpc_addr = ":9444" +tls_cert = "/srv/mcp/certs/cert.pem" +tls_key = "/srv/mcp/certs/key.pem" + +[database] +path = "/srv/mcp/mcp.db" + +[mcias] +server_url = "https://mcias.metacircular.net:8443" +service_name = "mcp-agent" + +[agent] +node_name = "" +container_runtime = "podman" + +[monitor] +interval = "60s" +alert_command = [] +cooldown = "15m" +flap_threshold = 3 +flap_window = "10m" +retention = "30d" + +[log] +level = "info" +``` + +## Step 4: Install Agent Binary + +```bash +scp mcp-agent :/tmp/ +ssh "doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent" +``` + +## Step 5: Start the Agent + +```bash +ssh "doas systemctl start mcp-agent" +ssh "doas systemctl status mcp-agent" +``` + +## Step 6: Configure CLI + +On the operator's workstation, create `~/.config/mcp/mcp.toml` and save +the MCIAS admin service account token to `~/.config/mcp/token`. + +## Step 7: Migrate Containers (if existing) + +If containers are running under another user (e.g., `kyle`), migrate them +to the `mcp` user's podman. Process each service in dependency order: + +**Dependency order:** Metacrypt → MC-Proxy → MCR → MCNS + +For each service: + +```bash +# 1. Stop containers under the old user +ssh "podman stop && podman rm " + +# 2. Transfer ownership of data directory +ssh "doas chown -R mcp:mcp /srv/" + +# 3. Transfer images to mcp's podman +ssh "podman save -o /tmp/.tar" +ssh "doas su -l -s /bin/sh mcp -c 'XDG_RUNTIME_DIR=/run/user/ podman load -i /tmp/.tar'" + +# 4. Start containers under mcp (with new naming convention) +ssh "doas su -l -s /bin/sh mcp -c 'XDG_RUNTIME_DIR=/run/user/ podman run -d \ + --name - \ + --network mcpnet \ + --restart unless-stopped \ + --user 0:0 \ + -p \ + -v /srv/:/srv/ \ + '" +``` + +**Container naming convention:** `-` (e.g., +`metacrypt-api`, `metacrypt-web`, `mc-proxy`). + +**Network:** Services whose components need to communicate (metacrypt +api↔web, mcr api↔web) must be on the same podman network with DNS +enabled. Create with `podman network create mcpnet`. + +**Config updates:** If service configs reference container names for +inter-component communication (e.g., `vault_grpc = "metacrypt:9443"`), +update them to use the new names (e.g., `vault_grpc = "metacrypt-api:9443"`). + +**Unseal Metacrypt** after migration — it starts sealed. + +## Step 8: Adopt Containers + +```bash +mcp adopt metacrypt +mcp adopt mc-proxy +mcp adopt mcr +mcp adopt mcns +``` + +## Step 9: Export and Complete Service Definitions + +```bash +mcp service export metacrypt +mcp service export mc-proxy +mcp service export mcr +mcp service export mcns +``` + +The exported files will have name + image only. Edit each file to add the +full container spec: network, ports, volumes, user, restart, cmd. + +Then sync to push the complete specs: + +```bash +mcp sync +``` + +## Step 10: Verify + +```bash +mcp status +``` + +All services should show `desired: running`, `observed: running`, no drift. + +## Lessons Learned (from first deployment, 2026-03-26) + +- **NixOS systemd sandbox**: `ProtectHome=true` blocks `/run/user` which + rootless podman needs. Use `ProtectHome=false`. `ProtectSystem=strict` + also blocks it; use `full` instead. +- **PATH**: the agent's systemd unit needs `PATH=/run/current-system/sw/bin` + to find podman. +- **XDG_RUNTIME_DIR**: must be set to `/run/user/` for rootless podman. + Pin the UID in NixOS config to avoid drift. +- **Podman ps JSON**: the `Command` field is `[]string`, not `string`. +- **Container naming**: `mc-proxy` (service with hyphen) breaks naive split + on `-`. The agent uses registry-aware splitting. +- **Token whitespace**: token files with trailing newlines cause gRPC header + errors. The CLI trims whitespace. +- **MCR auth**: rootless podman under a new user can't pull from MCR without + OCI token auth. Workaround: `podman save` + `podman load` to transfer + images.