P5.2 + P5.3: Bootstrap docs, README, and RUNBOOK

- docs/bootstrap.md: step-by-step bootstrap procedure with lessons
  learned from the first deployment (NixOS sandbox issues, podman
  rootless setup, container naming, MCR auth workaround)
- README.md: quick-start guide, command reference, doc links
- RUNBOOK.md: operational procedures for operators (health checks,
  common operations, unsealing metacrypt, cert renewal, incident
  response, disaster recovery, file locations)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 15:32:22 -07:00
parent ff9bfc5087
commit ea8a42a696
4 changed files with 624 additions and 2 deletions

198
docs/bootstrap.md Normal file
View File

@@ -0,0 +1,198 @@
# MCP Bootstrap Procedure
How to bring MCP up on a node for the first time, including migrating
existing containers from another user's podman instance.
## Prerequisites
- NixOS configuration applied with `configs/mcp.nix` (creates `mcp` user
with rootless podman, subuid/subgid, systemd service)
- MCIAS system account with `admin` role (for token validation and cert
provisioning)
- Metacrypt running (for TLS certificate issuance)
## Step 1: Provision TLS Certificate
Issue a cert from Metacrypt with DNS and IP SANs:
```bash
export METACRYPT_TOKEN="<admin-token>"
# From a machine that can reach Metacrypt (e.g., via loopback on rift):
curl -sk -X POST https://127.0.0.1:18443/v1/engine/request \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $METACRYPT_TOKEN" \
-d '{
"mount": "pki",
"operation": "issue",
"path": "web",
"data": {
"issuer": "web",
"common_name": "mcp-agent.svc.mcp.metacircular.net",
"profile": "server",
"dns_names": ["mcp-agent.svc.mcp.metacircular.net"],
"ip_addresses": ["<tailscale-ip>", "<lan-ip>"],
"ttl": "2160h"
}
}' > cert-response.json
# Extract cert and key from the JSON response and install:
doas cp cert.pem /srv/mcp/certs/cert.pem
doas cp key.pem /srv/mcp/certs/key.pem
doas chown mcp:mcp /srv/mcp/certs/cert.pem /srv/mcp/certs/key.pem
doas chmod 600 /srv/mcp/certs/cert.pem /srv/mcp/certs/key.pem
```
## Step 2: Add DNS Record
Add an A record for `mcp-agent.svc.mcp.metacircular.net` pointing to the
node's IP in the MCNS zone file, bump the serial, restart CoreDNS.
## Step 3: Write Agent Config
Create `/srv/mcp/mcp-agent.toml`:
```toml
[server]
grpc_addr = "<tailscale-ip>:9444"
tls_cert = "/srv/mcp/certs/cert.pem"
tls_key = "/srv/mcp/certs/key.pem"
[database]
path = "/srv/mcp/mcp.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp-agent"
[agent]
node_name = "<node-name>"
container_runtime = "podman"
[monitor]
interval = "60s"
alert_command = []
cooldown = "15m"
flap_threshold = 3
flap_window = "10m"
retention = "30d"
[log]
level = "info"
```
## Step 4: Install Agent Binary
```bash
scp mcp-agent <node>:/tmp/
ssh <node> "doas cp /tmp/mcp-agent /usr/local/bin/mcp-agent"
```
## Step 5: Start the Agent
```bash
ssh <node> "doas systemctl start mcp-agent"
ssh <node> "doas systemctl status mcp-agent"
```
## Step 6: Configure CLI
On the operator's workstation, create `~/.config/mcp/mcp.toml` and save
the MCIAS admin service account token to `~/.config/mcp/token`.
## Step 7: Migrate Containers (if existing)
If containers are running under another user (e.g., `kyle`), migrate them
to the `mcp` user's podman. Process each service in dependency order:
**Dependency order:** Metacrypt → MC-Proxy → MCR → MCNS
For each service:
```bash
# 1. Stop containers under the old user
ssh <node> "podman stop <container> && podman rm <container>"
# 2. Transfer ownership of data directory
ssh <node> "doas chown -R mcp:mcp /srv/<service>"
# 3. Transfer images to mcp's podman
ssh <node> "podman save <image> -o /tmp/<service>.tar"
ssh <node> "doas su -l -s /bin/sh mcp -c 'XDG_RUNTIME_DIR=/run/user/<uid> podman load -i /tmp/<service>.tar'"
# 4. Start containers under mcp (with new naming convention)
ssh <node> "doas su -l -s /bin/sh mcp -c 'XDG_RUNTIME_DIR=/run/user/<uid> podman run -d \
--name <service>-<component> \
--network mcpnet \
--restart unless-stopped \
--user 0:0 \
-p <ports> \
-v /srv/<service>:/srv/<service> \
<image> <cmd>'"
```
**Container naming convention:** `<service>-<component>` (e.g.,
`metacrypt-api`, `metacrypt-web`, `mc-proxy`).
**Network:** Services whose components need to communicate (metacrypt
api↔web, mcr api↔web) must be on the same podman network with DNS
enabled. Create with `podman network create mcpnet`.
**Config updates:** If service configs reference container names for
inter-component communication (e.g., `vault_grpc = "metacrypt:9443"`),
update them to use the new names (e.g., `vault_grpc = "metacrypt-api:9443"`).
**Unseal Metacrypt** after migration — it starts sealed.
## Step 8: Adopt Containers
```bash
mcp adopt metacrypt
mcp adopt mc-proxy
mcp adopt mcr
mcp adopt mcns
```
## Step 9: Export and Complete Service Definitions
```bash
mcp service export metacrypt
mcp service export mc-proxy
mcp service export mcr
mcp service export mcns
```
The exported files will have name + image only. Edit each file to add the
full container spec: network, ports, volumes, user, restart, cmd.
Then sync to push the complete specs:
```bash
mcp sync
```
## Step 10: Verify
```bash
mcp status
```
All services should show `desired: running`, `observed: running`, no drift.
## Lessons Learned (from first deployment, 2026-03-26)
- **NixOS systemd sandbox**: `ProtectHome=true` blocks `/run/user` which
rootless podman needs. Use `ProtectHome=false`. `ProtectSystem=strict`
also blocks it; use `full` instead.
- **PATH**: the agent's systemd unit needs `PATH=/run/current-system/sw/bin`
to find podman.
- **XDG_RUNTIME_DIR**: must be set to `/run/user/<uid>` for rootless podman.
Pin the UID in NixOS config to avoid drift.
- **Podman ps JSON**: the `Command` field is `[]string`, not `string`.
- **Container naming**: `mc-proxy` (service with hyphen) breaks naive split
on `-`. The agent uses registry-aware splitting.
- **Token whitespace**: token files with trailing newlines cause gRPC header
errors. The CLI trims whitespace.
- **MCR auth**: rootless podman under a new user can't pull from MCR without
OCI token auth. Workaround: `podman save` + `podman load` to transfer
images.