Files
mcr/RUNBOOK.md
Kyle Isom acc4851549 Update RUNBOOK MCP example to use pinned version tags
Replace :latest with :v1.1.0 in the MCP service definition example
to match the new platform convention of explicit version pinning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:31:02 -07:00

405 lines
11 KiB
Markdown

# MCR Runbook
## Service Overview
MCR (Metacircular Container Registry) is an OCI Distribution
Spec-compliant container registry for the Metacircular platform. It
stores and serves container images, with authentication delegated to
MCIAS and a local policy engine for fine-grained access control.
MCR runs as two containers:
- **mcr-api** -- the registry server. Exposes OCI Distribution endpoints
and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on
port 9443. Handles blob storage, manifest management, and
token-based authentication via MCIAS.
- **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port
9443. Provides repository/tag browsing and ACL policy management for
administrators. Listens on port 8080. Guest accounts are blocked at
login; only `admin` and `user` roles can access the web interface.
Both are fronted by MC-Proxy for TLS routing. Metadata is stored in
SQLite; blobs are stored as content-addressed files on the filesystem
under `/srv/mcr/layers/`.
## Health Checks
### REST
```bash
curl -k https://localhost:8443/v1/health
```
Expected: HTTP 200.
### gRPC
Use the `AdminService.Health` RPC on port 9443. This method is public
(no auth required).
### OCI Version Check
```bash
curl -k https://localhost:8443/v2/
```
Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI
endpoint is alive and responding). An authenticated request returns
HTTP 200 with `{}`.
### CLI
```bash
mcrctl status --addr https://localhost:8443
```
Expected output: `ok`
## Common Operations
### Start the Service (MCP)
1. Deploy via MCP:
```bash
mcp deploy mcr
```
2. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### Start the Service (Docker Compose)
1. Verify config exists: `ls /srv/mcr/mcr.toml`
2. Start the containers:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml up -d
```
3. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### Stop the Service
Via MCP:
```bash
mcp stop mcr
```
Via Docker Compose:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml stop
```
MCR handles SIGTERM gracefully: it stops accepting new connections,
drains in-flight requests (including ongoing uploads) for up to 60
seconds, then force-closes remaining connections and exits.
### Restart the Service
Via MCP:
```bash
mcp restart mcr
```
Via Docker Compose:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml restart
```
Verify health after restart:
```bash
curl -k https://localhost:8443/v1/health
```
### Backup (Snapshot)
MCR backups have two parts: the SQLite database (metadata) and the blob
filesystem. The database snapshot alone is usable but incomplete --
missing blobs return 404 on pull.
1. Run the snapshot command:
```bash
mcrsrv snapshot --config /srv/mcr/mcr.toml
```
2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`.
3. Verify the snapshot file exists and has a reasonable size:
```bash
ls -lh /srv/mcr/backups/
```
4. For a complete backup, also copy the blob directory:
```bash
rsync -a /srv/mcr/layers/ /backup/mcr/layers/
```
A systemd timer (`mcr-backup.timer`) runs the database snapshot daily
at 02:00 UTC with 5-minute jitter.
### Restore from Snapshot
1. Stop the service (see above).
2. Back up the current database:
```bash
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore
```
3. Copy the snapshot into place:
```bash
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
```
4. If restoring blobs as well:
```bash
rsync -a /backup/mcr/layers/ /srv/mcr/layers/
```
5. Start the service (see above).
6. Verify the service is healthy:
```bash
curl -k https://localhost:8443/v1/health
```
7. Verify an image pull works:
```bash
docker pull mcr.svc.mcp.metacircular.net:8443/<repo>:<tag>
```
### Log Inspection
Container logs (mcr-api):
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api
```
Container logs (mcr-web):
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web
```
Follow logs in real time:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web
```
Via MCP:
```bash
mcp logs mcr
```
MCR logs to stderr as structured text (slog). Log level is configured
via `[log] level` in `mcr.toml` (debug, info, warn, error).
### Garbage Collection
Garbage collection removes unreferenced blobs -- blobs no longer
referenced by any manifest. GC acquires a registry-wide lock that
blocks new blob uploads for the duration of the mark-and-sweep phase.
Pulls are not blocked.
1. Trigger GC via CLI:
```bash
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
```
2. Check GC status:
```bash
mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443
```
3. GC can also be triggered via the REST API:
```bash
curl -k -X POST -H "Authorization: Bearer <token>" https://localhost:8443/v1/gc
```
If a previous GC crashed after the database sweep but before filesystem
cleanup, orphaned files may remain on disk. Run reconciliation to clean
them up:
```bash
mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443
```
## Incident Procedures
### Database Corruption
Symptoms: server fails to start with SQLite errors, or API requests
return unexpected errors.
1. Stop the service.
2. Check for WAL/SHM files alongside the database:
```bash
ls -la /srv/mcr/mcr.db*
```
3. Attempt an integrity check:
```bash
sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;"
```
4. If integrity check fails, restore from the most recent snapshot:
```bash
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
```
5. Start the service and verify health.
6. Note: blobs on the filesystem are unaffected by database corruption.
Images pushed after the snapshot was taken will be missing from
metadata. Their blobs remain on disk and will be cleaned up by GC
unless the metadata is re-created.
### TLS Certificate Expiry
Symptoms: health check fails with TLS errors, Docker clients get
certificate verification errors on push/pull.
1. Check certificate expiry:
```bash
openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate
```
2. Replace the certificate and key files at the paths configured in
`mcr.toml` (`[server] tls_cert` and `tls_key`).
3. Restart the service to load the new certificate.
4. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### MCIAS Outage
Symptoms: push/pull fails with 401 or 502 errors. Authentication
cannot complete.
1. Confirm MCIAS is unreachable:
```bash
curl -k https://svc.metacircular.net:8443/v1/health
```
2. Cached token validation results remain valid for up to 30 seconds
after the last successful MCIAS check. Operations using
recently-validated tokens may continue briefly.
3. Once cached tokens expire, all authenticated operations (push, pull,
catalog, admin) will fail until MCIAS recovers.
4. The OCI `/v2/` version check endpoint still responds (confirms MCR
itself is running).
5. Escalate to MCIAS (see Escalation below).
### Disk Full
Symptoms: blob uploads fail, database writes fail, container may crash.
1. Check disk usage:
```bash
df -h /srv/mcr/
du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db
```
2. Clean up stale uploads:
```bash
ls -la /srv/mcr/uploads/
```
Remove upload files that are old and have no matching in-progress
upload in the database.
3. Run garbage collection to reclaim unreferenced blobs:
```bash
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
```
4. If GC does not free enough space, identify large repositories:
```bash
mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443
```
5. Delete unused tags or repositories to free space, then run GC again.
6. If the disk is completely full and the service cannot start, manually
remove orphaned files from `/srv/mcr/uploads/` to free enough space
for the service to start, then run GC.
### Image Push/Pull Failures
Symptoms: `docker push` or `docker pull` returns errors.
1. Verify the service is running and healthy:
```bash
curl -k https://localhost:8443/v1/health
```
2. Test OCI endpoint:
```bash
curl -k https://localhost:8443/v2/
```
Expected: HTTP 401 with `WWW-Authenticate` header.
3. Test authentication:
```bash
curl -k -u username:password https://localhost:8443/v2/token?service=mcr
```
Expected: HTTP 200 with a token response.
4. Check if the issue is policy-related (403 Denied):
```bash
mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443
```
Review policy rules for the affected account and repository.
5. Check audit log for denied requests:
```bash
mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443
```
6. For push failures, verify all referenced blobs exist before pushing
the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was
not uploaded before the manifest push.
7. Check logs for detailed error information:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api
```
## MCP Deployment
MCR is deployed via MCP as a two-component service on the `rift` node.
### Service Definition
```toml
name = "mcr"
node = "rift"
active = true
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/mcr:v1.1.0"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.1.0"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28080:8080"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
```
### Port Mapping
| Component | Container Port | Host Port | Purpose |
|-----------|---------------|-----------|---------|
| mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) |
| mcr-api | 9443 | 29443 | gRPC admin API |
| mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) |
Both containers share the `/srv/mcr` volume for configuration, database,
and blob storage. They are connected to the `mcpnet` Docker network.
## Escalation
Escalate when:
- Database corruption cannot be resolved by restoring a snapshot.
- MCIAS is down and registry operations are urgently needed.
- Disk full cannot be resolved by GC and cleanup.
- Push/pull failures persist after following the procedures above.
- Any issue not covered by this runbook.
Escalation path: Kyle (platform owner).