Add RUNBOOK.md and expand README.md
Create operational runbook covering health checks, start/stop/restart (MCP and Docker Compose), backup/restore, garbage collection, and incident procedures for database corruption, TLS expiry, MCIAS outage, disk full, and push/pull failures. Includes MCP service definition reference for the two-component deployment (mcr-api + mcr-web). Rewrite README from 2-line stub to full project overview with quick-start instructions, binary descriptions, port tables, and links to ARCHITECTURE.md and RUNBOOK.md per engineering standards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
403
RUNBOOK.md
Normal file
403
RUNBOOK.md
Normal file
@@ -0,0 +1,403 @@
|
||||
# MCR Runbook
|
||||
|
||||
## Service Overview
|
||||
|
||||
MCR (Metacircular Container Registry) is an OCI Distribution
|
||||
Spec-compliant container registry for the Metacircular platform. It
|
||||
stores and serves container images, with authentication delegated to
|
||||
MCIAS and a local policy engine for fine-grained access control.
|
||||
|
||||
MCR runs as two containers:
|
||||
|
||||
- **mcr-api** -- the registry server. Exposes OCI Distribution endpoints
|
||||
and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on
|
||||
port 9443. Handles blob storage, manifest management, and
|
||||
token-based authentication via MCIAS.
|
||||
- **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port
|
||||
9443. Provides repository/tag browsing and ACL policy management for
|
||||
administrators. Listens on port 8080.
|
||||
|
||||
Both are fronted by MC-Proxy for TLS routing. Metadata is stored in
|
||||
SQLite; blobs are stored as content-addressed files on the filesystem
|
||||
under `/srv/mcr/layers/`.
|
||||
|
||||
## Health Checks
|
||||
|
||||
### REST
|
||||
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
|
||||
Expected: HTTP 200.
|
||||
|
||||
### gRPC
|
||||
|
||||
Use the `AdminService.Health` RPC on port 9443. This method is public
|
||||
(no auth required).
|
||||
|
||||
### OCI Version Check
|
||||
|
||||
```bash
|
||||
curl -k https://localhost:8443/v2/
|
||||
```
|
||||
|
||||
Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI
|
||||
endpoint is alive and responding). An authenticated request returns
|
||||
HTTP 200 with `{}`.
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
mcrctl status --addr https://localhost:8443
|
||||
```
|
||||
|
||||
Expected output: `ok`
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Start the Service (MCP)
|
||||
|
||||
1. Deploy via MCP:
|
||||
```bash
|
||||
mcp deploy mcr
|
||||
```
|
||||
2. Verify health:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
|
||||
### Start the Service (Docker Compose)
|
||||
|
||||
1. Verify config exists: `ls /srv/mcr/mcr.toml`
|
||||
2. Start the containers:
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml up -d
|
||||
```
|
||||
3. Verify health:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
|
||||
### Stop the Service
|
||||
|
||||
Via MCP:
|
||||
|
||||
```bash
|
||||
mcp stop mcr
|
||||
```
|
||||
|
||||
Via Docker Compose:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml stop
|
||||
```
|
||||
|
||||
MCR handles SIGTERM gracefully: it stops accepting new connections,
|
||||
drains in-flight requests (including ongoing uploads) for up to 60
|
||||
seconds, then force-closes remaining connections and exits.
|
||||
|
||||
### Restart the Service
|
||||
|
||||
Via MCP:
|
||||
|
||||
```bash
|
||||
mcp restart mcr
|
||||
```
|
||||
|
||||
Via Docker Compose:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml restart
|
||||
```
|
||||
|
||||
Verify health after restart:
|
||||
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
|
||||
### Backup (Snapshot)
|
||||
|
||||
MCR backups have two parts: the SQLite database (metadata) and the blob
|
||||
filesystem. The database snapshot alone is usable but incomplete --
|
||||
missing blobs return 404 on pull.
|
||||
|
||||
1. Run the snapshot command:
|
||||
```bash
|
||||
mcrsrv snapshot --config /srv/mcr/mcr.toml
|
||||
```
|
||||
2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`.
|
||||
3. Verify the snapshot file exists and has a reasonable size:
|
||||
```bash
|
||||
ls -lh /srv/mcr/backups/
|
||||
```
|
||||
4. For a complete backup, also copy the blob directory:
|
||||
```bash
|
||||
rsync -a /srv/mcr/layers/ /backup/mcr/layers/
|
||||
```
|
||||
|
||||
A systemd timer (`mcr-backup.timer`) runs the database snapshot daily
|
||||
at 02:00 UTC with 5-minute jitter.
|
||||
|
||||
### Restore from Snapshot
|
||||
|
||||
1. Stop the service (see above).
|
||||
2. Back up the current database:
|
||||
```bash
|
||||
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore
|
||||
```
|
||||
3. Copy the snapshot into place:
|
||||
```bash
|
||||
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
|
||||
```
|
||||
4. If restoring blobs as well:
|
||||
```bash
|
||||
rsync -a /backup/mcr/layers/ /srv/mcr/layers/
|
||||
```
|
||||
5. Start the service (see above).
|
||||
6. Verify the service is healthy:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
7. Verify an image pull works:
|
||||
```bash
|
||||
docker pull mcr.svc.mcp.metacircular.net:8443/<repo>:<tag>
|
||||
```
|
||||
|
||||
### Log Inspection
|
||||
|
||||
Container logs (mcr-api):
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api
|
||||
```
|
||||
|
||||
Container logs (mcr-web):
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web
|
||||
```
|
||||
|
||||
Follow logs in real time:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web
|
||||
```
|
||||
|
||||
Via MCP:
|
||||
|
||||
```bash
|
||||
mcp logs mcr
|
||||
```
|
||||
|
||||
MCR logs to stderr as structured text (slog). Log level is configured
|
||||
via `[log] level` in `mcr.toml` (debug, info, warn, error).
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
Garbage collection removes unreferenced blobs -- blobs no longer
|
||||
referenced by any manifest. GC acquires a registry-wide lock that
|
||||
blocks new blob uploads for the duration of the mark-and-sweep phase.
|
||||
Pulls are not blocked.
|
||||
|
||||
1. Trigger GC via CLI:
|
||||
```bash
|
||||
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
2. Check GC status:
|
||||
```bash
|
||||
mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
3. GC can also be triggered via the REST API:
|
||||
```bash
|
||||
curl -k -X POST -H "Authorization: Bearer <token>" https://localhost:8443/v1/gc
|
||||
```
|
||||
|
||||
If a previous GC crashed after the database sweep but before filesystem
|
||||
cleanup, orphaned files may remain on disk. Run reconciliation to clean
|
||||
them up:
|
||||
|
||||
```bash
|
||||
mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### Database Corruption
|
||||
|
||||
Symptoms: server fails to start with SQLite errors, or API requests
|
||||
return unexpected errors.
|
||||
|
||||
1. Stop the service.
|
||||
2. Check for WAL/SHM files alongside the database:
|
||||
```bash
|
||||
ls -la /srv/mcr/mcr.db*
|
||||
```
|
||||
3. Attempt an integrity check:
|
||||
```bash
|
||||
sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;"
|
||||
```
|
||||
4. If integrity check fails, restore from the most recent snapshot:
|
||||
```bash
|
||||
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt
|
||||
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
|
||||
```
|
||||
5. Start the service and verify health.
|
||||
6. Note: blobs on the filesystem are unaffected by database corruption.
|
||||
Images pushed after the snapshot was taken will be missing from
|
||||
metadata. Their blobs remain on disk and will be cleaned up by GC
|
||||
unless the metadata is re-created.
|
||||
|
||||
### TLS Certificate Expiry
|
||||
|
||||
Symptoms: health check fails with TLS errors, Docker clients get
|
||||
certificate verification errors on push/pull.
|
||||
|
||||
1. Check certificate expiry:
|
||||
```bash
|
||||
openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate
|
||||
```
|
||||
2. Replace the certificate and key files at the paths configured in
|
||||
`mcr.toml` (`[server] tls_cert` and `tls_key`).
|
||||
3. Restart the service to load the new certificate.
|
||||
4. Verify health:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
|
||||
### MCIAS Outage
|
||||
|
||||
Symptoms: push/pull fails with 401 or 502 errors. Authentication
|
||||
cannot complete.
|
||||
|
||||
1. Confirm MCIAS is unreachable:
|
||||
```bash
|
||||
curl -k https://svc.metacircular.net:8443/v1/health
|
||||
```
|
||||
2. Cached token validation results remain valid for up to 30 seconds
|
||||
after the last successful MCIAS check. Operations using
|
||||
recently-validated tokens may continue briefly.
|
||||
3. Once cached tokens expire, all authenticated operations (push, pull,
|
||||
catalog, admin) will fail until MCIAS recovers.
|
||||
4. The OCI `/v2/` version check endpoint still responds (confirms MCR
|
||||
itself is running).
|
||||
5. Escalate to MCIAS (see Escalation below).
|
||||
|
||||
### Disk Full
|
||||
|
||||
Symptoms: blob uploads fail, database writes fail, container may crash.
|
||||
|
||||
1. Check disk usage:
|
||||
```bash
|
||||
df -h /srv/mcr/
|
||||
du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db
|
||||
```
|
||||
2. Clean up stale uploads:
|
||||
```bash
|
||||
ls -la /srv/mcr/uploads/
|
||||
```
|
||||
Remove upload files that are old and have no matching in-progress
|
||||
upload in the database.
|
||||
3. Run garbage collection to reclaim unreferenced blobs:
|
||||
```bash
|
||||
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
4. If GC does not free enough space, identify large repositories:
|
||||
```bash
|
||||
mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
5. Delete unused tags or repositories to free space, then run GC again.
|
||||
6. If the disk is completely full and the service cannot start, manually
|
||||
remove orphaned files from `/srv/mcr/uploads/` to free enough space
|
||||
for the service to start, then run GC.
|
||||
|
||||
### Image Push/Pull Failures
|
||||
|
||||
Symptoms: `docker push` or `docker pull` returns errors.
|
||||
|
||||
1. Verify the service is running and healthy:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v1/health
|
||||
```
|
||||
2. Test OCI endpoint:
|
||||
```bash
|
||||
curl -k https://localhost:8443/v2/
|
||||
```
|
||||
Expected: HTTP 401 with `WWW-Authenticate` header.
|
||||
3. Test authentication:
|
||||
```bash
|
||||
curl -k -u username:password https://localhost:8443/v2/token?service=mcr
|
||||
```
|
||||
Expected: HTTP 200 with a token response.
|
||||
4. Check if the issue is policy-related (403 Denied):
|
||||
```bash
|
||||
mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
Review policy rules for the affected account and repository.
|
||||
5. Check audit log for denied requests:
|
||||
```bash
|
||||
mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443
|
||||
```
|
||||
6. For push failures, verify all referenced blobs exist before pushing
|
||||
the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was
|
||||
not uploaded before the manifest push.
|
||||
7. Check logs for detailed error information:
|
||||
```bash
|
||||
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api
|
||||
```
|
||||
|
||||
## MCP Deployment
|
||||
|
||||
MCR is deployed via MCP as a two-component service on the `rift` node.
|
||||
|
||||
### Service Definition
|
||||
|
||||
```toml
|
||||
name = "mcr"
|
||||
node = "rift"
|
||||
active = true
|
||||
|
||||
[[components]]
|
||||
name = "api"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/mcr:latest"
|
||||
network = "mcpnet"
|
||||
user = "0:0"
|
||||
restart = "unless-stopped"
|
||||
ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"]
|
||||
volumes = ["/srv/mcr:/srv/mcr"]
|
||||
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
|
||||
|
||||
[[components]]
|
||||
name = "web"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:latest"
|
||||
network = "mcpnet"
|
||||
user = "0:0"
|
||||
restart = "unless-stopped"
|
||||
ports = ["127.0.0.1:28080:8080"]
|
||||
volumes = ["/srv/mcr:/srv/mcr"]
|
||||
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
|
||||
```
|
||||
|
||||
### Port Mapping
|
||||
|
||||
| Component | Container Port | Host Port | Purpose |
|
||||
|-----------|---------------|-----------|---------|
|
||||
| mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) |
|
||||
| mcr-api | 9443 | 29443 | gRPC admin API |
|
||||
| mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) |
|
||||
|
||||
Both containers share the `/srv/mcr` volume for configuration, database,
|
||||
and blob storage. They are connected to the `mcpnet` Docker network.
|
||||
|
||||
## Escalation
|
||||
|
||||
Escalate when:
|
||||
|
||||
- Database corruption cannot be resolved by restoring a snapshot.
|
||||
- MCIAS is down and registry operations are urgently needed.
|
||||
- Disk full cannot be resolved by GC and cleanup.
|
||||
- Push/pull failures persist after following the procedures above.
|
||||
- Any issue not covered by this runbook.
|
||||
|
||||
Escalation path: Kyle (platform owner).
|
||||
Reference in New Issue
Block a user