# MCR Runbook ## Service Overview MCR (Metacircular Container Registry) is an OCI Distribution Spec-compliant container registry for the Metacircular platform. It stores and serves container images, with authentication delegated to MCIAS and a local policy engine for fine-grained access control. MCR runs as two containers: - **mcr-api** -- the registry server. Exposes OCI Distribution endpoints and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on port 9443. Handles blob storage, manifest management, and token-based authentication via MCIAS. - **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port 9443. Provides repository/tag browsing and ACL policy management for administrators. Listens on port 8080. Guest accounts are blocked at login; only `admin` and `user` roles can access the web interface. Both are fronted by MC-Proxy for TLS routing. Metadata is stored in SQLite; blobs are stored as content-addressed files on the filesystem under `/srv/mcr/layers/`. ## Health Checks ### REST ```bash curl -k https://localhost:8443/v1/health ``` Expected: HTTP 200. ### gRPC Use the `AdminService.Health` RPC on port 9443. This method is public (no auth required). ### OCI Version Check ```bash curl -k https://localhost:8443/v2/ ``` Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI endpoint is alive and responding). An authenticated request returns HTTP 200 with `{}`. ### CLI ```bash mcrctl status --addr https://localhost:8443 ``` Expected output: `ok` ## Common Operations ### Start the Service (MCP) 1. Deploy via MCP: ```bash mcp deploy mcr ``` 2. Verify health: ```bash curl -k https://localhost:8443/v1/health ``` ### Start the Service (Docker Compose) 1. Verify config exists: `ls /srv/mcr/mcr.toml` 2. Start the containers: ```bash docker compose -f deploy/docker/docker-compose-rift.yml up -d ``` 3. Verify health: ```bash curl -k https://localhost:8443/v1/health ``` ### Stop the Service Via MCP: ```bash mcp stop mcr ``` Via Docker Compose: ```bash docker compose -f deploy/docker/docker-compose-rift.yml stop ``` MCR handles SIGTERM gracefully: it stops accepting new connections, drains in-flight requests (including ongoing uploads) for up to 60 seconds, then force-closes remaining connections and exits. ### Restart the Service Via MCP: ```bash mcp restart mcr ``` Via Docker Compose: ```bash docker compose -f deploy/docker/docker-compose-rift.yml restart ``` Verify health after restart: ```bash curl -k https://localhost:8443/v1/health ``` ### Backup (Snapshot) MCR backups have two parts: the SQLite database (metadata) and the blob filesystem. The database snapshot alone is usable but incomplete -- missing blobs return 404 on pull. 1. Run the snapshot command: ```bash mcrsrv snapshot --config /srv/mcr/mcr.toml ``` 2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`. 3. Verify the snapshot file exists and has a reasonable size: ```bash ls -lh /srv/mcr/backups/ ``` 4. For a complete backup, also copy the blob directory: ```bash rsync -a /srv/mcr/layers/ /backup/mcr/layers/ ``` A systemd timer (`mcr-backup.timer`) runs the database snapshot daily at 02:00 UTC with 5-minute jitter. ### Restore from Snapshot 1. Stop the service (see above). 2. Back up the current database: ```bash cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore ``` 3. Copy the snapshot into place: ```bash cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db ``` 4. If restoring blobs as well: ```bash rsync -a /backup/mcr/layers/ /srv/mcr/layers/ ``` 5. Start the service (see above). 6. Verify the service is healthy: ```bash curl -k https://localhost:8443/v1/health ``` 7. Verify an image pull works: ```bash docker pull mcr.svc.mcp.metacircular.net:8443/: ``` ### Log Inspection Container logs (mcr-api): ```bash docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api ``` Container logs (mcr-web): ```bash docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web ``` Follow logs in real time: ```bash docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web ``` Via MCP: ```bash mcp logs mcr ``` MCR logs to stderr as structured text (slog). Log level is configured via `[log] level` in `mcr.toml` (debug, info, warn, error). ### Garbage Collection Garbage collection removes unreferenced blobs -- blobs no longer referenced by any manifest. GC acquires a registry-wide lock that blocks new blob uploads for the duration of the mark-and-sweep phase. Pulls are not blocked. 1. Trigger GC via CLI: ```bash mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443 ``` 2. Check GC status: ```bash mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443 ``` 3. GC can also be triggered via the REST API: ```bash curl -k -X POST -H "Authorization: Bearer " https://localhost:8443/v1/gc ``` If a previous GC crashed after the database sweep but before filesystem cleanup, orphaned files may remain on disk. Run reconciliation to clean them up: ```bash mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443 ``` ## Incident Procedures ### Database Corruption Symptoms: server fails to start with SQLite errors, or API requests return unexpected errors. 1. Stop the service. 2. Check for WAL/SHM files alongside the database: ```bash ls -la /srv/mcr/mcr.db* ``` 3. Attempt an integrity check: ```bash sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;" ``` 4. If integrity check fails, restore from the most recent snapshot: ```bash cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db ``` 5. Start the service and verify health. 6. Note: blobs on the filesystem are unaffected by database corruption. Images pushed after the snapshot was taken will be missing from metadata. Their blobs remain on disk and will be cleaned up by GC unless the metadata is re-created. ### TLS Certificate Expiry Symptoms: health check fails with TLS errors, Docker clients get certificate verification errors on push/pull. 1. Check certificate expiry: ```bash openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate ``` 2. Replace the certificate and key files at the paths configured in `mcr.toml` (`[server] tls_cert` and `tls_key`). 3. Restart the service to load the new certificate. 4. Verify health: ```bash curl -k https://localhost:8443/v1/health ``` ### MCIAS Outage Symptoms: push/pull fails with 401 or 502 errors. Authentication cannot complete. 1. Confirm MCIAS is unreachable: ```bash curl -k https://svc.metacircular.net:8443/v1/health ``` 2. Cached token validation results remain valid for up to 30 seconds after the last successful MCIAS check. Operations using recently-validated tokens may continue briefly. 3. Once cached tokens expire, all authenticated operations (push, pull, catalog, admin) will fail until MCIAS recovers. 4. The OCI `/v2/` version check endpoint still responds (confirms MCR itself is running). 5. Escalate to MCIAS (see Escalation below). ### Disk Full Symptoms: blob uploads fail, database writes fail, container may crash. 1. Check disk usage: ```bash df -h /srv/mcr/ du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db ``` 2. Clean up stale uploads: ```bash ls -la /srv/mcr/uploads/ ``` Remove upload files that are old and have no matching in-progress upload in the database. 3. Run garbage collection to reclaim unreferenced blobs: ```bash mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443 ``` 4. If GC does not free enough space, identify large repositories: ```bash mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443 ``` 5. Delete unused tags or repositories to free space, then run GC again. 6. If the disk is completely full and the service cannot start, manually remove orphaned files from `/srv/mcr/uploads/` to free enough space for the service to start, then run GC. ### Image Push/Pull Failures Symptoms: `docker push` or `docker pull` returns errors. 1. Verify the service is running and healthy: ```bash curl -k https://localhost:8443/v1/health ``` 2. Test OCI endpoint: ```bash curl -k https://localhost:8443/v2/ ``` Expected: HTTP 401 with `WWW-Authenticate` header. 3. Test authentication: ```bash curl -k -u username:password https://localhost:8443/v2/token?service=mcr ``` Expected: HTTP 200 with a token response. 4. Check if the issue is policy-related (403 Denied): ```bash mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443 ``` Review policy rules for the affected account and repository. 5. Check audit log for denied requests: ```bash mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443 ``` 6. For push failures, verify all referenced blobs exist before pushing the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was not uploaded before the manifest push. 7. Check logs for detailed error information: ```bash docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api ``` ## MCP Deployment MCR is deployed via MCP as a two-component service on the `rift` node. ### Service Definition ```toml name = "mcr" node = "rift" active = true [[components]] name = "api" image = "mcr.svc.mcp.metacircular.net:8443/mcr:latest" network = "mcpnet" user = "0:0" restart = "unless-stopped" ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"] volumes = ["/srv/mcr:/srv/mcr"] cmd = ["server", "--config", "/srv/mcr/mcr.toml"] [[components]] name = "web" image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:latest" network = "mcpnet" user = "0:0" restart = "unless-stopped" ports = ["127.0.0.1:28080:8080"] volumes = ["/srv/mcr:/srv/mcr"] cmd = ["server", "--config", "/srv/mcr/mcr.toml"] ``` ### Port Mapping | Component | Container Port | Host Port | Purpose | |-----------|---------------|-----------|---------| | mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) | | mcr-api | 9443 | 29443 | gRPC admin API | | mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) | Both containers share the `/srv/mcr` volume for configuration, database, and blob storage. They are connected to the `mcpnet` Docker network. ## Escalation Escalate when: - Database corruption cannot be resolved by restoring a snapshot. - MCIAS is down and registry operations are urgently needed. - Disk full cannot be resolved by GC and cleanup. - Push/pull failures persist after following the procedures above. - Any issue not covered by this runbook. Escalation path: Kyle (platform owner).