Files
mcr/RUNBOOK.md
Kyle Isom acc4851549 Update RUNBOOK MCP example to use pinned version tags
Replace :latest with :v1.1.0 in the MCP service definition example
to match the new platform convention of explicit version pinning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:31:02 -07:00

11 KiB

MCR Runbook

Service Overview

MCR (Metacircular Container Registry) is an OCI Distribution Spec-compliant container registry for the Metacircular platform. It stores and serves container images, with authentication delegated to MCIAS and a local policy engine for fine-grained access control.

MCR runs as two containers:

  • mcr-api -- the registry server. Exposes OCI Distribution endpoints and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on port 9443. Handles blob storage, manifest management, and token-based authentication via MCIAS.
  • mcr-web -- the web UI. Communicates with mcr-api via gRPC on port 9443. Provides repository/tag browsing and ACL policy management for administrators. Listens on port 8080. Guest accounts are blocked at login; only admin and user roles can access the web interface.

Both are fronted by MC-Proxy for TLS routing. Metadata is stored in SQLite; blobs are stored as content-addressed files on the filesystem under /srv/mcr/layers/.

Health Checks

REST

curl -k https://localhost:8443/v1/health

Expected: HTTP 200.

gRPC

Use the AdminService.Health RPC on port 9443. This method is public (no auth required).

OCI Version Check

curl -k https://localhost:8443/v2/

Expected: HTTP 401 with WWW-Authenticate header (confirms the OCI endpoint is alive and responding). An authenticated request returns HTTP 200 with {}.

CLI

mcrctl status --addr https://localhost:8443

Expected output: ok

Common Operations

Start the Service (MCP)

  1. Deploy via MCP:
    mcp deploy mcr
    
  2. Verify health:
    curl -k https://localhost:8443/v1/health
    

Start the Service (Docker Compose)

  1. Verify config exists: ls /srv/mcr/mcr.toml
  2. Start the containers:
    docker compose -f deploy/docker/docker-compose-rift.yml up -d
    
  3. Verify health:
    curl -k https://localhost:8443/v1/health
    

Stop the Service

Via MCP:

mcp stop mcr

Via Docker Compose:

docker compose -f deploy/docker/docker-compose-rift.yml stop

MCR handles SIGTERM gracefully: it stops accepting new connections, drains in-flight requests (including ongoing uploads) for up to 60 seconds, then force-closes remaining connections and exits.

Restart the Service

Via MCP:

mcp restart mcr

Via Docker Compose:

docker compose -f deploy/docker/docker-compose-rift.yml restart

Verify health after restart:

curl -k https://localhost:8443/v1/health

Backup (Snapshot)

MCR backups have two parts: the SQLite database (metadata) and the blob filesystem. The database snapshot alone is usable but incomplete -- missing blobs return 404 on pull.

  1. Run the snapshot command:
    mcrsrv snapshot --config /srv/mcr/mcr.toml
    
  2. The snapshot is saved to /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db.
  3. Verify the snapshot file exists and has a reasonable size:
    ls -lh /srv/mcr/backups/
    
  4. For a complete backup, also copy the blob directory:
    rsync -a /srv/mcr/layers/ /backup/mcr/layers/
    

A systemd timer (mcr-backup.timer) runs the database snapshot daily at 02:00 UTC with 5-minute jitter.

Restore from Snapshot

  1. Stop the service (see above).
  2. Back up the current database:
    cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore
    
  3. Copy the snapshot into place:
    cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
    
  4. If restoring blobs as well:
    rsync -a /backup/mcr/layers/ /srv/mcr/layers/
    
  5. Start the service (see above).
  6. Verify the service is healthy:
    curl -k https://localhost:8443/v1/health
    
  7. Verify an image pull works:
    docker pull mcr.svc.mcp.metacircular.net:8443/<repo>:<tag>
    

Log Inspection

Container logs (mcr-api):

docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api

Container logs (mcr-web):

docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web

Follow logs in real time:

docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web

Via MCP:

mcp logs mcr

MCR logs to stderr as structured text (slog). Log level is configured via [log] level in mcr.toml (debug, info, warn, error).

Garbage Collection

Garbage collection removes unreferenced blobs -- blobs no longer referenced by any manifest. GC acquires a registry-wide lock that blocks new blob uploads for the duration of the mark-and-sweep phase. Pulls are not blocked.

  1. Trigger GC via CLI:
    mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
    
  2. Check GC status:
    mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443
    
  3. GC can also be triggered via the REST API:
    curl -k -X POST -H "Authorization: Bearer <token>" https://localhost:8443/v1/gc
    

If a previous GC crashed after the database sweep but before filesystem cleanup, orphaned files may remain on disk. Run reconciliation to clean them up:

mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443

Incident Procedures

Database Corruption

Symptoms: server fails to start with SQLite errors, or API requests return unexpected errors.

  1. Stop the service.
  2. Check for WAL/SHM files alongside the database:
    ls -la /srv/mcr/mcr.db*
    
  3. Attempt an integrity check:
    sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;"
    
  4. If integrity check fails, restore from the most recent snapshot:
    cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt
    cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
    
  5. Start the service and verify health.
  6. Note: blobs on the filesystem are unaffected by database corruption. Images pushed after the snapshot was taken will be missing from metadata. Their blobs remain on disk and will be cleaned up by GC unless the metadata is re-created.

TLS Certificate Expiry

Symptoms: health check fails with TLS errors, Docker clients get certificate verification errors on push/pull.

  1. Check certificate expiry:
    openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate
    
  2. Replace the certificate and key files at the paths configured in mcr.toml ([server] tls_cert and tls_key).
  3. Restart the service to load the new certificate.
  4. Verify health:
    curl -k https://localhost:8443/v1/health
    

MCIAS Outage

Symptoms: push/pull fails with 401 or 502 errors. Authentication cannot complete.

  1. Confirm MCIAS is unreachable:
    curl -k https://svc.metacircular.net:8443/v1/health
    
  2. Cached token validation results remain valid for up to 30 seconds after the last successful MCIAS check. Operations using recently-validated tokens may continue briefly.
  3. Once cached tokens expire, all authenticated operations (push, pull, catalog, admin) will fail until MCIAS recovers.
  4. The OCI /v2/ version check endpoint still responds (confirms MCR itself is running).
  5. Escalate to MCIAS (see Escalation below).

Disk Full

Symptoms: blob uploads fail, database writes fail, container may crash.

  1. Check disk usage:
    df -h /srv/mcr/
    du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db
    
  2. Clean up stale uploads:
    ls -la /srv/mcr/uploads/
    
    Remove upload files that are old and have no matching in-progress upload in the database.
  3. Run garbage collection to reclaim unreferenced blobs:
    mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
    
  4. If GC does not free enough space, identify large repositories:
    mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443
    
  5. Delete unused tags or repositories to free space, then run GC again.
  6. If the disk is completely full and the service cannot start, manually remove orphaned files from /srv/mcr/uploads/ to free enough space for the service to start, then run GC.

Image Push/Pull Failures

Symptoms: docker push or docker pull returns errors.

  1. Verify the service is running and healthy:
    curl -k https://localhost:8443/v1/health
    
  2. Test OCI endpoint:
    curl -k https://localhost:8443/v2/
    
    Expected: HTTP 401 with WWW-Authenticate header.
  3. Test authentication:
    curl -k -u username:password https://localhost:8443/v2/token?service=mcr
    
    Expected: HTTP 200 with a token response.
  4. Check if the issue is policy-related (403 Denied):
    mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443
    
    Review policy rules for the affected account and repository.
  5. Check audit log for denied requests:
    mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443
    
  6. For push failures, verify all referenced blobs exist before pushing the manifest. The error MANIFEST_BLOB_UNKNOWN means a layer was not uploaded before the manifest push.
  7. Check logs for detailed error information:
    docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api
    

MCP Deployment

MCR is deployed via MCP as a two-component service on the rift node.

Service Definition

name = "mcr"
node = "rift"
active = true

[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/mcr:v1.1.0"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]

[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.1.0"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28080:8080"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]

Port Mapping

Component Container Port Host Port Purpose
mcr-api 8443 28443 HTTPS (OCI + admin REST)
mcr-api 9443 29443 gRPC admin API
mcr-web 8080 28080 Web UI (HTTP, behind MC-Proxy)

Both containers share the /srv/mcr volume for configuration, database, and blob storage. They are connected to the mcpnet Docker network.

Escalation

Escalate when:

  • Database corruption cannot be resolved by restoring a snapshot.
  • MCIAS is down and registry operations are urgently needed.
  • Disk full cannot be resolved by GC and cleanup.
  • Push/pull failures persist after following the procedures above.
  • Any issue not covered by this runbook.

Escalation path: Kyle (platform owner).