diff --git a/README.md b/README.md index 77f835a..174a031 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,59 @@ -MCR is the Metacircular Container Registry +# MCR -This is a container registry integrated with MCIAS. \ No newline at end of file +Metacircular Container Registry -- an OCI Distribution Spec-compliant +container registry for the Metacircular platform. MCR stores and serves +container images with authentication delegated to MCIAS and a local +policy engine for fine-grained access control. Metadata is stored in +SQLite; blobs are stored as content-addressed files on the filesystem. + +## Quick Start + +Build the binaries: + +```bash +make all +``` + +This produces three binaries: + +| Binary | Purpose | +|--------|---------| +| `mcrsrv` | Registry server (OCI + admin REST + gRPC) | +| `mcr-web` | Web UI (htmx, communicates with mcrsrv via gRPC) | +| `mcrctl` | Admin CLI | + +Copy and edit the example configuration: + +```bash +cp deploy/examples/mcr.toml /srv/mcr/mcr.toml +# Edit TLS paths, database path, storage paths, MCIAS URL +``` + +Run the server: + +```bash +./mcrsrv server --config /srv/mcr/mcr.toml +``` + +The server starts two listeners: + +| Port | Protocol | Purpose | +|------|----------|---------| +| 8443 | TCP | HTTPS -- OCI Distribution endpoints + admin REST API | +| 9443 | TCP | gRPC admin API (TLS, MCIAS auth) | + +Run the web UI: + +```bash +./mcr-web server --config /srv/mcr/mcr.toml +``` + +| Port | Protocol | Purpose | +|------|----------|---------| +| 8080 | TCP | HTTP -- web UI (repository browsing, policy management) | + +## Documentation + +- [ARCHITECTURE.md](ARCHITECTURE.md) -- full technical specification, OCI compliance details, database schema, policy engine, and security model. +- [RUNBOOK.md](RUNBOOK.md) -- operational procedures, health checks, backup/restore, incident response, and MCP deployment. +- [CLAUDE.md](CLAUDE.md) -- context for AI-assisted development. diff --git a/RUNBOOK.md b/RUNBOOK.md new file mode 100644 index 0000000..541aba6 --- /dev/null +++ b/RUNBOOK.md @@ -0,0 +1,403 @@ +# MCR Runbook + +## Service Overview + +MCR (Metacircular Container Registry) is an OCI Distribution +Spec-compliant container registry for the Metacircular platform. It +stores and serves container images, with authentication delegated to +MCIAS and a local policy engine for fine-grained access control. + +MCR runs as two containers: + +- **mcr-api** -- the registry server. Exposes OCI Distribution endpoints + and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on + port 9443. Handles blob storage, manifest management, and + token-based authentication via MCIAS. +- **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port + 9443. Provides repository/tag browsing and ACL policy management for + administrators. Listens on port 8080. + +Both are fronted by MC-Proxy for TLS routing. Metadata is stored in +SQLite; blobs are stored as content-addressed files on the filesystem +under `/srv/mcr/layers/`. + +## Health Checks + +### REST + +```bash +curl -k https://localhost:8443/v1/health +``` + +Expected: HTTP 200. + +### gRPC + +Use the `AdminService.Health` RPC on port 9443. This method is public +(no auth required). + +### OCI Version Check + +```bash +curl -k https://localhost:8443/v2/ +``` + +Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI +endpoint is alive and responding). An authenticated request returns +HTTP 200 with `{}`. + +### CLI + +```bash +mcrctl status --addr https://localhost:8443 +``` + +Expected output: `ok` + +## Common Operations + +### Start the Service (MCP) + +1. Deploy via MCP: + ```bash + mcp deploy mcr + ``` +2. Verify health: + ```bash + curl -k https://localhost:8443/v1/health + ``` + +### Start the Service (Docker Compose) + +1. Verify config exists: `ls /srv/mcr/mcr.toml` +2. Start the containers: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml up -d + ``` +3. Verify health: + ```bash + curl -k https://localhost:8443/v1/health + ``` + +### Stop the Service + +Via MCP: + +```bash +mcp stop mcr +``` + +Via Docker Compose: + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml stop +``` + +MCR handles SIGTERM gracefully: it stops accepting new connections, +drains in-flight requests (including ongoing uploads) for up to 60 +seconds, then force-closes remaining connections and exits. + +### Restart the Service + +Via MCP: + +```bash +mcp restart mcr +``` + +Via Docker Compose: + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml restart +``` + +Verify health after restart: + +```bash +curl -k https://localhost:8443/v1/health +``` + +### Backup (Snapshot) + +MCR backups have two parts: the SQLite database (metadata) and the blob +filesystem. The database snapshot alone is usable but incomplete -- +missing blobs return 404 on pull. + +1. Run the snapshot command: + ```bash + mcrsrv snapshot --config /srv/mcr/mcr.toml + ``` +2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`. +3. Verify the snapshot file exists and has a reasonable size: + ```bash + ls -lh /srv/mcr/backups/ + ``` +4. For a complete backup, also copy the blob directory: + ```bash + rsync -a /srv/mcr/layers/ /backup/mcr/layers/ + ``` + +A systemd timer (`mcr-backup.timer`) runs the database snapshot daily +at 02:00 UTC with 5-minute jitter. + +### Restore from Snapshot + +1. Stop the service (see above). +2. Back up the current database: + ```bash + cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore + ``` +3. Copy the snapshot into place: + ```bash + cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db + ``` +4. If restoring blobs as well: + ```bash + rsync -a /backup/mcr/layers/ /srv/mcr/layers/ + ``` +5. Start the service (see above). +6. Verify the service is healthy: + ```bash + curl -k https://localhost:8443/v1/health + ``` +7. Verify an image pull works: + ```bash + docker pull mcr.svc.mcp.metacircular.net:8443/: + ``` + +### Log Inspection + +Container logs (mcr-api): + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api +``` + +Container logs (mcr-web): + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web +``` + +Follow logs in real time: + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web +``` + +Via MCP: + +```bash +mcp logs mcr +``` + +MCR logs to stderr as structured text (slog). Log level is configured +via `[log] level` in `mcr.toml` (debug, info, warn, error). + +### Garbage Collection + +Garbage collection removes unreferenced blobs -- blobs no longer +referenced by any manifest. GC acquires a registry-wide lock that +blocks new blob uploads for the duration of the mark-and-sweep phase. +Pulls are not blocked. + +1. Trigger GC via CLI: + ```bash + mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` +2. Check GC status: + ```bash + mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` +3. GC can also be triggered via the REST API: + ```bash + curl -k -X POST -H "Authorization: Bearer " https://localhost:8443/v1/gc + ``` + +If a previous GC crashed after the database sweep but before filesystem +cleanup, orphaned files may remain on disk. Run reconciliation to clean +them up: + +```bash +mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443 +``` + +## Incident Procedures + +### Database Corruption + +Symptoms: server fails to start with SQLite errors, or API requests +return unexpected errors. + +1. Stop the service. +2. Check for WAL/SHM files alongside the database: + ```bash + ls -la /srv/mcr/mcr.db* + ``` +3. Attempt an integrity check: + ```bash + sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;" + ``` +4. If integrity check fails, restore from the most recent snapshot: + ```bash + cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt + cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db + ``` +5. Start the service and verify health. +6. Note: blobs on the filesystem are unaffected by database corruption. + Images pushed after the snapshot was taken will be missing from + metadata. Their blobs remain on disk and will be cleaned up by GC + unless the metadata is re-created. + +### TLS Certificate Expiry + +Symptoms: health check fails with TLS errors, Docker clients get +certificate verification errors on push/pull. + +1. Check certificate expiry: + ```bash + openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate + ``` +2. Replace the certificate and key files at the paths configured in + `mcr.toml` (`[server] tls_cert` and `tls_key`). +3. Restart the service to load the new certificate. +4. Verify health: + ```bash + curl -k https://localhost:8443/v1/health + ``` + +### MCIAS Outage + +Symptoms: push/pull fails with 401 or 502 errors. Authentication +cannot complete. + +1. Confirm MCIAS is unreachable: + ```bash + curl -k https://svc.metacircular.net:8443/v1/health + ``` +2. Cached token validation results remain valid for up to 30 seconds + after the last successful MCIAS check. Operations using + recently-validated tokens may continue briefly. +3. Once cached tokens expire, all authenticated operations (push, pull, + catalog, admin) will fail until MCIAS recovers. +4. The OCI `/v2/` version check endpoint still responds (confirms MCR + itself is running). +5. Escalate to MCIAS (see Escalation below). + +### Disk Full + +Symptoms: blob uploads fail, database writes fail, container may crash. + +1. Check disk usage: + ```bash + df -h /srv/mcr/ + du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db + ``` +2. Clean up stale uploads: + ```bash + ls -la /srv/mcr/uploads/ + ``` + Remove upload files that are old and have no matching in-progress + upload in the database. +3. Run garbage collection to reclaim unreferenced blobs: + ```bash + mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` +4. If GC does not free enough space, identify large repositories: + ```bash + mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` +5. Delete unused tags or repositories to free space, then run GC again. +6. If the disk is completely full and the service cannot start, manually + remove orphaned files from `/srv/mcr/uploads/` to free enough space + for the service to start, then run GC. + +### Image Push/Pull Failures + +Symptoms: `docker push` or `docker pull` returns errors. + +1. Verify the service is running and healthy: + ```bash + curl -k https://localhost:8443/v1/health + ``` +2. Test OCI endpoint: + ```bash + curl -k https://localhost:8443/v2/ + ``` + Expected: HTTP 401 with `WWW-Authenticate` header. +3. Test authentication: + ```bash + curl -k -u username:password https://localhost:8443/v2/token?service=mcr + ``` + Expected: HTTP 200 with a token response. +4. Check if the issue is policy-related (403 Denied): + ```bash + mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` + Review policy rules for the affected account and repository. +5. Check audit log for denied requests: + ```bash + mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443 + ``` +6. For push failures, verify all referenced blobs exist before pushing + the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was + not uploaded before the manifest push. +7. Check logs for detailed error information: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api + ``` + +## MCP Deployment + +MCR is deployed via MCP as a two-component service on the `rift` node. + +### Service Definition + +```toml +name = "mcr" +node = "rift" +active = true + +[[components]] +name = "api" +image = "mcr.svc.mcp.metacircular.net:8443/mcr:latest" +network = "mcpnet" +user = "0:0" +restart = "unless-stopped" +ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"] +volumes = ["/srv/mcr:/srv/mcr"] +cmd = ["server", "--config", "/srv/mcr/mcr.toml"] + +[[components]] +name = "web" +image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:latest" +network = "mcpnet" +user = "0:0" +restart = "unless-stopped" +ports = ["127.0.0.1:28080:8080"] +volumes = ["/srv/mcr:/srv/mcr"] +cmd = ["server", "--config", "/srv/mcr/mcr.toml"] +``` + +### Port Mapping + +| Component | Container Port | Host Port | Purpose | +|-----------|---------------|-----------|---------| +| mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) | +| mcr-api | 9443 | 29443 | gRPC admin API | +| mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) | + +Both containers share the `/srv/mcr` volume for configuration, database, +and blob storage. They are connected to the `mcpnet` Docker network. + +## Escalation + +Escalate when: + +- Database corruption cannot be resolved by restoring a snapshot. +- MCIAS is down and registry operations are urgently needed. +- Disk full cannot be resolved by GC and cleanup. +- Push/pull failures persist after following the procedures above. +- Any issue not covered by this runbook. + +Escalation path: Kyle (platform owner).