Add RUNBOOK.md and expand README.md

Create operational runbook covering health checks, start/stop/restart
(MCP and Docker Compose), backup/restore, garbage collection, and
incident procedures for database corruption, TLS expiry, MCIAS outage,
disk full, and push/pull failures. Includes MCP service definition
reference for the two-component deployment (mcr-api + mcr-web).

Rewrite README from 2-line stub to full project overview with
quick-start instructions, binary descriptions, port tables, and links
to ARCHITECTURE.md and RUNBOOK.md per engineering standards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 22:11:21 -07:00
parent ad2af6df57
commit 3d36c58d0d
2 changed files with 461 additions and 2 deletions

View File

@@ -1,3 +1,59 @@
MCR is the Metacircular Container Registry
# MCR
This is a container registry integrated with MCIAS.
Metacircular Container Registry -- an OCI Distribution Spec-compliant
container registry for the Metacircular platform. MCR stores and serves
container images with authentication delegated to MCIAS and a local
policy engine for fine-grained access control. Metadata is stored in
SQLite; blobs are stored as content-addressed files on the filesystem.
## Quick Start
Build the binaries:
```bash
make all
```
This produces three binaries:
| Binary | Purpose |
|--------|---------|
| `mcrsrv` | Registry server (OCI + admin REST + gRPC) |
| `mcr-web` | Web UI (htmx, communicates with mcrsrv via gRPC) |
| `mcrctl` | Admin CLI |
Copy and edit the example configuration:
```bash
cp deploy/examples/mcr.toml /srv/mcr/mcr.toml
# Edit TLS paths, database path, storage paths, MCIAS URL
```
Run the server:
```bash
./mcrsrv server --config /srv/mcr/mcr.toml
```
The server starts two listeners:
| Port | Protocol | Purpose |
|------|----------|---------|
| 8443 | TCP | HTTPS -- OCI Distribution endpoints + admin REST API |
| 9443 | TCP | gRPC admin API (TLS, MCIAS auth) |
Run the web UI:
```bash
./mcr-web server --config /srv/mcr/mcr.toml
```
| Port | Protocol | Purpose |
|------|----------|---------|
| 8080 | TCP | HTTP -- web UI (repository browsing, policy management) |
## Documentation
- [ARCHITECTURE.md](ARCHITECTURE.md) -- full technical specification, OCI compliance details, database schema, policy engine, and security model.
- [RUNBOOK.md](RUNBOOK.md) -- operational procedures, health checks, backup/restore, incident response, and MCP deployment.
- [CLAUDE.md](CLAUDE.md) -- context for AI-assisted development.

403
RUNBOOK.md Normal file
View File

@@ -0,0 +1,403 @@
# MCR Runbook
## Service Overview
MCR (Metacircular Container Registry) is an OCI Distribution
Spec-compliant container registry for the Metacircular platform. It
stores and serves container images, with authentication delegated to
MCIAS and a local policy engine for fine-grained access control.
MCR runs as two containers:
- **mcr-api** -- the registry server. Exposes OCI Distribution endpoints
and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on
port 9443. Handles blob storage, manifest management, and
token-based authentication via MCIAS.
- **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port
9443. Provides repository/tag browsing and ACL policy management for
administrators. Listens on port 8080.
Both are fronted by MC-Proxy for TLS routing. Metadata is stored in
SQLite; blobs are stored as content-addressed files on the filesystem
under `/srv/mcr/layers/`.
## Health Checks
### REST
```bash
curl -k https://localhost:8443/v1/health
```
Expected: HTTP 200.
### gRPC
Use the `AdminService.Health` RPC on port 9443. This method is public
(no auth required).
### OCI Version Check
```bash
curl -k https://localhost:8443/v2/
```
Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI
endpoint is alive and responding). An authenticated request returns
HTTP 200 with `{}`.
### CLI
```bash
mcrctl status --addr https://localhost:8443
```
Expected output: `ok`
## Common Operations
### Start the Service (MCP)
1. Deploy via MCP:
```bash
mcp deploy mcr
```
2. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### Start the Service (Docker Compose)
1. Verify config exists: `ls /srv/mcr/mcr.toml`
2. Start the containers:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml up -d
```
3. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### Stop the Service
Via MCP:
```bash
mcp stop mcr
```
Via Docker Compose:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml stop
```
MCR handles SIGTERM gracefully: it stops accepting new connections,
drains in-flight requests (including ongoing uploads) for up to 60
seconds, then force-closes remaining connections and exits.
### Restart the Service
Via MCP:
```bash
mcp restart mcr
```
Via Docker Compose:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml restart
```
Verify health after restart:
```bash
curl -k https://localhost:8443/v1/health
```
### Backup (Snapshot)
MCR backups have two parts: the SQLite database (metadata) and the blob
filesystem. The database snapshot alone is usable but incomplete --
missing blobs return 404 on pull.
1. Run the snapshot command:
```bash
mcrsrv snapshot --config /srv/mcr/mcr.toml
```
2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`.
3. Verify the snapshot file exists and has a reasonable size:
```bash
ls -lh /srv/mcr/backups/
```
4. For a complete backup, also copy the blob directory:
```bash
rsync -a /srv/mcr/layers/ /backup/mcr/layers/
```
A systemd timer (`mcr-backup.timer`) runs the database snapshot daily
at 02:00 UTC with 5-minute jitter.
### Restore from Snapshot
1. Stop the service (see above).
2. Back up the current database:
```bash
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore
```
3. Copy the snapshot into place:
```bash
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
```
4. If restoring blobs as well:
```bash
rsync -a /backup/mcr/layers/ /srv/mcr/layers/
```
5. Start the service (see above).
6. Verify the service is healthy:
```bash
curl -k https://localhost:8443/v1/health
```
7. Verify an image pull works:
```bash
docker pull mcr.svc.mcp.metacircular.net:8443/<repo>:<tag>
```
### Log Inspection
Container logs (mcr-api):
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api
```
Container logs (mcr-web):
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web
```
Follow logs in real time:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web
```
Via MCP:
```bash
mcp logs mcr
```
MCR logs to stderr as structured text (slog). Log level is configured
via `[log] level` in `mcr.toml` (debug, info, warn, error).
### Garbage Collection
Garbage collection removes unreferenced blobs -- blobs no longer
referenced by any manifest. GC acquires a registry-wide lock that
blocks new blob uploads for the duration of the mark-and-sweep phase.
Pulls are not blocked.
1. Trigger GC via CLI:
```bash
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
```
2. Check GC status:
```bash
mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443
```
3. GC can also be triggered via the REST API:
```bash
curl -k -X POST -H "Authorization: Bearer <token>" https://localhost:8443/v1/gc
```
If a previous GC crashed after the database sweep but before filesystem
cleanup, orphaned files may remain on disk. Run reconciliation to clean
them up:
```bash
mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443
```
## Incident Procedures
### Database Corruption
Symptoms: server fails to start with SQLite errors, or API requests
return unexpected errors.
1. Stop the service.
2. Check for WAL/SHM files alongside the database:
```bash
ls -la /srv/mcr/mcr.db*
```
3. Attempt an integrity check:
```bash
sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;"
```
4. If integrity check fails, restore from the most recent snapshot:
```bash
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
```
5. Start the service and verify health.
6. Note: blobs on the filesystem are unaffected by database corruption.
Images pushed after the snapshot was taken will be missing from
metadata. Their blobs remain on disk and will be cleaned up by GC
unless the metadata is re-created.
### TLS Certificate Expiry
Symptoms: health check fails with TLS errors, Docker clients get
certificate verification errors on push/pull.
1. Check certificate expiry:
```bash
openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate
```
2. Replace the certificate and key files at the paths configured in
`mcr.toml` (`[server] tls_cert` and `tls_key`).
3. Restart the service to load the new certificate.
4. Verify health:
```bash
curl -k https://localhost:8443/v1/health
```
### MCIAS Outage
Symptoms: push/pull fails with 401 or 502 errors. Authentication
cannot complete.
1. Confirm MCIAS is unreachable:
```bash
curl -k https://svc.metacircular.net:8443/v1/health
```
2. Cached token validation results remain valid for up to 30 seconds
after the last successful MCIAS check. Operations using
recently-validated tokens may continue briefly.
3. Once cached tokens expire, all authenticated operations (push, pull,
catalog, admin) will fail until MCIAS recovers.
4. The OCI `/v2/` version check endpoint still responds (confirms MCR
itself is running).
5. Escalate to MCIAS (see Escalation below).
### Disk Full
Symptoms: blob uploads fail, database writes fail, container may crash.
1. Check disk usage:
```bash
df -h /srv/mcr/
du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db
```
2. Clean up stale uploads:
```bash
ls -la /srv/mcr/uploads/
```
Remove upload files that are old and have no matching in-progress
upload in the database.
3. Run garbage collection to reclaim unreferenced blobs:
```bash
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
```
4. If GC does not free enough space, identify large repositories:
```bash
mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443
```
5. Delete unused tags or repositories to free space, then run GC again.
6. If the disk is completely full and the service cannot start, manually
remove orphaned files from `/srv/mcr/uploads/` to free enough space
for the service to start, then run GC.
### Image Push/Pull Failures
Symptoms: `docker push` or `docker pull` returns errors.
1. Verify the service is running and healthy:
```bash
curl -k https://localhost:8443/v1/health
```
2. Test OCI endpoint:
```bash
curl -k https://localhost:8443/v2/
```
Expected: HTTP 401 with `WWW-Authenticate` header.
3. Test authentication:
```bash
curl -k -u username:password https://localhost:8443/v2/token?service=mcr
```
Expected: HTTP 200 with a token response.
4. Check if the issue is policy-related (403 Denied):
```bash
mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443
```
Review policy rules for the affected account and repository.
5. Check audit log for denied requests:
```bash
mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443
```
6. For push failures, verify all referenced blobs exist before pushing
the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was
not uploaded before the manifest push.
7. Check logs for detailed error information:
```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api
```
## MCP Deployment
MCR is deployed via MCP as a two-component service on the `rift` node.
### Service Definition
```toml
name = "mcr"
node = "rift"
active = true
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/mcr:latest"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:latest"
network = "mcpnet"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:28080:8080"]
volumes = ["/srv/mcr:/srv/mcr"]
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
```
### Port Mapping
| Component | Container Port | Host Port | Purpose |
|-----------|---------------|-----------|---------|
| mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) |
| mcr-api | 9443 | 29443 | gRPC admin API |
| mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) |
Both containers share the `/srv/mcr` volume for configuration, database,
and blob storage. They are connected to the `mcpnet` Docker network.
## Escalation
Escalate when:
- Database corruption cannot be resolved by restoring a snapshot.
- MCIAS is down and registry operations are urgently needed.
- Disk full cannot be resolved by GC and cleanup.
- Push/pull failures persist after following the procedures above.
- Any issue not covered by this runbook.
Escalation path: Kyle (platform owner).