The web UI now validates the MCIAS token after login and rejects accounts with the guest role before setting the session cookie. This is defense-in-depth alongside the env:restricted MCIAS tag. The webserver.New() constructor takes a new ValidateFunc parameter that inspects token roles post-authentication. MCIAS login does not return roles, so this requires an extra ValidateToken round-trip at login time (result is cached for 30s). Security: guest role accounts are denied web UI access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
405 lines
11 KiB
Markdown
405 lines
11 KiB
Markdown
# MCR Runbook
|
|
|
|
## Service Overview
|
|
|
|
MCR (Metacircular Container Registry) is an OCI Distribution
|
|
Spec-compliant container registry for the Metacircular platform. It
|
|
stores and serves container images, with authentication delegated to
|
|
MCIAS and a local policy engine for fine-grained access control.
|
|
|
|
MCR runs as two containers:
|
|
|
|
- **mcr-api** -- the registry server. Exposes OCI Distribution endpoints
|
|
and an admin REST API on port 8443 (HTTPS), plus a gRPC admin API on
|
|
port 9443. Handles blob storage, manifest management, and
|
|
token-based authentication via MCIAS.
|
|
- **mcr-web** -- the web UI. Communicates with mcr-api via gRPC on port
|
|
9443. Provides repository/tag browsing and ACL policy management for
|
|
administrators. Listens on port 8080. Guest accounts are blocked at
|
|
login; only `admin` and `user` roles can access the web interface.
|
|
|
|
Both are fronted by MC-Proxy for TLS routing. Metadata is stored in
|
|
SQLite; blobs are stored as content-addressed files on the filesystem
|
|
under `/srv/mcr/layers/`.
|
|
|
|
## Health Checks
|
|
|
|
### REST
|
|
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
Expected: HTTP 200.
|
|
|
|
### gRPC
|
|
|
|
Use the `AdminService.Health` RPC on port 9443. This method is public
|
|
(no auth required).
|
|
|
|
### OCI Version Check
|
|
|
|
```bash
|
|
curl -k https://localhost:8443/v2/
|
|
```
|
|
|
|
Expected: HTTP 401 with `WWW-Authenticate` header (confirms the OCI
|
|
endpoint is alive and responding). An authenticated request returns
|
|
HTTP 200 with `{}`.
|
|
|
|
### CLI
|
|
|
|
```bash
|
|
mcrctl status --addr https://localhost:8443
|
|
```
|
|
|
|
Expected output: `ok`
|
|
|
|
## Common Operations
|
|
|
|
### Start the Service (MCP)
|
|
|
|
1. Deploy via MCP:
|
|
```bash
|
|
mcp deploy mcr
|
|
```
|
|
2. Verify health:
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
### Start the Service (Docker Compose)
|
|
|
|
1. Verify config exists: `ls /srv/mcr/mcr.toml`
|
|
2. Start the containers:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml up -d
|
|
```
|
|
3. Verify health:
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
### Stop the Service
|
|
|
|
Via MCP:
|
|
|
|
```bash
|
|
mcp stop mcr
|
|
```
|
|
|
|
Via Docker Compose:
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml stop
|
|
```
|
|
|
|
MCR handles SIGTERM gracefully: it stops accepting new connections,
|
|
drains in-flight requests (including ongoing uploads) for up to 60
|
|
seconds, then force-closes remaining connections and exits.
|
|
|
|
### Restart the Service
|
|
|
|
Via MCP:
|
|
|
|
```bash
|
|
mcp restart mcr
|
|
```
|
|
|
|
Via Docker Compose:
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml restart
|
|
```
|
|
|
|
Verify health after restart:
|
|
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
### Backup (Snapshot)
|
|
|
|
MCR backups have two parts: the SQLite database (metadata) and the blob
|
|
filesystem. The database snapshot alone is usable but incomplete --
|
|
missing blobs return 404 on pull.
|
|
|
|
1. Run the snapshot command:
|
|
```bash
|
|
mcrsrv snapshot --config /srv/mcr/mcr.toml
|
|
```
|
|
2. The snapshot is saved to `/srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db`.
|
|
3. Verify the snapshot file exists and has a reasonable size:
|
|
```bash
|
|
ls -lh /srv/mcr/backups/
|
|
```
|
|
4. For a complete backup, also copy the blob directory:
|
|
```bash
|
|
rsync -a /srv/mcr/layers/ /backup/mcr/layers/
|
|
```
|
|
|
|
A systemd timer (`mcr-backup.timer`) runs the database snapshot daily
|
|
at 02:00 UTC with 5-minute jitter.
|
|
|
|
### Restore from Snapshot
|
|
|
|
1. Stop the service (see above).
|
|
2. Back up the current database:
|
|
```bash
|
|
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.pre-restore
|
|
```
|
|
3. Copy the snapshot into place:
|
|
```bash
|
|
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
|
|
```
|
|
4. If restoring blobs as well:
|
|
```bash
|
|
rsync -a /backup/mcr/layers/ /srv/mcr/layers/
|
|
```
|
|
5. Start the service (see above).
|
|
6. Verify the service is healthy:
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
7. Verify an image pull works:
|
|
```bash
|
|
docker pull mcr.svc.mcp.metacircular.net:8443/<repo>:<tag>
|
|
```
|
|
|
|
### Log Inspection
|
|
|
|
Container logs (mcr-api):
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-api
|
|
```
|
|
|
|
Container logs (mcr-web):
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcr-web
|
|
```
|
|
|
|
Follow logs in real time:
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcr-api mcr-web
|
|
```
|
|
|
|
Via MCP:
|
|
|
|
```bash
|
|
mcp logs mcr
|
|
```
|
|
|
|
MCR logs to stderr as structured text (slog). Log level is configured
|
|
via `[log] level` in `mcr.toml` (debug, info, warn, error).
|
|
|
|
### Garbage Collection
|
|
|
|
Garbage collection removes unreferenced blobs -- blobs no longer
|
|
referenced by any manifest. GC acquires a registry-wide lock that
|
|
blocks new blob uploads for the duration of the mark-and-sweep phase.
|
|
Pulls are not blocked.
|
|
|
|
1. Trigger GC via CLI:
|
|
```bash
|
|
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
2. Check GC status:
|
|
```bash
|
|
mcrctl gc status --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
3. GC can also be triggered via the REST API:
|
|
```bash
|
|
curl -k -X POST -H "Authorization: Bearer <token>" https://localhost:8443/v1/gc
|
|
```
|
|
|
|
If a previous GC crashed after the database sweep but before filesystem
|
|
cleanup, orphaned files may remain on disk. Run reconciliation to clean
|
|
them up:
|
|
|
|
```bash
|
|
mcrctl gc --reconcile --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
|
|
## Incident Procedures
|
|
|
|
### Database Corruption
|
|
|
|
Symptoms: server fails to start with SQLite errors, or API requests
|
|
return unexpected errors.
|
|
|
|
1. Stop the service.
|
|
2. Check for WAL/SHM files alongside the database:
|
|
```bash
|
|
ls -la /srv/mcr/mcr.db*
|
|
```
|
|
3. Attempt an integrity check:
|
|
```bash
|
|
sqlite3 /srv/mcr/mcr.db "PRAGMA integrity_check;"
|
|
```
|
|
4. If integrity check fails, restore from the most recent snapshot:
|
|
```bash
|
|
cp /srv/mcr/mcr.db /srv/mcr/mcr.db.corrupt
|
|
cp /srv/mcr/backups/mcr-YYYYMMDD-HHMMSS.db /srv/mcr/mcr.db
|
|
```
|
|
5. Start the service and verify health.
|
|
6. Note: blobs on the filesystem are unaffected by database corruption.
|
|
Images pushed after the snapshot was taken will be missing from
|
|
metadata. Their blobs remain on disk and will be cleaned up by GC
|
|
unless the metadata is re-created.
|
|
|
|
### TLS Certificate Expiry
|
|
|
|
Symptoms: health check fails with TLS errors, Docker clients get
|
|
certificate verification errors on push/pull.
|
|
|
|
1. Check certificate expiry:
|
|
```bash
|
|
openssl x509 -in /srv/mcr/certs/cert.pem -noout -enddate
|
|
```
|
|
2. Replace the certificate and key files at the paths configured in
|
|
`mcr.toml` (`[server] tls_cert` and `tls_key`).
|
|
3. Restart the service to load the new certificate.
|
|
4. Verify health:
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
### MCIAS Outage
|
|
|
|
Symptoms: push/pull fails with 401 or 502 errors. Authentication
|
|
cannot complete.
|
|
|
|
1. Confirm MCIAS is unreachable:
|
|
```bash
|
|
curl -k https://svc.metacircular.net:8443/v1/health
|
|
```
|
|
2. Cached token validation results remain valid for up to 30 seconds
|
|
after the last successful MCIAS check. Operations using
|
|
recently-validated tokens may continue briefly.
|
|
3. Once cached tokens expire, all authenticated operations (push, pull,
|
|
catalog, admin) will fail until MCIAS recovers.
|
|
4. The OCI `/v2/` version check endpoint still responds (confirms MCR
|
|
itself is running).
|
|
5. Escalate to MCIAS (see Escalation below).
|
|
|
|
### Disk Full
|
|
|
|
Symptoms: blob uploads fail, database writes fail, container may crash.
|
|
|
|
1. Check disk usage:
|
|
```bash
|
|
df -h /srv/mcr/
|
|
du -sh /srv/mcr/layers/ /srv/mcr/uploads/ /srv/mcr/mcr.db
|
|
```
|
|
2. Clean up stale uploads:
|
|
```bash
|
|
ls -la /srv/mcr/uploads/
|
|
```
|
|
Remove upload files that are old and have no matching in-progress
|
|
upload in the database.
|
|
3. Run garbage collection to reclaim unreferenced blobs:
|
|
```bash
|
|
mcrctl gc --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
4. If GC does not free enough space, identify large repositories:
|
|
```bash
|
|
mcrctl repo list --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
5. Delete unused tags or repositories to free space, then run GC again.
|
|
6. If the disk is completely full and the service cannot start, manually
|
|
remove orphaned files from `/srv/mcr/uploads/` to free enough space
|
|
for the service to start, then run GC.
|
|
|
|
### Image Push/Pull Failures
|
|
|
|
Symptoms: `docker push` or `docker pull` returns errors.
|
|
|
|
1. Verify the service is running and healthy:
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
2. Test OCI endpoint:
|
|
```bash
|
|
curl -k https://localhost:8443/v2/
|
|
```
|
|
Expected: HTTP 401 with `WWW-Authenticate` header.
|
|
3. Test authentication:
|
|
```bash
|
|
curl -k -u username:password https://localhost:8443/v2/token?service=mcr
|
|
```
|
|
Expected: HTTP 200 with a token response.
|
|
4. Check if the issue is policy-related (403 Denied):
|
|
```bash
|
|
mcrctl policy list --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
Review policy rules for the affected account and repository.
|
|
5. Check audit log for denied requests:
|
|
```bash
|
|
mcrctl audit tail --n 20 --addr https://mcr.svc.mcp.metacircular.net:8443
|
|
```
|
|
6. For push failures, verify all referenced blobs exist before pushing
|
|
the manifest. The error `MANIFEST_BLOB_UNKNOWN` means a layer was
|
|
not uploaded before the manifest push.
|
|
7. Check logs for detailed error information:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcr-api
|
|
```
|
|
|
|
## MCP Deployment
|
|
|
|
MCR is deployed via MCP as a two-component service on the `rift` node.
|
|
|
|
### Service Definition
|
|
|
|
```toml
|
|
name = "mcr"
|
|
node = "rift"
|
|
active = true
|
|
|
|
[[components]]
|
|
name = "api"
|
|
image = "mcr.svc.mcp.metacircular.net:8443/mcr:latest"
|
|
network = "mcpnet"
|
|
user = "0:0"
|
|
restart = "unless-stopped"
|
|
ports = ["127.0.0.1:28443:8443", "127.0.0.1:29443:9443"]
|
|
volumes = ["/srv/mcr:/srv/mcr"]
|
|
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
|
|
|
|
[[components]]
|
|
name = "web"
|
|
image = "mcr.svc.mcp.metacircular.net:8443/mcr-web:latest"
|
|
network = "mcpnet"
|
|
user = "0:0"
|
|
restart = "unless-stopped"
|
|
ports = ["127.0.0.1:28080:8080"]
|
|
volumes = ["/srv/mcr:/srv/mcr"]
|
|
cmd = ["server", "--config", "/srv/mcr/mcr.toml"]
|
|
```
|
|
|
|
### Port Mapping
|
|
|
|
| Component | Container Port | Host Port | Purpose |
|
|
|-----------|---------------|-----------|---------|
|
|
| mcr-api | 8443 | 28443 | HTTPS (OCI + admin REST) |
|
|
| mcr-api | 9443 | 29443 | gRPC admin API |
|
|
| mcr-web | 8080 | 28080 | Web UI (HTTP, behind MC-Proxy) |
|
|
|
|
Both containers share the `/srv/mcr` volume for configuration, database,
|
|
and blob storage. They are connected to the `mcpnet` Docker network.
|
|
|
|
## Escalation
|
|
|
|
Escalate when:
|
|
|
|
- Database corruption cannot be resolved by restoring a snapshot.
|
|
- MCIAS is down and registry operations are urgently needed.
|
|
- Disk full cannot be resolved by GC and cleanup.
|
|
- Push/pull failures persist after following the procedures above.
|
|
- Any issue not covered by this runbook.
|
|
|
|
Escalation path: Kyle (platform owner).
|