diff --git a/README.md b/README.md new file mode 100644 index 0000000..9ee470d --- /dev/null +++ b/README.md @@ -0,0 +1,42 @@ +# MCNS + +Metacircular Networking Service -- an authoritative DNS server for the +Metacircular platform. MCNS serves DNS zones backed by SQLite, forwards +non-authoritative queries to upstream resolvers, and exposes a gRPC and +REST management API authenticated through MCIAS. Records are updated +dynamically via the API and visible to DNS immediately on commit. + +## Quick Start + +Build the binary: + +```bash +make mcns +``` + +Copy and edit the example configuration: + +```bash +cp deploy/examples/mcns.toml /srv/mcns/mcns.toml +# Edit TLS paths, database path, MCIAS URL, upstream resolvers +``` + +Run the server: + +```bash +./mcns server --config /srv/mcns/mcns.toml +``` + +The server starts three listeners: + +| Port | Protocol | Purpose | +|------|----------|---------| +| 53 | UDP + TCP | DNS (no auth) | +| 8443 | TCP | REST management API (TLS, MCIAS auth) | +| 9443 | TCP | gRPC management API (TLS, MCIAS auth) | + +## Documentation + +- [ARCHITECTURE.md](ARCHITECTURE.md) -- full technical specification, database schema, API surface, and security model. +- [RUNBOOK.md](RUNBOOK.md) -- operational procedures and incident response for operators. +- [CLAUDE.md](CLAUDE.md) -- context for AI-assisted development. diff --git a/RUNBOOK.md b/RUNBOOK.md new file mode 100644 index 0000000..89f48df --- /dev/null +++ b/RUNBOOK.md @@ -0,0 +1,242 @@ +# MCNS Runbook + +## Service Overview + +MCNS is an authoritative DNS server for the Metacircular platform. It +listens on port 53 (UDP+TCP) for DNS queries, port 8443 for the REST +management API, and port 9443 for the gRPC management API. Zone and +record data is stored in SQLite. All management operations require MCIAS +authentication; DNS queries are unauthenticated. + +## Health Checks + +### CLI + +```bash +mcns status --addr https://localhost:8443 +``` + +With a custom CA certificate: + +```bash +mcns status --addr https://localhost:8443 --ca-cert /srv/mcns/certs/ca.pem +``` + +Expected output: `ok` + +### REST + +```bash +curl -k https://localhost:8443/v1/health +``` + +Expected: HTTP 200. + +### gRPC + +Use the `AdminService.Health` RPC on port 9443. This method is public +(no auth required). + +### DNS + +```bash +dig @localhost svc.mcp.metacircular.net SOA +short +``` + +A valid SOA response confirms the DNS listener and database are working. + +## Common Operations + +### Start the Service + +1. Verify config exists: `ls /srv/mcns/mcns.toml` +2. Start the container: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml up -d + ``` +3. Verify health: + ```bash + mcns status --addr https://localhost:8443 + ``` + +### Stop the Service + +1. Stop the container: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml stop mcns + ``` +2. MCNS handles SIGTERM gracefully and drains in-flight requests (30s timeout). + +### Restart the Service + +1. Restart the container: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml restart mcns + ``` +2. Verify health: + ```bash + mcns status --addr https://localhost:8443 + ``` + +### Backup (Snapshot) + +1. Run the snapshot command: + ```bash + mcns snapshot --config /srv/mcns/mcns.toml + ``` +2. The snapshot is saved to `/srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db`. +3. Verify the snapshot file exists and has a reasonable size: + ```bash + ls -lh /srv/mcns/backups/ + ``` + +### Restore from Snapshot + +1. Stop the service (see above). +2. Back up the current database: + ```bash + cp /srv/mcns/mcns.db /srv/mcns/mcns.db.pre-restore + ``` +3. Copy the snapshot into place: + ```bash + cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db + ``` +4. Start the service (see above). +5. Verify the service is healthy: + ```bash + mcns status --addr https://localhost:8443 + ``` +6. Verify zones are accessible by querying DNS: + ```bash + dig @localhost svc.mcp.metacircular.net SOA +short + ``` + +### Log Inspection + +Container logs: + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcns +``` + +Follow logs in real time: + +```bash +docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcns +``` + +MCNS logs to stderr as structured text (slog). Log level is configured +via `[log] level` in `mcns.toml` (debug, info, warn, error). + +## Incident Procedures + +### Database Corruption + +Symptoms: server fails to start with SQLite errors, or queries return +unexpected errors. + +1. Stop the service. +2. Check for WAL/SHM files alongside the database: + ```bash + ls -la /srv/mcns/mcns.db* + ``` +3. Attempt an integrity check: + ```bash + sqlite3 /srv/mcns/mcns.db "PRAGMA integrity_check;" + ``` +4. If integrity check fails, restore from the most recent snapshot: + ```bash + cp /srv/mcns/mcns.db /srv/mcns/mcns.db.corrupt + cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db + ``` +5. Start the service and verify health. +6. Re-create any records added after the snapshot was taken. + +### Certificate Expiry + +Symptoms: health check fails with TLS errors, API clients get +certificate errors. + +1. Check certificate expiry: + ```bash + openssl x509 -in /srv/mcns/certs/cert.pem -noout -enddate + ``` +2. Replace the certificate and key files at the paths in `mcns.toml`. +3. Restart the service to load the new certificate. +4. Verify health: + ```bash + mcns status --addr https://localhost:8443 + ``` + +### MCIAS Outage + +Symptoms: management API returns 502 or authentication errors. DNS +continues to work normally (DNS has no auth dependency). + +1. Confirm MCIAS is unreachable: + ```bash + curl -k https://svc.metacircular.net:8443/v1/health + ``` +2. DNS resolution is unaffected -- no immediate action needed for DNS. +3. Management operations (zone/record create/update/delete) will fail + until MCIAS recovers. +4. Escalate to MCIAS (see Escalation below). + +### DNS Not Resolving + +Symptoms: `dig @ ` returns SERVFAIL or times out. + +1. Verify the service is running: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml ps mcns + ``` +2. Check that port 53 is listening: + ```bash + ss -ulnp | grep ':53' + ss -tlnp | grep ':53' + ``` +3. Test an authoritative query: + ```bash + dig @localhost svc.mcp.metacircular.net SOA + ``` +4. Test a forwarded query: + ```bash + dig @localhost example.com A + ``` +5. If authoritative queries fail but forwarding works, the database may + be corrupt (see Database Corruption above). +6. If forwarding fails, check upstream connectivity: + ```bash + dig @1.1.1.1 example.com A + ``` +7. Check logs for errors: + ```bash + docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcns + ``` + +### Port 53 Already in Use + +Symptoms: MCNS fails to start with "address already in use" on port 53. + +1. Identify what is using the port: + ```bash + ss -ulnp | grep ':53' + ss -tlnp | grep ':53' + ``` +2. Common culprit: `systemd-resolved` listening on `127.0.0.53:53`. + - If on a system with systemd-resolved, either disable it or bind + MCNS to a specific IP instead of `0.0.0.0:53`. +3. If another DNS server is running, stop it or change the MCNS + `[dns] listen_addr` in `mcns.toml` to a different address. +4. Restart MCNS and verify DNS is responding. + +## Escalation + +Escalate when: + +- Database corruption cannot be resolved by restoring a snapshot. +- MCIAS is down and management operations are urgently needed. +- DNS resolution failures persist after following the procedures above. +- Any issue not covered by this runbook. + +Escalation path: Kyle (platform owner).