mc/mcns

Files

Kyle Isom 96b5a0fa1b Add README.md and RUNBOOK.md per engineering standards

Create the two required documentation files for MCNS:
- README.md: project overview, quick-start (build/configure/run), links
  to ARCHITECTURE.md and RUNBOOK.md
- RUNBOOK.md: operator-focused procedures with numbered steps covering
  health checks, start/stop/restart, backup/restore, log inspection,
  and incident playbooks for database corruption, certificate expiry,
  MCIAS outage, DNS resolution failures, and port conflicts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 21:04:03 -07:00

6.0 KiB

Raw Blame History

MCNS Runbook

Service Overview

MCNS is an authoritative DNS server for the Metacircular platform. It listens on port 53 (UDP+TCP) for DNS queries, port 8443 for the REST management API, and port 9443 for the gRPC management API. Zone and record data is stored in SQLite. All management operations require MCIAS authentication; DNS queries are unauthenticated.

Health Checks

CLI

mcns status --addr https://localhost:8443

With a custom CA certificate:

mcns status --addr https://localhost:8443 --ca-cert /srv/mcns/certs/ca.pem

Expected output: ok

REST

curl -k https://localhost:8443/v1/health

Expected: HTTP 200.

gRPC

Use the AdminService.Health RPC on port 9443. This method is public (no auth required).

DNS

dig @localhost svc.mcp.metacircular.net SOA +short

A valid SOA response confirms the DNS listener and database are working.

Common Operations

Start the Service

Verify config exists: ls /srv/mcns/mcns.toml

Start the container:

docker compose -f deploy/docker/docker-compose-rift.yml up -d

Verify health:

mcns status --addr https://localhost:8443

Stop the Service

Stop the container:

docker compose -f deploy/docker/docker-compose-rift.yml stop mcns

MCNS handles SIGTERM gracefully and drains in-flight requests (30s timeout).

Restart the Service

Restart the container:

docker compose -f deploy/docker/docker-compose-rift.yml restart mcns

Verify health:

mcns status --addr https://localhost:8443

Backup (Snapshot)

Run the snapshot command:

mcns snapshot --config /srv/mcns/mcns.toml

The snapshot is saved to /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db.
Verify the snapshot file exists and has a reasonable size:
```
ls -lh /srv/mcns/backups/
```

Restore from Snapshot

Stop the service (see above).

Back up the current database:

cp /srv/mcns/mcns.db /srv/mcns/mcns.db.pre-restore

Copy the snapshot into place:

cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db

Start the service (see above).

Verify the service is healthy:

mcns status --addr https://localhost:8443

Verify zones are accessible by querying DNS:

dig @localhost svc.mcp.metacircular.net SOA +short

Log Inspection

Container logs:

docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcns

Follow logs in real time:

docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcns

MCNS logs to stderr as structured text (slog). Log level is configured via [log] level in mcns.toml (debug, info, warn, error).

Incident Procedures

Database Corruption

Symptoms: server fails to start with SQLite errors, or queries return unexpected errors.

Stop the service.
Check for WAL/SHM files alongside the database:
```
ls -la /srv/mcns/mcns.db*
```

Attempt an integrity check:

sqlite3 /srv/mcns/mcns.db "PRAGMA integrity_check;"

If integrity check fails, restore from the most recent snapshot:

cp /srv/mcns/mcns.db /srv/mcns/mcns.db.corrupt
cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db

Start the service and verify health.
Re-create any records added after the snapshot was taken.

Certificate Expiry

Symptoms: health check fails with TLS errors, API clients get certificate errors.

Check certificate expiry:

openssl x509 -in /srv/mcns/certs/cert.pem -noout -enddate

Replace the certificate and key files at the paths in mcns.toml.
Restart the service to load the new certificate.

Verify health:

mcns status --addr https://localhost:8443

MCIAS Outage

Symptoms: management API returns 502 or authentication errors. DNS continues to work normally (DNS has no auth dependency).

Confirm MCIAS is unreachable:

curl -k https://svc.metacircular.net:8443/v1/health

DNS resolution is unaffected -- no immediate action needed for DNS.
Management operations (zone/record create/update/delete) will fail until MCIAS recovers.
Escalate to MCIAS (see Escalation below).

DNS Not Resolving

Symptoms: dig @<server> <name> returns SERVFAIL or times out.

Verify the service is running:

docker compose -f deploy/docker/docker-compose-rift.yml ps mcns

Check that port 53 is listening:

ss -ulnp | grep ':53'
ss -tlnp | grep ':53'

Test an authoritative query:

dig @localhost svc.mcp.metacircular.net SOA

Test a forwarded query:
```
dig @localhost example.com A
```
If authoritative queries fail but forwarding works, the database may be corrupt (see Database Corruption above).
If forwarding fails, check upstream connectivity:
```
dig @1.1.1.1 example.com A
```

Check logs for errors:

docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcns

Port 53 Already in Use

Symptoms: MCNS fails to start with "address already in use" on port 53.

Identify what is using the port:

ss -ulnp | grep ':53'
ss -tlnp | grep ':53'

Common culprit: systemd-resolved listening on 127.0.0.53:53.
- If on a system with systemd-resolved, either disable it or bind MCNS to a specific IP instead of 0.0.0.0:53.
If another DNS server is running, stop it or change the MCNS [dns] listen_addr in mcns.toml to a different address.
Restart MCNS and verify DNS is responding.

Escalation

Escalate when:

Database corruption cannot be resolved by restoring a snapshot.
MCIAS is down and management operations are urgently needed.
DNS resolution failures persist after following the procedures above.
Any issue not covered by this runbook.

Escalation path: Kyle (platform owner).

6.0 KiB Raw Blame History

MCNS Runbook

Service Overview

Health Checks

CLI

REST

gRPC

DNS

Common Operations

Start the Service

Stop the Service

Restart the Service

Backup (Snapshot)

Restore from Snapshot

Log Inspection

Incident Procedures

Database Corruption

Certificate Expiry

MCIAS Outage

DNS Not Resolving

Port 53 Already in Use

Escalation

6.0 KiB

Raw Blame History