mcns/RUNBOOK.md

# MCNS Runbook

## Service Overview

MCNS is an authoritative DNS server for the Metacircular platform. It
listens on port 53 (UDP+TCP) for DNS queries, port 8443 for the REST
management API, and port 9443 for the gRPC management API. Zone and
record data is stored in SQLite. All management operations require MCIAS
authentication; DNS queries are unauthenticated.

## Health Checks

### CLI

```bash
mcns status --addr https://localhost:8443
```

With a custom CA certificate:

```bash
mcns status --addr https://localhost:8443 --ca-cert /srv/mcns/certs/ca.pem
```

Expected output: `ok`

### REST

```bash
curl -k https://localhost:8443/v1/health
```

Expected: HTTP 200.

### gRPC

Use the `AdminService.Health` RPC on port 9443. This method is public
(no auth required).

### DNS

```bash
dig @localhost svc.mcp.metacircular.net SOA +short
```

A valid SOA response confirms the DNS listener and database are working.

## Common Operations

### Start the Service

1. Verify config exists: `ls /srv/mcns/mcns.toml`
2. Start the container:
   ```bash
   docker compose -f deploy/docker/docker-compose-rift.yml up -d
   ```
3. Verify health:
   ```bash
   mcns status --addr https://localhost:8443
   ```

### Stop the Service

1. Stop the container:
   ```bash
   docker compose -f deploy/docker/docker-compose-rift.yml stop mcns
   ```
2. MCNS handles SIGTERM gracefully and drains in-flight requests (30s timeout).

### Restart the Service

1. Restart the container:
   ```bash
   docker compose -f deploy/docker/docker-compose-rift.yml restart mcns
   ```
2. Verify health:
   ```bash
   mcns status --addr https://localhost:8443
   ```

### Backup (Snapshot)

1. Run the snapshot command:
   ```bash
   mcns snapshot --config /srv/mcns/mcns.toml
   ```
2. The snapshot is saved to `/srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db`.
3. Verify the snapshot file exists and has a reasonable size:
   ```bash
   ls -lh /srv/mcns/backups/
   ```

### Restore from Snapshot

1. Stop the service (see above).
2. Back up the current database:
   ```bash
   cp /srv/mcns/mcns.db /srv/mcns/mcns.db.pre-restore
   ```
3. Copy the snapshot into place:
   ```bash
   cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db
   ```
4. Start the service (see above).
5. Verify the service is healthy:
   ```bash
   mcns status --addr https://localhost:8443
   ```
6. Verify zones are accessible by querying DNS:
   ```bash
   dig @localhost svc.mcp.metacircular.net SOA +short
   ```

### Log Inspection

Container logs:

```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcns
```

Follow logs in real time:

```bash
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcns
```

MCNS logs to stderr as structured text (slog). Log level is configured
via `[log] level` in `mcns.toml` (debug, info, warn, error).

## Incident Procedures

### Database Corruption

Symptoms: server fails to start with SQLite errors, or queries return
unexpected errors.

1. Stop the service.
2. Check for WAL/SHM files alongside the database:
   ```bash
   ls -la /srv/mcns/mcns.db*
   ```
3. Attempt an integrity check:
   ```bash
   sqlite3 /srv/mcns/mcns.db "PRAGMA integrity_check;"
   ```
4. If integrity check fails, restore from the most recent snapshot:
   ```bash
   cp /srv/mcns/mcns.db /srv/mcns/mcns.db.corrupt
   cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db
   ```
5. Start the service and verify health.
6. Re-create any records added after the snapshot was taken.

### Certificate Expiry

Symptoms: health check fails with TLS errors, API clients get
certificate errors.

1. Check certificate expiry:
   ```bash
   openssl x509 -in /srv/mcns/certs/cert.pem -noout -enddate
   ```
2. Replace the certificate and key files at the paths in `mcns.toml`.
3. Restart the service to load the new certificate.
4. Verify health:
   ```bash
   mcns status --addr https://localhost:8443
   ```

### MCIAS Outage

Symptoms: management API returns 502 or authentication errors. DNS
continues to work normally (DNS has no auth dependency).

1. Confirm MCIAS is unreachable:
   ```bash
   curl -k https://svc.metacircular.net:8443/v1/health
   ```
2. DNS resolution is unaffected -- no immediate action needed for DNS.
3. Management operations (zone/record create/update/delete) will fail
   until MCIAS recovers.
4. Escalate to MCIAS (see Escalation below).

### DNS Not Resolving

Symptoms: `dig @<server> <name>` returns SERVFAIL or times out.

1. Verify the service is running:
   ```bash
   docker compose -f deploy/docker/docker-compose-rift.yml ps mcns
   ```
2. Check that port 53 is listening:
   ```bash
   ss -ulnp | grep ':53'
   ss -tlnp | grep ':53'
   ```
3. Test an authoritative query:
   ```bash
   dig @localhost svc.mcp.metacircular.net SOA
   ```
4. Test a forwarded query:
   ```bash
   dig @localhost example.com A
   ```
5. If authoritative queries fail but forwarding works, the database may
   be corrupt (see Database Corruption above).
6. If forwarding fails, check upstream connectivity:
   ```bash
   dig @1.1.1.1 example.com A
   ```
7. Check logs for errors:
   ```bash
   docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcns
   ```

### Port 53 Already in Use

Symptoms: MCNS fails to start with "address already in use" on port 53.

1. Identify what is using the port:
   ```bash
   ss -ulnp | grep ':53'
   ss -tlnp | grep ':53'
   ```
2. Common culprit: `systemd-resolved` listening on `127.0.0.53:53`.
   - If on a system with systemd-resolved, either disable it or bind
     MCNS to a specific IP instead of `0.0.0.0:53`.
3. If another DNS server is running, stop it or change the MCNS
   `[dns] listen_addr` in `mcns.toml` to a different address.
4. Restart MCNS and verify DNS is responding.

## Escalation

Escalate when:

- Database corruption cannot be resolved by restoring a snapshot.
- MCIAS is down and management operations are urgently needed.
- DNS resolution failures persist after following the procedures above.
- Any issue not covered by this runbook.

Escalation path: Kyle (platform owner).