Document MCP-based container management for MCNS on rift, replacing the docker-compose workflow. Add deploy/mcns-rift.toml as the reference MCP service definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
265 lines
6.8 KiB
Markdown
265 lines
6.8 KiB
Markdown
# MCNS Runbook
|
|
|
|
## Service Overview
|
|
|
|
MCNS is an authoritative DNS server for the Metacircular platform. It
|
|
listens on port 53 (UDP+TCP) for DNS queries, port 8443 for the REST
|
|
management API, and port 9443 for the gRPC management API. Zone and
|
|
record data is stored in SQLite. All management operations require MCIAS
|
|
authentication; DNS queries are unauthenticated.
|
|
|
|
## Health Checks
|
|
|
|
### CLI
|
|
|
|
```bash
|
|
mcns status --addr https://localhost:8443
|
|
```
|
|
|
|
With a custom CA certificate:
|
|
|
|
```bash
|
|
mcns status --addr https://localhost:8443 --ca-cert /srv/mcns/certs/ca.pem
|
|
```
|
|
|
|
Expected output: `ok`
|
|
|
|
### REST
|
|
|
|
```bash
|
|
curl -k https://localhost:8443/v1/health
|
|
```
|
|
|
|
Expected: HTTP 200.
|
|
|
|
### gRPC
|
|
|
|
Use the `AdminService.Health` RPC on port 9443. This method is public
|
|
(no auth required).
|
|
|
|
### DNS
|
|
|
|
```bash
|
|
dig @localhost svc.mcp.metacircular.net SOA +short
|
|
```
|
|
|
|
A valid SOA response confirms the DNS listener and database are working.
|
|
|
|
## Common Operations
|
|
|
|
### Start the Service
|
|
|
|
1. Verify config exists: `ls /srv/mcns/mcns.toml`
|
|
2. Start the container:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml up -d
|
|
```
|
|
3. Verify health:
|
|
```bash
|
|
mcns status --addr https://localhost:8443
|
|
```
|
|
|
|
### Stop the Service
|
|
|
|
1. Stop the container:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml stop mcns
|
|
```
|
|
2. MCNS handles SIGTERM gracefully and drains in-flight requests (30s timeout).
|
|
|
|
### Restart the Service
|
|
|
|
1. Restart the container:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml restart mcns
|
|
```
|
|
2. Verify health:
|
|
```bash
|
|
mcns status --addr https://localhost:8443
|
|
```
|
|
|
|
### Backup (Snapshot)
|
|
|
|
1. Run the snapshot command:
|
|
```bash
|
|
mcns snapshot --config /srv/mcns/mcns.toml
|
|
```
|
|
2. The snapshot is saved to `/srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db`.
|
|
3. Verify the snapshot file exists and has a reasonable size:
|
|
```bash
|
|
ls -lh /srv/mcns/backups/
|
|
```
|
|
|
|
### Restore from Snapshot
|
|
|
|
1. Stop the service (see above).
|
|
2. Back up the current database:
|
|
```bash
|
|
cp /srv/mcns/mcns.db /srv/mcns/mcns.db.pre-restore
|
|
```
|
|
3. Copy the snapshot into place:
|
|
```bash
|
|
cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db
|
|
```
|
|
4. Start the service (see above).
|
|
5. Verify the service is healthy:
|
|
```bash
|
|
mcns status --addr https://localhost:8443
|
|
```
|
|
6. Verify zones are accessible by querying DNS:
|
|
```bash
|
|
dig @localhost svc.mcp.metacircular.net SOA +short
|
|
```
|
|
|
|
### Log Inspection
|
|
|
|
Container logs:
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 100 mcns
|
|
```
|
|
|
|
Follow logs in real time:
|
|
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs -f mcns
|
|
```
|
|
|
|
MCNS logs to stderr as structured text (slog). Log level is configured
|
|
via `[log] level` in `mcns.toml` (debug, info, warn, error).
|
|
|
|
## Incident Procedures
|
|
|
|
### Database Corruption
|
|
|
|
Symptoms: server fails to start with SQLite errors, or queries return
|
|
unexpected errors.
|
|
|
|
1. Stop the service.
|
|
2. Check for WAL/SHM files alongside the database:
|
|
```bash
|
|
ls -la /srv/mcns/mcns.db*
|
|
```
|
|
3. Attempt an integrity check:
|
|
```bash
|
|
sqlite3 /srv/mcns/mcns.db "PRAGMA integrity_check;"
|
|
```
|
|
4. If integrity check fails, restore from the most recent snapshot:
|
|
```bash
|
|
cp /srv/mcns/mcns.db /srv/mcns/mcns.db.corrupt
|
|
cp /srv/mcns/backups/mcns-YYYYMMDD-HHMMSS.db /srv/mcns/mcns.db
|
|
```
|
|
5. Start the service and verify health.
|
|
6. Re-create any records added after the snapshot was taken.
|
|
|
|
### Certificate Expiry
|
|
|
|
Symptoms: health check fails with TLS errors, API clients get
|
|
certificate errors.
|
|
|
|
1. Check certificate expiry:
|
|
```bash
|
|
openssl x509 -in /srv/mcns/certs/cert.pem -noout -enddate
|
|
```
|
|
2. Replace the certificate and key files at the paths in `mcns.toml`.
|
|
3. Restart the service to load the new certificate.
|
|
4. Verify health:
|
|
```bash
|
|
mcns status --addr https://localhost:8443
|
|
```
|
|
|
|
### MCIAS Outage
|
|
|
|
Symptoms: management API returns 502 or authentication errors. DNS
|
|
continues to work normally (DNS has no auth dependency).
|
|
|
|
1. Confirm MCIAS is unreachable:
|
|
```bash
|
|
curl -k https://svc.metacircular.net:8443/v1/health
|
|
```
|
|
2. DNS resolution is unaffected -- no immediate action needed for DNS.
|
|
3. Management operations (zone/record create/update/delete) will fail
|
|
until MCIAS recovers.
|
|
4. Escalate to MCIAS (see Escalation below).
|
|
|
|
### DNS Not Resolving
|
|
|
|
Symptoms: `dig @<server> <name>` returns SERVFAIL or times out.
|
|
|
|
1. Verify the service is running:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml ps mcns
|
|
```
|
|
2. Check that port 53 is listening:
|
|
```bash
|
|
ss -ulnp | grep ':53'
|
|
ss -tlnp | grep ':53'
|
|
```
|
|
3. Test an authoritative query:
|
|
```bash
|
|
dig @localhost svc.mcp.metacircular.net SOA
|
|
```
|
|
4. Test a forwarded query:
|
|
```bash
|
|
dig @localhost example.com A
|
|
```
|
|
5. If authoritative queries fail but forwarding works, the database may
|
|
be corrupt (see Database Corruption above).
|
|
6. If forwarding fails, check upstream connectivity:
|
|
```bash
|
|
dig @1.1.1.1 example.com A
|
|
```
|
|
7. Check logs for errors:
|
|
```bash
|
|
docker compose -f deploy/docker/docker-compose-rift.yml logs --tail 50 mcns
|
|
```
|
|
|
|
### Port 53 Already in Use
|
|
|
|
Symptoms: MCNS fails to start with "address already in use" on port 53.
|
|
|
|
1. Identify what is using the port:
|
|
```bash
|
|
ss -ulnp | grep ':53'
|
|
ss -tlnp | grep ':53'
|
|
```
|
|
2. Common culprit: `systemd-resolved` listening on `127.0.0.53:53`.
|
|
- If on a system with systemd-resolved, either disable it or bind
|
|
MCNS to a specific IP instead of `0.0.0.0:53`.
|
|
3. If another DNS server is running, stop it or change the MCNS
|
|
`[dns] listen_addr` in `mcns.toml` to a different address.
|
|
4. Restart MCNS and verify DNS is responding.
|
|
|
|
## Deployment with MCP
|
|
|
|
MCNS runs on rift as a single container managed by MCP. The service
|
|
definition lives at `~/.config/mcp/services/mcns.toml` on the operator's
|
|
machine. A reference copy is maintained at `deploy/mcns-rift.toml` in
|
|
this repository.
|
|
|
|
The container image is pulled from MCR. The container mounts `/srv/mcns`
|
|
and runs as `--user 0:0`. DNS listens on port 53 (UDP+TCP) on both
|
|
192.168.88.181 and 100.95.252.120, with the management API on 8443/9443.
|
|
|
|
Note: the operator's `~/.config/mcp/services/mcns.toml` may still
|
|
reference the old CoreDNS image and needs updating to the new MCNS image.
|
|
|
|
### Key Operations
|
|
|
|
1. Deploy or update: `mcp deploy mcns`
|
|
2. Restart: `mcp restart mcns`
|
|
3. Stop: `mcp stop mcns` (WARNING: stops DNS for all internal zones)
|
|
4. Check status: `mcp ps` or `mcp status mcns`
|
|
5. View logs: `ssh rift 'doas su - mcp -s /bin/sh -c "podman logs mcns"'`
|
|
|
|
## Escalation
|
|
|
|
Escalate when:
|
|
|
|
- Database corruption cannot be resolved by restoring a snapshot.
|
|
- MCIAS is down and management operations are urgently needed.
|
|
- DNS resolution failures persist after following the procedures above.
|
|
- Any issue not covered by this runbook.
|
|
|
|
Escalation path: Kyle (platform owner).
|