Add docker-compose, RUNBOOK.md, and docker Makefile target
- docker-compose.yml: single service with data volume, ports 8443/9443/8080 - RUNBOOK.md: health checks, common operations (start/stop/backup/init), FIDO2 key registration, incident procedures (won't start, DB corruption, cert expiry, disk full, sync failures), escalation path - Makefile: added docker target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
198
RUNBOOK.md
Normal file
198
RUNBOOK.md
Normal file
@@ -0,0 +1,198 @@
|
||||
# RUNBOOK.md — eng-pad-server
|
||||
|
||||
## 1. Service Overview
|
||||
|
||||
eng-pad-server receives engineering notebook data from the Engineering
|
||||
Pad Android app via gRPC, stores it in SQLite, and serves read-only
|
||||
views through a web UI. Single authenticated user.
|
||||
|
||||
**Ports**: 8443 (REST/HTTPS), 9443 (gRPC/TLS), 8080 (Web UI)
|
||||
**Data**: `/srv/eng-pad-server/`
|
||||
**Config**: `/srv/eng-pad-server/eng-pad-server.toml`
|
||||
**Binary**: `/usr/local/bin/eng-pad-server`
|
||||
|
||||
## 2. Health Checks
|
||||
|
||||
1. Check service is running:
|
||||
```
|
||||
systemctl status eng-pad-server
|
||||
```
|
||||
|
||||
2. Check database health:
|
||||
```
|
||||
eng-pad-server status -c /srv/eng-pad-server/eng-pad-server.toml
|
||||
```
|
||||
|
||||
3. Check web UI responds:
|
||||
```
|
||||
curl -k https://localhost:8443/login
|
||||
```
|
||||
|
||||
4. Check gRPC responds:
|
||||
```
|
||||
grpcurl -insecure localhost:9443 list
|
||||
```
|
||||
|
||||
## 3. Common Operations
|
||||
|
||||
### Start / Stop / Restart
|
||||
|
||||
```
|
||||
systemctl start eng-pad-server
|
||||
systemctl stop eng-pad-server
|
||||
systemctl restart eng-pad-server
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```
|
||||
journalctl -u eng-pad-server -f
|
||||
```
|
||||
|
||||
### Manual Backup
|
||||
|
||||
```
|
||||
eng-pad-server snapshot -c /srv/eng-pad-server/eng-pad-server.toml
|
||||
```
|
||||
|
||||
Backup saved to `/srv/eng-pad-server/backups/`.
|
||||
|
||||
### Check Backup Timer
|
||||
|
||||
```
|
||||
systemctl list-timers eng-pad-server-backup.timer
|
||||
```
|
||||
|
||||
### Initialize (First Time)
|
||||
|
||||
1. Install the binary and config:
|
||||
```
|
||||
sudo deploy/scripts/install.sh
|
||||
```
|
||||
|
||||
2. Edit the config file:
|
||||
```
|
||||
sudo -u engpad vi /srv/eng-pad-server/eng-pad-server.toml
|
||||
```
|
||||
|
||||
3. Generate TLS certificates (or copy existing ones):
|
||||
```
|
||||
# Self-signed for development:
|
||||
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
|
||||
-keyout /srv/eng-pad-server/certs/key.pem \
|
||||
-out /srv/eng-pad-server/certs/cert.pem \
|
||||
-days 3650 -nodes -subj '/CN=pad.metacircular.net'
|
||||
chown engpad:engpad /srv/eng-pad-server/certs/*.pem
|
||||
chmod 600 /srv/eng-pad-server/certs/key.pem
|
||||
```
|
||||
|
||||
4. Create the admin user:
|
||||
```
|
||||
eng-pad-server init -c /srv/eng-pad-server/eng-pad-server.toml
|
||||
```
|
||||
|
||||
5. Start the service:
|
||||
```
|
||||
systemctl enable --now eng-pad-server
|
||||
systemctl enable --now eng-pad-server-backup.timer
|
||||
```
|
||||
|
||||
### Register a FIDO2/U2F Security Key
|
||||
|
||||
1. Log in to the web UI with password.
|
||||
2. Navigate to `/keys`.
|
||||
3. Enter a name for the key (e.g., "YubiKey 5").
|
||||
4. Click "Register" and touch the key when prompted.
|
||||
|
||||
### Docker Deployment
|
||||
|
||||
```
|
||||
cd deploy/docker
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
First-time setup inside the container:
|
||||
```
|
||||
docker compose exec eng-pad-server eng-pad-server init -c /srv/eng-pad-server/eng-pad-server.toml
|
||||
```
|
||||
|
||||
## 4. Alerting
|
||||
|
||||
No automated alerting is configured. Monitor via:
|
||||
- `systemctl status eng-pad-server` — process health
|
||||
- `journalctl -u eng-pad-server --since "1 hour ago" | grep ERROR` — errors
|
||||
- Backup age: `ls -lt /srv/eng-pad-server/backups/ | head`
|
||||
|
||||
## 5. Incident Procedures
|
||||
|
||||
### Service Won't Start
|
||||
|
||||
1. Check logs:
|
||||
```
|
||||
journalctl -u eng-pad-server -n 50 --no-pager
|
||||
```
|
||||
2. Common causes:
|
||||
- Config file missing or invalid → fix config
|
||||
- TLS cert/key missing → regenerate or copy
|
||||
- Port already in use → `ss -tlnp | grep 8443`
|
||||
- Database locked → check for zombie processes: `fuser /srv/eng-pad-server/eng-pad-server.db`
|
||||
|
||||
### Database Corruption
|
||||
|
||||
1. Stop the service:
|
||||
```
|
||||
systemctl stop eng-pad-server
|
||||
```
|
||||
2. Check integrity:
|
||||
```
|
||||
sqlite3 /srv/eng-pad-server/eng-pad-server.db "PRAGMA integrity_check"
|
||||
```
|
||||
3. If corrupted, restore from backup:
|
||||
```
|
||||
cp /srv/eng-pad-server/backups/eng-pad-server-LATEST.db /srv/eng-pad-server/eng-pad-server.db
|
||||
chown engpad:engpad /srv/eng-pad-server/eng-pad-server.db
|
||||
```
|
||||
4. Restart:
|
||||
```
|
||||
systemctl start eng-pad-server
|
||||
```
|
||||
|
||||
### Certificate Expiry
|
||||
|
||||
1. Check expiry:
|
||||
```
|
||||
openssl x509 -in /srv/eng-pad-server/certs/cert.pem -noout -dates
|
||||
```
|
||||
2. Regenerate or renew the certificate.
|
||||
3. Restart the service (picks up new certs on start).
|
||||
|
||||
### Disk Full
|
||||
|
||||
1. Check disk usage:
|
||||
```
|
||||
df -h /srv/eng-pad-server/
|
||||
du -sh /srv/eng-pad-server/*
|
||||
```
|
||||
2. Prune old backups:
|
||||
```
|
||||
ls -t /srv/eng-pad-server/backups/ | tail -n +8 | xargs -I{} rm /srv/eng-pad-server/backups/{}
|
||||
```
|
||||
3. Compact the database:
|
||||
```
|
||||
sqlite3 /srv/eng-pad-server/eng-pad-server.db "VACUUM"
|
||||
```
|
||||
|
||||
### Sync Fails from Android App
|
||||
|
||||
1. Verify server is reachable from the device's network.
|
||||
2. Check gRPC port is open: `ss -tlnp | grep 9443`
|
||||
3. Check TLS cert is valid and trusted by the device.
|
||||
4. Check credentials: verify the user exists via `eng-pad-server status`.
|
||||
5. Check server logs for auth failures: `journalctl -u eng-pad-server | grep UNAUTHENTICATED`
|
||||
|
||||
## 6. Escalation
|
||||
|
||||
If the runbook doesn't resolve the issue:
|
||||
1. Check ARCHITECTURE.md for system design context.
|
||||
2. Check AUDIT.md for known security considerations.
|
||||
3. Review recent commits for changes that may have introduced the issue.
|
||||
Reference in New Issue
Block a user