Files
eng-pad-server/RUNBOOK.md
Kyle Isom 2185bbe563 Add passwd command, fix template rendering, update deployment docs
- Add `passwd` CLI command to reset user passwords
- Fix web UI templates: parse each page template with layout so blocks
  render correctly (was outputting empty pages)
- Add login error logging for debugging auth failures
- Update README with deploy workflow and container management commands
- Update RUNBOOK for Docker-on-deimos deployment (replaces systemd refs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 08:27:31 -07:00

182 lines
4.8 KiB
Markdown

# RUNBOOK.md — eng-pad-server
## 1. Service Overview
eng-pad-server receives engineering notebook data from the Engineering
Pad Android app via gRPC, stores it in SQLite, and serves read-only
views through a web UI. Single authenticated user.
**Host**: deimos.wntrmute.net
**URL**: https://pad.metacircular.net
**Ports**: 443 (nginx → 8080 web UI), 8443 (REST/TLS), 9443 (gRPC/TLS)
**Data**: `/srv/eng-pad-server/`
**Config**: `/srv/eng-pad-server/eng-pad-server.toml`
**TLS**: Let's Encrypt (`/etc/letsencrypt/live/pad.metacircular.net/`), copied to `/srv/eng-pad-server/certs/`
**Container**: `eng-pad-server` (Docker, `--restart unless-stopped`)
## 2. Health Checks
1. Check container is running:
```
docker ps | grep eng-pad-server
```
2. Check web UI responds:
```
curl -s https://pad.metacircular.net/login | head -1
```
3. Check container logs:
```
docker logs eng-pad-server --tail 20
```
## 3. Common Operations
### Start / Stop / Restart
```
docker start eng-pad-server
docker stop eng-pad-server
docker restart eng-pad-server
```
### View Logs
```
docker logs eng-pad-server -f
```
### Deploy New Version
```bash
# From local machine:
rsync -az --exclude='.git' --exclude='srv/' . deimos.wntrmute.net:/tmp/eng-pad-server-build/
ssh deimos.wntrmute.net "cd /tmp/eng-pad-server-build && \
docker build -t eng-pad-server . && \
docker stop eng-pad-server && docker rm eng-pad-server && \
docker run -d --name eng-pad-server --restart unless-stopped \
-p 127.0.0.1:8090:8080 -p 8443:8443 -p 9443:9443 \
-v /srv/eng-pad-server:/srv/eng-pad-server eng-pad-server"
```
### Create User
```
docker exec -it eng-pad-server \
eng-pad-server init -c /srv/eng-pad-server/eng-pad-server.toml
```
### Reset User Password
```
docker exec -it eng-pad-server \
eng-pad-server passwd <username> -c /srv/eng-pad-server/eng-pad-server.toml
```
### Manual Backup
```
docker exec eng-pad-server \
eng-pad-server snapshot -c /srv/eng-pad-server/eng-pad-server.toml
```
Backup saved to `/srv/eng-pad-server/backups/`.
### Renew TLS Certificates
After certbot renews the Let's Encrypt cert:
```
sudo cp /etc/letsencrypt/live/pad.metacircular.net/{fullchain,privkey}.pem \
/srv/eng-pad-server/certs/
docker restart eng-pad-server
```
### Register a FIDO2/U2F Security Key
1. Log in to the web UI at https://pad.metacircular.net with password.
2. Navigate to `/keys`.
3. Enter a name for the key (e.g., "YubiKey 5").
4. Click "Register" and touch the key when prompted.
## 4. Alerting
No automated alerting is configured. Monitor via:
- `systemctl status eng-pad-server` — process health
- `journalctl -u eng-pad-server --since "1 hour ago" | grep ERROR` — errors
- Backup age: `ls -lt /srv/eng-pad-server/backups/ | head`
## 5. Incident Procedures
### Service Won't Start
1. Check logs:
```
docker logs eng-pad-server --tail 50
```
2. Common causes:
- Config file missing or invalid → fix `/srv/eng-pad-server/eng-pad-server.toml`
- TLS cert/key missing → re-copy from Let's Encrypt (see Renew TLS above)
- Port already in use → `ss -tlnp | grep -E '8443|9443|8090'`
- Database locked → check for zombie processes: `fuser /srv/eng-pad-server/eng-pad-server.db`
### Database Corruption
1. Stop the service:
```
systemctl stop eng-pad-server
```
2. Check integrity:
```
sqlite3 /srv/eng-pad-server/eng-pad-server.db "PRAGMA integrity_check"
```
3. If corrupted, restore from backup:
```
cp /srv/eng-pad-server/backups/eng-pad-server-LATEST.db /srv/eng-pad-server/eng-pad-server.db
chown engpad:engpad /srv/eng-pad-server/eng-pad-server.db
```
4. Restart:
```
systemctl start eng-pad-server
```
### Certificate Expiry
1. Check expiry:
```
openssl x509 -in /srv/eng-pad-server/certs/cert.pem -noout -dates
```
2. Regenerate or renew the certificate.
3. Restart the service (picks up new certs on start).
### Disk Full
1. Check disk usage:
```
df -h /srv/eng-pad-server/
du -sh /srv/eng-pad-server/*
```
2. Prune old backups:
```
ls -t /srv/eng-pad-server/backups/ | tail -n +8 | xargs -I{} rm /srv/eng-pad-server/backups/{}
```
3. Compact the database:
```
sqlite3 /srv/eng-pad-server/eng-pad-server.db "VACUUM"
```
### Sync Fails from Android App
1. Verify server is reachable from the device's network.
2. Check gRPC port is open: `ss -tlnp | grep 9443`
3. Check TLS cert is valid and trusted by the device.
4. Check credentials: verify the user exists via `eng-pad-server status`.
5. Check server logs for auth failures: `journalctl -u eng-pad-server | grep UNAUTHENTICATED`
## 6. Escalation
If the runbook doesn't resolve the issue:
1. Check ARCHITECTURE.md for system design context.
2. Check AUDIT.md for known security considerations.
3. Review recent commits for changes that may have introduced the issue.