- Add `passwd` CLI command to reset user passwords - Fix web UI templates: parse each page template with layout so blocks render correctly (was outputting empty pages) - Add login error logging for debugging auth failures - Update README with deploy workflow and container management commands - Update RUNBOOK for Docker-on-deimos deployment (replaces systemd refs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
182 lines
4.8 KiB
Markdown
182 lines
4.8 KiB
Markdown
# RUNBOOK.md — eng-pad-server
|
|
|
|
## 1. Service Overview
|
|
|
|
eng-pad-server receives engineering notebook data from the Engineering
|
|
Pad Android app via gRPC, stores it in SQLite, and serves read-only
|
|
views through a web UI. Single authenticated user.
|
|
|
|
**Host**: deimos.wntrmute.net
|
|
**URL**: https://pad.metacircular.net
|
|
**Ports**: 443 (nginx → 8080 web UI), 8443 (REST/TLS), 9443 (gRPC/TLS)
|
|
**Data**: `/srv/eng-pad-server/`
|
|
**Config**: `/srv/eng-pad-server/eng-pad-server.toml`
|
|
**TLS**: Let's Encrypt (`/etc/letsencrypt/live/pad.metacircular.net/`), copied to `/srv/eng-pad-server/certs/`
|
|
**Container**: `eng-pad-server` (Docker, `--restart unless-stopped`)
|
|
|
|
## 2. Health Checks
|
|
|
|
1. Check container is running:
|
|
```
|
|
docker ps | grep eng-pad-server
|
|
```
|
|
|
|
2. Check web UI responds:
|
|
```
|
|
curl -s https://pad.metacircular.net/login | head -1
|
|
```
|
|
|
|
3. Check container logs:
|
|
```
|
|
docker logs eng-pad-server --tail 20
|
|
```
|
|
|
|
## 3. Common Operations
|
|
|
|
### Start / Stop / Restart
|
|
|
|
```
|
|
docker start eng-pad-server
|
|
docker stop eng-pad-server
|
|
docker restart eng-pad-server
|
|
```
|
|
|
|
### View Logs
|
|
|
|
```
|
|
docker logs eng-pad-server -f
|
|
```
|
|
|
|
### Deploy New Version
|
|
|
|
```bash
|
|
# From local machine:
|
|
rsync -az --exclude='.git' --exclude='srv/' . deimos.wntrmute.net:/tmp/eng-pad-server-build/
|
|
ssh deimos.wntrmute.net "cd /tmp/eng-pad-server-build && \
|
|
docker build -t eng-pad-server . && \
|
|
docker stop eng-pad-server && docker rm eng-pad-server && \
|
|
docker run -d --name eng-pad-server --restart unless-stopped \
|
|
-p 127.0.0.1:8090:8080 -p 8443:8443 -p 9443:9443 \
|
|
-v /srv/eng-pad-server:/srv/eng-pad-server eng-pad-server"
|
|
```
|
|
|
|
### Create User
|
|
|
|
```
|
|
docker exec -it eng-pad-server \
|
|
eng-pad-server init -c /srv/eng-pad-server/eng-pad-server.toml
|
|
```
|
|
|
|
### Reset User Password
|
|
|
|
```
|
|
docker exec -it eng-pad-server \
|
|
eng-pad-server passwd <username> -c /srv/eng-pad-server/eng-pad-server.toml
|
|
```
|
|
|
|
### Manual Backup
|
|
|
|
```
|
|
docker exec eng-pad-server \
|
|
eng-pad-server snapshot -c /srv/eng-pad-server/eng-pad-server.toml
|
|
```
|
|
|
|
Backup saved to `/srv/eng-pad-server/backups/`.
|
|
|
|
### Renew TLS Certificates
|
|
|
|
After certbot renews the Let's Encrypt cert:
|
|
```
|
|
sudo cp /etc/letsencrypt/live/pad.metacircular.net/{fullchain,privkey}.pem \
|
|
/srv/eng-pad-server/certs/
|
|
docker restart eng-pad-server
|
|
```
|
|
|
|
### Register a FIDO2/U2F Security Key
|
|
|
|
1. Log in to the web UI at https://pad.metacircular.net with password.
|
|
2. Navigate to `/keys`.
|
|
3. Enter a name for the key (e.g., "YubiKey 5").
|
|
4. Click "Register" and touch the key when prompted.
|
|
|
|
## 4. Alerting
|
|
|
|
No automated alerting is configured. Monitor via:
|
|
- `systemctl status eng-pad-server` — process health
|
|
- `journalctl -u eng-pad-server --since "1 hour ago" | grep ERROR` — errors
|
|
- Backup age: `ls -lt /srv/eng-pad-server/backups/ | head`
|
|
|
|
## 5. Incident Procedures
|
|
|
|
### Service Won't Start
|
|
|
|
1. Check logs:
|
|
```
|
|
docker logs eng-pad-server --tail 50
|
|
```
|
|
2. Common causes:
|
|
- Config file missing or invalid → fix `/srv/eng-pad-server/eng-pad-server.toml`
|
|
- TLS cert/key missing → re-copy from Let's Encrypt (see Renew TLS above)
|
|
- Port already in use → `ss -tlnp | grep -E '8443|9443|8090'`
|
|
- Database locked → check for zombie processes: `fuser /srv/eng-pad-server/eng-pad-server.db`
|
|
|
|
### Database Corruption
|
|
|
|
1. Stop the service:
|
|
```
|
|
systemctl stop eng-pad-server
|
|
```
|
|
2. Check integrity:
|
|
```
|
|
sqlite3 /srv/eng-pad-server/eng-pad-server.db "PRAGMA integrity_check"
|
|
```
|
|
3. If corrupted, restore from backup:
|
|
```
|
|
cp /srv/eng-pad-server/backups/eng-pad-server-LATEST.db /srv/eng-pad-server/eng-pad-server.db
|
|
chown engpad:engpad /srv/eng-pad-server/eng-pad-server.db
|
|
```
|
|
4. Restart:
|
|
```
|
|
systemctl start eng-pad-server
|
|
```
|
|
|
|
### Certificate Expiry
|
|
|
|
1. Check expiry:
|
|
```
|
|
openssl x509 -in /srv/eng-pad-server/certs/cert.pem -noout -dates
|
|
```
|
|
2. Regenerate or renew the certificate.
|
|
3. Restart the service (picks up new certs on start).
|
|
|
|
### Disk Full
|
|
|
|
1. Check disk usage:
|
|
```
|
|
df -h /srv/eng-pad-server/
|
|
du -sh /srv/eng-pad-server/*
|
|
```
|
|
2. Prune old backups:
|
|
```
|
|
ls -t /srv/eng-pad-server/backups/ | tail -n +8 | xargs -I{} rm /srv/eng-pad-server/backups/{}
|
|
```
|
|
3. Compact the database:
|
|
```
|
|
sqlite3 /srv/eng-pad-server/eng-pad-server.db "VACUUM"
|
|
```
|
|
|
|
### Sync Fails from Android App
|
|
|
|
1. Verify server is reachable from the device's network.
|
|
2. Check gRPC port is open: `ss -tlnp | grep 9443`
|
|
3. Check TLS cert is valid and trusted by the device.
|
|
4. Check credentials: verify the user exists via `eng-pad-server status`.
|
|
5. Check server logs for auth failures: `journalctl -u eng-pad-server | grep UNAUTHENTICATED`
|
|
|
|
## 6. Escalation
|
|
|
|
If the runbook doesn't resolve the issue:
|
|
1. Check ARCHITECTURE.md for system design context.
|
|
2. Check AUDIT.md for known security considerations.
|
|
3. Review recent commits for changes that may have introduced the issue.
|