Fix ECDH zeroization, add audit logging, and remediate high findings

- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-17 14:04:39 -07:00
parent b33d1f99a0
commit 5c5d7e184e
24 changed files with 1699 additions and 72 deletions

View File

@@ -561,19 +561,702 @@ subsystems (barrier/seal vs engines/server).
---
## Post-Remediation
## Post-Remediation (High)
After all eight findings are resolved:
All eight High findings (#39, #40, #48, #49, #61, #62, #68, #69) have
been resolved. See commit `a80323e`.
1. **Update AUDIT.md** — mark #39, #40, #48, #49, #61, #62, #68, #69 as
RESOLVED with resolution summaries.
---
---
# Remediation Plan — Medium-Priority Audit Findings
**Date**: 2026-03-17
**Scope**: AUDIT.md findings #7, #10, #21, #41, #42, #43, #46, #50, #51,
#53, #54, #58, #59, #63, #64, #65, #66, #67, #70, #72, #73, #74, #78, #79
This plan addresses all medium-severity findings. Findings are grouped into
eight work items by subsystem. Two findings (#72, #79) are already resolved
or trivial and are noted inline.
**Previously resolved by High remediation**:
- #72 (policy rule ID path traversal) — covered by #48's `ValidateName`
on `CreateRule` and barrier-level `validatePath`.
---
## Work Item 5: Default-Deny in Engine Policy Checks (#54, #58)
**Risk**: The SSH CA and transit engines default to **allow** when no policy
rule matches a non-admin user's request. This contradicts the default-deny
principle in ARCHITECTURE.md and the engineering standards. Now that #69 is
resolved (typed REST handlers pass `CheckPolicy`), the engines receive a
policy checker — but when `CheckPolicy` returns no match, they still allow.
**Root cause**: Both engines treat "no matching policy rule" as implicit
allow for authenticated users.
### Fix
Change the policy evaluation result handling in both engines.
**SSH CA** (`sshca.go`, `handleSignHost` ~line 309, `handleSignUser`
~line 443):
```go
// Before (default-allow):
if matched && effect == "deny" {
return nil, ErrForbidden
}
// After (default-deny):
if req.CheckPolicy != nil {
effect, matched := req.CheckPolicy(resource, "sign")
if !matched || effect != "allow" {
return nil, ErrForbidden
}
}
```
**Transit** (`transit.go`, `requireUserWithPolicy` ~line 330):
```go
// Before (default-allow):
if matched {
if effect == "deny" {
return ErrForbidden
}
return nil
}
return nil // no match → allow
// After (default-deny):
if req.CheckPolicy == nil {
return ErrForbidden
}
effect, matched := req.CheckPolicy(resource, action)
if !matched || effect != "allow" {
return ErrForbidden
}
return nil
```
Apply the same pattern to any other policy check sites in these engines.
### Files
| File | Change |
|------|--------|
| `internal/engine/sshca/sshca.go` | Default-deny in `handleSignHost`, `handleSignUser` |
| `internal/engine/transit/transit.go` | Default-deny in `requireUserWithPolicy` |
| `internal/engine/sshca/sshca_test.go` | Add test: no policy rules → user request denied |
| `internal/engine/transit/transit_test.go` | Add test: no policy rules → user request denied |
---
## Work Item 6: User Engine Concurrency and Safety (#63, #64, #65, #66, #67)
Five findings in the user engine share a root cause: the engine was
implemented after the others and does not follow the same concurrency
patterns.
### #63 — Missing role checks on decrypt, re-encrypt, rotate-key
**Fix**: Add `IsUser()` check to `handleDecrypt`, `handleReEncrypt`, and
`handleRotateKey`, matching the pattern used in `handleEncrypt`:
```go
if !req.CallerInfo.IsUser() {
return nil, ErrForbidden
}
```
### #64 — Initialize holds no mutex
**Fix**: Acquire the write lock at the top of `Initialize`:
```go
func (e *UserEngine) Initialize(ctx context.Context, ...) error {
e.mu.Lock()
defer e.mu.Unlock()
// ... existing body ...
}
```
This matches the pattern in the SSH CA and transit engines.
### #65 — handleEncrypt releases lock before using state
**Fix**: Hold `RLock` (not write lock) through the ECDH and encryption
operations. After auto-provisioning (which needs the write lock), downgrade
to a read lock for the crypto work:
```go
e.mu.Lock()
// ... auto-provisioning ...
// Capture references while holding write lock.
senderState := e.users[sender]
recipientStates := ...
e.mu.Unlock()
// Re-acquire RLock for crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// verify states are still valid (not nil)
if senderState.privKey == nil {
return nil, ErrSealed
}
// ... ECDH, wrap DEK, encrypt ...
```
Alternatively, hold the write lock through the entire operation (simpler
but serializes all encrypts). Given that encryption is the hot path, the
downgrade approach is preferred.
### #66 — handleReEncrypt manual lock without defer
**Fix**: Restructure to use `defer` for all lock releases. Split the
method into a provisioning phase (write lock) and a crypto phase (read
lock), each with its own `defer`:
```go
func (e *UserEngine) handleReEncrypt(...) {
// Phase 1: read envelope, resolve users.
e.mu.RLock()
// ... read state ...
e.mu.RUnlock()
// Phase 2: crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// ... decrypt, re-wrap, encrypt ...
}
```
### #67 — No sealed-state check in HandleRequest
**Fix**: Add a sealed-state guard at the top of `HandleRequest`, before
the operation dispatch:
```go
func (e *UserEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.RLock()
sealed := e.config == nil
e.mu.RUnlock()
if sealed {
return nil, ErrSealed
}
switch req.Operation {
// ...
}
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | All five fixes above |
| `internal/engine/user/user_test.go` | Add tests: guest role denied; concurrent encrypt/seal; sealed-state rejection |
### Verification
```bash
go test -race ./internal/engine/user/
```
---
## Work Item 7: Barrier and Seal Fixes (#41, #42, #43)
### #41 — `loadKeys` errors silently swallowed during unseal
**Risk**: If `loadKeys` fails to decrypt DEK entries (corrupt data, wrong
MEK), errors are swallowed and the keys map is incomplete. Subsequent
operations fail with confusing `ErrKeyNotFound` errors.
**Fix**: Return the error from `loadKeys` instead of ignoring it. Distinguish
between "table doesn't exist" (acceptable on first run before migration)
and genuine decryption failures:
```go
func (b *AESGCMBarrier) Unseal(mek []byte) error {
// ...
if err := b.loadKeys(); err != nil {
// Check if this is a "no such table" error (pre-migration).
if !isTableMissing(err) {
b.mek = nil
return fmt.Errorf("barrier: load keys: %w", err)
}
// Pre-migration: no barrier_keys table yet, proceed with empty map.
}
return nil
}
func isTableMissing(err error) bool {
return strings.Contains(err.Error(), "no such table")
}
```
### #42 — No AAD binding on MEK encryption with KWK
**Risk**: The MEK ciphertext in `seal_config` has no AAD, so there is no
cryptographic binding to its storage purpose.
**Fix**: Pass a constant AAD string when encrypting/decrypting the MEK:
```go
var mekAAD = []byte("metacrypt/seal_config/mek")
// In Init:
encryptedMEK, err := crypto.Encrypt(kwk, mek, mekAAD)
// In Unseal:
mek, err := crypto.Decrypt(kwk, encryptedMEK, mekAAD)
// In RotateMEK:
newEncMEK, err := crypto.Encrypt(kwk, newMEK, mekAAD)
```
**Migration note**: Existing databases have MEKs encrypted with `nil` AAD.
The unseal path must try `mekAAD` first, then fall back to `nil` for
backward compatibility. After successful unseal with `nil`, re-encrypt with
`mekAAD` and update `seal_config`.
### #43 — Barrier `List` SQL LIKE with unescaped prefix
**Risk**: If a barrier path prefix contains `%` or `_`, the LIKE query
matches unintended entries.
**Fix**: Escape SQL LIKE wildcards in the prefix:
```go
func escapeLIKE(s string) string {
s = strings.ReplaceAll(s, `\`, `\\`)
s = strings.ReplaceAll(s, `%`, `\%`)
s = strings.ReplaceAll(s, `_`, `\_`)
return s
}
// In List:
rows, err := b.db.QueryContext(ctx,
"SELECT path FROM barrier_entries WHERE path LIKE ? ESCAPE '\\'",
escapeLIKE(prefix)+"%")
```
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Fix `loadKeys` error handling; add `escapeLIKE` to `List` |
| `internal/seal/seal.go` | Add `mekAAD` constant; use in Init, Unseal, RotateMEK with fallback |
| `internal/barrier/barrier_test.go` | Add test: `List` with `%` and `_` in paths |
| `internal/seal/seal_test.go` | Add test: unseal with AAD-encrypted MEK |
---
## Work Item 8: CA Engine Fixes (#50, #51, #70)
### #50 — Non-admin users can override key usages
**Risk**: A non-admin user can request `cert sign` or `crl sign` key usage
on a leaf certificate, creating a de facto CA certificate.
**Fix**: Only allow admin users to override `key_usages` and
`ext_key_usages`. Non-admin users get the profile defaults:
```go
// In handleIssue, after profile lookup:
if req.CallerInfo.IsAdmin {
if v, ok := req.Data["key_usages"].([]interface{}); ok {
profile.KeyUse = toStringSlice(v)
}
if v, ok := req.Data["ext_key_usages"].([]interface{}); ok {
profile.ExtKeyUsages = toStringSlice(v)
}
}
```
### #51 — Certificate renewal does not revoke original
**Fix**: After issuing the renewed certificate, revoke the original by
serial. The original serial is available from the request data:
```go
// In handleRenew, after successful issuance:
oldSerial := req.Data["serial"].(string)
if err := e.revokeCertBySerial(ctx, issuerName, oldSerial, req.CallerInfo.Username); err != nil {
// Log but do not fail — the new cert is already issued.
e.logger.Warn("failed to revoke original during renewal",
"serial", oldSerial, "error", err)
}
```
### #70 — `RenewCert` has no REST route (API sync violation)
**Fix**: Add a dedicated REST route for certificate renewal:
```go
r.Post("/v1/ca/{mount}/cert/{serial}/renew", s.requireAuth(s.handleRenewCert))
```
Implement `handleRenewCert` following the typed handler pattern (with
`CheckPolicy`).
### Files
| File | Change |
|------|--------|
| `internal/engine/ca/ca.go` | Guard key usage overrides with `IsAdmin`; revoke original in `handleRenew` |
| `internal/server/routes.go` | Add `POST /v1/ca/{mount}/cert/{serial}/renew` route and handler |
| `internal/engine/ca/ca_test.go` | Add test: non-admin cannot set `cert sign` usage; renewal revokes original |
---
## Work Item 9: SSH CA Lock Granularity (#53)
**Risk**: `HandleRequest` acquires a write lock for all operations including
reads (`get-ca-pubkey`, `get-cert`, `list-certs`, `get-profile`,
`list-profiles`, `get-krl`). This serializes all operations unnecessarily.
**Fix**: Move locking into individual handlers. Read-only operations use
`RLock`; mutating operations use `Lock`:
```go
func (e *SSHCAEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
// No lock here — each handler manages its own.
switch req.Operation {
case "get-ca-pubkey":
return e.handleGetCAPubkey(ctx, req) // uses RLock internally
case "sign-host":
return e.handleSignHost(ctx, req) // uses Lock internally
// ...
}
}
```
Read-only handlers:
```go
func (e *SSHCAEngine) handleGetCAPubkey(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.RLock()
defer e.mu.RUnlock()
// ...
}
```
Mutating handlers (`sign-host`, `sign-user`, `create-profile`,
`update-profile`, `delete-profile`, `revoke-cert`, `delete-cert`):
```go
func (e *SSHCAEngine) handleSignHost(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.Lock()
defer e.mu.Unlock()
// ...
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/sshca/sshca.go` | Remove top-level lock from `HandleRequest`; add per-handler locks |
---
## Work Item 10: Transit Ciphertext Validation (#59)
**Risk**: `parseVersionedData` accepts negative version numbers. A crafted
ciphertext `metacrypt:v-1:...` parses as version -1, which fails the key
lookup but produces a confusing error.
**Fix**: Add a bounds check after parsing:
```go
func parseVersionedData(s string) (int, []byte, error) {
// ... existing parse logic ...
version, err := strconv.Atoi(parts[1][1:])
if err != nil {
return 0, nil, ErrInvalidFormat
}
if version < 1 {
return 0, nil, fmt.Errorf("transit: %w: version must be >= 1", ErrInvalidFormat)
}
// ...
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/transit/transit.go` | Add `version < 1` check in `parseVersionedData` |
| `internal/engine/transit/transit_test.go` | Add test: `metacrypt:v0:...` and `metacrypt:v-1:...` rejected |
---
## Work Item 11: Database and Auth (#46, #74)
### #46 — SQLite PRAGMAs only applied to first connection
**Risk**: `database/sql` may open additional connections that don't receive
the `journal_mode`, `foreign_keys`, and `busy_timeout` PRAGMAs.
**Fix**: Use the `_pragma` DSN parameter supported by `modernc.org/sqlite`:
```go
dsn := path + "?_pragma=journal_mode(WAL)&_pragma=foreign_keys(ON)&_pragma=busy_timeout(5000)"
db, err := sql.Open("sqlite", dsn)
```
Remove the manual PRAGMA execution loop. This ensures every connection
opened by the pool receives the PRAGMAs.
### #74 — Token validation cache grows without bound
**Risk**: The `cache` map in `auth.go` grows indefinitely. Expired entries
are never proactively removed.
**Fix**: Add periodic eviction. Run a background goroutine that sweeps
expired entries every minute:
```go
func (a *Authenticator) startEviction(ctx context.Context) {
ticker := time.NewTicker(time.Minute)
go func() {
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
a.evictExpired()
}
}
}()
}
func (a *Authenticator) evictExpired() {
now := time.Now()
a.mu.Lock()
defer a.mu.Unlock()
for key, entry := range a.cache {
if now.After(entry.expiresAt) {
delete(a.cache, key)
}
}
}
```
Call `startEviction` from the `Authenticator` constructor. Accept a
`context.Context` to allow cancellation on shutdown.
### Files
| File | Change |
|------|--------|
| `internal/db/db.go` | Switch to `_pragma` DSN parameters; remove manual PRAGMA loop |
| `internal/auth/auth.go` | Add `startEviction` goroutine; call from constructor |
| `internal/db/db_test.go` | Verify PRAGMAs are active on a fresh connection from the pool |
| `internal/auth/auth_test.go` | Add test: expired entries are evicted after sweep |
---
## Work Item 12: Policy Engine Glob Matching (#73)
**Risk**: `filepath.Match` does not support `**` for recursive directory
matching. Administrators writing rules like `engine/**/certs/*` will find
they don't match paths with multiple segments.
**Fix**: Replace `filepath.Match` with `path.Match` (POSIX-style, no
OS-specific behavior) and add support for `**` by splitting the pattern
and value on `/` and matching segments:
```go
import "path"
func matchesAnyGlob(patterns []string, value string) bool {
for _, p := range patterns {
if matchGlob(p, value) {
return true
}
}
return false
}
func matchGlob(pattern, value string) bool {
// Handle ** by checking if any suffix of value matches the rest of pattern.
if strings.Contains(pattern, "**") {
parts := strings.SplitN(pattern, "**", 2)
prefix := parts[0]
suffix := parts[1]
if !strings.HasPrefix(value, prefix) {
return false
}
remainder := value[len(prefix):]
// Try matching suffix against every suffix of remainder.
for i := 0; i <= len(remainder); i++ {
if matched, _ := path.Match(suffix, remainder[i:]); matched {
return true
}
// Advance to next separator.
next := strings.IndexByte(remainder[i:], '/')
if next < 0 {
break
}
i += next
}
return false
}
matched, _ := path.Match(pattern, value)
return matched
}
```
Document in POLICY.md that `**` matches zero or more path segments, while
`*` matches within a single segment.
### Files
| File | Change |
|------|--------|
| `internal/policy/policy.go` | Replace `filepath.Match` with `path.Match` + `**` support |
| `internal/policy/policy_test.go` | Add test: `engine/**/certs/*` matches `engine/ca/prod/certs/abc123` |
---
## Work Item 13: Deployment Fixes (#78, #79)
### #78 — systemd `ExecReload` sends SIGHUP with no handler
**Fix**: Two options:
**Option A** (recommended): Remove `ExecReload` entirely. Metacrypt does
not support graceful reload — configuration changes require a restart. The
`ExecReload` line creates a false expectation.
**Option B**: Implement SIGHUP handling that re-reads the TOML config and
applies non-breaking changes (log level, TLS cert reload). This is
significant new functionality and should be a separate feature.
For now, remove `ExecReload` from both service units:
```diff
-ExecReload=/bin/kill -HUP $MAINPID
```
### #79 — Dockerfiles use Go 1.23 but module requires Go 1.25
**Fix**: Update the builder base image:
```diff
-FROM golang:1.23-alpine AS builder
+FROM golang:1.25-alpine AS builder
```
Apply to both `Dockerfile.api` and `Dockerfile.web`. Also remove the
unnecessary `gcc` and `musl-dev` installation since `CGO_ENABLED=0`.
```diff
-RUN apk add --no-cache gcc musl-dev
```
### #7 — No audit logging
**Note**: This is the only medium finding that requires significant new
functionality rather than a targeted fix. Audit logging should be
implemented as a dedicated subsystem:
1. Define an `AuditEvent` struct with: timestamp, caller, operation,
resource, outcome (success/denied/error), and metadata.
2. Write events to a structured log sink (slog with JSON output to a
dedicated file at `/srv/metacrypt/audit.log`).
3. Instrument every engine `HandleRequest` to emit an event on completion.
4. Instrument policy `CreateRule`/`DeleteRule`.
5. Instrument seal/unseal operations.
This is a substantial feature. Track separately from the quick-fix items
in this plan.
### #10 — No extension allowlist for SSH host certificates
**Note**: Host certificates typically don't use extensions (extensions are
for user certificates). The fix is to ignore the `extensions` field on
host signing requests rather than passing it through:
```go
// In handleSignHost, after building the certificate:
cert.Permissions.Extensions = nil // Host certs should not carry extensions
```
### #21 — No rate limiting on transit cryptographic operations
**Note**: This requires a rate-limiting middleware or per-caller token
bucket. Best implemented as server-level middleware applied to engine
request handlers:
```go
func (s *Server) requireRateLimit(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
info := TokenInfoFromContext(r.Context())
if !s.rateLimiter.Allow(info.Username) {
writeJSONError(w, "rate limit exceeded", http.StatusTooManyRequests)
return
}
next(w, r)
}
}
```
Track separately — this affects API design decisions (limits, quotas,
per-user vs per-token).
### Files
| File | Change |
|------|--------|
| `deploy/systemd/metacrypt.service` | Remove `ExecReload` line |
| `deploy/systemd/metacrypt-web.service` | Remove `ExecReload` line |
| `Dockerfile.api` | Update to `golang:1.25-alpine`; remove `gcc musl-dev` |
| `Dockerfile.web` | Update to `golang:1.25-alpine`; remove `gcc musl-dev` |
---
## Implementation Order
```
5. #54, #58 Default-deny in engines (security-critical, do first)
6. #63#67 User engine concurrency (5 coupled fixes, one change)
7. #41#43 Barrier/seal fixes (3 independent fixes)
8. #50, #51, #70 CA engine fixes (key usage + renewal + API sync)
9. #53 SSH CA lock granularity (standalone refactor)
10. #59 Transit version validation (one-line fix)
11. #46, #74 DB pragmas + auth cache (independent, no overlap)
12. #73 Policy glob matching (standalone)
13. #78, #79 Deployment fixes (non-code, standalone)
#7 Audit logging (new feature, track separately)
#10 SSH host extensions (one-line fix, ship with #9)
#21 Transit rate limiting (new feature, track separately)
```
Work items 510 can be parallelized across engineers:
- Engineer A: #5 (default-deny) + #9 (SSH CA locks) + #10 (host extensions)
- Engineer B: #6 (user concurrency) + #8 (CA fixes)
- Engineer C: #7 (barrier/seal) + #10 (transit version) + #11 (DB/auth)
- Independent: #12 (policy), #13 (deployment)
Work items #7 (audit logging) and #21 (rate limiting) are substantial new
features and should be tracked as separate engineering tasks, not quick
fixes.
---
## Post-Remediation (Medium)
After all medium findings are resolved:
1. **Update AUDIT.md** — mark each finding as RESOLVED with summary.
2. **Run the full pipeline**: `make all` (vet, lint, test, build).
3. **Run race detector**: `go test -race ./...`
4. **Address related medium findings** that interact with these fixes:
- #54 (SSH CA default-allow) and #58 (transit default-allow) — once
#69 is fixed, the typed handlers will pass policy checkers to the
engines, but the engines still default-allow when `CheckPolicy`
returns no match. Consider changing the engine-level default to deny
for non-admin callers.
- #72 (policy ID path traversal) — already covered by #48's
`ValidateName` fix on `CreateRule`.
3. **Run race detector**: `go test -race ./...` — especially important
for work items 6 and 9 (concurrency changes).
4. **Address remaining low findings** — 14 low-severity items remain,
mostly zeroization gaps and documentation drift.