Files
metacrypt/REMEDIATION.md
Kyle Isom 5c5d7e184e Fix ECDH zeroization, add audit logging, and remediate high findings
- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:04:39 -07:00

1263 lines
40 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Remediation Plan — High-Priority Audit Findings
**Date**: 2026-03-17
**Scope**: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69
This plan addresses all eight High-severity findings from the 2026-03-17
full system audit. Findings are grouped into four work items by shared root
cause or affected subsystem. The order reflects dependency chains: #68 is a
standalone fix that should ship first; #48 is a prerequisite for safe
operation across all engines; #39/#40 affect the storage core; the remaining
four affect specific engines.
---
## Work Item 1: JSON Injection in REST Error Responses (#68)
**Risk**: An error message containing `"` or `\` breaks the JSON response
structure. If the error contains attacker-controlled input (e.g., a mount
name or key name that triggers a downstream error), this enables JSON
injection in API responses.
**Root cause**: 13 locations in `internal/server/routes.go` construct JSON
error responses via string concatenation:
```go
http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)
```
The `writeEngineError` helper (line 1704) is the most common entry point;
most typed handlers call it.
### Fix
1. **Replace `writeEngineError`** with a safe JSON encoder:
```go
func writeJSONError(w http.ResponseWriter, msg string, code int) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
_ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
}
```
2. **Replace all 13 call sites** that use string concatenation with
`writeJSONError(w, grpcMessage(err), status)` or
`writeJSONError(w, err.Error(), status)`.
The `grpcMessage` helper already exists in the webserver package and
extracts human-readable messages from gRPC errors. Add an equivalent
to the REST server, and prefer it over raw `err.Error()` to avoid
leaking internal error details.
3. **Grep for the pattern** `"error":"` in `routes.go` to confirm no
remaining string-concatenated JSON.
### Files
| File | Change |
|------|--------|
| `internal/server/routes.go` | Replace `writeEngineError` and all 13 inline error sites |
### Verification
- `go vet ./internal/server/`
- `go test ./internal/server/`
- Manual test: mount an engine with a name containing `"`, trigger an error,
verify the response is valid JSON.
---
## Work Item 2: Path Traversal via Unsanitized Names (#48)
**Risk**: User-controlled strings (issuer names, key names, profile names,
usernames, mount names) are concatenated directly into barrier storage
paths. An input containing `../` traverses the barrier namespace, allowing
reads and writes to arbitrary paths. This affects all four engines and the
engine registry.
**Root cause**: No validation exists at any layer — neither the barrier's
`Put`/`Get`/`Delete` methods nor the engines sanitize path components.
### Vulnerable locations
| File | Input | Path Pattern |
|------|-------|-------------|
| `ca/ca.go` | issuer `name` | `mountPath + "issuers/" + name + "/"` |
| `sshca/sshca.go` | profile `name` | `mountPath + "profiles/" + name + ".json"` |
| `transit/transit.go` | key `name` | `mountPath + "keys/" + name + "/"` |
| `user/user.go` | `username` | `mountPath + "users/" + username + "/"` |
| `engine/engine.go` | mount `name` | `engine/{type}/{name}/` |
| `policy/policy.go` | rule `ID` | `policy/rules/{id}` |
### Fix
Enforce validation at **two layers** (defense in depth):
1. **Barrier layer** — reject paths containing `..` segments.
Add a `validatePath` check at the top of `Get`, `Put`, `Delete`, and
`List` in `barrier.go`:
```go
var ErrInvalidPath = errors.New("barrier: invalid path")
func validatePath(p string) error {
for _, seg := range strings.Split(p, "/") {
if seg == ".." {
return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
}
}
return nil
}
```
Call `validatePath` at the entry of `Get`, `Put`, `Delete`, `List`.
Return `ErrInvalidPath` on failure.
2. **Engine/registry layer** — validate entity names at input boundaries.
Add a `ValidateName` helper to `internal/engine/`:
```go
var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)
func ValidateName(name string) error {
if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
"dot, hyphen, or underscore characters", name)
}
return nil
}
```
Call `ValidateName` in:
| Location | Input validated |
|----------|----------------|
| `engine.go` `Mount()` | mount name |
| `ca.go` `handleCreateIssuer` | issuer name |
| `sshca.go` `handleCreateProfile` | profile name |
| `transit.go` `handleCreateKey` | key name |
| `user.go` `handleRegister`, `handleProvision` | username |
| `user.go` `handleEncrypt` | recipient usernames |
| `policy.go` `CreateRule` | rule ID |
Note: certificate serials are generated server-side from `crypto/rand`
and hex-encoded, so they are safe. Validate anyway for defense in depth.
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Add `validatePath`, call from Get/Put/Delete/List |
| `internal/engine/engine.go` | Add `ValidateName`, call from `Mount` |
| `internal/engine/ca/ca.go` | Call `ValidateName` on issuer name |
| `internal/engine/sshca/sshca.go` | Call `ValidateName` on profile name |
| `internal/engine/transit/transit.go` | Call `ValidateName` on key name |
| `internal/engine/user/user.go` | Call `ValidateName` on usernames |
| `internal/policy/policy.go` | Call `ValidateName` on rule ID |
### Verification
- Add `TestValidatePath` to `barrier_test.go`: confirm `../` and `..` are
rejected; confirm normal paths pass.
- Add `TestValidateName` to `engine_test.go`: confirm `../evil`, empty
string, and overlong names are rejected; confirm valid names pass.
- `go test ./internal/barrier/ ./internal/engine/... ./internal/policy/`
---
## Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)
These two findings share the barrier/seal subsystem and should be addressed
together.
### #39 — TOCTOU Race in Barrier Get/Put
**Risk**: `Get` and `Put` copy the `mek` slice header and `keys` map
reference under `RLock`, release the lock, then use the copied references
for encryption/decryption. A concurrent `Seal()` zeroizes the underlying
byte slices in place before nil-ing the fields, so a concurrent reader
uses zeroized key material.
**Root cause**: The lock does not cover the crypto operation. The "copy"
is a shallow reference copy (slice header), not a deep byte copy. `Seal()`
zeroizes the backing array, which is shared.
**Current locking pattern** (`barrier.go`):
```
Get: RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put: RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock
```
**Fix**: Hold `RLock` through the entire crypto operation:
```go
func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
if err := validatePath(path); err != nil {
return nil, err
}
b.mu.RLock()
defer b.mu.RUnlock()
if b.mek == nil {
return nil, ErrSealed
}
// query DB, resolve key, decrypt — all under RLock
// ...
}
```
This is the minimal, safest change. `RLock` permits concurrent readers, so
there is no throughput regression for parallel `Get`/`Put` operations. The
only serialization point is `Seal()`, which acquires the exclusive `Lock`
and waits for all readers to drain — exactly the semantics we want.
Apply the same pattern to `Put`, `Delete`, and `List`.
**Alternative considered**: Atomic pointer swap (`atomic.Pointer[keyState]`).
This eliminates the lock from the hot path entirely, but introduces
complexity around deferred zeroization of the old state (readers may still
hold references). The `RLock`-through-crypto approach is simpler and
sufficient for Metacrypt's concurrency profile.
### #40 — Crash During `ReWrapKeys` Loses All Data
**Risk**: `RotateMEK` calls `barrier.ReWrapKeys(newMEK)` which commits a
transaction re-wrapping all DEKs, then separately updates `seal_config`
with the new encrypted MEK. A crash between these two database operations
leaves DEKs wrapped with a MEK that is not persisted — all data is
irrecoverable.
**Current flow** (`seal.go` lines 245313):
```
1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK) ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil) ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ? ← separate statement, not in transaction
*** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK
```
**Fix**: Unify steps 24 into a single database transaction.
Refactor `ReWrapKeys` to accept an optional `*sql.Tx`:
```go
// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
// Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
// ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}
// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
b.mu.Lock()
defer b.mu.Unlock()
mcrypto.Zeroize(b.mek)
b.mek = newMEK
}
```
Then in `RotateMEK`:
```go
func (m *Manager) RotateMEK(ctx context.Context, password string) error {
// ... derive KWK, generate newMEK ...
tx, err := m.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
// Re-wrap all DEKs within the transaction.
if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
return err
}
// Update seal_config within the same transaction.
encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
if err != nil {
return err
}
if _, err := tx.ExecContext(ctx,
"UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
encNewMEK,
); err != nil {
return err
}
if err := tx.Commit(); err != nil {
return err
}
// Only after commit: update in-memory state.
m.barrier.SwapMEK(newMEK)
return nil
}
```
SQLite in WAL mode handles this correctly — the transaction is atomic
regardless of process crash. The `barrier_keys` and `seal_config` updates
either both commit or neither does.
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Extend RLock scope in Get/Put/Delete/List; add `ReWrapKeysTx`, `SwapMEK` |
| `internal/seal/seal.go` | Wrap ReWrapKeysTx + seal_config UPDATE in single transaction |
| `internal/barrier/barrier_test.go` | Add concurrent Get/Seal stress test |
### Verification
- `go test -race ./internal/barrier/ ./internal/seal/`
- Add `TestConcurrentGetSeal`: spawn goroutines doing Get while another
goroutine calls Seal. Run with `-race`. Verify no panics or data races.
- Add `TestRotateMEKAtomic`: verify that `barrier_keys` and `seal_config`
are updated in the same transaction (mock the DB to detect transaction
boundaries, or verify via rollback behavior).
---
## Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)
These four findings touch separate files with no overlap and can be
addressed in parallel.
### #49 — No TTL Ceiling in CA Certificate Issuance
**Risk**: A non-admin user can request an arbitrarily long certificate
lifetime. The issuer's `MaxTTL` exists in config but is not enforced
during `handleIssue` or `handleSignCSR`.
**Root cause**: The CA engine applies the user's requested TTL directly
to the certificate without comparing it against `issuerConfig.MaxTTL`.
The SSH CA engine correctly enforces this via `resolveTTL` — the CA
engine does not.
**Fix**: Add a `resolveTTL` method to the CA engine, following the SSH
CA engine's pattern (`sshca.go` lines 902932):
```go
func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
if err != nil {
maxTTL = 2160 * time.Hour // 90 days fallback
}
if requested != "" {
ttl, err := time.ParseDuration(requested)
if err != nil {
return 0, fmt.Errorf("invalid TTL: %w", err)
}
if ttl > maxTTL {
return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
}
return ttl, nil
}
return maxTTL, nil
}
```
Call this in `handleIssue` and `handleSignCSR` before constructing the
certificate. Replace the raw TTL string with the validated duration.
| File | Change |
|------|--------|
| `internal/engine/ca/ca.go` | Add `resolveTTL`, call from `handleIssue` and `handleSignCSR` |
| `internal/engine/ca/ca_test.go` | Add test: issue cert with TTL > MaxTTL, verify rejection |
### #61 — Ineffective ECDH Key Zeroization
**Risk**: `privKey.Bytes()` returns a copy of the private key bytes.
Zeroizing the copy leaves the original inside `*ecdh.PrivateKey`. Go's
`crypto/ecdh` API does not expose the internal byte slice.
**Root cause**: Language/API limitation in Go's `crypto/ecdh` package.
**Fix**: Store the raw private key bytes alongside the parsed key in
`userState`, and zeroize those bytes on seal:
```go
type userState struct {
privKey *ecdh.PrivateKey
privBytes []byte // raw key bytes, retained for zeroization
pubKey *ecdh.PublicKey
config *UserKeyConfig
}
```
On **load from barrier** (Unseal, auto-provision):
```go
raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw // retain for zeroization
state.privKey = priv
```
On **Seal**:
```go
mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil
```
Document the limitation: the parsed `*ecdh.PrivateKey` struct's internal
copy cannot be zeroized from Go code. Setting `privKey = nil` makes it
eligible for GC, but does not guarantee immediate byte overwrite. This is
an accepted Go runtime limitation.
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `privBytes` to `userState`, populate on load, zeroize on Seal |
| `internal/engine/user/types.go` | Update `userState` struct |
### #62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name
**Risk**: Policy checks construct the resource path using `e.mountPath`
(which is `engine/user/{name}/`) instead of just the mount name. Policy
rules match against `user/{name}/recipient/{username}`, so the full mount
path creates a mismatch like `user/engine/user/myengine//recipient/alice`.
No policy rule will ever match.
**Root cause**: Line 358 of `user.go` uses `e.mountPath` directly. The
SSH CA and transit engines correctly use a `mountName()` helper.
**Fix**: Add a `mountName()` method to the user engine:
```go
func (e *UserEngine) mountName() string {
// mountPath is "engine/user/{name}/"
parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
if len(parts) >= 3 {
return parts[2]
}
return e.mountPath
}
```
Change line 358:
```go
resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)
```
Audit all other resource path constructions in the user engine to confirm
they also use the correct mount name.
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `mountName()`, fix resource path on line 358 |
| `internal/engine/user/user_test.go` | Add test: verify policy resource path format |
### #69 — Typed REST Handlers Bypass Policy Engine
**Risk**: 18 typed REST handlers pass `nil` for `CheckPolicy` in the
`engine.Request`, skipping service-level policy evaluation. The generic
`/v1/engine/request` endpoint correctly passes a `policyChecker`. Since
engines #54 and #58 default to allow when no policy matches, typed routes
are effectively unprotected by policy.
**Root cause**: Typed handlers were modeled after admin-only operations
(which don't need policy) but applied to user-accessible operations.
**Fix**: Extract the policy checker construction from
`handleEngineRequest` into a shared helper:
```go
func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
return func(resource, action string) (string, bool) {
effect, matched, err := s.policy.Check(
info.Username, info.Roles, resource, action,
)
if err != nil || !matched {
return "deny", false
}
return effect, matched
}
}
```
Then in each typed handler, set `CheckPolicy` on the request:
```go
req := &engine.Request{
Operation: "get-cert",
Data: data,
CallerInfo: callerInfo,
CheckPolicy: s.newPolicyChecker(callerInfo),
}
```
**18 handlers to update**:
| Handler | Operation |
|---------|-----------|
| `handleGetCert` | `get-cert` |
| `handleRevokeCert` | `revoke-cert` |
| `handleDeleteCert` | `delete-cert` |
| `handleSSHCASignHost` | `sign-host` |
| `handleSSHCASignUser` | `sign-user` |
| `handleSSHCAGetProfile` | `get-profile` |
| `handleSSHCAListProfiles` | `list-profiles` |
| `handleSSHCADeleteProfile` | `delete-profile` |
| `handleSSHCAGetCert` | `get-cert` |
| `handleSSHCAListCerts` | `list-certs` |
| `handleSSHCARevokeCert` | `revoke-cert` |
| `handleSSHCADeleteCert` | `delete-cert` |
| `handleUserRegister` | `register` |
| `handleUserProvision` | `provision` |
| `handleUserListUsers` | `list-users` |
| `handleUserGetPublicKey` | `get-public-key` |
| `handleUserDeleteUser` | `delete-user` |
| `handleUserDecrypt` | `decrypt` |
Note: `handleUserEncrypt` already passes a policy checker — verify it
uses the same shared helper after refactoring. Admin-only handlers
(behind `requireAdmin` wrapper) do not need a policy checker since admin
bypasses policy.
| File | Change |
|------|--------|
| `internal/server/routes.go` | Add `newPolicyChecker`, pass to all 18 typed handlers |
| `internal/server/server_test.go` | Add test: policy-denied user is rejected by typed route |
### Verification (Work Item 4, all findings)
```bash
go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...
```
---
## Implementation Order
```
1. #68 JSON injection (standalone, ship immediately)
2. #48 Path traversal (standalone, blocks safe engine operation)
3. #39 Barrier TOCTOU race ─┐
#40 ReWrapKeys crash safety ┘ (coupled, requires careful testing)
4. #49 CA TTL enforcement ─┐
#61 ECDH zeroization │
#62 User policy path │ (independent fixes, parallelizable)
#69 Policy bypass ─┘
```
Items 1 and 2 have no dependencies and can be done in parallel by
different engineers.
Items 3 and 4 can also be done in parallel since they touch different
subsystems (barrier/seal vs engines/server).
---
## Post-Remediation (High)
All eight High findings (#39, #40, #48, #49, #61, #62, #68, #69) have
been resolved. See commit `a80323e`.
---
---
# Remediation Plan — Medium-Priority Audit Findings
**Date**: 2026-03-17
**Scope**: AUDIT.md findings #7, #10, #21, #41, #42, #43, #46, #50, #51,
#53, #54, #58, #59, #63, #64, #65, #66, #67, #70, #72, #73, #74, #78, #79
This plan addresses all medium-severity findings. Findings are grouped into
eight work items by subsystem. Two findings (#72, #79) are already resolved
or trivial and are noted inline.
**Previously resolved by High remediation**:
- #72 (policy rule ID path traversal) — covered by #48's `ValidateName`
on `CreateRule` and barrier-level `validatePath`.
---
## Work Item 5: Default-Deny in Engine Policy Checks (#54, #58)
**Risk**: The SSH CA and transit engines default to **allow** when no policy
rule matches a non-admin user's request. This contradicts the default-deny
principle in ARCHITECTURE.md and the engineering standards. Now that #69 is
resolved (typed REST handlers pass `CheckPolicy`), the engines receive a
policy checker — but when `CheckPolicy` returns no match, they still allow.
**Root cause**: Both engines treat "no matching policy rule" as implicit
allow for authenticated users.
### Fix
Change the policy evaluation result handling in both engines.
**SSH CA** (`sshca.go`, `handleSignHost` ~line 309, `handleSignUser`
~line 443):
```go
// Before (default-allow):
if matched && effect == "deny" {
return nil, ErrForbidden
}
// After (default-deny):
if req.CheckPolicy != nil {
effect, matched := req.CheckPolicy(resource, "sign")
if !matched || effect != "allow" {
return nil, ErrForbidden
}
}
```
**Transit** (`transit.go`, `requireUserWithPolicy` ~line 330):
```go
// Before (default-allow):
if matched {
if effect == "deny" {
return ErrForbidden
}
return nil
}
return nil // no match → allow
// After (default-deny):
if req.CheckPolicy == nil {
return ErrForbidden
}
effect, matched := req.CheckPolicy(resource, action)
if !matched || effect != "allow" {
return ErrForbidden
}
return nil
```
Apply the same pattern to any other policy check sites in these engines.
### Files
| File | Change |
|------|--------|
| `internal/engine/sshca/sshca.go` | Default-deny in `handleSignHost`, `handleSignUser` |
| `internal/engine/transit/transit.go` | Default-deny in `requireUserWithPolicy` |
| `internal/engine/sshca/sshca_test.go` | Add test: no policy rules → user request denied |
| `internal/engine/transit/transit_test.go` | Add test: no policy rules → user request denied |
---
## Work Item 6: User Engine Concurrency and Safety (#63, #64, #65, #66, #67)
Five findings in the user engine share a root cause: the engine was
implemented after the others and does not follow the same concurrency
patterns.
### #63 — Missing role checks on decrypt, re-encrypt, rotate-key
**Fix**: Add `IsUser()` check to `handleDecrypt`, `handleReEncrypt`, and
`handleRotateKey`, matching the pattern used in `handleEncrypt`:
```go
if !req.CallerInfo.IsUser() {
return nil, ErrForbidden
}
```
### #64 — Initialize holds no mutex
**Fix**: Acquire the write lock at the top of `Initialize`:
```go
func (e *UserEngine) Initialize(ctx context.Context, ...) error {
e.mu.Lock()
defer e.mu.Unlock()
// ... existing body ...
}
```
This matches the pattern in the SSH CA and transit engines.
### #65 — handleEncrypt releases lock before using state
**Fix**: Hold `RLock` (not write lock) through the ECDH and encryption
operations. After auto-provisioning (which needs the write lock), downgrade
to a read lock for the crypto work:
```go
e.mu.Lock()
// ... auto-provisioning ...
// Capture references while holding write lock.
senderState := e.users[sender]
recipientStates := ...
e.mu.Unlock()
// Re-acquire RLock for crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// verify states are still valid (not nil)
if senderState.privKey == nil {
return nil, ErrSealed
}
// ... ECDH, wrap DEK, encrypt ...
```
Alternatively, hold the write lock through the entire operation (simpler
but serializes all encrypts). Given that encryption is the hot path, the
downgrade approach is preferred.
### #66 — handleReEncrypt manual lock without defer
**Fix**: Restructure to use `defer` for all lock releases. Split the
method into a provisioning phase (write lock) and a crypto phase (read
lock), each with its own `defer`:
```go
func (e *UserEngine) handleReEncrypt(...) {
// Phase 1: read envelope, resolve users.
e.mu.RLock()
// ... read state ...
e.mu.RUnlock()
// Phase 2: crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// ... decrypt, re-wrap, encrypt ...
}
```
### #67 — No sealed-state check in HandleRequest
**Fix**: Add a sealed-state guard at the top of `HandleRequest`, before
the operation dispatch:
```go
func (e *UserEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.RLock()
sealed := e.config == nil
e.mu.RUnlock()
if sealed {
return nil, ErrSealed
}
switch req.Operation {
// ...
}
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | All five fixes above |
| `internal/engine/user/user_test.go` | Add tests: guest role denied; concurrent encrypt/seal; sealed-state rejection |
### Verification
```bash
go test -race ./internal/engine/user/
```
---
## Work Item 7: Barrier and Seal Fixes (#41, #42, #43)
### #41 — `loadKeys` errors silently swallowed during unseal
**Risk**: If `loadKeys` fails to decrypt DEK entries (corrupt data, wrong
MEK), errors are swallowed and the keys map is incomplete. Subsequent
operations fail with confusing `ErrKeyNotFound` errors.
**Fix**: Return the error from `loadKeys` instead of ignoring it. Distinguish
between "table doesn't exist" (acceptable on first run before migration)
and genuine decryption failures:
```go
func (b *AESGCMBarrier) Unseal(mek []byte) error {
// ...
if err := b.loadKeys(); err != nil {
// Check if this is a "no such table" error (pre-migration).
if !isTableMissing(err) {
b.mek = nil
return fmt.Errorf("barrier: load keys: %w", err)
}
// Pre-migration: no barrier_keys table yet, proceed with empty map.
}
return nil
}
func isTableMissing(err error) bool {
return strings.Contains(err.Error(), "no such table")
}
```
### #42 — No AAD binding on MEK encryption with KWK
**Risk**: The MEK ciphertext in `seal_config` has no AAD, so there is no
cryptographic binding to its storage purpose.
**Fix**: Pass a constant AAD string when encrypting/decrypting the MEK:
```go
var mekAAD = []byte("metacrypt/seal_config/mek")
// In Init:
encryptedMEK, err := crypto.Encrypt(kwk, mek, mekAAD)
// In Unseal:
mek, err := crypto.Decrypt(kwk, encryptedMEK, mekAAD)
// In RotateMEK:
newEncMEK, err := crypto.Encrypt(kwk, newMEK, mekAAD)
```
**Migration note**: Existing databases have MEKs encrypted with `nil` AAD.
The unseal path must try `mekAAD` first, then fall back to `nil` for
backward compatibility. After successful unseal with `nil`, re-encrypt with
`mekAAD` and update `seal_config`.
### #43 — Barrier `List` SQL LIKE with unescaped prefix
**Risk**: If a barrier path prefix contains `%` or `_`, the LIKE query
matches unintended entries.
**Fix**: Escape SQL LIKE wildcards in the prefix:
```go
func escapeLIKE(s string) string {
s = strings.ReplaceAll(s, `\`, `\\`)
s = strings.ReplaceAll(s, `%`, `\%`)
s = strings.ReplaceAll(s, `_`, `\_`)
return s
}
// In List:
rows, err := b.db.QueryContext(ctx,
"SELECT path FROM barrier_entries WHERE path LIKE ? ESCAPE '\\'",
escapeLIKE(prefix)+"%")
```
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Fix `loadKeys` error handling; add `escapeLIKE` to `List` |
| `internal/seal/seal.go` | Add `mekAAD` constant; use in Init, Unseal, RotateMEK with fallback |
| `internal/barrier/barrier_test.go` | Add test: `List` with `%` and `_` in paths |
| `internal/seal/seal_test.go` | Add test: unseal with AAD-encrypted MEK |
---
## Work Item 8: CA Engine Fixes (#50, #51, #70)
### #50 — Non-admin users can override key usages
**Risk**: A non-admin user can request `cert sign` or `crl sign` key usage
on a leaf certificate, creating a de facto CA certificate.
**Fix**: Only allow admin users to override `key_usages` and
`ext_key_usages`. Non-admin users get the profile defaults:
```go
// In handleIssue, after profile lookup:
if req.CallerInfo.IsAdmin {
if v, ok := req.Data["key_usages"].([]interface{}); ok {
profile.KeyUse = toStringSlice(v)
}
if v, ok := req.Data["ext_key_usages"].([]interface{}); ok {
profile.ExtKeyUsages = toStringSlice(v)
}
}
```
### #51 — Certificate renewal does not revoke original
**Fix**: After issuing the renewed certificate, revoke the original by
serial. The original serial is available from the request data:
```go
// In handleRenew, after successful issuance:
oldSerial := req.Data["serial"].(string)
if err := e.revokeCertBySerial(ctx, issuerName, oldSerial, req.CallerInfo.Username); err != nil {
// Log but do not fail — the new cert is already issued.
e.logger.Warn("failed to revoke original during renewal",
"serial", oldSerial, "error", err)
}
```
### #70 — `RenewCert` has no REST route (API sync violation)
**Fix**: Add a dedicated REST route for certificate renewal:
```go
r.Post("/v1/ca/{mount}/cert/{serial}/renew", s.requireAuth(s.handleRenewCert))
```
Implement `handleRenewCert` following the typed handler pattern (with
`CheckPolicy`).
### Files
| File | Change |
|------|--------|
| `internal/engine/ca/ca.go` | Guard key usage overrides with `IsAdmin`; revoke original in `handleRenew` |
| `internal/server/routes.go` | Add `POST /v1/ca/{mount}/cert/{serial}/renew` route and handler |
| `internal/engine/ca/ca_test.go` | Add test: non-admin cannot set `cert sign` usage; renewal revokes original |
---
## Work Item 9: SSH CA Lock Granularity (#53)
**Risk**: `HandleRequest` acquires a write lock for all operations including
reads (`get-ca-pubkey`, `get-cert`, `list-certs`, `get-profile`,
`list-profiles`, `get-krl`). This serializes all operations unnecessarily.
**Fix**: Move locking into individual handlers. Read-only operations use
`RLock`; mutating operations use `Lock`:
```go
func (e *SSHCAEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
// No lock here — each handler manages its own.
switch req.Operation {
case "get-ca-pubkey":
return e.handleGetCAPubkey(ctx, req) // uses RLock internally
case "sign-host":
return e.handleSignHost(ctx, req) // uses Lock internally
// ...
}
}
```
Read-only handlers:
```go
func (e *SSHCAEngine) handleGetCAPubkey(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.RLock()
defer e.mu.RUnlock()
// ...
}
```
Mutating handlers (`sign-host`, `sign-user`, `create-profile`,
`update-profile`, `delete-profile`, `revoke-cert`, `delete-cert`):
```go
func (e *SSHCAEngine) handleSignHost(ctx context.Context, req *engine.Request) (*engine.Response, error) {
e.mu.Lock()
defer e.mu.Unlock()
// ...
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/sshca/sshca.go` | Remove top-level lock from `HandleRequest`; add per-handler locks |
---
## Work Item 10: Transit Ciphertext Validation (#59)
**Risk**: `parseVersionedData` accepts negative version numbers. A crafted
ciphertext `metacrypt:v-1:...` parses as version -1, which fails the key
lookup but produces a confusing error.
**Fix**: Add a bounds check after parsing:
```go
func parseVersionedData(s string) (int, []byte, error) {
// ... existing parse logic ...
version, err := strconv.Atoi(parts[1][1:])
if err != nil {
return 0, nil, ErrInvalidFormat
}
if version < 1 {
return 0, nil, fmt.Errorf("transit: %w: version must be >= 1", ErrInvalidFormat)
}
// ...
}
```
### Files
| File | Change |
|------|--------|
| `internal/engine/transit/transit.go` | Add `version < 1` check in `parseVersionedData` |
| `internal/engine/transit/transit_test.go` | Add test: `metacrypt:v0:...` and `metacrypt:v-1:...` rejected |
---
## Work Item 11: Database and Auth (#46, #74)
### #46 — SQLite PRAGMAs only applied to first connection
**Risk**: `database/sql` may open additional connections that don't receive
the `journal_mode`, `foreign_keys`, and `busy_timeout` PRAGMAs.
**Fix**: Use the `_pragma` DSN parameter supported by `modernc.org/sqlite`:
```go
dsn := path + "?_pragma=journal_mode(WAL)&_pragma=foreign_keys(ON)&_pragma=busy_timeout(5000)"
db, err := sql.Open("sqlite", dsn)
```
Remove the manual PRAGMA execution loop. This ensures every connection
opened by the pool receives the PRAGMAs.
### #74 — Token validation cache grows without bound
**Risk**: The `cache` map in `auth.go` grows indefinitely. Expired entries
are never proactively removed.
**Fix**: Add periodic eviction. Run a background goroutine that sweeps
expired entries every minute:
```go
func (a *Authenticator) startEviction(ctx context.Context) {
ticker := time.NewTicker(time.Minute)
go func() {
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
a.evictExpired()
}
}
}()
}
func (a *Authenticator) evictExpired() {
now := time.Now()
a.mu.Lock()
defer a.mu.Unlock()
for key, entry := range a.cache {
if now.After(entry.expiresAt) {
delete(a.cache, key)
}
}
}
```
Call `startEviction` from the `Authenticator` constructor. Accept a
`context.Context` to allow cancellation on shutdown.
### Files
| File | Change |
|------|--------|
| `internal/db/db.go` | Switch to `_pragma` DSN parameters; remove manual PRAGMA loop |
| `internal/auth/auth.go` | Add `startEviction` goroutine; call from constructor |
| `internal/db/db_test.go` | Verify PRAGMAs are active on a fresh connection from the pool |
| `internal/auth/auth_test.go` | Add test: expired entries are evicted after sweep |
---
## Work Item 12: Policy Engine Glob Matching (#73)
**Risk**: `filepath.Match` does not support `**` for recursive directory
matching. Administrators writing rules like `engine/**/certs/*` will find
they don't match paths with multiple segments.
**Fix**: Replace `filepath.Match` with `path.Match` (POSIX-style, no
OS-specific behavior) and add support for `**` by splitting the pattern
and value on `/` and matching segments:
```go
import "path"
func matchesAnyGlob(patterns []string, value string) bool {
for _, p := range patterns {
if matchGlob(p, value) {
return true
}
}
return false
}
func matchGlob(pattern, value string) bool {
// Handle ** by checking if any suffix of value matches the rest of pattern.
if strings.Contains(pattern, "**") {
parts := strings.SplitN(pattern, "**", 2)
prefix := parts[0]
suffix := parts[1]
if !strings.HasPrefix(value, prefix) {
return false
}
remainder := value[len(prefix):]
// Try matching suffix against every suffix of remainder.
for i := 0; i <= len(remainder); i++ {
if matched, _ := path.Match(suffix, remainder[i:]); matched {
return true
}
// Advance to next separator.
next := strings.IndexByte(remainder[i:], '/')
if next < 0 {
break
}
i += next
}
return false
}
matched, _ := path.Match(pattern, value)
return matched
}
```
Document in POLICY.md that `**` matches zero or more path segments, while
`*` matches within a single segment.
### Files
| File | Change |
|------|--------|
| `internal/policy/policy.go` | Replace `filepath.Match` with `path.Match` + `**` support |
| `internal/policy/policy_test.go` | Add test: `engine/**/certs/*` matches `engine/ca/prod/certs/abc123` |
---
## Work Item 13: Deployment Fixes (#78, #79)
### #78 — systemd `ExecReload` sends SIGHUP with no handler
**Fix**: Two options:
**Option A** (recommended): Remove `ExecReload` entirely. Metacrypt does
not support graceful reload — configuration changes require a restart. The
`ExecReload` line creates a false expectation.
**Option B**: Implement SIGHUP handling that re-reads the TOML config and
applies non-breaking changes (log level, TLS cert reload). This is
significant new functionality and should be a separate feature.
For now, remove `ExecReload` from both service units:
```diff
-ExecReload=/bin/kill -HUP $MAINPID
```
### #79 — Dockerfiles use Go 1.23 but module requires Go 1.25
**Fix**: Update the builder base image:
```diff
-FROM golang:1.23-alpine AS builder
+FROM golang:1.25-alpine AS builder
```
Apply to both `Dockerfile.api` and `Dockerfile.web`. Also remove the
unnecessary `gcc` and `musl-dev` installation since `CGO_ENABLED=0`.
```diff
-RUN apk add --no-cache gcc musl-dev
```
### #7 — No audit logging
**Note**: This is the only medium finding that requires significant new
functionality rather than a targeted fix. Audit logging should be
implemented as a dedicated subsystem:
1. Define an `AuditEvent` struct with: timestamp, caller, operation,
resource, outcome (success/denied/error), and metadata.
2. Write events to a structured log sink (slog with JSON output to a
dedicated file at `/srv/metacrypt/audit.log`).
3. Instrument every engine `HandleRequest` to emit an event on completion.
4. Instrument policy `CreateRule`/`DeleteRule`.
5. Instrument seal/unseal operations.
This is a substantial feature. Track separately from the quick-fix items
in this plan.
### #10 — No extension allowlist for SSH host certificates
**Note**: Host certificates typically don't use extensions (extensions are
for user certificates). The fix is to ignore the `extensions` field on
host signing requests rather than passing it through:
```go
// In handleSignHost, after building the certificate:
cert.Permissions.Extensions = nil // Host certs should not carry extensions
```
### #21 — No rate limiting on transit cryptographic operations
**Note**: This requires a rate-limiting middleware or per-caller token
bucket. Best implemented as server-level middleware applied to engine
request handlers:
```go
func (s *Server) requireRateLimit(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
info := TokenInfoFromContext(r.Context())
if !s.rateLimiter.Allow(info.Username) {
writeJSONError(w, "rate limit exceeded", http.StatusTooManyRequests)
return
}
next(w, r)
}
}
```
Track separately — this affects API design decisions (limits, quotas,
per-user vs per-token).
### Files
| File | Change |
|------|--------|
| `deploy/systemd/metacrypt.service` | Remove `ExecReload` line |
| `deploy/systemd/metacrypt-web.service` | Remove `ExecReload` line |
| `Dockerfile.api` | Update to `golang:1.25-alpine`; remove `gcc musl-dev` |
| `Dockerfile.web` | Update to `golang:1.25-alpine`; remove `gcc musl-dev` |
---
## Implementation Order
```
5. #54, #58 Default-deny in engines (security-critical, do first)
6. #63#67 User engine concurrency (5 coupled fixes, one change)
7. #41#43 Barrier/seal fixes (3 independent fixes)
8. #50, #51, #70 CA engine fixes (key usage + renewal + API sync)
9. #53 SSH CA lock granularity (standalone refactor)
10. #59 Transit version validation (one-line fix)
11. #46, #74 DB pragmas + auth cache (independent, no overlap)
12. #73 Policy glob matching (standalone)
13. #78, #79 Deployment fixes (non-code, standalone)
#7 Audit logging (new feature, track separately)
#10 SSH host extensions (one-line fix, ship with #9)
#21 Transit rate limiting (new feature, track separately)
```
Work items 510 can be parallelized across engineers:
- Engineer A: #5 (default-deny) + #9 (SSH CA locks) + #10 (host extensions)
- Engineer B: #6 (user concurrency) + #8 (CA fixes)
- Engineer C: #7 (barrier/seal) + #10 (transit version) + #11 (DB/auth)
- Independent: #12 (policy), #13 (deployment)
Work items #7 (audit logging) and #21 (rate limiting) are substantial new
features and should be tracked as separate engineering tasks, not quick
fixes.
---
## Post-Remediation (Medium)
After all medium findings are resolved:
1. **Update AUDIT.md** — mark each finding as RESOLVED with summary.
2. **Run the full pipeline**: `make all` (vet, lint, test, build).
3. **Run race detector**: `go test -race ./...` — especially important
for work items 6 and 9 (concurrency changes).
4. **Address remaining low findings** — 14 low-severity items remain,
mostly zeroization gaps and documentation drift.