Files

Kyle Isom 5c5d7e184e Fix ECDH zeroization, add audit logging, and remediate high findings

- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-17 14:04:39 -07:00

40 KiB

Raw Permalink Blame History

Remediation Plan — High-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69

This plan addresses all eight High-severity findings from the 2026-03-17 full system audit. Findings are grouped into four work items by shared root cause or affected subsystem. The order reflects dependency chains: #68 is a standalone fix that should ship first; #48 is a prerequisite for safe operation across all engines; #39/#40 affect the storage core; the remaining four affect specific engines.

Work Item 1: JSON Injection in REST Error Responses (#68)

Risk: An error message containing " or \ breaks the JSON response structure. If the error contains attacker-controlled input (e.g., a mount name or key name that triggers a downstream error), this enables JSON injection in API responses.

Root cause: 13 locations in internal/server/routes.go construct JSON error responses via string concatenation:

http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)

The writeEngineError helper (line 1704) is the most common entry point; most typed handlers call it.

Fix

Replace writeEngineError with a safe JSON encoder:

func writeJSONError(w http.ResponseWriter, msg string, code int) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(code)
    _ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
}

Replace all 13 call sites that use string concatenation with writeJSONError(w, grpcMessage(err), status) or writeJSONError(w, err.Error(), status).

The grpcMessage helper already exists in the webserver package and extracts human-readable messages from gRPC errors. Add an equivalent to the REST server, and prefer it over raw err.Error() to avoid leaking internal error details.
Grep for the pattern "error":" in routes.go to confirm no remaining string-concatenated JSON.

Files

File	Change
`internal/server/routes.go`	Replace `writeEngineError` and all 13 inline error sites

Verification

go vet ./internal/server/
go test ./internal/server/
Manual test: mount an engine with a name containing ", trigger an error, verify the response is valid JSON.

Work Item 2: Path Traversal via Unsanitized Names (#48)

Risk: User-controlled strings (issuer names, key names, profile names, usernames, mount names) are concatenated directly into barrier storage paths. An input containing ../ traverses the barrier namespace, allowing reads and writes to arbitrary paths. This affects all four engines and the engine registry.

Root cause: No validation exists at any layer — neither the barrier's Put/Get/Delete methods nor the engines sanitize path components.

Vulnerable locations

File	Input	Path Pattern
`ca/ca.go`	issuer `name`	`mountPath + "issuers/" + name + "/"`
`sshca/sshca.go`	profile `name`	`mountPath + "profiles/" + name + ".json"`
`transit/transit.go`	key `name`	`mountPath + "keys/" + name + "/"`
`user/user.go`	`username`	`mountPath + "users/" + username + "/"`
`engine/engine.go`	mount `name`	`engine/{type}/{name}/`
`policy/policy.go`	rule `ID`	`policy/rules/{id}`

Fix

Enforce validation at two layers (defense in depth):

Barrier layer — reject paths containing .. segments.

Add a validatePath check at the top of Get, Put, Delete, and List in barrier.go:

var ErrInvalidPath = errors.New("barrier: invalid path")

func validatePath(p string) error {
    for _, seg := range strings.Split(p, "/") {
        if seg == ".." {
            return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
        }
    }
    return nil
}

Call validatePath at the entry of Get, Put, Delete, List. Return ErrInvalidPath on failure.

Engine/registry layer — validate entity names at input boundaries.

Add a ValidateName helper to internal/engine/:

var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)

func ValidateName(name string) error {
    if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
        return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
            "dot, hyphen, or underscore characters", name)
    }
    return nil
}

Call ValidateName in:

Location	Input validated
`engine.go` `Mount()`	mount name
`ca.go` `handleCreateIssuer`	issuer name
`sshca.go` `handleCreateProfile`	profile name
`transit.go` `handleCreateKey`	key name
`user.go` `handleRegister`, `handleProvision`	username
`user.go` `handleEncrypt`	recipient usernames
`policy.go` `CreateRule`	rule ID

Note: certificate serials are generated server-side from crypto/rand and hex-encoded, so they are safe. Validate anyway for defense in depth.

Files

File	Change
`internal/barrier/barrier.go`	Add `validatePath`, call from Get/Put/Delete/List
`internal/engine/engine.go`	Add `ValidateName`, call from `Mount`
`internal/engine/ca/ca.go`	Call `ValidateName` on issuer name
`internal/engine/sshca/sshca.go`	Call `ValidateName` on profile name
`internal/engine/transit/transit.go`	Call `ValidateName` on key name
`internal/engine/user/user.go`	Call `ValidateName` on usernames
`internal/policy/policy.go`	Call `ValidateName` on rule ID

Verification

Add TestValidatePath to barrier_test.go: confirm ../ and .. are rejected; confirm normal paths pass.
Add TestValidateName to engine_test.go: confirm ../evil, empty string, and overlong names are rejected; confirm valid names pass.
go test ./internal/barrier/ ./internal/engine/... ./internal/policy/

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

These two findings share the barrier/seal subsystem and should be addressed together.

#39 — TOCTOU Race in Barrier Get/Put

Risk: Get and Put copy the mek slice header and keys map reference under RLock, release the lock, then use the copied references for encryption/decryption. A concurrent Seal() zeroizes the underlying byte slices in place before nil-ing the fields, so a concurrent reader uses zeroized key material.

Root cause: The lock does not cover the crypto operation. The "copy" is a shallow reference copy (slice header), not a deep byte copy. Seal() zeroizes the backing array, which is shared.

Current locking pattern (barrier.go):

Get:  RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put:  RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock  → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock

Fix: Hold RLock through the entire crypto operation:

func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
    if err := validatePath(path); err != nil {
        return nil, err
    }
    b.mu.RLock()
    defer b.mu.RUnlock()
    if b.mek == nil {
        return nil, ErrSealed
    }
    // query DB, resolve key, decrypt — all under RLock
    // ...
}

This is the minimal, safest change. RLock permits concurrent readers, so there is no throughput regression for parallel Get/Put operations. The only serialization point is Seal(), which acquires the exclusive Lock and waits for all readers to drain — exactly the semantics we want.

Apply the same pattern to Put, Delete, and List.

Alternative considered: Atomic pointer swap (atomic.Pointer[keyState]). This eliminates the lock from the hot path entirely, but introduces complexity around deferred zeroization of the old state (readers may still hold references). The RLock-through-crypto approach is simpler and sufficient for Metacrypt's concurrency profile.

#40 — Crash During `ReWrapKeys` Loses All Data

Risk: RotateMEK calls barrier.ReWrapKeys(newMEK) which commits a transaction re-wrapping all DEKs, then separately updates seal_config with the new encrypted MEK. A crash between these two database operations leaves DEKs wrapped with a MEK that is not persisted — all data is irrecoverable.

Current flow (seal.go lines 245–313):

1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK)   ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil)  ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ?  ← separate statement, not in transaction
   *** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK

Fix: Unify steps 2–4 into a single database transaction.

Refactor ReWrapKeys to accept an optional *sql.Tx:

// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
    // Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
    rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
    // ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}

// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
    b.mu.Lock()
    defer b.mu.Unlock()
    mcrypto.Zeroize(b.mek)
    b.mek = newMEK
}

Then in RotateMEK:

func (m *Manager) RotateMEK(ctx context.Context, password string) error {
    // ... derive KWK, generate newMEK ...

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Re-wrap all DEKs within the transaction.
    if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
        return err
    }

    // Update seal_config within the same transaction.
    encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
    if err != nil {
        return err
    }
    if _, err := tx.ExecContext(ctx,
        "UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
        encNewMEK,
    ); err != nil {
        return err
    }

    if err := tx.Commit(); err != nil {
        return err
    }

    // Only after commit: update in-memory state.
    m.barrier.SwapMEK(newMEK)
    return nil
}

SQLite in WAL mode handles this correctly — the transaction is atomic regardless of process crash. The barrier_keys and seal_config updates either both commit or neither does.

Files

File	Change
`internal/barrier/barrier.go`	Extend RLock scope in Get/Put/Delete/List; add `ReWrapKeysTx`, `SwapMEK`
`internal/seal/seal.go`	Wrap ReWrapKeysTx + seal_config UPDATE in single transaction
`internal/barrier/barrier_test.go`	Add concurrent Get/Seal stress test

Verification

go test -race ./internal/barrier/ ./internal/seal/
Add TestConcurrentGetSeal: spawn goroutines doing Get while another goroutine calls Seal. Run with -race. Verify no panics or data races.
Add TestRotateMEKAtomic: verify that barrier_keys and seal_config are updated in the same transaction (mock the DB to detect transaction boundaries, or verify via rollback behavior).

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

These four findings touch separate files with no overlap and can be addressed in parallel.

#49 — No TTL Ceiling in CA Certificate Issuance

Risk: A non-admin user can request an arbitrarily long certificate lifetime. The issuer's MaxTTL exists in config but is not enforced during handleIssue or handleSignCSR.

Root cause: The CA engine applies the user's requested TTL directly to the certificate without comparing it against issuerConfig.MaxTTL. The SSH CA engine correctly enforces this via resolveTTL — the CA engine does not.

Fix: Add a resolveTTL method to the CA engine, following the SSH CA engine's pattern (sshca.go lines 902–932):

func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
    maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
    if err != nil {
        maxTTL = 2160 * time.Hour // 90 days fallback
    }

    if requested != "" {
        ttl, err := time.ParseDuration(requested)
        if err != nil {
            return 0, fmt.Errorf("invalid TTL: %w", err)
        }
        if ttl > maxTTL {
            return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
        }
        return ttl, nil
    }

    return maxTTL, nil
}

Call this in handleIssue and handleSignCSR before constructing the certificate. Replace the raw TTL string with the validated duration.

File	Change
`internal/engine/ca/ca.go`	Add `resolveTTL`, call from `handleIssue` and `handleSignCSR`
`internal/engine/ca/ca_test.go`	Add test: issue cert with TTL > MaxTTL, verify rejection

#61 — Ineffective ECDH Key Zeroization

Risk: privKey.Bytes() returns a copy of the private key bytes. Zeroizing the copy leaves the original inside *ecdh.PrivateKey. Go's crypto/ecdh API does not expose the internal byte slice.

Root cause: Language/API limitation in Go's crypto/ecdh package.

Fix: Store the raw private key bytes alongside the parsed key in userState, and zeroize those bytes on seal:

type userState struct {
    privKey    *ecdh.PrivateKey
    privBytes  []byte            // raw key bytes, retained for zeroization
    pubKey     *ecdh.PublicKey
    config     *UserKeyConfig
}

On load from barrier (Unseal, auto-provision):

raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw  // retain for zeroization
state.privKey = priv

On Seal:

mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil

Document the limitation: the parsed *ecdh.PrivateKey struct's internal copy cannot be zeroized from Go code. Setting privKey = nil makes it eligible for GC, but does not guarantee immediate byte overwrite. This is an accepted Go runtime limitation.

File	Change
`internal/engine/user/user.go`	Add `privBytes` to `userState`, populate on load, zeroize on Seal
`internal/engine/user/types.go`	Update `userState` struct

#62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name

Risk: Policy checks construct the resource path using e.mountPath (which is engine/user/{name}/) instead of just the mount name. Policy rules match against user/{name}/recipient/{username}, so the full mount path creates a mismatch like user/engine/user/myengine//recipient/alice. No policy rule will ever match.

Root cause: Line 358 of user.go uses e.mountPath directly. The SSH CA and transit engines correctly use a mountName() helper.

Fix: Add a mountName() method to the user engine:

func (e *UserEngine) mountName() string {
    // mountPath is "engine/user/{name}/"
    parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
    if len(parts) >= 3 {
        return parts[2]
    }
    return e.mountPath
}

Change line 358:

resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)

Audit all other resource path constructions in the user engine to confirm they also use the correct mount name.

File	Change
`internal/engine/user/user.go`	Add `mountName()`, fix resource path on line 358
`internal/engine/user/user_test.go`	Add test: verify policy resource path format

#69 — Typed REST Handlers Bypass Policy Engine

Risk: 18 typed REST handlers pass nil for CheckPolicy in the engine.Request, skipping service-level policy evaluation. The generic /v1/engine/request endpoint correctly passes a policyChecker. Since engines #54 and #58 default to allow when no policy matches, typed routes are effectively unprotected by policy.

Root cause: Typed handlers were modeled after admin-only operations (which don't need policy) but applied to user-accessible operations.

Fix: Extract the policy checker construction from handleEngineRequest into a shared helper:

func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
    return func(resource, action string) (string, bool) {
        effect, matched, err := s.policy.Check(
            info.Username, info.Roles, resource, action,
        )
        if err != nil || !matched {
            return "deny", false
        }
        return effect, matched
    }
}

Then in each typed handler, set CheckPolicy on the request:

req := &engine.Request{
    Operation:   "get-cert",
    Data:        data,
    CallerInfo:  callerInfo,
    CheckPolicy: s.newPolicyChecker(callerInfo),
}

18 handlers to update:

Handler	Operation
`handleGetCert`	`get-cert`
`handleRevokeCert`	`revoke-cert`
`handleDeleteCert`	`delete-cert`
`handleSSHCASignHost`	`sign-host`
`handleSSHCASignUser`	`sign-user`
`handleSSHCAGetProfile`	`get-profile`
`handleSSHCAListProfiles`	`list-profiles`
`handleSSHCADeleteProfile`	`delete-profile`
`handleSSHCAGetCert`	`get-cert`
`handleSSHCAListCerts`	`list-certs`
`handleSSHCARevokeCert`	`revoke-cert`
`handleSSHCADeleteCert`	`delete-cert`
`handleUserRegister`	`register`
`handleUserProvision`	`provision`
`handleUserListUsers`	`list-users`
`handleUserGetPublicKey`	`get-public-key`
`handleUserDeleteUser`	`delete-user`
`handleUserDecrypt`	`decrypt`

Note: handleUserEncrypt already passes a policy checker — verify it uses the same shared helper after refactoring. Admin-only handlers (behind requireAdmin wrapper) do not need a policy checker since admin bypasses policy.

File	Change
`internal/server/routes.go`	Add `newPolicyChecker`, pass to all 18 typed handlers
`internal/server/server_test.go`	Add test: policy-denied user is rejected by typed route

Verification (Work Item 4, all findings)

go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...

Implementation Order

1. #68  JSON injection             (standalone, ship immediately)
2. #48  Path traversal             (standalone, blocks safe engine operation)
3. #39  Barrier TOCTOU race    ─┐
   #40  ReWrapKeys crash safety ┘  (coupled, requires careful testing)
4. #49  CA TTL enforcement     ─┐
   #61  ECDH zeroization       │
   #62  User policy path       │  (independent fixes, parallelizable)
   #69  Policy bypass          ─┘

Items 1 and 2 have no dependencies and can be done in parallel by different engineers.

Items 3 and 4 can also be done in parallel since they touch different subsystems (barrier/seal vs engines/server).

Post-Remediation (High)

All eight High findings (#39, #40, #48, #49, #61, #62, #68, #69) have been resolved. See commit a80323e.

Remediation Plan — Medium-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #7, #10, #21, #41, #42, #43, #46, #50, #51, #53, #54, #58, #59, #63, #64, #65, #66, #67, #70, #72, #73, #74, #78, #79

This plan addresses all medium-severity findings. Findings are grouped into eight work items by subsystem. Two findings (#72, #79) are already resolved or trivial and are noted inline.

Previously resolved by High remediation:

#72 (policy rule ID path traversal) — covered by #48's ValidateName on CreateRule and barrier-level validatePath.

Work Item 5: Default-Deny in Engine Policy Checks (#54, #58)

Risk: The SSH CA and transit engines default to allow when no policy rule matches a non-admin user's request. This contradicts the default-deny principle in ARCHITECTURE.md and the engineering standards. Now that #69 is resolved (typed REST handlers pass CheckPolicy), the engines receive a policy checker — but when CheckPolicy returns no match, they still allow.

Root cause: Both engines treat "no matching policy rule" as implicit allow for authenticated users.

Fix

Change the policy evaluation result handling in both engines.

SSH CA (sshca.go, handleSignHost ~line 309, handleSignUser ~line 443):

// Before (default-allow):
if matched && effect == "deny" {
    return nil, ErrForbidden
}

// After (default-deny):
if req.CheckPolicy != nil {
    effect, matched := req.CheckPolicy(resource, "sign")
    if !matched || effect != "allow" {
        return nil, ErrForbidden
    }
}

Transit (transit.go, requireUserWithPolicy ~line 330):

// Before (default-allow):
if matched {
    if effect == "deny" {
        return ErrForbidden
    }
    return nil
}
return nil  // no match → allow

// After (default-deny):
if req.CheckPolicy == nil {
    return ErrForbidden
}
effect, matched := req.CheckPolicy(resource, action)
if !matched || effect != "allow" {
    return ErrForbidden
}
return nil

Apply the same pattern to any other policy check sites in these engines.

Files

File	Change
`internal/engine/sshca/sshca.go`	Default-deny in `handleSignHost`, `handleSignUser`
`internal/engine/transit/transit.go`	Default-deny in `requireUserWithPolicy`
`internal/engine/sshca/sshca_test.go`	Add test: no policy rules → user request denied
`internal/engine/transit/transit_test.go`	Add test: no policy rules → user request denied

Work Item 6: User Engine Concurrency and Safety (#63, #64, #65, #66, #67)

Five findings in the user engine share a root cause: the engine was implemented after the others and does not follow the same concurrency patterns.

#63 — Missing role checks on decrypt, re-encrypt, rotate-key

Fix: Add IsUser() check to handleDecrypt, handleReEncrypt, and handleRotateKey, matching the pattern used in handleEncrypt:

if !req.CallerInfo.IsUser() {
    return nil, ErrForbidden
}

#64 — Initialize holds no mutex

Fix: Acquire the write lock at the top of Initialize:

func (e *UserEngine) Initialize(ctx context.Context, ...) error {
    e.mu.Lock()
    defer e.mu.Unlock()
    // ... existing body ...
}

This matches the pattern in the SSH CA and transit engines.

#65 — handleEncrypt releases lock before using state

Fix: Hold RLock (not write lock) through the ECDH and encryption operations. After auto-provisioning (which needs the write lock), downgrade to a read lock for the crypto work:

e.mu.Lock()
// ... auto-provisioning ...
// Capture references while holding write lock.
senderState := e.users[sender]
recipientStates := ...
e.mu.Unlock()

// Re-acquire RLock for crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// verify states are still valid (not nil)
if senderState.privKey == nil {
    return nil, ErrSealed
}
// ... ECDH, wrap DEK, encrypt ...

Alternatively, hold the write lock through the entire operation (simpler but serializes all encrypts). Given that encryption is the hot path, the downgrade approach is preferred.

#66 — handleReEncrypt manual lock without defer

Fix: Restructure to use defer for all lock releases. Split the method into a provisioning phase (write lock) and a crypto phase (read lock), each with its own defer:

func (e *UserEngine) handleReEncrypt(...) {
    // Phase 1: read envelope, resolve users.
    e.mu.RLock()
    // ... read state ...
    e.mu.RUnlock()

    // Phase 2: crypto operations.
    e.mu.RLock()
    defer e.mu.RUnlock()
    // ... decrypt, re-wrap, encrypt ...
}

#67 — No sealed-state check in HandleRequest

Fix: Add a sealed-state guard at the top of HandleRequest, before the operation dispatch:

func (e *UserEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.RLock()
    sealed := e.config == nil
    e.mu.RUnlock()
    if sealed {
        return nil, ErrSealed
    }

    switch req.Operation {
    // ...
    }
}

Files

File	Change
`internal/engine/user/user.go`	All five fixes above
`internal/engine/user/user_test.go`	Add tests: guest role denied; concurrent encrypt/seal; sealed-state rejection

Verification

go test -race ./internal/engine/user/

Work Item 7: Barrier and Seal Fixes (#41, #42, #43)

#41 — `loadKeys` errors silently swallowed during unseal

Risk: If loadKeys fails to decrypt DEK entries (corrupt data, wrong MEK), errors are swallowed and the keys map is incomplete. Subsequent operations fail with confusing ErrKeyNotFound errors.

Fix: Return the error from loadKeys instead of ignoring it. Distinguish between "table doesn't exist" (acceptable on first run before migration) and genuine decryption failures:

func (b *AESGCMBarrier) Unseal(mek []byte) error {
    // ...
    if err := b.loadKeys(); err != nil {
        // Check if this is a "no such table" error (pre-migration).
        if !isTableMissing(err) {
            b.mek = nil
            return fmt.Errorf("barrier: load keys: %w", err)
        }
        // Pre-migration: no barrier_keys table yet, proceed with empty map.
    }
    return nil
}

func isTableMissing(err error) bool {
    return strings.Contains(err.Error(), "no such table")
}

#42 — No AAD binding on MEK encryption with KWK

Risk: The MEK ciphertext in seal_config has no AAD, so there is no cryptographic binding to its storage purpose.

Fix: Pass a constant AAD string when encrypting/decrypting the MEK:

var mekAAD = []byte("metacrypt/seal_config/mek")

// In Init:
encryptedMEK, err := crypto.Encrypt(kwk, mek, mekAAD)

// In Unseal:
mek, err := crypto.Decrypt(kwk, encryptedMEK, mekAAD)

// In RotateMEK:
newEncMEK, err := crypto.Encrypt(kwk, newMEK, mekAAD)

Migration note: Existing databases have MEKs encrypted with nil AAD. The unseal path must try mekAAD first, then fall back to nil for backward compatibility. After successful unseal with nil, re-encrypt with mekAAD and update seal_config.

#43 — Barrier `List` SQL LIKE with unescaped prefix

Risk: If a barrier path prefix contains % or _, the LIKE query matches unintended entries.

Fix: Escape SQL LIKE wildcards in the prefix:

func escapeLIKE(s string) string {
    s = strings.ReplaceAll(s, `\`, `\\`)
    s = strings.ReplaceAll(s, `%`, `\%`)
    s = strings.ReplaceAll(s, `_`, `\_`)
    return s
}

// In List:
rows, err := b.db.QueryContext(ctx,
    "SELECT path FROM barrier_entries WHERE path LIKE ? ESCAPE '\\'",
    escapeLIKE(prefix)+"%")

Files

File	Change
`internal/barrier/barrier.go`	Fix `loadKeys` error handling; add `escapeLIKE` to `List`
`internal/seal/seal.go`	Add `mekAAD` constant; use in Init, Unseal, RotateMEK with fallback
`internal/barrier/barrier_test.go`	Add test: `List` with `%` and `_` in paths
`internal/seal/seal_test.go`	Add test: unseal with AAD-encrypted MEK

Work Item 8: CA Engine Fixes (#50, #51, #70)

#50 — Non-admin users can override key usages

Risk: A non-admin user can request cert sign or crl sign key usage on a leaf certificate, creating a de facto CA certificate.

Fix: Only allow admin users to override key_usages and ext_key_usages. Non-admin users get the profile defaults:

// In handleIssue, after profile lookup:
if req.CallerInfo.IsAdmin {
    if v, ok := req.Data["key_usages"].([]interface{}); ok {
        profile.KeyUse = toStringSlice(v)
    }
    if v, ok := req.Data["ext_key_usages"].([]interface{}); ok {
        profile.ExtKeyUsages = toStringSlice(v)
    }
}

#51 — Certificate renewal does not revoke original

Fix: After issuing the renewed certificate, revoke the original by serial. The original serial is available from the request data:

// In handleRenew, after successful issuance:
oldSerial := req.Data["serial"].(string)
if err := e.revokeCertBySerial(ctx, issuerName, oldSerial, req.CallerInfo.Username); err != nil {
    // Log but do not fail — the new cert is already issued.
    e.logger.Warn("failed to revoke original during renewal",
        "serial", oldSerial, "error", err)
}

#70 — `RenewCert` has no REST route (API sync violation)

Fix: Add a dedicated REST route for certificate renewal:

r.Post("/v1/ca/{mount}/cert/{serial}/renew", s.requireAuth(s.handleRenewCert))

Implement handleRenewCert following the typed handler pattern (with CheckPolicy).

Files

File	Change
`internal/engine/ca/ca.go`	Guard key usage overrides with `IsAdmin`; revoke original in `handleRenew`
`internal/server/routes.go`	Add `POST /v1/ca/{mount}/cert/{serial}/renew` route and handler
`internal/engine/ca/ca_test.go`	Add test: non-admin cannot set `cert sign` usage; renewal revokes original

Work Item 9: SSH CA Lock Granularity (#53)

Risk: HandleRequest acquires a write lock for all operations including reads (get-ca-pubkey, get-cert, list-certs, get-profile, list-profiles, get-krl). This serializes all operations unnecessarily.

Fix: Move locking into individual handlers. Read-only operations use RLock; mutating operations use Lock:

func (e *SSHCAEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    // No lock here — each handler manages its own.
    switch req.Operation {
    case "get-ca-pubkey":
        return e.handleGetCAPubkey(ctx, req) // uses RLock internally
    case "sign-host":
        return e.handleSignHost(ctx, req)     // uses Lock internally
    // ...
    }
}

Read-only handlers:

func (e *SSHCAEngine) handleGetCAPubkey(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()
    // ...
}

Mutating handlers (sign-host, sign-user, create-profile, update-profile, delete-profile, revoke-cert, delete-cert):

func (e *SSHCAEngine) handleSignHost(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.Lock()
    defer e.mu.Unlock()
    // ...
}

Files

File	Change
`internal/engine/sshca/sshca.go`	Remove top-level lock from `HandleRequest`; add per-handler locks

Work Item 10: Transit Ciphertext Validation (#59)

Risk: parseVersionedData accepts negative version numbers. A crafted ciphertext metacrypt:v-1:... parses as version -1, which fails the key lookup but produces a confusing error.

Fix: Add a bounds check after parsing:

func parseVersionedData(s string) (int, []byte, error) {
    // ... existing parse logic ...
    version, err := strconv.Atoi(parts[1][1:])
    if err != nil {
        return 0, nil, ErrInvalidFormat
    }
    if version < 1 {
        return 0, nil, fmt.Errorf("transit: %w: version must be >= 1", ErrInvalidFormat)
    }
    // ...
}

Files

File	Change
`internal/engine/transit/transit.go`	Add `version < 1` check in `parseVersionedData`
`internal/engine/transit/transit_test.go`	Add test: `metacrypt:v0:...` and `metacrypt:v-1:...` rejected

Work Item 11: Database and Auth (#46, #74)

#46 — SQLite PRAGMAs only applied to first connection

Risk: database/sql may open additional connections that don't receive the journal_mode, foreign_keys, and busy_timeout PRAGMAs.

Fix: Use the _pragma DSN parameter supported by modernc.org/sqlite:

dsn := path + "?_pragma=journal_mode(WAL)&_pragma=foreign_keys(ON)&_pragma=busy_timeout(5000)"
db, err := sql.Open("sqlite", dsn)

Remove the manual PRAGMA execution loop. This ensures every connection opened by the pool receives the PRAGMAs.

#74 — Token validation cache grows without bound

Risk: The cache map in auth.go grows indefinitely. Expired entries are never proactively removed.

Fix: Add periodic eviction. Run a background goroutine that sweeps expired entries every minute:

func (a *Authenticator) startEviction(ctx context.Context) {
    ticker := time.NewTicker(time.Minute)
    go func() {
        defer ticker.Stop()
        for {
            select {
            case <-ctx.Done():
                return
            case <-ticker.C:
                a.evictExpired()
            }
        }
    }()
}

func (a *Authenticator) evictExpired() {
    now := time.Now()
    a.mu.Lock()
    defer a.mu.Unlock()
    for key, entry := range a.cache {
        if now.After(entry.expiresAt) {
            delete(a.cache, key)
        }
    }
}

Call startEviction from the Authenticator constructor. Accept a context.Context to allow cancellation on shutdown.

Files

File	Change
`internal/db/db.go`	Switch to `_pragma` DSN parameters; remove manual PRAGMA loop
`internal/auth/auth.go`	Add `startEviction` goroutine; call from constructor
`internal/db/db_test.go`	Verify PRAGMAs are active on a fresh connection from the pool
`internal/auth/auth_test.go`	Add test: expired entries are evicted after sweep

Work Item 12: Policy Engine Glob Matching (#73)

Risk: filepath.Match does not support ** for recursive directory matching. Administrators writing rules like engine/**/certs/* will find they don't match paths with multiple segments.

Fix: Replace filepath.Match with path.Match (POSIX-style, no OS-specific behavior) and add support for ** by splitting the pattern and value on / and matching segments:

import "path"

func matchesAnyGlob(patterns []string, value string) bool {
    for _, p := range patterns {
        if matchGlob(p, value) {
            return true
        }
    }
    return false
}

func matchGlob(pattern, value string) bool {
    // Handle ** by checking if any suffix of value matches the rest of pattern.
    if strings.Contains(pattern, "**") {
        parts := strings.SplitN(pattern, "**", 2)
        prefix := parts[0]
        suffix := parts[1]
        if !strings.HasPrefix(value, prefix) {
            return false
        }
        remainder := value[len(prefix):]
        // Try matching suffix against every suffix of remainder.
        for i := 0; i <= len(remainder); i++ {
            if matched, _ := path.Match(suffix, remainder[i:]); matched {
                return true
            }
            // Advance to next separator.
            next := strings.IndexByte(remainder[i:], '/')
            if next < 0 {
                break
            }
            i += next
        }
        return false
    }
    matched, _ := path.Match(pattern, value)
    return matched
}

Document in POLICY.md that ** matches zero or more path segments, while * matches within a single segment.

Files

File	Change
`internal/policy/policy.go`	Replace `filepath.Match` with `path.Match` + `**` support
`internal/policy/policy_test.go`	Add test: `engine/*/certs/` matches `engine/ca/prod/certs/abc123`

Work Item 13: Deployment Fixes (#78, #79)

#78 — systemd `ExecReload` sends SIGHUP with no handler

Fix: Two options:

Option A (recommended): Remove ExecReload entirely. Metacrypt does not support graceful reload — configuration changes require a restart. The ExecReload line creates a false expectation.

Option B: Implement SIGHUP handling that re-reads the TOML config and applies non-breaking changes (log level, TLS cert reload). This is significant new functionality and should be a separate feature.

For now, remove ExecReload from both service units:

-ExecReload=/bin/kill -HUP $MAINPID

#79 — Dockerfiles use Go 1.23 but module requires Go 1.25

Fix: Update the builder base image:

-FROM golang:1.23-alpine AS builder
+FROM golang:1.25-alpine AS builder

Apply to both Dockerfile.api and Dockerfile.web. Also remove the unnecessary gcc and musl-dev installation since CGO_ENABLED=0.

-RUN apk add --no-cache gcc musl-dev

#7 — No audit logging

Note: This is the only medium finding that requires significant new functionality rather than a targeted fix. Audit logging should be implemented as a dedicated subsystem:

Define an AuditEvent struct with: timestamp, caller, operation, resource, outcome (success/denied/error), and metadata.
Write events to a structured log sink (slog with JSON output to a dedicated file at /srv/metacrypt/audit.log).
Instrument every engine HandleRequest to emit an event on completion.
Instrument policy CreateRule/DeleteRule.
Instrument seal/unseal operations.

This is a substantial feature. Track separately from the quick-fix items in this plan.

#10 — No extension allowlist for SSH host certificates

Note: Host certificates typically don't use extensions (extensions are for user certificates). The fix is to ignore the extensions field on host signing requests rather than passing it through:

// In handleSignHost, after building the certificate:
cert.Permissions.Extensions = nil // Host certs should not carry extensions

#21 — No rate limiting on transit cryptographic operations

Note: This requires a rate-limiting middleware or per-caller token bucket. Best implemented as server-level middleware applied to engine request handlers:

func (s *Server) requireRateLimit(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        info := TokenInfoFromContext(r.Context())
        if !s.rateLimiter.Allow(info.Username) {
            writeJSONError(w, "rate limit exceeded", http.StatusTooManyRequests)
            return
        }
        next(w, r)
    }
}

Track separately — this affects API design decisions (limits, quotas, per-user vs per-token).

Files

File	Change
`deploy/systemd/metacrypt.service`	Remove `ExecReload` line
`deploy/systemd/metacrypt-web.service`	Remove `ExecReload` line
`Dockerfile.api`	Update to `golang:1.25-alpine`; remove `gcc musl-dev`
`Dockerfile.web`	Update to `golang:1.25-alpine`; remove `gcc musl-dev`

Implementation Order

 5. #54, #58  Default-deny in engines     (security-critical, do first)
 6. #63–#67   User engine concurrency     (5 coupled fixes, one change)
 7. #41–#43   Barrier/seal fixes          (3 independent fixes)
 8. #50, #51, #70  CA engine fixes        (key usage + renewal + API sync)
 9. #53       SSH CA lock granularity      (standalone refactor)
10. #59       Transit version validation   (one-line fix)
11. #46, #74  DB pragmas + auth cache      (independent, no overlap)
12. #73       Policy glob matching         (standalone)
13. #78, #79  Deployment fixes             (non-code, standalone)
    #7        Audit logging                (new feature, track separately)
    #10       SSH host extensions          (one-line fix, ship with #9)
    #21       Transit rate limiting        (new feature, track separately)

Work items 5–10 can be parallelized across engineers:

Engineer A: #5 (default-deny) + #9 (SSH CA locks) + #10 (host extensions)
Engineer B: #6 (user concurrency) + #8 (CA fixes)
Engineer C: #7 (barrier/seal) + #10 (transit version) + #11 (DB/auth)
Independent: #12 (policy), #13 (deployment)

Work items #7 (audit logging) and #21 (rate limiting) are substantial new features and should be tracked as separate engineering tasks, not quick fixes.

Post-Remediation (Medium)

After all medium findings are resolved:

Update AUDIT.md — mark each finding as RESOLVED with summary.
Run the full pipeline: make all (vet, lint, test, build).
Run race detector: go test -race ./... — especially important for work items 6 and 9 (concurrency changes).
Address remaining low findings — 14 low-severity items remain, mostly zeroization gaps and documentation drift.

40 KiB Raw Permalink Blame History Unescape Escape

Remediation Plan — High-Priority Audit Findings

Work Item 1: JSON Injection in REST Error Responses (#68)

Fix

Files

Verification

Work Item 2: Path Traversal via Unsanitized Names (#48)

Vulnerable locations

Fix

Files

Verification

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

#39 — TOCTOU Race in Barrier Get/Put

#40 — Crash During ReWrapKeys Loses All Data

Files

Verification

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

#49 — No TTL Ceiling in CA Certificate Issuance

#61 — Ineffective ECDH Key Zeroization

#62 — User Engine Policy Path Uses mountPath Instead of Mount Name

#69 — Typed REST Handlers Bypass Policy Engine

Verification (Work Item 4, all findings)

Implementation Order

Post-Remediation (High)

Remediation Plan — Medium-Priority Audit Findings

Work Item 5: Default-Deny in Engine Policy Checks (#54, #58)

Fix

Files

Work Item 6: User Engine Concurrency and Safety (#63, #64, #65, #66, #67)

#63 — Missing role checks on decrypt, re-encrypt, rotate-key

#64 — Initialize holds no mutex

#65 — handleEncrypt releases lock before using state

#66 — handleReEncrypt manual lock without defer

#67 — No sealed-state check in HandleRequest

Files

Verification

Work Item 7: Barrier and Seal Fixes (#41, #42, #43)

#41 — loadKeys errors silently swallowed during unseal

#42 — No AAD binding on MEK encryption with KWK

#43 — Barrier List SQL LIKE with unescaped prefix

Files

Work Item 8: CA Engine Fixes (#50, #51, #70)

#50 — Non-admin users can override key usages

#51 — Certificate renewal does not revoke original

#70 — RenewCert has no REST route (API sync violation)

Files

Work Item 9: SSH CA Lock Granularity (#53)

Files

Work Item 10: Transit Ciphertext Validation (#59)

Files

Work Item 11: Database and Auth (#46, #74)

#46 — SQLite PRAGMAs only applied to first connection

#74 — Token validation cache grows without bound

Files

Work Item 12: Policy Engine Glob Matching (#73)

Files

Work Item 13: Deployment Fixes (#78, #79)

#78 — systemd ExecReload sends SIGHUP with no handler

#79 — Dockerfiles use Go 1.23 but module requires Go 1.25

#7 — No audit logging

#10 — No extension allowlist for SSH host certificates

#21 — No rate limiting on transit cryptographic operations

Files

Implementation Order

Post-Remediation (Medium)

40 KiB

Raw Permalink Blame History

#40 — Crash During `ReWrapKeys` Loses All Data

#62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name

#41 — `loadKeys` errors silently swallowed during unseal

#43 — Barrier `List` SQL LIKE with unescaped prefix

#70 — `RenewCert` has no REST route (API sync violation)

#78 — systemd `ExecReload` sends SIGHUP with no handler