Files
metacrypt/REMEDIATION.md
Kyle Isom 5c5d7e184e Fix ECDH zeroization, add audit logging, and remediate high findings
- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:04:39 -07:00

40 KiB
Raw Permalink Blame History

Remediation Plan — High-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69

This plan addresses all eight High-severity findings from the 2026-03-17 full system audit. Findings are grouped into four work items by shared root cause or affected subsystem. The order reflects dependency chains: #68 is a standalone fix that should ship first; #48 is a prerequisite for safe operation across all engines; #39/#40 affect the storage core; the remaining four affect specific engines.


Work Item 1: JSON Injection in REST Error Responses (#68)

Risk: An error message containing " or \ breaks the JSON response structure. If the error contains attacker-controlled input (e.g., a mount name or key name that triggers a downstream error), this enables JSON injection in API responses.

Root cause: 13 locations in internal/server/routes.go construct JSON error responses via string concatenation:

http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)

The writeEngineError helper (line 1704) is the most common entry point; most typed handlers call it.

Fix

  1. Replace writeEngineError with a safe JSON encoder:

    func writeJSONError(w http.ResponseWriter, msg string, code int) {
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(code)
        _ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
    }
    
  2. Replace all 13 call sites that use string concatenation with writeJSONError(w, grpcMessage(err), status) or writeJSONError(w, err.Error(), status).

    The grpcMessage helper already exists in the webserver package and extracts human-readable messages from gRPC errors. Add an equivalent to the REST server, and prefer it over raw err.Error() to avoid leaking internal error details.

  3. Grep for the pattern "error":" in routes.go to confirm no remaining string-concatenated JSON.

Files

File Change
internal/server/routes.go Replace writeEngineError and all 13 inline error sites

Verification

  • go vet ./internal/server/
  • go test ./internal/server/
  • Manual test: mount an engine with a name containing ", trigger an error, verify the response is valid JSON.

Work Item 2: Path Traversal via Unsanitized Names (#48)

Risk: User-controlled strings (issuer names, key names, profile names, usernames, mount names) are concatenated directly into barrier storage paths. An input containing ../ traverses the barrier namespace, allowing reads and writes to arbitrary paths. This affects all four engines and the engine registry.

Root cause: No validation exists at any layer — neither the barrier's Put/Get/Delete methods nor the engines sanitize path components.

Vulnerable locations

File Input Path Pattern
ca/ca.go issuer name mountPath + "issuers/" + name + "/"
sshca/sshca.go profile name mountPath + "profiles/" + name + ".json"
transit/transit.go key name mountPath + "keys/" + name + "/"
user/user.go username mountPath + "users/" + username + "/"
engine/engine.go mount name engine/{type}/{name}/
policy/policy.go rule ID policy/rules/{id}

Fix

Enforce validation at two layers (defense in depth):

  1. Barrier layer — reject paths containing .. segments.

    Add a validatePath check at the top of Get, Put, Delete, and List in barrier.go:

    var ErrInvalidPath = errors.New("barrier: invalid path")
    
    func validatePath(p string) error {
        for _, seg := range strings.Split(p, "/") {
            if seg == ".." {
                return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
            }
        }
        return nil
    }
    

    Call validatePath at the entry of Get, Put, Delete, List. Return ErrInvalidPath on failure.

  2. Engine/registry layer — validate entity names at input boundaries.

    Add a ValidateName helper to internal/engine/:

    var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)
    
    func ValidateName(name string) error {
        if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
            return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
                "dot, hyphen, or underscore characters", name)
        }
        return nil
    }
    

    Call ValidateName in:

    Location Input validated
    engine.go Mount() mount name
    ca.go handleCreateIssuer issuer name
    sshca.go handleCreateProfile profile name
    transit.go handleCreateKey key name
    user.go handleRegister, handleProvision username
    user.go handleEncrypt recipient usernames
    policy.go CreateRule rule ID

    Note: certificate serials are generated server-side from crypto/rand and hex-encoded, so they are safe. Validate anyway for defense in depth.

Files

File Change
internal/barrier/barrier.go Add validatePath, call from Get/Put/Delete/List
internal/engine/engine.go Add ValidateName, call from Mount
internal/engine/ca/ca.go Call ValidateName on issuer name
internal/engine/sshca/sshca.go Call ValidateName on profile name
internal/engine/transit/transit.go Call ValidateName on key name
internal/engine/user/user.go Call ValidateName on usernames
internal/policy/policy.go Call ValidateName on rule ID

Verification

  • Add TestValidatePath to barrier_test.go: confirm ../ and .. are rejected; confirm normal paths pass.
  • Add TestValidateName to engine_test.go: confirm ../evil, empty string, and overlong names are rejected; confirm valid names pass.
  • go test ./internal/barrier/ ./internal/engine/... ./internal/policy/

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

These two findings share the barrier/seal subsystem and should be addressed together.

#39 — TOCTOU Race in Barrier Get/Put

Risk: Get and Put copy the mek slice header and keys map reference under RLock, release the lock, then use the copied references for encryption/decryption. A concurrent Seal() zeroizes the underlying byte slices in place before nil-ing the fields, so a concurrent reader uses zeroized key material.

Root cause: The lock does not cover the crypto operation. The "copy" is a shallow reference copy (slice header), not a deep byte copy. Seal() zeroizes the backing array, which is shared.

Current locking pattern (barrier.go):

Get:  RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put:  RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock  → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock

Fix: Hold RLock through the entire crypto operation:

func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
    if err := validatePath(path); err != nil {
        return nil, err
    }
    b.mu.RLock()
    defer b.mu.RUnlock()
    if b.mek == nil {
        return nil, ErrSealed
    }
    // query DB, resolve key, decrypt — all under RLock
    // ...
}

This is the minimal, safest change. RLock permits concurrent readers, so there is no throughput regression for parallel Get/Put operations. The only serialization point is Seal(), which acquires the exclusive Lock and waits for all readers to drain — exactly the semantics we want.

Apply the same pattern to Put, Delete, and List.

Alternative considered: Atomic pointer swap (atomic.Pointer[keyState]). This eliminates the lock from the hot path entirely, but introduces complexity around deferred zeroization of the old state (readers may still hold references). The RLock-through-crypto approach is simpler and sufficient for Metacrypt's concurrency profile.

#40 — Crash During ReWrapKeys Loses All Data

Risk: RotateMEK calls barrier.ReWrapKeys(newMEK) which commits a transaction re-wrapping all DEKs, then separately updates seal_config with the new encrypted MEK. A crash between these two database operations leaves DEKs wrapped with a MEK that is not persisted — all data is irrecoverable.

Current flow (seal.go lines 245313):

1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK)   ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil)  ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ?  ← separate statement, not in transaction
   *** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK

Fix: Unify steps 24 into a single database transaction.

Refactor ReWrapKeys to accept an optional *sql.Tx:

// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
    // Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
    rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
    // ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}

// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
    b.mu.Lock()
    defer b.mu.Unlock()
    mcrypto.Zeroize(b.mek)
    b.mek = newMEK
}

Then in RotateMEK:

func (m *Manager) RotateMEK(ctx context.Context, password string) error {
    // ... derive KWK, generate newMEK ...

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Re-wrap all DEKs within the transaction.
    if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
        return err
    }

    // Update seal_config within the same transaction.
    encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
    if err != nil {
        return err
    }
    if _, err := tx.ExecContext(ctx,
        "UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
        encNewMEK,
    ); err != nil {
        return err
    }

    if err := tx.Commit(); err != nil {
        return err
    }

    // Only after commit: update in-memory state.
    m.barrier.SwapMEK(newMEK)
    return nil
}

SQLite in WAL mode handles this correctly — the transaction is atomic regardless of process crash. The barrier_keys and seal_config updates either both commit or neither does.

Files

File Change
internal/barrier/barrier.go Extend RLock scope in Get/Put/Delete/List; add ReWrapKeysTx, SwapMEK
internal/seal/seal.go Wrap ReWrapKeysTx + seal_config UPDATE in single transaction
internal/barrier/barrier_test.go Add concurrent Get/Seal stress test

Verification

  • go test -race ./internal/barrier/ ./internal/seal/
  • Add TestConcurrentGetSeal: spawn goroutines doing Get while another goroutine calls Seal. Run with -race. Verify no panics or data races.
  • Add TestRotateMEKAtomic: verify that barrier_keys and seal_config are updated in the same transaction (mock the DB to detect transaction boundaries, or verify via rollback behavior).

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

These four findings touch separate files with no overlap and can be addressed in parallel.

#49 — No TTL Ceiling in CA Certificate Issuance

Risk: A non-admin user can request an arbitrarily long certificate lifetime. The issuer's MaxTTL exists in config but is not enforced during handleIssue or handleSignCSR.

Root cause: The CA engine applies the user's requested TTL directly to the certificate without comparing it against issuerConfig.MaxTTL. The SSH CA engine correctly enforces this via resolveTTL — the CA engine does not.

Fix: Add a resolveTTL method to the CA engine, following the SSH CA engine's pattern (sshca.go lines 902932):

func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
    maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
    if err != nil {
        maxTTL = 2160 * time.Hour // 90 days fallback
    }

    if requested != "" {
        ttl, err := time.ParseDuration(requested)
        if err != nil {
            return 0, fmt.Errorf("invalid TTL: %w", err)
        }
        if ttl > maxTTL {
            return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
        }
        return ttl, nil
    }

    return maxTTL, nil
}

Call this in handleIssue and handleSignCSR before constructing the certificate. Replace the raw TTL string with the validated duration.

File Change
internal/engine/ca/ca.go Add resolveTTL, call from handleIssue and handleSignCSR
internal/engine/ca/ca_test.go Add test: issue cert with TTL > MaxTTL, verify rejection

#61 — Ineffective ECDH Key Zeroization

Risk: privKey.Bytes() returns a copy of the private key bytes. Zeroizing the copy leaves the original inside *ecdh.PrivateKey. Go's crypto/ecdh API does not expose the internal byte slice.

Root cause: Language/API limitation in Go's crypto/ecdh package.

Fix: Store the raw private key bytes alongside the parsed key in userState, and zeroize those bytes on seal:

type userState struct {
    privKey    *ecdh.PrivateKey
    privBytes  []byte            // raw key bytes, retained for zeroization
    pubKey     *ecdh.PublicKey
    config     *UserKeyConfig
}

On load from barrier (Unseal, auto-provision):

raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw  // retain for zeroization
state.privKey = priv

On Seal:

mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil

Document the limitation: the parsed *ecdh.PrivateKey struct's internal copy cannot be zeroized from Go code. Setting privKey = nil makes it eligible for GC, but does not guarantee immediate byte overwrite. This is an accepted Go runtime limitation.

File Change
internal/engine/user/user.go Add privBytes to userState, populate on load, zeroize on Seal
internal/engine/user/types.go Update userState struct

#62 — User Engine Policy Path Uses mountPath Instead of Mount Name

Risk: Policy checks construct the resource path using e.mountPath (which is engine/user/{name}/) instead of just the mount name. Policy rules match against user/{name}/recipient/{username}, so the full mount path creates a mismatch like user/engine/user/myengine//recipient/alice. No policy rule will ever match.

Root cause: Line 358 of user.go uses e.mountPath directly. The SSH CA and transit engines correctly use a mountName() helper.

Fix: Add a mountName() method to the user engine:

func (e *UserEngine) mountName() string {
    // mountPath is "engine/user/{name}/"
    parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
    if len(parts) >= 3 {
        return parts[2]
    }
    return e.mountPath
}

Change line 358:

resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)

Audit all other resource path constructions in the user engine to confirm they also use the correct mount name.

File Change
internal/engine/user/user.go Add mountName(), fix resource path on line 358
internal/engine/user/user_test.go Add test: verify policy resource path format

#69 — Typed REST Handlers Bypass Policy Engine

Risk: 18 typed REST handlers pass nil for CheckPolicy in the engine.Request, skipping service-level policy evaluation. The generic /v1/engine/request endpoint correctly passes a policyChecker. Since engines #54 and #58 default to allow when no policy matches, typed routes are effectively unprotected by policy.

Root cause: Typed handlers were modeled after admin-only operations (which don't need policy) but applied to user-accessible operations.

Fix: Extract the policy checker construction from handleEngineRequest into a shared helper:

func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
    return func(resource, action string) (string, bool) {
        effect, matched, err := s.policy.Check(
            info.Username, info.Roles, resource, action,
        )
        if err != nil || !matched {
            return "deny", false
        }
        return effect, matched
    }
}

Then in each typed handler, set CheckPolicy on the request:

req := &engine.Request{
    Operation:   "get-cert",
    Data:        data,
    CallerInfo:  callerInfo,
    CheckPolicy: s.newPolicyChecker(callerInfo),
}

18 handlers to update:

Handler Operation
handleGetCert get-cert
handleRevokeCert revoke-cert
handleDeleteCert delete-cert
handleSSHCASignHost sign-host
handleSSHCASignUser sign-user
handleSSHCAGetProfile get-profile
handleSSHCAListProfiles list-profiles
handleSSHCADeleteProfile delete-profile
handleSSHCAGetCert get-cert
handleSSHCAListCerts list-certs
handleSSHCARevokeCert revoke-cert
handleSSHCADeleteCert delete-cert
handleUserRegister register
handleUserProvision provision
handleUserListUsers list-users
handleUserGetPublicKey get-public-key
handleUserDeleteUser delete-user
handleUserDecrypt decrypt

Note: handleUserEncrypt already passes a policy checker — verify it uses the same shared helper after refactoring. Admin-only handlers (behind requireAdmin wrapper) do not need a policy checker since admin bypasses policy.

File Change
internal/server/routes.go Add newPolicyChecker, pass to all 18 typed handlers
internal/server/server_test.go Add test: policy-denied user is rejected by typed route

Verification (Work Item 4, all findings)

go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...

Implementation Order

1. #68  JSON injection             (standalone, ship immediately)
2. #48  Path traversal             (standalone, blocks safe engine operation)
3. #39  Barrier TOCTOU race    ─┐
   #40  ReWrapKeys crash safety ┘  (coupled, requires careful testing)
4. #49  CA TTL enforcement     ─┐
   #61  ECDH zeroization       │
   #62  User policy path       │  (independent fixes, parallelizable)
   #69  Policy bypass          ─┘

Items 1 and 2 have no dependencies and can be done in parallel by different engineers.

Items 3 and 4 can also be done in parallel since they touch different subsystems (barrier/seal vs engines/server).


Post-Remediation (High)

All eight High findings (#39, #40, #48, #49, #61, #62, #68, #69) have been resolved. See commit a80323e.



Remediation Plan — Medium-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #7, #10, #21, #41, #42, #43, #46, #50, #51, #53, #54, #58, #59, #63, #64, #65, #66, #67, #70, #72, #73, #74, #78, #79

This plan addresses all medium-severity findings. Findings are grouped into eight work items by subsystem. Two findings (#72, #79) are already resolved or trivial and are noted inline.

Previously resolved by High remediation:

  • #72 (policy rule ID path traversal) — covered by #48's ValidateName on CreateRule and barrier-level validatePath.

Work Item 5: Default-Deny in Engine Policy Checks (#54, #58)

Risk: The SSH CA and transit engines default to allow when no policy rule matches a non-admin user's request. This contradicts the default-deny principle in ARCHITECTURE.md and the engineering standards. Now that #69 is resolved (typed REST handlers pass CheckPolicy), the engines receive a policy checker — but when CheckPolicy returns no match, they still allow.

Root cause: Both engines treat "no matching policy rule" as implicit allow for authenticated users.

Fix

Change the policy evaluation result handling in both engines.

SSH CA (sshca.go, handleSignHost ~line 309, handleSignUser ~line 443):

// Before (default-allow):
if matched && effect == "deny" {
    return nil, ErrForbidden
}

// After (default-deny):
if req.CheckPolicy != nil {
    effect, matched := req.CheckPolicy(resource, "sign")
    if !matched || effect != "allow" {
        return nil, ErrForbidden
    }
}

Transit (transit.go, requireUserWithPolicy ~line 330):

// Before (default-allow):
if matched {
    if effect == "deny" {
        return ErrForbidden
    }
    return nil
}
return nil  // no match → allow

// After (default-deny):
if req.CheckPolicy == nil {
    return ErrForbidden
}
effect, matched := req.CheckPolicy(resource, action)
if !matched || effect != "allow" {
    return ErrForbidden
}
return nil

Apply the same pattern to any other policy check sites in these engines.

Files

File Change
internal/engine/sshca/sshca.go Default-deny in handleSignHost, handleSignUser
internal/engine/transit/transit.go Default-deny in requireUserWithPolicy
internal/engine/sshca/sshca_test.go Add test: no policy rules → user request denied
internal/engine/transit/transit_test.go Add test: no policy rules → user request denied

Work Item 6: User Engine Concurrency and Safety (#63, #64, #65, #66, #67)

Five findings in the user engine share a root cause: the engine was implemented after the others and does not follow the same concurrency patterns.

#63 — Missing role checks on decrypt, re-encrypt, rotate-key

Fix: Add IsUser() check to handleDecrypt, handleReEncrypt, and handleRotateKey, matching the pattern used in handleEncrypt:

if !req.CallerInfo.IsUser() {
    return nil, ErrForbidden
}

#64 — Initialize holds no mutex

Fix: Acquire the write lock at the top of Initialize:

func (e *UserEngine) Initialize(ctx context.Context, ...) error {
    e.mu.Lock()
    defer e.mu.Unlock()
    // ... existing body ...
}

This matches the pattern in the SSH CA and transit engines.

#65 — handleEncrypt releases lock before using state

Fix: Hold RLock (not write lock) through the ECDH and encryption operations. After auto-provisioning (which needs the write lock), downgrade to a read lock for the crypto work:

e.mu.Lock()
// ... auto-provisioning ...
// Capture references while holding write lock.
senderState := e.users[sender]
recipientStates := ...
e.mu.Unlock()

// Re-acquire RLock for crypto operations.
e.mu.RLock()
defer e.mu.RUnlock()
// verify states are still valid (not nil)
if senderState.privKey == nil {
    return nil, ErrSealed
}
// ... ECDH, wrap DEK, encrypt ...

Alternatively, hold the write lock through the entire operation (simpler but serializes all encrypts). Given that encryption is the hot path, the downgrade approach is preferred.

#66 — handleReEncrypt manual lock without defer

Fix: Restructure to use defer for all lock releases. Split the method into a provisioning phase (write lock) and a crypto phase (read lock), each with its own defer:

func (e *UserEngine) handleReEncrypt(...) {
    // Phase 1: read envelope, resolve users.
    e.mu.RLock()
    // ... read state ...
    e.mu.RUnlock()

    // Phase 2: crypto operations.
    e.mu.RLock()
    defer e.mu.RUnlock()
    // ... decrypt, re-wrap, encrypt ...
}

#67 — No sealed-state check in HandleRequest

Fix: Add a sealed-state guard at the top of HandleRequest, before the operation dispatch:

func (e *UserEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.RLock()
    sealed := e.config == nil
    e.mu.RUnlock()
    if sealed {
        return nil, ErrSealed
    }

    switch req.Operation {
    // ...
    }
}

Files

File Change
internal/engine/user/user.go All five fixes above
internal/engine/user/user_test.go Add tests: guest role denied; concurrent encrypt/seal; sealed-state rejection

Verification

go test -race ./internal/engine/user/

Work Item 7: Barrier and Seal Fixes (#41, #42, #43)

#41 — loadKeys errors silently swallowed during unseal

Risk: If loadKeys fails to decrypt DEK entries (corrupt data, wrong MEK), errors are swallowed and the keys map is incomplete. Subsequent operations fail with confusing ErrKeyNotFound errors.

Fix: Return the error from loadKeys instead of ignoring it. Distinguish between "table doesn't exist" (acceptable on first run before migration) and genuine decryption failures:

func (b *AESGCMBarrier) Unseal(mek []byte) error {
    // ...
    if err := b.loadKeys(); err != nil {
        // Check if this is a "no such table" error (pre-migration).
        if !isTableMissing(err) {
            b.mek = nil
            return fmt.Errorf("barrier: load keys: %w", err)
        }
        // Pre-migration: no barrier_keys table yet, proceed with empty map.
    }
    return nil
}

func isTableMissing(err error) bool {
    return strings.Contains(err.Error(), "no such table")
}

#42 — No AAD binding on MEK encryption with KWK

Risk: The MEK ciphertext in seal_config has no AAD, so there is no cryptographic binding to its storage purpose.

Fix: Pass a constant AAD string when encrypting/decrypting the MEK:

var mekAAD = []byte("metacrypt/seal_config/mek")

// In Init:
encryptedMEK, err := crypto.Encrypt(kwk, mek, mekAAD)

// In Unseal:
mek, err := crypto.Decrypt(kwk, encryptedMEK, mekAAD)

// In RotateMEK:
newEncMEK, err := crypto.Encrypt(kwk, newMEK, mekAAD)

Migration note: Existing databases have MEKs encrypted with nil AAD. The unseal path must try mekAAD first, then fall back to nil for backward compatibility. After successful unseal with nil, re-encrypt with mekAAD and update seal_config.

#43 — Barrier List SQL LIKE with unescaped prefix

Risk: If a barrier path prefix contains % or _, the LIKE query matches unintended entries.

Fix: Escape SQL LIKE wildcards in the prefix:

func escapeLIKE(s string) string {
    s = strings.ReplaceAll(s, `\`, `\\`)
    s = strings.ReplaceAll(s, `%`, `\%`)
    s = strings.ReplaceAll(s, `_`, `\_`)
    return s
}

// In List:
rows, err := b.db.QueryContext(ctx,
    "SELECT path FROM barrier_entries WHERE path LIKE ? ESCAPE '\\'",
    escapeLIKE(prefix)+"%")

Files

File Change
internal/barrier/barrier.go Fix loadKeys error handling; add escapeLIKE to List
internal/seal/seal.go Add mekAAD constant; use in Init, Unseal, RotateMEK with fallback
internal/barrier/barrier_test.go Add test: List with % and _ in paths
internal/seal/seal_test.go Add test: unseal with AAD-encrypted MEK

Work Item 8: CA Engine Fixes (#50, #51, #70)

#50 — Non-admin users can override key usages

Risk: A non-admin user can request cert sign or crl sign key usage on a leaf certificate, creating a de facto CA certificate.

Fix: Only allow admin users to override key_usages and ext_key_usages. Non-admin users get the profile defaults:

// In handleIssue, after profile lookup:
if req.CallerInfo.IsAdmin {
    if v, ok := req.Data["key_usages"].([]interface{}); ok {
        profile.KeyUse = toStringSlice(v)
    }
    if v, ok := req.Data["ext_key_usages"].([]interface{}); ok {
        profile.ExtKeyUsages = toStringSlice(v)
    }
}

#51 — Certificate renewal does not revoke original

Fix: After issuing the renewed certificate, revoke the original by serial. The original serial is available from the request data:

// In handleRenew, after successful issuance:
oldSerial := req.Data["serial"].(string)
if err := e.revokeCertBySerial(ctx, issuerName, oldSerial, req.CallerInfo.Username); err != nil {
    // Log but do not fail — the new cert is already issued.
    e.logger.Warn("failed to revoke original during renewal",
        "serial", oldSerial, "error", err)
}

#70 — RenewCert has no REST route (API sync violation)

Fix: Add a dedicated REST route for certificate renewal:

r.Post("/v1/ca/{mount}/cert/{serial}/renew", s.requireAuth(s.handleRenewCert))

Implement handleRenewCert following the typed handler pattern (with CheckPolicy).

Files

File Change
internal/engine/ca/ca.go Guard key usage overrides with IsAdmin; revoke original in handleRenew
internal/server/routes.go Add POST /v1/ca/{mount}/cert/{serial}/renew route and handler
internal/engine/ca/ca_test.go Add test: non-admin cannot set cert sign usage; renewal revokes original

Work Item 9: SSH CA Lock Granularity (#53)

Risk: HandleRequest acquires a write lock for all operations including reads (get-ca-pubkey, get-cert, list-certs, get-profile, list-profiles, get-krl). This serializes all operations unnecessarily.

Fix: Move locking into individual handlers. Read-only operations use RLock; mutating operations use Lock:

func (e *SSHCAEngine) HandleRequest(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    // No lock here — each handler manages its own.
    switch req.Operation {
    case "get-ca-pubkey":
        return e.handleGetCAPubkey(ctx, req) // uses RLock internally
    case "sign-host":
        return e.handleSignHost(ctx, req)     // uses Lock internally
    // ...
    }
}

Read-only handlers:

func (e *SSHCAEngine) handleGetCAPubkey(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()
    // ...
}

Mutating handlers (sign-host, sign-user, create-profile, update-profile, delete-profile, revoke-cert, delete-cert):

func (e *SSHCAEngine) handleSignHost(ctx context.Context, req *engine.Request) (*engine.Response, error) {
    e.mu.Lock()
    defer e.mu.Unlock()
    // ...
}

Files

File Change
internal/engine/sshca/sshca.go Remove top-level lock from HandleRequest; add per-handler locks

Work Item 10: Transit Ciphertext Validation (#59)

Risk: parseVersionedData accepts negative version numbers. A crafted ciphertext metacrypt:v-1:... parses as version -1, which fails the key lookup but produces a confusing error.

Fix: Add a bounds check after parsing:

func parseVersionedData(s string) (int, []byte, error) {
    // ... existing parse logic ...
    version, err := strconv.Atoi(parts[1][1:])
    if err != nil {
        return 0, nil, ErrInvalidFormat
    }
    if version < 1 {
        return 0, nil, fmt.Errorf("transit: %w: version must be >= 1", ErrInvalidFormat)
    }
    // ...
}

Files

File Change
internal/engine/transit/transit.go Add version < 1 check in parseVersionedData
internal/engine/transit/transit_test.go Add test: metacrypt:v0:... and metacrypt:v-1:... rejected

Work Item 11: Database and Auth (#46, #74)

#46 — SQLite PRAGMAs only applied to first connection

Risk: database/sql may open additional connections that don't receive the journal_mode, foreign_keys, and busy_timeout PRAGMAs.

Fix: Use the _pragma DSN parameter supported by modernc.org/sqlite:

dsn := path + "?_pragma=journal_mode(WAL)&_pragma=foreign_keys(ON)&_pragma=busy_timeout(5000)"
db, err := sql.Open("sqlite", dsn)

Remove the manual PRAGMA execution loop. This ensures every connection opened by the pool receives the PRAGMAs.

#74 — Token validation cache grows without bound

Risk: The cache map in auth.go grows indefinitely. Expired entries are never proactively removed.

Fix: Add periodic eviction. Run a background goroutine that sweeps expired entries every minute:

func (a *Authenticator) startEviction(ctx context.Context) {
    ticker := time.NewTicker(time.Minute)
    go func() {
        defer ticker.Stop()
        for {
            select {
            case <-ctx.Done():
                return
            case <-ticker.C:
                a.evictExpired()
            }
        }
    }()
}

func (a *Authenticator) evictExpired() {
    now := time.Now()
    a.mu.Lock()
    defer a.mu.Unlock()
    for key, entry := range a.cache {
        if now.After(entry.expiresAt) {
            delete(a.cache, key)
        }
    }
}

Call startEviction from the Authenticator constructor. Accept a context.Context to allow cancellation on shutdown.

Files

File Change
internal/db/db.go Switch to _pragma DSN parameters; remove manual PRAGMA loop
internal/auth/auth.go Add startEviction goroutine; call from constructor
internal/db/db_test.go Verify PRAGMAs are active on a fresh connection from the pool
internal/auth/auth_test.go Add test: expired entries are evicted after sweep

Work Item 12: Policy Engine Glob Matching (#73)

Risk: filepath.Match does not support ** for recursive directory matching. Administrators writing rules like engine/**/certs/* will find they don't match paths with multiple segments.

Fix: Replace filepath.Match with path.Match (POSIX-style, no OS-specific behavior) and add support for ** by splitting the pattern and value on / and matching segments:

import "path"

func matchesAnyGlob(patterns []string, value string) bool {
    for _, p := range patterns {
        if matchGlob(p, value) {
            return true
        }
    }
    return false
}

func matchGlob(pattern, value string) bool {
    // Handle ** by checking if any suffix of value matches the rest of pattern.
    if strings.Contains(pattern, "**") {
        parts := strings.SplitN(pattern, "**", 2)
        prefix := parts[0]
        suffix := parts[1]
        if !strings.HasPrefix(value, prefix) {
            return false
        }
        remainder := value[len(prefix):]
        // Try matching suffix against every suffix of remainder.
        for i := 0; i <= len(remainder); i++ {
            if matched, _ := path.Match(suffix, remainder[i:]); matched {
                return true
            }
            // Advance to next separator.
            next := strings.IndexByte(remainder[i:], '/')
            if next < 0 {
                break
            }
            i += next
        }
        return false
    }
    matched, _ := path.Match(pattern, value)
    return matched
}

Document in POLICY.md that ** matches zero or more path segments, while * matches within a single segment.

Files

File Change
internal/policy/policy.go Replace filepath.Match with path.Match + ** support
internal/policy/policy_test.go Add test: engine/**/certs/* matches engine/ca/prod/certs/abc123

Work Item 13: Deployment Fixes (#78, #79)

#78 — systemd ExecReload sends SIGHUP with no handler

Fix: Two options:

Option A (recommended): Remove ExecReload entirely. Metacrypt does not support graceful reload — configuration changes require a restart. The ExecReload line creates a false expectation.

Option B: Implement SIGHUP handling that re-reads the TOML config and applies non-breaking changes (log level, TLS cert reload). This is significant new functionality and should be a separate feature.

For now, remove ExecReload from both service units:

-ExecReload=/bin/kill -HUP $MAINPID

#79 — Dockerfiles use Go 1.23 but module requires Go 1.25

Fix: Update the builder base image:

-FROM golang:1.23-alpine AS builder
+FROM golang:1.25-alpine AS builder

Apply to both Dockerfile.api and Dockerfile.web. Also remove the unnecessary gcc and musl-dev installation since CGO_ENABLED=0.

-RUN apk add --no-cache gcc musl-dev

#7 — No audit logging

Note: This is the only medium finding that requires significant new functionality rather than a targeted fix. Audit logging should be implemented as a dedicated subsystem:

  1. Define an AuditEvent struct with: timestamp, caller, operation, resource, outcome (success/denied/error), and metadata.
  2. Write events to a structured log sink (slog with JSON output to a dedicated file at /srv/metacrypt/audit.log).
  3. Instrument every engine HandleRequest to emit an event on completion.
  4. Instrument policy CreateRule/DeleteRule.
  5. Instrument seal/unseal operations.

This is a substantial feature. Track separately from the quick-fix items in this plan.

#10 — No extension allowlist for SSH host certificates

Note: Host certificates typically don't use extensions (extensions are for user certificates). The fix is to ignore the extensions field on host signing requests rather than passing it through:

// In handleSignHost, after building the certificate:
cert.Permissions.Extensions = nil // Host certs should not carry extensions

#21 — No rate limiting on transit cryptographic operations

Note: This requires a rate-limiting middleware or per-caller token bucket. Best implemented as server-level middleware applied to engine request handlers:

func (s *Server) requireRateLimit(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        info := TokenInfoFromContext(r.Context())
        if !s.rateLimiter.Allow(info.Username) {
            writeJSONError(w, "rate limit exceeded", http.StatusTooManyRequests)
            return
        }
        next(w, r)
    }
}

Track separately — this affects API design decisions (limits, quotas, per-user vs per-token).

Files

File Change
deploy/systemd/metacrypt.service Remove ExecReload line
deploy/systemd/metacrypt-web.service Remove ExecReload line
Dockerfile.api Update to golang:1.25-alpine; remove gcc musl-dev
Dockerfile.web Update to golang:1.25-alpine; remove gcc musl-dev

Implementation Order

 5. #54, #58  Default-deny in engines     (security-critical, do first)
 6. #63#67   User engine concurrency     (5 coupled fixes, one change)
 7. #41#43   Barrier/seal fixes          (3 independent fixes)
 8. #50, #51, #70  CA engine fixes        (key usage + renewal + API sync)
 9. #53       SSH CA lock granularity      (standalone refactor)
10. #59       Transit version validation   (one-line fix)
11. #46, #74  DB pragmas + auth cache      (independent, no overlap)
12. #73       Policy glob matching         (standalone)
13. #78, #79  Deployment fixes             (non-code, standalone)
    #7        Audit logging                (new feature, track separately)
    #10       SSH host extensions          (one-line fix, ship with #9)
    #21       Transit rate limiting        (new feature, track separately)

Work items 510 can be parallelized across engineers:

  • Engineer A: #5 (default-deny) + #9 (SSH CA locks) + #10 (host extensions)
  • Engineer B: #6 (user concurrency) + #8 (CA fixes)
  • Engineer C: #7 (barrier/seal) + #10 (transit version) + #11 (DB/auth)
  • Independent: #12 (policy), #13 (deployment)

Work items #7 (audit logging) and #21 (rate limiting) are substantial new features and should be tracked as separate engineering tasks, not quick fixes.


Post-Remediation (Medium)

After all medium findings are resolved:

  1. Update AUDIT.md — mark each finding as RESOLVED with summary.
  2. Run the full pipeline: make all (vet, lint, test, build).
  3. Run race detector: go test -race ./... — especially important for work items 6 and 9 (concurrency changes).
  4. Address remaining low findings — 14 low-severity items remain, mostly zeroization gaps and documentation drift.