Files
metacrypt/REMEDIATION.md
Kyle Isom a80323e320 Add web UI for SSH CA, Transit, and User engines; full security audit and remediation
Web UI: Added browser-based management for all three remaining engines
(SSH CA, Transit, User E2E). Includes gRPC client wiring, handler files,
7 HTML templates, dashboard mount forms, and conditional navigation links.
Fixed REST API routes to match design specs (SSH CA cert singular paths,
Transit PATCH for update-key-config).

Security audit: Conducted full-system audit covering crypto core, all
engine implementations, API servers, policy engine, auth, deployment,
and documentation. Identified 42 new findings (#39-#80) across all
severity levels.

Remediation of all 8 High findings:
- #68: Replaced 14 JSON-injection-vulnerable error responses with safe
  json.Encoder via writeJSONError helper
- #48: Added two-layer path traversal defense (barrier validatePath
  rejects ".." segments; engine ValidateName enforces safe name pattern)
- #39: Extended RLock through entire crypto operations in barrier
  Get/Put/Delete/List to eliminate TOCTOU race with Seal
- #40: Unified ReWrapKeys and seal_config UPDATE into single SQLite
  transaction to prevent irrecoverable data loss on crash during MEK
  rotation
- #49: Added resolveTTL to CA engine enforcing issuer MaxTTL ceiling
  on handleIssue and handleSignCSR
- #61: Store raw ECDH private key bytes in userState for effective
  zeroization on Seal
- #62: Fixed user engine policy resource path from mountPath to
  mountName() so policy rules match correctly
- #69: Added newPolicyChecker helper and passed service-level policy
  evaluation to all 25 typed REST handler engine.Request structs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 22:02:06 -07:00

20 KiB
Raw Blame History

Remediation Plan — High-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69

This plan addresses all eight High-severity findings from the 2026-03-17 full system audit. Findings are grouped into four work items by shared root cause or affected subsystem. The order reflects dependency chains: #68 is a standalone fix that should ship first; #48 is a prerequisite for safe operation across all engines; #39/#40 affect the storage core; the remaining four affect specific engines.


Work Item 1: JSON Injection in REST Error Responses (#68)

Risk: An error message containing " or \ breaks the JSON response structure. If the error contains attacker-controlled input (e.g., a mount name or key name that triggers a downstream error), this enables JSON injection in API responses.

Root cause: 13 locations in internal/server/routes.go construct JSON error responses via string concatenation:

http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)

The writeEngineError helper (line 1704) is the most common entry point; most typed handlers call it.

Fix

  1. Replace writeEngineError with a safe JSON encoder:

    func writeJSONError(w http.ResponseWriter, msg string, code int) {
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(code)
        _ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
    }
    
  2. Replace all 13 call sites that use string concatenation with writeJSONError(w, grpcMessage(err), status) or writeJSONError(w, err.Error(), status).

    The grpcMessage helper already exists in the webserver package and extracts human-readable messages from gRPC errors. Add an equivalent to the REST server, and prefer it over raw err.Error() to avoid leaking internal error details.

  3. Grep for the pattern "error":" in routes.go to confirm no remaining string-concatenated JSON.

Files

File Change
internal/server/routes.go Replace writeEngineError and all 13 inline error sites

Verification

  • go vet ./internal/server/
  • go test ./internal/server/
  • Manual test: mount an engine with a name containing ", trigger an error, verify the response is valid JSON.

Work Item 2: Path Traversal via Unsanitized Names (#48)

Risk: User-controlled strings (issuer names, key names, profile names, usernames, mount names) are concatenated directly into barrier storage paths. An input containing ../ traverses the barrier namespace, allowing reads and writes to arbitrary paths. This affects all four engines and the engine registry.

Root cause: No validation exists at any layer — neither the barrier's Put/Get/Delete methods nor the engines sanitize path components.

Vulnerable locations

File Input Path Pattern
ca/ca.go issuer name mountPath + "issuers/" + name + "/"
sshca/sshca.go profile name mountPath + "profiles/" + name + ".json"
transit/transit.go key name mountPath + "keys/" + name + "/"
user/user.go username mountPath + "users/" + username + "/"
engine/engine.go mount name engine/{type}/{name}/
policy/policy.go rule ID policy/rules/{id}

Fix

Enforce validation at two layers (defense in depth):

  1. Barrier layer — reject paths containing .. segments.

    Add a validatePath check at the top of Get, Put, Delete, and List in barrier.go:

    var ErrInvalidPath = errors.New("barrier: invalid path")
    
    func validatePath(p string) error {
        for _, seg := range strings.Split(p, "/") {
            if seg == ".." {
                return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
            }
        }
        return nil
    }
    

    Call validatePath at the entry of Get, Put, Delete, List. Return ErrInvalidPath on failure.

  2. Engine/registry layer — validate entity names at input boundaries.

    Add a ValidateName helper to internal/engine/:

    var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)
    
    func ValidateName(name string) error {
        if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
            return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
                "dot, hyphen, or underscore characters", name)
        }
        return nil
    }
    

    Call ValidateName in:

    Location Input validated
    engine.go Mount() mount name
    ca.go handleCreateIssuer issuer name
    sshca.go handleCreateProfile profile name
    transit.go handleCreateKey key name
    user.go handleRegister, handleProvision username
    user.go handleEncrypt recipient usernames
    policy.go CreateRule rule ID

    Note: certificate serials are generated server-side from crypto/rand and hex-encoded, so they are safe. Validate anyway for defense in depth.

Files

File Change
internal/barrier/barrier.go Add validatePath, call from Get/Put/Delete/List
internal/engine/engine.go Add ValidateName, call from Mount
internal/engine/ca/ca.go Call ValidateName on issuer name
internal/engine/sshca/sshca.go Call ValidateName on profile name
internal/engine/transit/transit.go Call ValidateName on key name
internal/engine/user/user.go Call ValidateName on usernames
internal/policy/policy.go Call ValidateName on rule ID

Verification

  • Add TestValidatePath to barrier_test.go: confirm ../ and .. are rejected; confirm normal paths pass.
  • Add TestValidateName to engine_test.go: confirm ../evil, empty string, and overlong names are rejected; confirm valid names pass.
  • go test ./internal/barrier/ ./internal/engine/... ./internal/policy/

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

These two findings share the barrier/seal subsystem and should be addressed together.

#39 — TOCTOU Race in Barrier Get/Put

Risk: Get and Put copy the mek slice header and keys map reference under RLock, release the lock, then use the copied references for encryption/decryption. A concurrent Seal() zeroizes the underlying byte slices in place before nil-ing the fields, so a concurrent reader uses zeroized key material.

Root cause: The lock does not cover the crypto operation. The "copy" is a shallow reference copy (slice header), not a deep byte copy. Seal() zeroizes the backing array, which is shared.

Current locking pattern (barrier.go):

Get:  RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put:  RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock  → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock

Fix: Hold RLock through the entire crypto operation:

func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
    if err := validatePath(path); err != nil {
        return nil, err
    }
    b.mu.RLock()
    defer b.mu.RUnlock()
    if b.mek == nil {
        return nil, ErrSealed
    }
    // query DB, resolve key, decrypt — all under RLock
    // ...
}

This is the minimal, safest change. RLock permits concurrent readers, so there is no throughput regression for parallel Get/Put operations. The only serialization point is Seal(), which acquires the exclusive Lock and waits for all readers to drain — exactly the semantics we want.

Apply the same pattern to Put, Delete, and List.

Alternative considered: Atomic pointer swap (atomic.Pointer[keyState]). This eliminates the lock from the hot path entirely, but introduces complexity around deferred zeroization of the old state (readers may still hold references). The RLock-through-crypto approach is simpler and sufficient for Metacrypt's concurrency profile.

#40 — Crash During ReWrapKeys Loses All Data

Risk: RotateMEK calls barrier.ReWrapKeys(newMEK) which commits a transaction re-wrapping all DEKs, then separately updates seal_config with the new encrypted MEK. A crash between these two database operations leaves DEKs wrapped with a MEK that is not persisted — all data is irrecoverable.

Current flow (seal.go lines 245313):

1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK)   ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil)  ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ?  ← separate statement, not in transaction
   *** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK

Fix: Unify steps 24 into a single database transaction.

Refactor ReWrapKeys to accept an optional *sql.Tx:

// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
    // Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
    rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
    // ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}

// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
    b.mu.Lock()
    defer b.mu.Unlock()
    mcrypto.Zeroize(b.mek)
    b.mek = newMEK
}

Then in RotateMEK:

func (m *Manager) RotateMEK(ctx context.Context, password string) error {
    // ... derive KWK, generate newMEK ...

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Re-wrap all DEKs within the transaction.
    if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
        return err
    }

    // Update seal_config within the same transaction.
    encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
    if err != nil {
        return err
    }
    if _, err := tx.ExecContext(ctx,
        "UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
        encNewMEK,
    ); err != nil {
        return err
    }

    if err := tx.Commit(); err != nil {
        return err
    }

    // Only after commit: update in-memory state.
    m.barrier.SwapMEK(newMEK)
    return nil
}

SQLite in WAL mode handles this correctly — the transaction is atomic regardless of process crash. The barrier_keys and seal_config updates either both commit or neither does.

Files

File Change
internal/barrier/barrier.go Extend RLock scope in Get/Put/Delete/List; add ReWrapKeysTx, SwapMEK
internal/seal/seal.go Wrap ReWrapKeysTx + seal_config UPDATE in single transaction
internal/barrier/barrier_test.go Add concurrent Get/Seal stress test

Verification

  • go test -race ./internal/barrier/ ./internal/seal/
  • Add TestConcurrentGetSeal: spawn goroutines doing Get while another goroutine calls Seal. Run with -race. Verify no panics or data races.
  • Add TestRotateMEKAtomic: verify that barrier_keys and seal_config are updated in the same transaction (mock the DB to detect transaction boundaries, or verify via rollback behavior).

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

These four findings touch separate files with no overlap and can be addressed in parallel.

#49 — No TTL Ceiling in CA Certificate Issuance

Risk: A non-admin user can request an arbitrarily long certificate lifetime. The issuer's MaxTTL exists in config but is not enforced during handleIssue or handleSignCSR.

Root cause: The CA engine applies the user's requested TTL directly to the certificate without comparing it against issuerConfig.MaxTTL. The SSH CA engine correctly enforces this via resolveTTL — the CA engine does not.

Fix: Add a resolveTTL method to the CA engine, following the SSH CA engine's pattern (sshca.go lines 902932):

func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
    maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
    if err != nil {
        maxTTL = 2160 * time.Hour // 90 days fallback
    }

    if requested != "" {
        ttl, err := time.ParseDuration(requested)
        if err != nil {
            return 0, fmt.Errorf("invalid TTL: %w", err)
        }
        if ttl > maxTTL {
            return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
        }
        return ttl, nil
    }

    return maxTTL, nil
}

Call this in handleIssue and handleSignCSR before constructing the certificate. Replace the raw TTL string with the validated duration.

File Change
internal/engine/ca/ca.go Add resolveTTL, call from handleIssue and handleSignCSR
internal/engine/ca/ca_test.go Add test: issue cert with TTL > MaxTTL, verify rejection

#61 — Ineffective ECDH Key Zeroization

Risk: privKey.Bytes() returns a copy of the private key bytes. Zeroizing the copy leaves the original inside *ecdh.PrivateKey. Go's crypto/ecdh API does not expose the internal byte slice.

Root cause: Language/API limitation in Go's crypto/ecdh package.

Fix: Store the raw private key bytes alongside the parsed key in userState, and zeroize those bytes on seal:

type userState struct {
    privKey    *ecdh.PrivateKey
    privBytes  []byte            // raw key bytes, retained for zeroization
    pubKey     *ecdh.PublicKey
    config     *UserKeyConfig
}

On load from barrier (Unseal, auto-provision):

raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw  // retain for zeroization
state.privKey = priv

On Seal:

mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil

Document the limitation: the parsed *ecdh.PrivateKey struct's internal copy cannot be zeroized from Go code. Setting privKey = nil makes it eligible for GC, but does not guarantee immediate byte overwrite. This is an accepted Go runtime limitation.

File Change
internal/engine/user/user.go Add privBytes to userState, populate on load, zeroize on Seal
internal/engine/user/types.go Update userState struct

#62 — User Engine Policy Path Uses mountPath Instead of Mount Name

Risk: Policy checks construct the resource path using e.mountPath (which is engine/user/{name}/) instead of just the mount name. Policy rules match against user/{name}/recipient/{username}, so the full mount path creates a mismatch like user/engine/user/myengine//recipient/alice. No policy rule will ever match.

Root cause: Line 358 of user.go uses e.mountPath directly. The SSH CA and transit engines correctly use a mountName() helper.

Fix: Add a mountName() method to the user engine:

func (e *UserEngine) mountName() string {
    // mountPath is "engine/user/{name}/"
    parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
    if len(parts) >= 3 {
        return parts[2]
    }
    return e.mountPath
}

Change line 358:

resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)

Audit all other resource path constructions in the user engine to confirm they also use the correct mount name.

File Change
internal/engine/user/user.go Add mountName(), fix resource path on line 358
internal/engine/user/user_test.go Add test: verify policy resource path format

#69 — Typed REST Handlers Bypass Policy Engine

Risk: 18 typed REST handlers pass nil for CheckPolicy in the engine.Request, skipping service-level policy evaluation. The generic /v1/engine/request endpoint correctly passes a policyChecker. Since engines #54 and #58 default to allow when no policy matches, typed routes are effectively unprotected by policy.

Root cause: Typed handlers were modeled after admin-only operations (which don't need policy) but applied to user-accessible operations.

Fix: Extract the policy checker construction from handleEngineRequest into a shared helper:

func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
    return func(resource, action string) (string, bool) {
        effect, matched, err := s.policy.Check(
            info.Username, info.Roles, resource, action,
        )
        if err != nil || !matched {
            return "deny", false
        }
        return effect, matched
    }
}

Then in each typed handler, set CheckPolicy on the request:

req := &engine.Request{
    Operation:   "get-cert",
    Data:        data,
    CallerInfo:  callerInfo,
    CheckPolicy: s.newPolicyChecker(callerInfo),
}

18 handlers to update:

Handler Operation
handleGetCert get-cert
handleRevokeCert revoke-cert
handleDeleteCert delete-cert
handleSSHCASignHost sign-host
handleSSHCASignUser sign-user
handleSSHCAGetProfile get-profile
handleSSHCAListProfiles list-profiles
handleSSHCADeleteProfile delete-profile
handleSSHCAGetCert get-cert
handleSSHCAListCerts list-certs
handleSSHCARevokeCert revoke-cert
handleSSHCADeleteCert delete-cert
handleUserRegister register
handleUserProvision provision
handleUserListUsers list-users
handleUserGetPublicKey get-public-key
handleUserDeleteUser delete-user
handleUserDecrypt decrypt

Note: handleUserEncrypt already passes a policy checker — verify it uses the same shared helper after refactoring. Admin-only handlers (behind requireAdmin wrapper) do not need a policy checker since admin bypasses policy.

File Change
internal/server/routes.go Add newPolicyChecker, pass to all 18 typed handlers
internal/server/server_test.go Add test: policy-denied user is rejected by typed route

Verification (Work Item 4, all findings)

go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...

Implementation Order

1. #68  JSON injection             (standalone, ship immediately)
2. #48  Path traversal             (standalone, blocks safe engine operation)
3. #39  Barrier TOCTOU race    ─┐
   #40  ReWrapKeys crash safety ┘  (coupled, requires careful testing)
4. #49  CA TTL enforcement     ─┐
   #61  ECDH zeroization       │
   #62  User policy path       │  (independent fixes, parallelizable)
   #69  Policy bypass          ─┘

Items 1 and 2 have no dependencies and can be done in parallel by different engineers.

Items 3 and 4 can also be done in parallel since they touch different subsystems (barrier/seal vs engines/server).


Post-Remediation

After all eight findings are resolved:

  1. Update AUDIT.md — mark #39, #40, #48, #49, #61, #62, #68, #69 as RESOLVED with resolution summaries.
  2. Run the full pipeline: make all (vet, lint, test, build).
  3. Run race detector: go test -race ./...
  4. Address related medium findings that interact with these fixes:
    • #54 (SSH CA default-allow) and #58 (transit default-allow) — once #69 is fixed, the typed handlers will pass policy checkers to the engines, but the engines still default-allow when CheckPolicy returns no match. Consider changing the engine-level default to deny for non-admin callers.
    • #72 (policy ID path traversal) — already covered by #48's ValidateName fix on CreateRule.