Files

Kyle Isom a80323e320 Add web UI for SSH CA, Transit, and User engines; full security audit and remediation

Web UI: Added browser-based management for all three remaining engines
(SSH CA, Transit, User E2E). Includes gRPC client wiring, handler files,
7 HTML templates, dashboard mount forms, and conditional navigation links.
Fixed REST API routes to match design specs (SSH CA cert singular paths,
Transit PATCH for update-key-config).

Security audit: Conducted full-system audit covering crypto core, all
engine implementations, API servers, policy engine, auth, deployment,
and documentation. Identified 42 new findings (#39-#80) across all
severity levels.

Remediation of all 8 High findings:
- #68: Replaced 14 JSON-injection-vulnerable error responses with safe
  json.Encoder via writeJSONError helper
- #48: Added two-layer path traversal defense (barrier validatePath
  rejects ".." segments; engine ValidateName enforces safe name pattern)
- #39: Extended RLock through entire crypto operations in barrier
  Get/Put/Delete/List to eliminate TOCTOU race with Seal
- #40: Unified ReWrapKeys and seal_config UPDATE into single SQLite
  transaction to prevent irrecoverable data loss on crash during MEK
  rotation
- #49: Added resolveTTL to CA engine enforcing issuer MaxTTL ceiling
  on handleIssue and handleSignCSR
- #61: Store raw ECDH private key bytes in userState for effective
  zeroization on Seal
- #62: Fixed user engine policy resource path from mountPath to
  mountName() so policy rules match correctly
- #69: Added newPolicyChecker helper and passed service-level policy
  evaluation to all 25 typed REST handler engine.Request structs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-16 22:02:06 -07:00

20 KiB

Raw Blame History

Remediation Plan — High-Priority Audit Findings

Date: 2026-03-17 Scope: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69

This plan addresses all eight High-severity findings from the 2026-03-17 full system audit. Findings are grouped into four work items by shared root cause or affected subsystem. The order reflects dependency chains: #68 is a standalone fix that should ship first; #48 is a prerequisite for safe operation across all engines; #39/#40 affect the storage core; the remaining four affect specific engines.

Work Item 1: JSON Injection in REST Error Responses (#68)

Risk: An error message containing " or \ breaks the JSON response structure. If the error contains attacker-controlled input (e.g., a mount name or key name that triggers a downstream error), this enables JSON injection in API responses.

Root cause: 13 locations in internal/server/routes.go construct JSON error responses via string concatenation:

http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)

The writeEngineError helper (line 1704) is the most common entry point; most typed handlers call it.

Fix

Replace writeEngineError with a safe JSON encoder:

func writeJSONError(w http.ResponseWriter, msg string, code int) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(code)
    _ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
}

Replace all 13 call sites that use string concatenation with writeJSONError(w, grpcMessage(err), status) or writeJSONError(w, err.Error(), status).

The grpcMessage helper already exists in the webserver package and extracts human-readable messages from gRPC errors. Add an equivalent to the REST server, and prefer it over raw err.Error() to avoid leaking internal error details.
Grep for the pattern "error":" in routes.go to confirm no remaining string-concatenated JSON.

Files

File	Change
`internal/server/routes.go`	Replace `writeEngineError` and all 13 inline error sites

Verification

go vet ./internal/server/
go test ./internal/server/
Manual test: mount an engine with a name containing ", trigger an error, verify the response is valid JSON.

Work Item 2: Path Traversal via Unsanitized Names (#48)

Risk: User-controlled strings (issuer names, key names, profile names, usernames, mount names) are concatenated directly into barrier storage paths. An input containing ../ traverses the barrier namespace, allowing reads and writes to arbitrary paths. This affects all four engines and the engine registry.

Root cause: No validation exists at any layer — neither the barrier's Put/Get/Delete methods nor the engines sanitize path components.

Vulnerable locations

File	Input	Path Pattern
`ca/ca.go`	issuer `name`	`mountPath + "issuers/" + name + "/"`
`sshca/sshca.go`	profile `name`	`mountPath + "profiles/" + name + ".json"`
`transit/transit.go`	key `name`	`mountPath + "keys/" + name + "/"`
`user/user.go`	`username`	`mountPath + "users/" + username + "/"`
`engine/engine.go`	mount `name`	`engine/{type}/{name}/`
`policy/policy.go`	rule `ID`	`policy/rules/{id}`

Fix

Enforce validation at two layers (defense in depth):

Barrier layer — reject paths containing .. segments.

Add a validatePath check at the top of Get, Put, Delete, and List in barrier.go:

var ErrInvalidPath = errors.New("barrier: invalid path")

func validatePath(p string) error {
    for _, seg := range strings.Split(p, "/") {
        if seg == ".." {
            return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
        }
    }
    return nil
}

Call validatePath at the entry of Get, Put, Delete, List. Return ErrInvalidPath on failure.

Engine/registry layer — validate entity names at input boundaries.

Add a ValidateName helper to internal/engine/:

var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)

func ValidateName(name string) error {
    if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
        return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
            "dot, hyphen, or underscore characters", name)
    }
    return nil
}

Call ValidateName in:

Location	Input validated
`engine.go` `Mount()`	mount name
`ca.go` `handleCreateIssuer`	issuer name
`sshca.go` `handleCreateProfile`	profile name
`transit.go` `handleCreateKey`	key name
`user.go` `handleRegister`, `handleProvision`	username
`user.go` `handleEncrypt`	recipient usernames
`policy.go` `CreateRule`	rule ID

Note: certificate serials are generated server-side from crypto/rand and hex-encoded, so they are safe. Validate anyway for defense in depth.

Files

File	Change
`internal/barrier/barrier.go`	Add `validatePath`, call from Get/Put/Delete/List
`internal/engine/engine.go`	Add `ValidateName`, call from `Mount`
`internal/engine/ca/ca.go`	Call `ValidateName` on issuer name
`internal/engine/sshca/sshca.go`	Call `ValidateName` on profile name
`internal/engine/transit/transit.go`	Call `ValidateName` on key name
`internal/engine/user/user.go`	Call `ValidateName` on usernames
`internal/policy/policy.go`	Call `ValidateName` on rule ID

Verification

Add TestValidatePath to barrier_test.go: confirm ../ and .. are rejected; confirm normal paths pass.
Add TestValidateName to engine_test.go: confirm ../evil, empty string, and overlong names are rejected; confirm valid names pass.
go test ./internal/barrier/ ./internal/engine/... ./internal/policy/

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

These two findings share the barrier/seal subsystem and should be addressed together.

#39 — TOCTOU Race in Barrier Get/Put

Risk: Get and Put copy the mek slice header and keys map reference under RLock, release the lock, then use the copied references for encryption/decryption. A concurrent Seal() zeroizes the underlying byte slices in place before nil-ing the fields, so a concurrent reader uses zeroized key material.

Root cause: The lock does not cover the crypto operation. The "copy" is a shallow reference copy (slice header), not a deep byte copy. Seal() zeroizes the backing array, which is shared.

Current locking pattern (barrier.go):

Get:  RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put:  RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock  → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock

Fix: Hold RLock through the entire crypto operation:

func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
    if err := validatePath(path); err != nil {
        return nil, err
    }
    b.mu.RLock()
    defer b.mu.RUnlock()
    if b.mek == nil {
        return nil, ErrSealed
    }
    // query DB, resolve key, decrypt — all under RLock
    // ...
}

This is the minimal, safest change. RLock permits concurrent readers, so there is no throughput regression for parallel Get/Put operations. The only serialization point is Seal(), which acquires the exclusive Lock and waits for all readers to drain — exactly the semantics we want.

Apply the same pattern to Put, Delete, and List.

Alternative considered: Atomic pointer swap (atomic.Pointer[keyState]). This eliminates the lock from the hot path entirely, but introduces complexity around deferred zeroization of the old state (readers may still hold references). The RLock-through-crypto approach is simpler and sufficient for Metacrypt's concurrency profile.

#40 — Crash During `ReWrapKeys` Loses All Data

Risk: RotateMEK calls barrier.ReWrapKeys(newMEK) which commits a transaction re-wrapping all DEKs, then separately updates seal_config with the new encrypted MEK. A crash between these two database operations leaves DEKs wrapped with a MEK that is not persisted — all data is irrecoverable.

Current flow (seal.go lines 245–313):

1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK)   ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil)  ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ?  ← separate statement, not in transaction
   *** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK

Fix: Unify steps 2–4 into a single database transaction.

Refactor ReWrapKeys to accept an optional *sql.Tx:

// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
    // Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
    rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
    // ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}

// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
    b.mu.Lock()
    defer b.mu.Unlock()
    mcrypto.Zeroize(b.mek)
    b.mek = newMEK
}

Then in RotateMEK:

func (m *Manager) RotateMEK(ctx context.Context, password string) error {
    // ... derive KWK, generate newMEK ...

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Re-wrap all DEKs within the transaction.
    if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
        return err
    }

    // Update seal_config within the same transaction.
    encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
    if err != nil {
        return err
    }
    if _, err := tx.ExecContext(ctx,
        "UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
        encNewMEK,
    ); err != nil {
        return err
    }

    if err := tx.Commit(); err != nil {
        return err
    }

    // Only after commit: update in-memory state.
    m.barrier.SwapMEK(newMEK)
    return nil
}

SQLite in WAL mode handles this correctly — the transaction is atomic regardless of process crash. The barrier_keys and seal_config updates either both commit or neither does.

Files

File	Change
`internal/barrier/barrier.go`	Extend RLock scope in Get/Put/Delete/List; add `ReWrapKeysTx`, `SwapMEK`
`internal/seal/seal.go`	Wrap ReWrapKeysTx + seal_config UPDATE in single transaction
`internal/barrier/barrier_test.go`	Add concurrent Get/Seal stress test

Verification

go test -race ./internal/barrier/ ./internal/seal/
Add TestConcurrentGetSeal: spawn goroutines doing Get while another goroutine calls Seal. Run with -race. Verify no panics or data races.
Add TestRotateMEKAtomic: verify that barrier_keys and seal_config are updated in the same transaction (mock the DB to detect transaction boundaries, or verify via rollback behavior).

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

These four findings touch separate files with no overlap and can be addressed in parallel.

#49 — No TTL Ceiling in CA Certificate Issuance

Risk: A non-admin user can request an arbitrarily long certificate lifetime. The issuer's MaxTTL exists in config but is not enforced during handleIssue or handleSignCSR.

Root cause: The CA engine applies the user's requested TTL directly to the certificate without comparing it against issuerConfig.MaxTTL. The SSH CA engine correctly enforces this via resolveTTL — the CA engine does not.

Fix: Add a resolveTTL method to the CA engine, following the SSH CA engine's pattern (sshca.go lines 902–932):

func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
    maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
    if err != nil {
        maxTTL = 2160 * time.Hour // 90 days fallback
    }

    if requested != "" {
        ttl, err := time.ParseDuration(requested)
        if err != nil {
            return 0, fmt.Errorf("invalid TTL: %w", err)
        }
        if ttl > maxTTL {
            return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
        }
        return ttl, nil
    }

    return maxTTL, nil
}

Call this in handleIssue and handleSignCSR before constructing the certificate. Replace the raw TTL string with the validated duration.

File	Change
`internal/engine/ca/ca.go`	Add `resolveTTL`, call from `handleIssue` and `handleSignCSR`
`internal/engine/ca/ca_test.go`	Add test: issue cert with TTL > MaxTTL, verify rejection

#61 — Ineffective ECDH Key Zeroization

Risk: privKey.Bytes() returns a copy of the private key bytes. Zeroizing the copy leaves the original inside *ecdh.PrivateKey. Go's crypto/ecdh API does not expose the internal byte slice.

Root cause: Language/API limitation in Go's crypto/ecdh package.

Fix: Store the raw private key bytes alongside the parsed key in userState, and zeroize those bytes on seal:

type userState struct {
    privKey    *ecdh.PrivateKey
    privBytes  []byte            // raw key bytes, retained for zeroization
    pubKey     *ecdh.PublicKey
    config     *UserKeyConfig
}

On load from barrier (Unseal, auto-provision):

raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw  // retain for zeroization
state.privKey = priv

On Seal:

mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil

Document the limitation: the parsed *ecdh.PrivateKey struct's internal copy cannot be zeroized from Go code. Setting privKey = nil makes it eligible for GC, but does not guarantee immediate byte overwrite. This is an accepted Go runtime limitation.

File	Change
`internal/engine/user/user.go`	Add `privBytes` to `userState`, populate on load, zeroize on Seal
`internal/engine/user/types.go`	Update `userState` struct

#62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name

Risk: Policy checks construct the resource path using e.mountPath (which is engine/user/{name}/) instead of just the mount name. Policy rules match against user/{name}/recipient/{username}, so the full mount path creates a mismatch like user/engine/user/myengine//recipient/alice. No policy rule will ever match.

Root cause: Line 358 of user.go uses e.mountPath directly. The SSH CA and transit engines correctly use a mountName() helper.

Fix: Add a mountName() method to the user engine:

func (e *UserEngine) mountName() string {
    // mountPath is "engine/user/{name}/"
    parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
    if len(parts) >= 3 {
        return parts[2]
    }
    return e.mountPath
}

Change line 358:

resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)

Audit all other resource path constructions in the user engine to confirm they also use the correct mount name.

File	Change
`internal/engine/user/user.go`	Add `mountName()`, fix resource path on line 358
`internal/engine/user/user_test.go`	Add test: verify policy resource path format

#69 — Typed REST Handlers Bypass Policy Engine

Risk: 18 typed REST handlers pass nil for CheckPolicy in the engine.Request, skipping service-level policy evaluation. The generic /v1/engine/request endpoint correctly passes a policyChecker. Since engines #54 and #58 default to allow when no policy matches, typed routes are effectively unprotected by policy.

Root cause: Typed handlers were modeled after admin-only operations (which don't need policy) but applied to user-accessible operations.

Fix: Extract the policy checker construction from handleEngineRequest into a shared helper:

func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
    return func(resource, action string) (string, bool) {
        effect, matched, err := s.policy.Check(
            info.Username, info.Roles, resource, action,
        )
        if err != nil || !matched {
            return "deny", false
        }
        return effect, matched
    }
}

Then in each typed handler, set CheckPolicy on the request:

req := &engine.Request{
    Operation:   "get-cert",
    Data:        data,
    CallerInfo:  callerInfo,
    CheckPolicy: s.newPolicyChecker(callerInfo),
}

18 handlers to update:

Handler	Operation
`handleGetCert`	`get-cert`
`handleRevokeCert`	`revoke-cert`
`handleDeleteCert`	`delete-cert`
`handleSSHCASignHost`	`sign-host`
`handleSSHCASignUser`	`sign-user`
`handleSSHCAGetProfile`	`get-profile`
`handleSSHCAListProfiles`	`list-profiles`
`handleSSHCADeleteProfile`	`delete-profile`
`handleSSHCAGetCert`	`get-cert`
`handleSSHCAListCerts`	`list-certs`
`handleSSHCARevokeCert`	`revoke-cert`
`handleSSHCADeleteCert`	`delete-cert`
`handleUserRegister`	`register`
`handleUserProvision`	`provision`
`handleUserListUsers`	`list-users`
`handleUserGetPublicKey`	`get-public-key`
`handleUserDeleteUser`	`delete-user`
`handleUserDecrypt`	`decrypt`

Note: handleUserEncrypt already passes a policy checker — verify it uses the same shared helper after refactoring. Admin-only handlers (behind requireAdmin wrapper) do not need a policy checker since admin bypasses policy.

File	Change
`internal/server/routes.go`	Add `newPolicyChecker`, pass to all 18 typed handlers
`internal/server/server_test.go`	Add test: policy-denied user is rejected by typed route

Verification (Work Item 4, all findings)

go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...

Implementation Order

1. #68  JSON injection             (standalone, ship immediately)
2. #48  Path traversal             (standalone, blocks safe engine operation)
3. #39  Barrier TOCTOU race    ─┐
   #40  ReWrapKeys crash safety ┘  (coupled, requires careful testing)
4. #49  CA TTL enforcement     ─┐
   #61  ECDH zeroization       │
   #62  User policy path       │  (independent fixes, parallelizable)
   #69  Policy bypass          ─┘

Items 1 and 2 have no dependencies and can be done in parallel by different engineers.

Items 3 and 4 can also be done in parallel since they touch different subsystems (barrier/seal vs engines/server).

Post-Remediation

After all eight findings are resolved:

Update AUDIT.md — mark #39, #40, #48, #49, #61, #62, #68, #69 as RESOLVED with resolution summaries.
Run the full pipeline: make all (vet, lint, test, build).
Run race detector: go test -race ./...
Address related medium findings that interact with these fixes:
- #54 (SSH CA default-allow) and #58 (transit default-allow) — once #69 is fixed, the typed handlers will pass policy checkers to the engines, but the engines still default-allow when CheckPolicy returns no match. Consider changing the engine-level default to deny for non-admin callers.
- #72 (policy ID path traversal) — already covered by #48's ValidateName fix on CreateRule.

20 KiB Raw Blame History Unescape Escape

Remediation Plan — High-Priority Audit Findings

Work Item 1: JSON Injection in REST Error Responses (#68)

Fix

Files

Verification

Work Item 2: Path Traversal via Unsanitized Names (#48)

Vulnerable locations

Fix

Files

Verification

Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

#39 — TOCTOU Race in Barrier Get/Put

#40 — Crash During ReWrapKeys Loses All Data

Files

Verification

Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

#49 — No TTL Ceiling in CA Certificate Issuance

#61 — Ineffective ECDH Key Zeroization

#62 — User Engine Policy Path Uses mountPath Instead of Mount Name

#69 — Typed REST Handlers Bypass Policy Engine

Verification (Work Item 4, all findings)

Implementation Order

Post-Remediation

20 KiB

Raw Blame History

#40 — Crash During `ReWrapKeys` Loses All Data

#62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name