metacrypt/REMEDIATION.md

# Remediation Plan — High-Priority Audit Findings

**Date**: 2026-03-17
**Scope**: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69

This plan addresses all eight High-severity findings from the 2026-03-17
full system audit. Findings are grouped into four work items by shared root
cause or affected subsystem. The order reflects dependency chains: #68 is a
standalone fix that should ship first; #48 is a prerequisite for safe
operation across all engines; #39/#40 affect the storage core; the remaining
four affect specific engines.

---

## Work Item 1: JSON Injection in REST Error Responses (#68)

**Risk**: An error message containing `"` or `\` breaks the JSON response
structure. If the error contains attacker-controlled input (e.g., a mount
name or key name that triggers a downstream error), this enables JSON
injection in API responses.

**Root cause**: 13 locations in `internal/server/routes.go` construct JSON
error responses via string concatenation:

```go
http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)
```

The `writeEngineError` helper (line 1704) is the most common entry point;
most typed handlers call it.

### Fix

1. **Replace `writeEngineError`** with a safe JSON encoder:

   ```go
   func writeJSONError(w http.ResponseWriter, msg string, code int) {
       w.Header().Set("Content-Type", "application/json")
       w.WriteHeader(code)
       _ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
   }
   ```

2. **Replace all 13 call sites** that use string concatenation with
   `writeJSONError(w, grpcMessage(err), status)` or
   `writeJSONError(w, err.Error(), status)`.

   The `grpcMessage` helper already exists in the webserver package and
   extracts human-readable messages from gRPC errors. Add an equivalent
   to the REST server, and prefer it over raw `err.Error()` to avoid
   leaking internal error details.

3. **Grep for the pattern** `"error":"` in `routes.go` to confirm no
   remaining string-concatenated JSON.

### Files

| File | Change |
|------|--------|
| `internal/server/routes.go` | Replace `writeEngineError` and all 13 inline error sites |

### Verification

- `go vet ./internal/server/`
- `go test ./internal/server/`
- Manual test: mount an engine with a name containing `"`, trigger an error,
  verify the response is valid JSON.

---

## Work Item 2: Path Traversal via Unsanitized Names (#48)

**Risk**: User-controlled strings (issuer names, key names, profile names,
usernames, mount names) are concatenated directly into barrier storage
paths. An input containing `../` traverses the barrier namespace, allowing
reads and writes to arbitrary paths. This affects all four engines and the
engine registry.

**Root cause**: No validation exists at any layer — neither the barrier's
`Put`/`Get`/`Delete` methods nor the engines sanitize path components.

### Vulnerable locations

| File | Input | Path Pattern |
|------|-------|-------------|
| `ca/ca.go` | issuer `name` | `mountPath + "issuers/" + name + "/"` |
| `sshca/sshca.go` | profile `name` | `mountPath + "profiles/" + name + ".json"` |
| `transit/transit.go` | key `name` | `mountPath + "keys/" + name + "/"` |
| `user/user.go` | `username` | `mountPath + "users/" + username + "/"` |
| `engine/engine.go` | mount `name` | `engine/{type}/{name}/` |
| `policy/policy.go` | rule `ID` | `policy/rules/{id}` |

### Fix

Enforce validation at **two layers** (defense in depth):

1. **Barrier layer** — reject paths containing `..` segments.

   Add a `validatePath` check at the top of `Get`, `Put`, `Delete`, and
   `List` in `barrier.go`:

   ```go
   var ErrInvalidPath = errors.New("barrier: invalid path")

   func validatePath(p string) error {
       for _, seg := range strings.Split(p, "/") {
           if seg == ".." {
               return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
           }
       }
       return nil
   }
   ```

   Call `validatePath` at the entry of `Get`, `Put`, `Delete`, `List`.
   Return `ErrInvalidPath` on failure.

2. **Engine/registry layer** — validate entity names at input boundaries.

   Add a `ValidateName` helper to `internal/engine/`:

   ```go
   var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)

   func ValidateName(name string) error {
       if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
           return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
               "dot, hyphen, or underscore characters", name)
       }
       return nil
   }
   ```

   Call `ValidateName` in:

   | Location | Input validated |
   |----------|----------------|
   | `engine.go` `Mount()` | mount name |
   | `ca.go` `handleCreateIssuer` | issuer name |
   | `sshca.go` `handleCreateProfile` | profile name |
   | `transit.go` `handleCreateKey` | key name |
   | `user.go` `handleRegister`, `handleProvision` | username |
   | `user.go` `handleEncrypt` | recipient usernames |
   | `policy.go` `CreateRule` | rule ID |

   Note: certificate serials are generated server-side from `crypto/rand`
   and hex-encoded, so they are safe. Validate anyway for defense in depth.

### Files

| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Add `validatePath`, call from Get/Put/Delete/List |
| `internal/engine/engine.go` | Add `ValidateName`, call from `Mount` |
| `internal/engine/ca/ca.go` | Call `ValidateName` on issuer name |
| `internal/engine/sshca/sshca.go` | Call `ValidateName` on profile name |
| `internal/engine/transit/transit.go` | Call `ValidateName` on key name |
| `internal/engine/user/user.go` | Call `ValidateName` on usernames |
| `internal/policy/policy.go` | Call `ValidateName` on rule ID |

### Verification

- Add `TestValidatePath` to `barrier_test.go`: confirm `../` and `..` are
  rejected; confirm normal paths pass.
- Add `TestValidateName` to `engine_test.go`: confirm `../evil`, empty
  string, and overlong names are rejected; confirm valid names pass.
- `go test ./internal/barrier/ ./internal/engine/... ./internal/policy/`

---

## Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)

These two findings share the barrier/seal subsystem and should be addressed
together.

### #39 — TOCTOU Race in Barrier Get/Put

**Risk**: `Get` and `Put` copy the `mek` slice header and `keys` map
reference under `RLock`, release the lock, then use the copied references
for encryption/decryption. A concurrent `Seal()` zeroizes the underlying
byte slices in place before nil-ing the fields, so a concurrent reader
uses zeroized key material.

**Root cause**: The lock does not cover the crypto operation. The "copy"
is a shallow reference copy (slice header), not a deep byte copy. `Seal()`
zeroizes the backing array, which is shared.

**Current locking pattern** (`barrier.go`):

```
Get:  RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put:  RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock  → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock
```

**Fix**: Hold `RLock` through the entire crypto operation:

```go
func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
    if err := validatePath(path); err != nil {
        return nil, err
    }
    b.mu.RLock()
    defer b.mu.RUnlock()
    if b.mek == nil {
        return nil, ErrSealed
    }
    // query DB, resolve key, decrypt — all under RLock
    // ...
}
```

This is the minimal, safest change. `RLock` permits concurrent readers, so
there is no throughput regression for parallel `Get`/`Put` operations. The
only serialization point is `Seal()`, which acquires the exclusive `Lock`
and waits for all readers to drain — exactly the semantics we want.

Apply the same pattern to `Put`, `Delete`, and `List`.

**Alternative considered**: Atomic pointer swap (`atomic.Pointer[keyState]`).
This eliminates the lock from the hot path entirely, but introduces
complexity around deferred zeroization of the old state (readers may still
hold references). The `RLock`-through-crypto approach is simpler and
sufficient for Metacrypt's concurrency profile.

### #40 — Crash During `ReWrapKeys` Loses All Data

**Risk**: `RotateMEK` calls `barrier.ReWrapKeys(newMEK)` which commits a
transaction re-wrapping all DEKs, then separately updates `seal_config`
with the new encrypted MEK. A crash between these two database operations
leaves DEKs wrapped with a MEK that is not persisted — all data is
irrecoverable.

**Current flow** (`seal.go` lines 245–313):

```
1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK)   ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil)  ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ?  ← separate statement, not in transaction
   *** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK
```

**Fix**: Unify steps 2–4 into a single database transaction.

Refactor `ReWrapKeys` to accept an optional `*sql.Tx`:

```go
// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
    // Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
    rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
    // ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}

// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
    b.mu.Lock()
    defer b.mu.Unlock()
    mcrypto.Zeroize(b.mek)
    b.mek = newMEK
}
```

Then in `RotateMEK`:

```go
func (m *Manager) RotateMEK(ctx context.Context, password string) error {
    // ... derive KWK, generate newMEK ...

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Re-wrap all DEKs within the transaction.
    if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
        return err
    }

    // Update seal_config within the same transaction.
    encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
    if err != nil {
        return err
    }
    if _, err := tx.ExecContext(ctx,
        "UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
        encNewMEK,
    ); err != nil {
        return err
    }

    if err := tx.Commit(); err != nil {
        return err
    }

    // Only after commit: update in-memory state.
    m.barrier.SwapMEK(newMEK)
    return nil
}
```

SQLite in WAL mode handles this correctly — the transaction is atomic
regardless of process crash. The `barrier_keys` and `seal_config` updates
either both commit or neither does.

### Files

| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Extend RLock scope in Get/Put/Delete/List; add `ReWrapKeysTx`, `SwapMEK` |
| `internal/seal/seal.go` | Wrap ReWrapKeysTx + seal_config UPDATE in single transaction |
| `internal/barrier/barrier_test.go` | Add concurrent Get/Seal stress test |

### Verification

- `go test -race ./internal/barrier/ ./internal/seal/`
- Add `TestConcurrentGetSeal`: spawn goroutines doing Get while another
  goroutine calls Seal. Run with `-race`. Verify no panics or data races.
- Add `TestRotateMEKAtomic`: verify that `barrier_keys` and `seal_config`
  are updated in the same transaction (mock the DB to detect transaction
  boundaries, or verify via rollback behavior).

---

## Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)

These four findings touch separate files with no overlap and can be
addressed in parallel.

### #49 — No TTL Ceiling in CA Certificate Issuance

**Risk**: A non-admin user can request an arbitrarily long certificate
lifetime. The issuer's `MaxTTL` exists in config but is not enforced
during `handleIssue` or `handleSignCSR`.

**Root cause**: The CA engine applies the user's requested TTL directly
to the certificate without comparing it against `issuerConfig.MaxTTL`.
The SSH CA engine correctly enforces this via `resolveTTL` — the CA
engine does not.

**Fix**: Add a `resolveTTL` method to the CA engine, following the SSH
CA engine's pattern (`sshca.go` lines 902–932):

```go
func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
    maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
    if err != nil {
        maxTTL = 2160 * time.Hour // 90 days fallback
    }

    if requested != "" {
        ttl, err := time.ParseDuration(requested)
        if err != nil {
            return 0, fmt.Errorf("invalid TTL: %w", err)
        }
        if ttl > maxTTL {
            return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
        }
        return ttl, nil
    }

    return maxTTL, nil
}
```

Call this in `handleIssue` and `handleSignCSR` before constructing the
certificate. Replace the raw TTL string with the validated duration.

| File | Change |
|------|--------|
| `internal/engine/ca/ca.go` | Add `resolveTTL`, call from `handleIssue` and `handleSignCSR` |
| `internal/engine/ca/ca_test.go` | Add test: issue cert with TTL > MaxTTL, verify rejection |

### #61 — Ineffective ECDH Key Zeroization

**Risk**: `privKey.Bytes()` returns a copy of the private key bytes.
Zeroizing the copy leaves the original inside `*ecdh.PrivateKey`. Go's
`crypto/ecdh` API does not expose the internal byte slice.

**Root cause**: Language/API limitation in Go's `crypto/ecdh` package.

**Fix**: Store the raw private key bytes alongside the parsed key in
`userState`, and zeroize those bytes on seal:

```go
type userState struct {
    privKey    *ecdh.PrivateKey
    privBytes  []byte            // raw key bytes, retained for zeroization
    pubKey     *ecdh.PublicKey
    config     *UserKeyConfig
}
```

On **load from barrier** (Unseal, auto-provision):
```go
raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw  // retain for zeroization
state.privKey = priv
```

On **Seal**:
```go
mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil
```

Document the limitation: the parsed `*ecdh.PrivateKey` struct's internal
copy cannot be zeroized from Go code. Setting `privKey = nil` makes it
eligible for GC, but does not guarantee immediate byte overwrite. This is
an accepted Go runtime limitation.

| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `privBytes` to `userState`, populate on load, zeroize on Seal |
| `internal/engine/user/types.go` | Update `userState` struct |

### #62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name

**Risk**: Policy checks construct the resource path using `e.mountPath`
(which is `engine/user/{name}/`) instead of just the mount name. Policy
rules match against `user/{name}/recipient/{username}`, so the full mount
path creates a mismatch like `user/engine/user/myengine//recipient/alice`.
No policy rule will ever match.

**Root cause**: Line 358 of `user.go` uses `e.mountPath` directly. The
SSH CA and transit engines correctly use a `mountName()` helper.

**Fix**: Add a `mountName()` method to the user engine:

```go
func (e *UserEngine) mountName() string {
    // mountPath is "engine/user/{name}/"
    parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
    if len(parts) >= 3 {
        return parts[2]
    }
    return e.mountPath
}
```

Change line 358:

```go
resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)
```

Audit all other resource path constructions in the user engine to confirm
they also use the correct mount name.

| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `mountName()`, fix resource path on line 358 |
| `internal/engine/user/user_test.go` | Add test: verify policy resource path format |

### #69 — Typed REST Handlers Bypass Policy Engine

**Risk**: 18 typed REST handlers pass `nil` for `CheckPolicy` in the
`engine.Request`, skipping service-level policy evaluation. The generic
`/v1/engine/request` endpoint correctly passes a `policyChecker`. Since
engines #54 and #58 default to allow when no policy matches, typed routes
are effectively unprotected by policy.

**Root cause**: Typed handlers were modeled after admin-only operations
(which don't need policy) but applied to user-accessible operations.

**Fix**: Extract the policy checker construction from
`handleEngineRequest` into a shared helper:

```go
func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
    return func(resource, action string) (string, bool) {
        effect, matched, err := s.policy.Check(
            info.Username, info.Roles, resource, action,
        )
        if err != nil || !matched {
            return "deny", false
        }
        return effect, matched
    }
}
```

Then in each typed handler, set `CheckPolicy` on the request:

```go
req := &engine.Request{
    Operation:   "get-cert",
    Data:        data,
    CallerInfo:  callerInfo,
    CheckPolicy: s.newPolicyChecker(callerInfo),
}
```

**18 handlers to update**:

| Handler | Operation |
|---------|-----------|
| `handleGetCert` | `get-cert` |
| `handleRevokeCert` | `revoke-cert` |
| `handleDeleteCert` | `delete-cert` |
| `handleSSHCASignHost` | `sign-host` |
| `handleSSHCASignUser` | `sign-user` |
| `handleSSHCAGetProfile` | `get-profile` |
| `handleSSHCAListProfiles` | `list-profiles` |
| `handleSSHCADeleteProfile` | `delete-profile` |
| `handleSSHCAGetCert` | `get-cert` |
| `handleSSHCAListCerts` | `list-certs` |
| `handleSSHCARevokeCert` | `revoke-cert` |
| `handleSSHCADeleteCert` | `delete-cert` |
| `handleUserRegister` | `register` |
| `handleUserProvision` | `provision` |
| `handleUserListUsers` | `list-users` |
| `handleUserGetPublicKey` | `get-public-key` |
| `handleUserDeleteUser` | `delete-user` |
| `handleUserDecrypt` | `decrypt` |

Note: `handleUserEncrypt` already passes a policy checker — verify it
uses the same shared helper after refactoring. Admin-only handlers
(behind `requireAdmin` wrapper) do not need a policy checker since admin
bypasses policy.

| File | Change |
|------|--------|
| `internal/server/routes.go` | Add `newPolicyChecker`, pass to all 18 typed handlers |
| `internal/server/server_test.go` | Add test: policy-denied user is rejected by typed route |

### Verification (Work Item 4, all findings)

```bash
go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...
```

---

## Implementation Order

```
1. #68  JSON injection             (standalone, ship immediately)
2. #48  Path traversal             (standalone, blocks safe engine operation)
3. #39  Barrier TOCTOU race    ─┐
   #40  ReWrapKeys crash safety ┘  (coupled, requires careful testing)
4. #49  CA TTL enforcement     ─┐
   #61  ECDH zeroization       │
   #62  User policy path       │  (independent fixes, parallelizable)
   #69  Policy bypass          ─┘
```

Items 1 and 2 have no dependencies and can be done in parallel by
different engineers.

Items 3 and 4 can also be done in parallel since they touch different
subsystems (barrier/seal vs engines/server).

---

## Post-Remediation

After all eight findings are resolved:

1. **Update AUDIT.md** — mark #39, #40, #48, #49, #61, #62, #68, #69 as
   RESOLVED with resolution summaries.
2. **Run the full pipeline**: `make all` (vet, lint, test, build).
3. **Run race detector**: `go test -race ./...`
4. **Address related medium findings** that interact with these fixes:
   - #54 (SSH CA default-allow) and #58 (transit default-allow) — once
     #69 is fixed, the typed handlers will pass policy checkers to the
     engines, but the engines still default-allow when `CheckPolicy`
     returns no match. Consider changing the engine-level default to deny
     for non-admin callers.
   - #72 (policy ID path traversal) — already covered by #48's
     `ValidateName` fix on `CreateRule`.