Files
metacrypt/REMEDIATION.md
Kyle Isom a80323e320 Add web UI for SSH CA, Transit, and User engines; full security audit and remediation
Web UI: Added browser-based management for all three remaining engines
(SSH CA, Transit, User E2E). Includes gRPC client wiring, handler files,
7 HTML templates, dashboard mount forms, and conditional navigation links.
Fixed REST API routes to match design specs (SSH CA cert singular paths,
Transit PATCH for update-key-config).

Security audit: Conducted full-system audit covering crypto core, all
engine implementations, API servers, policy engine, auth, deployment,
and documentation. Identified 42 new findings (#39-#80) across all
severity levels.

Remediation of all 8 High findings:
- #68: Replaced 14 JSON-injection-vulnerable error responses with safe
  json.Encoder via writeJSONError helper
- #48: Added two-layer path traversal defense (barrier validatePath
  rejects ".." segments; engine ValidateName enforces safe name pattern)
- #39: Extended RLock through entire crypto operations in barrier
  Get/Put/Delete/List to eliminate TOCTOU race with Seal
- #40: Unified ReWrapKeys and seal_config UPDATE into single SQLite
  transaction to prevent irrecoverable data loss on crash during MEK
  rotation
- #49: Added resolveTTL to CA engine enforcing issuer MaxTTL ceiling
  on handleIssue and handleSignCSR
- #61: Store raw ECDH private key bytes in userState for effective
  zeroization on Seal
- #62: Fixed user engine policy resource path from mountPath to
  mountName() so policy rules match correctly
- #69: Added newPolicyChecker helper and passed service-level policy
  evaluation to all 25 typed REST handler engine.Request structs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 22:02:06 -07:00

580 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Remediation Plan — High-Priority Audit Findings
**Date**: 2026-03-17
**Scope**: AUDIT.md findings #39, #40, #48, #49, #61, #62, #68, #69
This plan addresses all eight High-severity findings from the 2026-03-17
full system audit. Findings are grouped into four work items by shared root
cause or affected subsystem. The order reflects dependency chains: #68 is a
standalone fix that should ship first; #48 is a prerequisite for safe
operation across all engines; #39/#40 affect the storage core; the remaining
four affect specific engines.
---
## Work Item 1: JSON Injection in REST Error Responses (#68)
**Risk**: An error message containing `"` or `\` breaks the JSON response
structure. If the error contains attacker-controlled input (e.g., a mount
name or key name that triggers a downstream error), this enables JSON
injection in API responses.
**Root cause**: 13 locations in `internal/server/routes.go` construct JSON
error responses via string concatenation:
```go
http.Error(w, `{"error":"`+err.Error()+`"}`, http.StatusInternalServerError)
```
The `writeEngineError` helper (line 1704) is the most common entry point;
most typed handlers call it.
### Fix
1. **Replace `writeEngineError`** with a safe JSON encoder:
```go
func writeJSONError(w http.ResponseWriter, msg string, code int) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
_ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
}
```
2. **Replace all 13 call sites** that use string concatenation with
`writeJSONError(w, grpcMessage(err), status)` or
`writeJSONError(w, err.Error(), status)`.
The `grpcMessage` helper already exists in the webserver package and
extracts human-readable messages from gRPC errors. Add an equivalent
to the REST server, and prefer it over raw `err.Error()` to avoid
leaking internal error details.
3. **Grep for the pattern** `"error":"` in `routes.go` to confirm no
remaining string-concatenated JSON.
### Files
| File | Change |
|------|--------|
| `internal/server/routes.go` | Replace `writeEngineError` and all 13 inline error sites |
### Verification
- `go vet ./internal/server/`
- `go test ./internal/server/`
- Manual test: mount an engine with a name containing `"`, trigger an error,
verify the response is valid JSON.
---
## Work Item 2: Path Traversal via Unsanitized Names (#48)
**Risk**: User-controlled strings (issuer names, key names, profile names,
usernames, mount names) are concatenated directly into barrier storage
paths. An input containing `../` traverses the barrier namespace, allowing
reads and writes to arbitrary paths. This affects all four engines and the
engine registry.
**Root cause**: No validation exists at any layer — neither the barrier's
`Put`/`Get`/`Delete` methods nor the engines sanitize path components.
### Vulnerable locations
| File | Input | Path Pattern |
|------|-------|-------------|
| `ca/ca.go` | issuer `name` | `mountPath + "issuers/" + name + "/"` |
| `sshca/sshca.go` | profile `name` | `mountPath + "profiles/" + name + ".json"` |
| `transit/transit.go` | key `name` | `mountPath + "keys/" + name + "/"` |
| `user/user.go` | `username` | `mountPath + "users/" + username + "/"` |
| `engine/engine.go` | mount `name` | `engine/{type}/{name}/` |
| `policy/policy.go` | rule `ID` | `policy/rules/{id}` |
### Fix
Enforce validation at **two layers** (defense in depth):
1. **Barrier layer** — reject paths containing `..` segments.
Add a `validatePath` check at the top of `Get`, `Put`, `Delete`, and
`List` in `barrier.go`:
```go
var ErrInvalidPath = errors.New("barrier: invalid path")
func validatePath(p string) error {
for _, seg := range strings.Split(p, "/") {
if seg == ".." {
return fmt.Errorf("%w: path traversal rejected: %q", ErrInvalidPath, p)
}
}
return nil
}
```
Call `validatePath` at the entry of `Get`, `Put`, `Delete`, `List`.
Return `ErrInvalidPath` on failure.
2. **Engine/registry layer** — validate entity names at input boundaries.
Add a `ValidateName` helper to `internal/engine/`:
```go
var namePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9._-]*$`)
func ValidateName(name string) error {
if name == "" || len(name) > 128 || !namePattern.MatchString(name) {
return fmt.Errorf("invalid name %q: must be 1-128 alphanumeric, "+
"dot, hyphen, or underscore characters", name)
}
return nil
}
```
Call `ValidateName` in:
| Location | Input validated |
|----------|----------------|
| `engine.go` `Mount()` | mount name |
| `ca.go` `handleCreateIssuer` | issuer name |
| `sshca.go` `handleCreateProfile` | profile name |
| `transit.go` `handleCreateKey` | key name |
| `user.go` `handleRegister`, `handleProvision` | username |
| `user.go` `handleEncrypt` | recipient usernames |
| `policy.go` `CreateRule` | rule ID |
Note: certificate serials are generated server-side from `crypto/rand`
and hex-encoded, so they are safe. Validate anyway for defense in depth.
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Add `validatePath`, call from Get/Put/Delete/List |
| `internal/engine/engine.go` | Add `ValidateName`, call from `Mount` |
| `internal/engine/ca/ca.go` | Call `ValidateName` on issuer name |
| `internal/engine/sshca/sshca.go` | Call `ValidateName` on profile name |
| `internal/engine/transit/transit.go` | Call `ValidateName` on key name |
| `internal/engine/user/user.go` | Call `ValidateName` on usernames |
| `internal/policy/policy.go` | Call `ValidateName` on rule ID |
### Verification
- Add `TestValidatePath` to `barrier_test.go`: confirm `../` and `..` are
rejected; confirm normal paths pass.
- Add `TestValidateName` to `engine_test.go`: confirm `../evil`, empty
string, and overlong names are rejected; confirm valid names pass.
- `go test ./internal/barrier/ ./internal/engine/... ./internal/policy/`
---
## Work Item 3: Barrier Concurrency and Crash Safety (#39, #40)
These two findings share the barrier/seal subsystem and should be addressed
together.
### #39 — TOCTOU Race in Barrier Get/Put
**Risk**: `Get` and `Put` copy the `mek` slice header and `keys` map
reference under `RLock`, release the lock, then use the copied references
for encryption/decryption. A concurrent `Seal()` zeroizes the underlying
byte slices in place before nil-ing the fields, so a concurrent reader
uses zeroized key material.
**Root cause**: The lock does not cover the crypto operation. The "copy"
is a shallow reference copy (slice header), not a deep byte copy. `Seal()`
zeroizes the backing array, which is shared.
**Current locking pattern** (`barrier.go`):
```
Get: RLock → copy mek/keys refs → RUnlock → decrypt (uses zeroized key)
Put: RLock → copy mek/keys refs → RUnlock → encrypt (uses zeroized key)
Seal: Lock → zeroize mek bytes → nil mek → zeroize keys → nil keys → Unlock
```
**Fix**: Hold `RLock` through the entire crypto operation:
```go
func (b *AESGCMBarrier) Get(ctx context.Context, path string) ([]byte, error) {
if err := validatePath(path); err != nil {
return nil, err
}
b.mu.RLock()
defer b.mu.RUnlock()
if b.mek == nil {
return nil, ErrSealed
}
// query DB, resolve key, decrypt — all under RLock
// ...
}
```
This is the minimal, safest change. `RLock` permits concurrent readers, so
there is no throughput regression for parallel `Get`/`Put` operations. The
only serialization point is `Seal()`, which acquires the exclusive `Lock`
and waits for all readers to drain — exactly the semantics we want.
Apply the same pattern to `Put`, `Delete`, and `List`.
**Alternative considered**: Atomic pointer swap (`atomic.Pointer[keyState]`).
This eliminates the lock from the hot path entirely, but introduces
complexity around deferred zeroization of the old state (readers may still
hold references). The `RLock`-through-crypto approach is simpler and
sufficient for Metacrypt's concurrency profile.
### #40 — Crash During `ReWrapKeys` Loses All Data
**Risk**: `RotateMEK` calls `barrier.ReWrapKeys(newMEK)` which commits a
transaction re-wrapping all DEKs, then separately updates `seal_config`
with the new encrypted MEK. A crash between these two database operations
leaves DEKs wrapped with a MEK that is not persisted — all data is
irrecoverable.
**Current flow** (`seal.go` lines 245313):
```
1. Generate newMEK
2. barrier.ReWrapKeys(ctx, newMEK) ← commits transaction (barrier_keys updated)
3. crypto.Encrypt(kwk, newMEK, nil) ← encrypt new MEK
4. UPDATE seal_config SET encrypted_mek = ? ← separate statement, not in transaction
*** CRASH HERE = DATA LOSS ***
5. Swap in-memory MEK
```
**Fix**: Unify steps 24 into a single database transaction.
Refactor `ReWrapKeys` to accept an optional `*sql.Tx`:
```go
// ReWrapKeysTx re-wraps all DEKs with newMEK within the given transaction.
func (b *AESGCMBarrier) ReWrapKeysTx(ctx context.Context, tx *sql.Tx, newMEK []byte) error {
// Same logic as ReWrapKeys, but use tx instead of b.db.BeginTx.
rows, err := tx.QueryContext(ctx, "SELECT key_id, wrapped_key FROM barrier_keys")
// ... decrypt with old MEK, encrypt with new MEK, UPDATE barrier_keys ...
}
// SwapMEK updates the in-memory MEK after a committed transaction.
func (b *AESGCMBarrier) SwapMEK(newMEK []byte) {
b.mu.Lock()
defer b.mu.Unlock()
mcrypto.Zeroize(b.mek)
b.mek = newMEK
}
```
Then in `RotateMEK`:
```go
func (m *Manager) RotateMEK(ctx context.Context, password string) error {
// ... derive KWK, generate newMEK ...
tx, err := m.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
// Re-wrap all DEKs within the transaction.
if err := m.barrier.ReWrapKeysTx(ctx, tx, newMEK); err != nil {
return err
}
// Update seal_config within the same transaction.
encNewMEK, err := crypto.Encrypt(kwk, newMEK, nil)
if err != nil {
return err
}
if _, err := tx.ExecContext(ctx,
"UPDATE seal_config SET encrypted_mek = ? WHERE id = 1",
encNewMEK,
); err != nil {
return err
}
if err := tx.Commit(); err != nil {
return err
}
// Only after commit: update in-memory state.
m.barrier.SwapMEK(newMEK)
return nil
}
```
SQLite in WAL mode handles this correctly — the transaction is atomic
regardless of process crash. The `barrier_keys` and `seal_config` updates
either both commit or neither does.
### Files
| File | Change |
|------|--------|
| `internal/barrier/barrier.go` | Extend RLock scope in Get/Put/Delete/List; add `ReWrapKeysTx`, `SwapMEK` |
| `internal/seal/seal.go` | Wrap ReWrapKeysTx + seal_config UPDATE in single transaction |
| `internal/barrier/barrier_test.go` | Add concurrent Get/Seal stress test |
### Verification
- `go test -race ./internal/barrier/ ./internal/seal/`
- Add `TestConcurrentGetSeal`: spawn goroutines doing Get while another
goroutine calls Seal. Run with `-race`. Verify no panics or data races.
- Add `TestRotateMEKAtomic`: verify that `barrier_keys` and `seal_config`
are updated in the same transaction (mock the DB to detect transaction
boundaries, or verify via rollback behavior).
---
## Work Item 4: CA TTL Enforcement, User Engine Fixes, Policy Bypass (#49, #61, #62, #69)
These four findings touch separate files with no overlap and can be
addressed in parallel.
### #49 — No TTL Ceiling in CA Certificate Issuance
**Risk**: A non-admin user can request an arbitrarily long certificate
lifetime. The issuer's `MaxTTL` exists in config but is not enforced
during `handleIssue` or `handleSignCSR`.
**Root cause**: The CA engine applies the user's requested TTL directly
to the certificate without comparing it against `issuerConfig.MaxTTL`.
The SSH CA engine correctly enforces this via `resolveTTL` — the CA
engine does not.
**Fix**: Add a `resolveTTL` method to the CA engine, following the SSH
CA engine's pattern (`sshca.go` lines 902932):
```go
func (e *CAEngine) resolveTTL(requested string, issuer *issuerState) (time.Duration, error) {
maxTTL, err := time.ParseDuration(issuer.config.MaxTTL)
if err != nil {
maxTTL = 2160 * time.Hour // 90 days fallback
}
if requested != "" {
ttl, err := time.ParseDuration(requested)
if err != nil {
return 0, fmt.Errorf("invalid TTL: %w", err)
}
if ttl > maxTTL {
return 0, fmt.Errorf("requested TTL %s exceeds issuer maximum %s", ttl, maxTTL)
}
return ttl, nil
}
return maxTTL, nil
}
```
Call this in `handleIssue` and `handleSignCSR` before constructing the
certificate. Replace the raw TTL string with the validated duration.
| File | Change |
|------|--------|
| `internal/engine/ca/ca.go` | Add `resolveTTL`, call from `handleIssue` and `handleSignCSR` |
| `internal/engine/ca/ca_test.go` | Add test: issue cert with TTL > MaxTTL, verify rejection |
### #61 — Ineffective ECDH Key Zeroization
**Risk**: `privKey.Bytes()` returns a copy of the private key bytes.
Zeroizing the copy leaves the original inside `*ecdh.PrivateKey`. Go's
`crypto/ecdh` API does not expose the internal byte slice.
**Root cause**: Language/API limitation in Go's `crypto/ecdh` package.
**Fix**: Store the raw private key bytes alongside the parsed key in
`userState`, and zeroize those bytes on seal:
```go
type userState struct {
privKey *ecdh.PrivateKey
privBytes []byte // raw key bytes, retained for zeroization
pubKey *ecdh.PublicKey
config *UserKeyConfig
}
```
On **load from barrier** (Unseal, auto-provision):
```go
raw, err := b.Get(ctx, prefix+"priv.key")
priv, err := curve.NewPrivateKey(raw)
state.privBytes = raw // retain for zeroization
state.privKey = priv
```
On **Seal**:
```go
mcrypto.Zeroize(u.privBytes)
u.privKey = nil
u.privBytes = nil
```
Document the limitation: the parsed `*ecdh.PrivateKey` struct's internal
copy cannot be zeroized from Go code. Setting `privKey = nil` makes it
eligible for GC, but does not guarantee immediate byte overwrite. This is
an accepted Go runtime limitation.
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `privBytes` to `userState`, populate on load, zeroize on Seal |
| `internal/engine/user/types.go` | Update `userState` struct |
### #62 — User Engine Policy Path Uses `mountPath` Instead of Mount Name
**Risk**: Policy checks construct the resource path using `e.mountPath`
(which is `engine/user/{name}/`) instead of just the mount name. Policy
rules match against `user/{name}/recipient/{username}`, so the full mount
path creates a mismatch like `user/engine/user/myengine//recipient/alice`.
No policy rule will ever match.
**Root cause**: Line 358 of `user.go` uses `e.mountPath` directly. The
SSH CA and transit engines correctly use a `mountName()` helper.
**Fix**: Add a `mountName()` method to the user engine:
```go
func (e *UserEngine) mountName() string {
// mountPath is "engine/user/{name}/"
parts := strings.Split(strings.TrimSuffix(e.mountPath, "/"), "/")
if len(parts) >= 3 {
return parts[2]
}
return e.mountPath
}
```
Change line 358:
```go
resource := fmt.Sprintf("user/%s/recipient/%s", e.mountName(), r)
```
Audit all other resource path constructions in the user engine to confirm
they also use the correct mount name.
| File | Change |
|------|--------|
| `internal/engine/user/user.go` | Add `mountName()`, fix resource path on line 358 |
| `internal/engine/user/user_test.go` | Add test: verify policy resource path format |
### #69 — Typed REST Handlers Bypass Policy Engine
**Risk**: 18 typed REST handlers pass `nil` for `CheckPolicy` in the
`engine.Request`, skipping service-level policy evaluation. The generic
`/v1/engine/request` endpoint correctly passes a `policyChecker`. Since
engines #54 and #58 default to allow when no policy matches, typed routes
are effectively unprotected by policy.
**Root cause**: Typed handlers were modeled after admin-only operations
(which don't need policy) but applied to user-accessible operations.
**Fix**: Extract the policy checker construction from
`handleEngineRequest` into a shared helper:
```go
func (s *Server) newPolicyChecker(info *CallerInfo) engine.PolicyChecker {
return func(resource, action string) (string, bool) {
effect, matched, err := s.policy.Check(
info.Username, info.Roles, resource, action,
)
if err != nil || !matched {
return "deny", false
}
return effect, matched
}
}
```
Then in each typed handler, set `CheckPolicy` on the request:
```go
req := &engine.Request{
Operation: "get-cert",
Data: data,
CallerInfo: callerInfo,
CheckPolicy: s.newPolicyChecker(callerInfo),
}
```
**18 handlers to update**:
| Handler | Operation |
|---------|-----------|
| `handleGetCert` | `get-cert` |
| `handleRevokeCert` | `revoke-cert` |
| `handleDeleteCert` | `delete-cert` |
| `handleSSHCASignHost` | `sign-host` |
| `handleSSHCASignUser` | `sign-user` |
| `handleSSHCAGetProfile` | `get-profile` |
| `handleSSHCAListProfiles` | `list-profiles` |
| `handleSSHCADeleteProfile` | `delete-profile` |
| `handleSSHCAGetCert` | `get-cert` |
| `handleSSHCAListCerts` | `list-certs` |
| `handleSSHCARevokeCert` | `revoke-cert` |
| `handleSSHCADeleteCert` | `delete-cert` |
| `handleUserRegister` | `register` |
| `handleUserProvision` | `provision` |
| `handleUserListUsers` | `list-users` |
| `handleUserGetPublicKey` | `get-public-key` |
| `handleUserDeleteUser` | `delete-user` |
| `handleUserDecrypt` | `decrypt` |
Note: `handleUserEncrypt` already passes a policy checker — verify it
uses the same shared helper after refactoring. Admin-only handlers
(behind `requireAdmin` wrapper) do not need a policy checker since admin
bypasses policy.
| File | Change |
|------|--------|
| `internal/server/routes.go` | Add `newPolicyChecker`, pass to all 18 typed handlers |
| `internal/server/server_test.go` | Add test: policy-denied user is rejected by typed route |
### Verification (Work Item 4, all findings)
```bash
go test ./internal/engine/ca/
go test ./internal/engine/user/
go test ./internal/server/
go vet ./...
```
---
## Implementation Order
```
1. #68 JSON injection (standalone, ship immediately)
2. #48 Path traversal (standalone, blocks safe engine operation)
3. #39 Barrier TOCTOU race ─┐
#40 ReWrapKeys crash safety ┘ (coupled, requires careful testing)
4. #49 CA TTL enforcement ─┐
#61 ECDH zeroization │
#62 User policy path │ (independent fixes, parallelizable)
#69 Policy bypass ─┘
```
Items 1 and 2 have no dependencies and can be done in parallel by
different engineers.
Items 3 and 4 can also be done in parallel since they touch different
subsystems (barrier/seal vs engines/server).
---
## Post-Remediation
After all eight findings are resolved:
1. **Update AUDIT.md** — mark #39, #40, #48, #49, #61, #62, #68, #69 as
RESOLVED with resolution summaries.
2. **Run the full pipeline**: `make all` (vet, lint, test, build).
3. **Run race detector**: `go test -race ./...`
4. **Address related medium findings** that interact with these fixes:
- #54 (SSH CA default-allow) and #58 (transit default-allow) — once
#69 is fixed, the typed handlers will pass policy checkers to the
engines, but the engines still default-allow when `CheckPolicy`
returns no match. Consider changing the engine-level default to deny
for non-admin callers.
- #72 (policy ID path traversal) — already covered by #48's
`ValidateName` fix on `CreateRule`.