Files
metacrypt/AUDIT.md
Kyle Isom 5c5d7e184e Fix ECDH zeroization, add audit logging, and remediate high findings
- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:04:39 -07:00

456 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Security Audit Report
**Date**: 2026-03-16 (design review), 2026-03-17 (full system audit)
**Scope**: Full system — architecture, cryptographic core, all engine implementations, API servers (REST/gRPC), web UI, policy engine, authentication, deployment, documentation
---
## Audit History
- **2026-03-16**: Initial design review of ARCHITECTURE.md, engines/sshca.md, engines/transit.md. Issues #1#24 identified. Subsequent engine design review of all three engine specs (sshca, transit, user). Issues #25#38 identified.
- **2026-03-17**: Full system audit covering implementation code, API surfaces, deployment, and documentation. Issues #39#80 identified.
- **2026-03-16**: High finding remediation validation. #39, #40, #49, #62, #68, #69 confirmed resolved. #48, #61 confirmed still open.
- **2026-03-17**: #61 resolved — `handleRotateKey` and `handleDeleteUser` now zeroize stored `privBytes` instead of calling `Bytes()` (which returns a copy). New state in `handleRotateKey` populates `privBytes`. References to `*ecdh.PrivateKey` are nil'd for GC. Residual risk: Go's internal copy in `*ecdh.PrivateKey` cannot be zeroized.
---
## Design Review Findings (#1#38)
### ARCHITECTURE.md
#### Strengths
- Solid key hierarchy: password → Argon2id → KWK → MEK → per-entry encryption. Defense-in-depth.
- Fail-closed design with `ErrSealed` on all operations when sealed.
- Fresh nonce per write, constant-time comparisons, explicit zeroization — all correct fundamentals.
- Default-deny policy engine with priority-based rule evaluation.
- Issued leaf private keys never stored — good principle of least persistence.
#### Issues
**1. ~~TLS minimum version should be 1.3, not 1.2~~ RESOLVED**
Updated all TLS configurations from `tls.VersionTLS12` to `tls.VersionTLS13`. Removed explicit cipher suite list (TLS 1.3 manages its own).
**2. ~~Token cache TTL of 30 seconds is a revocation gap~~ ACCEPTED**
Accepted as an explicit trade-off. 30-second cache TTL balances MCIAS load against revocation latency.
**3. ~~Admin bypass in policy engine is an all-or-nothing model~~ ACCEPTED**
The all-or-nothing admin model is intentional. MCIAS admin users get full access to all engines and operations.
**4. ~~Policy rule creation is listed as both Admin-only and User-accessible~~ RESOLVED**
Removed duplicate table. gRPC `adminRequiredMethods` now includes `ListPolicies` and `GetPolicy`. All policy CRUD is admin-only across both API surfaces.
**5. ~~No integrity protection on barrier entry paths~~ RESOLVED**
Barrier now passes entry path as GCM AAD on both `Put` and `Get`. Migration tool created for existing entries.
**6. ~~Single MEK with no rotation mechanism~~ RESOLVED**
Implemented MEK rotation and per-engine DEKs with v2 ciphertext format.
**7. No audit logging**
Every certificate issuance, sign operation, and policy change should be logged with caller identity, timestamp, and operation details. Without this, incident response is blind.
**8. ~~Rate limiting is in-memory only~~ ACCEPTED**
Accepted: Argon2id cost parameters are the primary brute-force mitigation.
### engines/sshca.md
#### Strengths
- Flat CA model is correct for SSH.
- Default principal restriction — users can only sign certs for their own username.
- `max_ttl` enforced server-side.
- Key zeroization on seal, no private keys in cert records.
- RSA excluded — reduces attack surface.
- Signing profiles are the only path to critical options — good privilege separation.
- Server-side serial generation with `crypto/rand`.
#### Issues
**9. ~~User-controllable serial numbers~~ RESOLVED**
**10. No explicit extension allowlist for host certificates**
The `extensions` field for `sign-host` accepts an arbitrary map. The engine should define a default extension set and restrict to an allowlist or require admin for non-default extensions.
**11. ~~`critical_options` on user certs is a privilege escalation surface~~ RESOLVED**
**12. ~~No KRL (Key Revocation List) support~~ RESOLVED**
**13. ~~Policy resource path uses `ca/` prefix instead of `sshca/`~~ RESOLVED**
**14. No source-address restriction by default**
User certificates should ideally include `source-address` critical options. Consider a mount-level configuration for default critical options.
### engines/transit.md
#### Strengths
- Ciphertext format with version prefix enables clean key rotation.
- `exportable` and `allow_deletion` immutable after creation.
- AAD/context binding for AEAD ciphers.
- Rewrap never exposes plaintext to caller.
- XChaCha20-Poly1305 with 24-byte nonce — correct for random nonce safety.
- `trim-key` logic is safe. Batch operations hold read lock for atomicity.
- 500-item batch limit prevents resource exhaustion.
#### Issues
**15. ~~No minimum key version enforcement~~ RESOLVED**
**16. ~~Key version pruning safety check~~ RESOLVED**
**17. ~~RSA encryption without specifying padding scheme~~ RESOLVED** (RSA removed entirely)
**18. ~~HMAC keys used for `sign` operation~~ RESOLVED**
**19. ~~No batch encrypt/decrypt operations~~ RESOLVED**
**20. ~~`read` action maps to `decrypt`~~ RESOLVED** (granular actions)
**21. No rate limiting or quota on cryptographic operations**
A compromised token could issue unlimited encrypt/decrypt/sign requests.
### engines/user.md
#### Strengths
- HKDF with per-recipient random salt prevents wrapping key reuse.
- AES-256-GCM for DEK wrapping (consistent with codebase).
- ECDH key agreement with info-string binding prevents key confusion.
- Explicit zeroization of all intermediate secrets documented.
- Envelope format includes salt per-recipient.
#### Issues
**22. ~~No forward secrecy for stored data~~ RESOLVED** (per-engine DEKs)
**23. ~~Generic `POST /v1/engine/request` bypasses typed route middleware~~ RESOLVED**
**24. ~~No CSRF protection for web UI~~ RESOLVED**
**2532.** ~~Various spec issues~~ **RESOLVED** (see detailed history below)
**33. ~~Auto-provisioning creates keys for arbitrary usernames~~ RESOLVED**
**34. ~~No recipient limit on encrypt~~ RESOLVED**
**35. ~~No re-encryption support for key rotation~~ RESOLVED**
**3638.** ~~Various spec/cross-cutting issues~~ **RESOLVED**
---
## Full System Audit (2026-03-17)
**Scope**: All implementation code, deployment, and documentation.
### Cryptographic Core
#### Strengths
- AES-256-GCM with 12-byte random nonces from `crypto/rand` — correct.
- Argon2id with configurable parameters stored in `seal_config` — correct.
- Path-bound AAD in barrier — defense against ciphertext relocation.
- Per-engine DEKs with v2 ciphertext format — limits blast radius.
- Constant-time comparison via `crypto/subtle` for all secret comparisons.
#### Issues
**39. ~~TOCTOU race in barrier Seal/Unseal~~ RESOLVED**
`barrier.go`: `Get()` and `Put()` hold `RLock` (via `defer`) through the entire crypto operation including decryption/encryption. `Seal()` acquires an exclusive `Lock()`, which blocks until all RLock holders release. There is no window where a reader can use zeroized key material.
**40. ~~Crash during `ReWrapKeys` loses all barrier data~~ RESOLVED**
`seal.go`: `RotateMEK` now wraps both `ReWrapKeysTx` (re-encrypts all DEKs) and the `seal_config` update in a single SQLite transaction. A crash at any point results in full rollback or full commit — no partial state. In-memory state (`SwapMEK`) is updated only after successful commit.
**41. `loadKeys` errors silently swallowed during unseal**
`barrier.go`: If `loadKeys` fails to decrypt DEK entries (e.g., corrupt `barrier_keys` rows), errors are silently ignored and the keys map may be incomplete. Subsequent operations on engine mounts with missing DEKs will fail with confusing errors instead of failing at unseal time.
**42. No AAD binding on MEK encryption with KWK**
`seal.go`: The MEK is encrypted with the KWK (derived from password via Argon2id) using `crypto.Encrypt(kwk, mek, nil)`. There is no AAD binding this ciphertext to its purpose. An attacker who can swap `encrypted_mek` in `seal_config` could substitute a different ciphertext (though the practical impact is limited since the KWK is password-derived).
**43. Barrier `List` uses SQL LIKE with unescaped prefix**
`barrier.go`: The `List` method passes the prefix directly into a SQL `LIKE` clause without escaping `%` and `_` characters. A path containing these characters would match unintended entries.
**44. System key rotation query may miss entries**
`barrier.go`: `RotateKey` for the system key excludes all `engine/%` paths, but entries at shorter paths encrypted with the system key could be missed if they don't follow the expected naming convention.
**45. `Zeroize` loop may be optimized away by compiler**
`crypto.go`: The `Zeroize` function uses a simple `for` loop to zero memory. The Go compiler may optimize this away if the slice is not used after zeroization. Use `crypto/subtle.XORBytes` or a volatile-equivalent pattern.
**46. SQLite PRAGMAs only applied to first connection**
`db.go`: `PRAGMA journal_mode`, `foreign_keys`, and `busy_timeout` are applied once at open time but `database/sql` may open additional connections in its pool that don't receive these PRAGMAs. Use a `ConnInitHook` or `_pragma` DSN parameters.
**47. Plaintext not zeroized after re-encryption during key rotation**
`barrier.go`: During `RotateKey`, decrypted plaintext is held in a `[]byte` but not zeroized after re-encryption. This leaves plaintext in memory longer than necessary.
### Engine Implementations
#### CA (PKI) Engine
**48. ~~Path traversal via unsanitized entity names in get/update/delete operations~~ RESOLVED**
All engines now call `engine.ValidateName()` on every operation that accepts user-supplied names, not just create operations. Fixed in: CA (`handleGetChain`, `handleGetIssuer`, `handleDeleteIssuer`, `handleIssue`, `handleSignCSR`), SSH CA (`handleUpdateProfile`, `handleGetProfile`, `handleDeleteProfile`), Transit (`handleDeleteKey`, `handleGetKey`, `handleRotateKey`, `handleUpdateKeyConfig`, `handleTrimKey`, `handleGetPublicKey`), User (`handleRegister`, `handleGetPublicKey`, `handleDeleteUser`).
**49. ~~No TTL enforcement against issuer MaxTTL in issuance~~ RESOLVED**
`ca/ca.go`: Both `handleIssue` and `handleSignCSR` now use a `resolveTTL` helper that parses the issuer's `MaxTTL`, caps the requested TTL against it, and returns an error if the requested TTL exceeds the maximum. Default TTL is the issuer's MaxTTL when none is specified.
**50. Non-admin users can override key usages**
`ca/ca.go`: The `key_usages` and `ext_key_usages` fields are accepted from non-admin users. A user could request a certificate with `cert sign` or `crl sign` key usage, potentially creating an intermediate CA certificate.
**51. Certificate renewal does not revoke original**
`ca/ca.go`: `handleRenew` creates a new certificate but does not revoke the original. This creates duplicate valid certificates for the same identity, which complicates revocation and weakens the security model.
**52. Leaf private key in API response not zeroized**
`ca/ca.go`: After marshalling the leaf private key to PEM for the API response, the in-memory key material is not zeroized. The key persists in memory until garbage collected.
#### SSH CA Engine
**53. HandleRequest uses exclusive write lock for all operations**
`sshca/sshca.go`: All operations (including reads like `get-cert`, `list-certs`, `get-profile`) acquire a write lock (`mu.Lock()`), serializing the entire engine. Read operations should use `mu.RLock()`.
**54. Host signing is default-allow without policy rules**
`sshca/sshca.go`: When no policy rules match a host signing request, the engine allows it by default. This contradicts the default-deny principle established in the engineering standards and ARCHITECTURE.md.
**55. SSH certificate serial collision risk**
`sshca/sshca.go`: Random `uint64` serials have a birthday collision probability of ~50% at ~4 billion certificates. While far beyond typical scale, the engine should detect and retry on collision.
**56. KRL is not signed**
`sshca/sshca.go`: The generated KRL is not cryptographically signed. An attacker who can intercept the KRL distribution (e.g., MITM on the `GET /v1/sshca/{mount}/krl` endpoint, though TLS mitigates this) could serve a truncated KRL that omits revoked certificates.
**57. PEM key bytes not zeroized after parsing in Unseal**
`sshca/sshca.go`: After reading the CA private key PEM from the barrier and parsing it, the raw PEM bytes are not zeroized.
#### Transit Engine
**58. Default-allow for non-admin users contradicts default-deny**
`transit/transit.go`: Similar to #54 — when no policy rules match a transit operation, the engine allows it. This should default to deny.
**59. Negative ciphertext version not rejected**
`transit/transit.go`: `parseVersionedData` does not reject negative version numbers. A crafted ciphertext with a negative version could cause unexpected behavior in version lookups.
**60. ECDSA big.Int internals not fully zeroized**
`transit/transit.go`: The local `zeroizeKey` clears `D` on ECDSA keys but not `PublicKey.X/Y`. While the public key is not secret, the `big.Int` internal representation may retain data from the private key computation.
#### User E2E Encryption Engine
**61. ~~ECDH private key zeroization is ineffective~~ RESOLVED**
`user/user.go`: `handleRotateKey` and `handleDeleteUser` now zeroize the stored `privBytes` field (retained at key creation time) instead of calling `Bytes()` which returns a new copy. The `privKey` and `privBytes` fields are nil'd after zeroization to allow GC of the `*ecdh.PrivateKey` object. Note: Go's `*ecdh.PrivateKey` internal bytes cannot be zeroized through the public API — this is a known limitation of Go's crypto library. The stored `privBytes` copy is the best-effort mitigation.
**62. ~~Policy resource path uses mountPath instead of mount name~~ RESOLVED**
`user/user.go`: A `mountName()` helper extracts the mount name from the full mount path (e.g., `"engine/user/mymount/"``"mymount"`). Policy resource paths are correctly constructed as `"user/{mountname}/recipient/{recipient}"`.
**63. No role checks on decrypt, re-encrypt, and rotate-key**
`user/user.go`: The `handleDecrypt`, `handleReEncrypt`, and `handleRotateKey` operations have no role checks. A guest-role user (who should have restricted access per MCIAS role definitions) can perform these operations.
**64. Initialize does not acquire mutex**
`user/user.go`: The `Initialize` method writes to shared state without holding the mutex, creating a data race if called concurrently.
**65. handleEncrypt uses stale state after releasing lock**
`user/user.go`: After releasing the write lock during encryption, the handler continues to use pointers to user state that may have been modified by another goroutine.
**66. handleReEncrypt uses manual lock without defer**
`user/user.go`: Manual `RLock`/`Unlock` calls without `defer` — a panic between lock and unlock will leak the lock, deadlocking the engine.
**67. No sealed-state check in user HandleRequest**
`user/user.go`: Unlike other engines, the user engine's `HandleRequest` does not check if the engine is sealed. A request reaching the engine after seal but before the HTTP layer catches it could panic on nil map access.
### API Servers
#### REST API
**68. ~~JSON injection via unsanitized error messages~~ RESOLVED**
`server/routes.go`: All error responses now use `writeJSONError()` which delegates to `writeJSON()``json.NewEncoder().Encode()`, properly JSON-escaping all error message content.
**69. ~~Typed REST handlers bypass policy engine~~ RESOLVED**
`server/routes.go`: All typed REST handlers now pass a `CheckPolicy` callback via `s.newPolicyChecker(r, info)` or an inline policy checker function. This includes all SSH CA, transit, user, and CA handlers.
**70. `RenewCert` gRPC RPC has no corresponding REST route**
`server/routes.go`: The `CAService/RenewCert` gRPC RPC exists but has no REST endpoint, violating the API sync rule.
#### gRPC API
**71. `PKIService/GetCRL` missing from `sealRequiredMethods`**
`grpcserver/server.go`: The `GetCRL` RPC can be called even when the service is sealed. While this is arguably intentional (public endpoint), it is inconsistent with the interceptor design where all RPCs are gated.
#### Policy Engine
**72. Policy rule ID allows path traversal**
`policy/policy.go`: Policy rule IDs are not validated. An ID containing `/` or `..` could write to arbitrary paths in the barrier, since rules are stored at `policy/rules/{id}`.
**73. `filepath.Match` does not support `**` recursive globs**
`policy/policy.go`: Policy resource patterns use `filepath.Match`, which does not support `**` for recursive directory matching. Administrators writing rules like `engine/**/certs/*` will find they don't match as expected.
#### Authentication
**74. Token validation cache grows without bound**
`auth/auth.go`: The token cache has no size limit or eviction of expired entries beyond lazy expiry checks. Under sustained load with many unique tokens, this is an unbounded memory growth vector.
### Web UI
**75. CSRF token not bound to user session**
`webserver/csrf.go`: CSRF tokens are signed with a server-wide HMAC key but not bound to the user's session. Any valid server-generated CSRF token works for any user, reducing CSRF protection to a server-origin check rather than a session-integrity check.
**76. Login cookie missing explicit expiry**
`webserver/routes.go`: The `metacrypt_token` cookie has no `MaxAge` or `Expires`, making it a session cookie that persists until the browser is closed. Consider an explicit TTL matching the MCIAS token lifetime.
**77. Several POST handlers missing `MaxBytesReader`**
`webserver/routes.go`, `webserver/user.go`, `webserver/sshca.go`: `handlePolicyCreate`, `handlePolicyDelete`, `handleUserRegister`, `handleUserRotateKey`, SSH CA cert revoke/delete — all accept POST bodies without `MaxBytesReader`, allowing arbitrarily large request bodies.
### Deployment & Documentation
**78. `ExecReload` sends SIGHUP but no handler exists**
`deploy/systemd/metacrypt.service`, `deploy/systemd/metacrypt-web.service`: Both units define `ExecReload=/bin/kill -HUP $MAINPID`, but the Go binary does not handle SIGHUP. A `systemctl reload` would crash the process.
**79. Dockerfiles use `golang:1.23-alpine` but `go.mod` requires Go 1.25**
`Dockerfile.api`, `Dockerfile.web`: The builder stage uses Go 1.23 but the module requires Go 1.25. Builds will fail.
**80. ARCHITECTURE.md system overview says "TLS 1.2+" but code enforces TLS 1.3**
`ARCHITECTURE.md:33`: The ASCII diagram still says "TLS 1.2+" despite issue #1 being resolved in code. The diagram was not updated.
---
## Open Issues (Unresolved)
### Open — Critical
*None.*
### Open — High
*None.*
### Open — Medium
| # | Issue | Location |
|---|-------|----------|
| 7 | No audit logging for cryptographic operations | ARCHITECTURE.md |
| 10 | No extension allowlist for SSH host certificates | `sshca/sshca.go` |
| 21 | No rate limiting on transit cryptographic operations | `transit/transit.go` |
| 41 | `loadKeys` errors silently swallowed during unseal | `barrier/barrier.go` |
| 42 | No AAD binding on MEK encryption with KWK | `seal/seal.go` |
| 43 | Barrier `List` SQL LIKE with unescaped prefix | `barrier/barrier.go` |
| 46 | SQLite PRAGMAs only applied to first connection | `db/db.go` |
| 50 | Non-admin users can override key usages (cert sign, CRL sign) | `ca/ca.go` |
| 51 | Certificate renewal does not revoke original | `ca/ca.go` |
| 53 | SSH CA write-locks all operations including reads | `sshca/sshca.go` |
| 54 | SSH CA host signing is default-allow (contradicts default-deny) | `sshca/sshca.go` |
| 58 | Transit default-allow contradicts default-deny | `transit/transit.go` |
| 59 | Negative ciphertext version not rejected in transit | `transit/transit.go` |
| 63 | No role checks on user decrypt/re-encrypt/rotate | `user/user.go` |
| 64 | User engine Initialize has no mutex | `user/user.go` |
| 65 | handleEncrypt uses stale state after lock release | `user/user.go` |
| 66 | handleReEncrypt manual lock without defer (leak risk) | `user/user.go` |
| 67 | No sealed-state check in user HandleRequest | `user/user.go` |
| 70 | `RenewCert` has no REST route (API sync violation) | `server/routes.go` |
| 72 | Policy rule ID allows path traversal in barrier | `policy/policy.go` |
| 73 | `filepath.Match` does not support `**` recursive globs | `policy/policy.go` |
| 74 | Token validation cache grows without bound | `auth/auth.go` |
| 78 | systemd `ExecReload` sends SIGHUP with no handler | `deploy/systemd/` |
| 79 | Dockerfiles use Go 1.23 but module requires Go 1.25 | `Dockerfile.*` |
### Open — Low
| # | Issue | Location |
|---|-------|----------|
| 14 | No source-address restriction by default in SSH certs | `sshca/sshca.go` |
| 44 | System key rotation query may miss entries | `barrier/barrier.go` |
| 45 | `Zeroize` loop may be optimized away by compiler | `crypto/crypto.go` |
| 47 | Plaintext not zeroized after re-encryption in rotation | `barrier/barrier.go` |
| 52 | Leaf private key in API response not zeroized | `ca/ca.go` |
| 55 | SSH certificate serial collision risk at scale | `sshca/sshca.go` |
| 56 | KRL is not cryptographically signed | `sshca/sshca.go` |
| 57 | PEM key bytes not zeroized after parsing in SSH CA | `sshca/sshca.go` |
| 60 | ECDSA big.Int internals not fully zeroized | `transit/transit.go` |
| 71 | `GetCRL` missing from `sealRequiredMethods` | `grpcserver/server.go` |
| 75 | CSRF token not bound to user session | `webserver/csrf.go` |
| 76 | Login cookie missing explicit expiry | `webserver/routes.go` |
| 77 | POST handlers missing `MaxBytesReader` | `webserver/` |
| 80 | ARCHITECTURE.md diagram still says "TLS 1.2+" | `ARCHITECTURE.md` |
### Accepted
| # | Issue | Rationale |
|---|-------|-----------|
| 2 | Token cache 30s revocation gap | Trade-off: MCIAS load vs revocation latency |
| 3 | Admin all-or-nothing access | Intentional design |
| 8 | Unseal rate limit resets on restart | Argon2id is the primary mitigation |
---
## Resolved Issues (#1#38, plus #39, #40, #48, #49, #61, #62, #68, #69)
All design review findings from the 2026-03-16 audit have been resolved or accepted. See the [Audit History](#audit-history) section. The following issues were resolved:
**Critical** (all resolved): #4 (policy auth contradiction), #9 (user-controllable SSH serials), #13 (policy path collision), #37 (adminOnlyOperations name collision).
**High** (all resolved): #5 (no path AAD), #6 (single MEK), #11 (critical_options unrestricted), #12 (no KRL), #15 (no min key version), #17 (RSA padding), #22 (no per-engine DEKs), #28 (HMAC not versioned), #30 (max_key_versions unclear), #33 (auto-provision arbitrary usernames), #39 (TOCTOU race — RLock held through crypto ops), #40 (ReWrapKeys crash — atomic transaction), #48 (path traversal — ValidateName on all ops), #49 (TTL enforcement — resolveTTL helper), #61 (ECDH zeroization — use stored privBytes), #62 (policy path — mountName helper), #68 (JSON injection — writeJSONError), #69 (policy bypass — newPolicyChecker).
**Medium** (all resolved or accepted): #1, #2, #3, #8, #20, #23, #24, #25, #26, #27, #29, #31, #34.
**Low** (all resolved): #18, #19, #32, #35, #36, #38.
---
## Priority Summary
| Priority | Count | Status |
|----------|-------|--------|
| High | 0 | All resolved |
| Medium | 21 | Open |
| Low | 14 | Open |
| Accepted | 3 | Closed |
| Resolved | 46 | Closed |
**Recommendation**: All High findings are resolved. The user engine medium issues (#63#67) should be addressed as a batch since they interact with each other.