Fix ECDH zeroization, add audit logging, and remediate high findings

- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored privBytes instead of calling Bytes() (which returns a copy). New state populates privBytes; old references nil'd for GC. - Add audit logging subsystem (internal/audit) with structured event recording for cryptographic operations. - Add audit log engine spec (engines/auditlog.md). - Add ValidateName checks across all engines for path traversal (#48). - Update AUDIT.md: all High findings resolved (0 open). - Add REMEDIATION.md with detailed remediation tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:04:39 -07:00
parent b33d1f99a0
commit 5c5d7e184e
24 changed files with 1699 additions and 72 deletions
--- a/engines/auditlog.md
+++ b/engines/auditlog.md
@@ -0,0 +1,513 @@
+# Audit Logging Design
+
+## Overview
+
+Metacrypt is a cryptographic service for a homelab/personal infrastructure
+platform. Audit logging gives the operator visibility into what happened,
+when, and by whom — essential for a service that issues certificates, signs
+SSH keys, and manages encryption keys, even at homelab scale.
+
+The design prioritizes simplicity and operational clarity over enterprise
+features. There is one operator. There is no SIEM. The audit log should be
+a structured, append-only file that can be read with `jq`, tailed with
+`journalctl`, and rotated with `logrotate`. It should not require a
+database, a separate service, or additional infrastructure.
+
+## Goals
+
+1. **Record all security-relevant operations** — who did what, when, and
+   whether it succeeded.
+2. **Separate audit events from operational logs** — operational logs
+   (`slog.Info`) are for debugging; audit events are for accountability.
+3. **Zero additional dependencies** — use Go's `log/slog` with a dedicated
+   handler writing to a file or stdout.
+4. **No performance overhead that matters at homelab scale** — synchronous
+   writes are fine. This is not a high-throughput system.
+5. **Queryable with standard tools** — one JSON object per line, greppable,
+   `jq`-friendly.
+
+## Non-Goals
+
+- Tamper-evident chaining (hash chains, Merkle trees). The operator has
+  root access to the machine; tamper evidence against the operator is
+  theatre. If the threat model changes, this can be added later.
+- Remote log shipping. If needed, `journalctl` or `filebeat` can ship
+  the file externally.
+- Log aggregation across services. Each Metacircular service logs
+  independently.
+- Structured querying (SQL, full-text search). `jq` and `grep` are
+  sufficient.
+
+## Event Model
+
+Every audit event is a single JSON line with these fields:
+
+```json
+{
+  "time":      "2026-03-17T04:15:42.577Z",
+  "level":     "AUDIT",
+  "msg":       "operation completed",
+  "caller":    "kyle",
+  "roles":     ["admin"],
+  "operation": "issue",
+  "engine":    "ca",
+  "mount":     "pki",
+  "resource":  "ca/pki/id/example.com",
+  "outcome":   "success",
+  "detail":    {"serial": "01:02:03", "issuer": "default", "cn": "example.com"}
+}
+```
+
+### Required Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `time` | RFC 3339 | When the event occurred |
+| `level` | string | Always `"AUDIT"` — distinguishes from operational logs |
+| `msg` | string | Human-readable summary |
+| `caller` | string | MCIAS username, or `"anonymous"` for unauthenticated ops |
+| `operation` | string | Engine operation name (e.g., `issue`, `sign-user`, `encrypt`) |
+| `outcome` | string | `"success"`, `"denied"`, or `"error"` |
+
+### Optional Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `roles` | []string | Caller's MCIAS roles |
+| `engine` | string | Engine type (`ca`, `sshca`, `transit`, `user`) |
+| `mount` | string | Mount name |
+| `resource` | string | Policy resource path evaluated |
+| `detail` | object | Operation-specific metadata (see below) |
+| `error` | string | Error message on `"error"` or `"denied"` outcomes |
+
+### Detail Fields by Operation Category
+
+**Certificate operations** (CA):
+- `serial`, `issuer`, `cn`, `profile`, `ttl`
+
+**SSH CA operations**:
+- `serial`, `cert_type` (`user`/`host`), `principals`, `profile`, `key_id`
+
+**Transit operations**:
+- `key` (key name), `key_version`, `batch_size` (for batch ops)
+
+**User E2E operations**:
+- `recipients` (list), `sender`
+
+**Policy operations**:
+- `rule_id`, `effect`
+
+**System operations** (seal/unseal/init):
+- No detail fields; the operation name is sufficient.
+
+### What NOT to Log
+
+- Plaintext, ciphertext, signatures, HMACs, envelopes, or any
+  cryptographic material.
+- Private keys, public keys, or key bytes.
+- Passwords, tokens, or credentials.
+- Full request/response bodies.
+
+The audit log records **what happened**, not **what the data was**.
+
+## Architecture
+
+### Audit Logger
+
+A thin wrapper around `slog.Logger` with a dedicated handler:
+
+```go
+// Package audit provides structured audit event logging.
+package audit
+
+import (
+    "context"
+    "log/slog"
+)
+
+// Logger writes structured audit events.
+type Logger struct {
+    logger *slog.Logger
+}
+
+// New creates an audit logger that writes to the given handler.
+func New(h slog.Handler) *Logger {
+    return &Logger{logger: slog.New(h)}
+}
+
+// Event represents a single audit event.
+type Event struct {
+    Caller    string
+    Roles     []string
+    Operation string
+    Engine    string
+    Mount     string
+    Resource  string
+    Outcome   string // "success", "denied", "error"
+    Error     string
+    Detail    map[string]interface{}
+}
+
+// Log writes an audit event.
+func (l *Logger) Log(ctx context.Context, e Event) {
+    attrs := []slog.Attr{
+        slog.String("caller", e.Caller),
+        slog.String("operation", e.Operation),
+        slog.String("outcome", e.Outcome),
+    }
+    if len(e.Roles) > 0 {
+        attrs = append(attrs, slog.Any("roles", e.Roles))
+    }
+    if e.Engine != "" {
+        attrs = append(attrs, slog.String("engine", e.Engine))
+    }
+    if e.Mount != "" {
+        attrs = append(attrs, slog.String("mount", e.Mount))
+    }
+    if e.Resource != "" {
+        attrs = append(attrs, slog.String("resource", e.Resource))
+    }
+    if e.Error != "" {
+        attrs = append(attrs, slog.String("error", e.Error))
+    }
+    if len(e.Detail) > 0 {
+        attrs = append(attrs, slog.Any("detail", e.Detail))
+    }
+
+    // Use a custom level that sorts above Info but is labelled "AUDIT".
+    l.logger.LogAttrs(ctx, LevelAudit, "operation completed", attrs...)
+}
+
+// LevelAudit is a custom slog level for audit events.
+const LevelAudit = slog.Level(12) // between Warn (4) and Error (8+)
+```
+
+The custom level ensures audit events are never suppressed by log level
+filtering (operators may set `level = "warn"` to quiet debug noise, but
+audit events must always be emitted).
+
+### Output Configuration
+
+Two modes, controlled by a config option:
+
+```toml
+[audit]
+# "file" writes to a dedicated audit log file.
+# "stdout" writes to stdout alongside operational logs (for journalctl).
+# Empty string disables audit logging.
+mode = "file"
+path = "/srv/metacrypt/audit.log"
+```
+
+**File mode**: Opens the file append-only with `0600` permissions. Uses
+`slog.NewJSONHandler` writing to the file. The file can be rotated with
+`logrotate` — the logger re-opens on the next write if the file is
+renamed/truncated. For simplicity, just write and let logrotate handle
+rotation; Go's `slog.JSONHandler` does not buffer.
+
+**Stdout mode**: Uses `slog.NewJSONHandler` writing to `os.Stdout`. Events
+are interleaved with operational logs but distinguishable by the `"AUDIT"`
+level. Suitable for systemd/journalctl capture where all output goes to
+the journal.
+
+**Disabled**: No audit logger is created. The `Logger` is nil-safe — all
+methods are no-ops on a nil receiver.
+
+```go
+func (l *Logger) Log(ctx context.Context, e Event) {
+    if l == nil {
+        return
+    }
+    // ...
+}
+```
+
+### Integration Points
+
+The audit logger is created at startup and injected into the components
+that need it:
+
+```
+cmd/metacrypt/server.go
+  └── audit.New(handler)
+        ├── server.Server        (REST handlers)
+        ├── grpcserver.GRPCServer (gRPC interceptor)
+        ├── seal.Manager         (seal/unseal/init)
+        └── policy.Engine        (rule create/delete)
+```
+
+Engine operations are logged at the **server layer** (REST handlers and
+gRPC interceptors), not inside the engines themselves. This keeps the
+engines focused on business logic and avoids threading the audit logger
+through every engine method.
+
+### Instrumentation
+
+#### REST API (`internal/server/`)
+
+Instrument `handleEngineRequest` and every typed handler. The audit event
+is emitted **after** the operation completes (success or failure):
+
+```go
+func (s *Server) handleGetCert(w http.ResponseWriter, r *http.Request) {
+    // ... existing handler logic ...
+
+    s.audit.Log(r.Context(), audit.Event{
+        Caller:    info.Username,
+        Roles:     info.Roles,
+        Operation: "get-cert",
+        Engine:    "ca",
+        Mount:     mountName,
+        Outcome:   "success",
+        Detail:    map[string]interface{}{"serial": serial},
+    })
+}
+```
+
+On error:
+
+```go
+s.audit.Log(r.Context(), audit.Event{
+    Caller:    info.Username,
+    Roles:     info.Roles,
+    Operation: "get-cert",
+    Engine:    "ca",
+    Mount:     mountName,
+    Outcome:   "error",
+    Error:     err.Error(),
+})
+```
+
+To avoid duplicating this in every handler, use a helper:
+
+```go
+func (s *Server) auditEngineOp(r *http.Request, info *auth.TokenInfo,
+    op, engineType, mount, outcome string, detail map[string]interface{}, err error) {
+    e := audit.Event{
+        Caller:    info.Username,
+        Roles:     info.Roles,
+        Operation: op,
+        Engine:    engineType,
+        Mount:     mount,
+        Outcome:   outcome,
+        Detail:    detail,
+    }
+    if err != nil {
+        e.Error = err.Error()
+    }
+    s.audit.Log(r.Context(), e)
+}
+```
+
+#### gRPC API (`internal/grpcserver/`)
+
+Add an audit interceptor that fires after each RPC completes. This is
+cleaner than instrumenting every handler individually:
+
+```go
+func (g *GRPCServer) auditInterceptor(
+    ctx context.Context,
+    req interface{},
+    info *grpc.UnaryServerInfo,
+    handler grpc.UnaryHandler,
+) (interface{}, error) {
+    resp, err := handler(ctx, req)
+
+    // Extract caller info from context (set by auth interceptor).
+    caller := callerFromContext(ctx)
+
+    outcome := "success"
+    if err != nil {
+        outcome = "error"
+    }
+
+    g.audit.Log(ctx, audit.Event{
+        Caller:    caller.Username,
+        Roles:     caller.Roles,
+        Operation: path.Base(info.FullMethod), // e.g., "IssueCert"
+        Resource:  info.FullMethod,
+        Outcome:   outcome,
+        Error:     errString(err),
+    })
+
+    return resp, err
+}
+```
+
+Register this interceptor **after** the auth interceptor in the chain so
+that caller info is available.
+
+#### Seal/Unseal (`internal/seal/`)
+
+Instrument `Init`, `Unseal`, `Seal`, and `RotateMEK`:
+
+```go
+// In Manager.Unseal, after success:
+m.audit.Log(ctx, audit.Event{
+    Caller:    "operator", // unseal is not authenticated
+    Operation: "unseal",
+    Outcome:   "success",
+})
+
+// On failure:
+m.audit.Log(ctx, audit.Event{
+    Caller:    "operator",
+    Operation: "unseal",
+    Outcome:   "denied",
+    Error:     "invalid password",
+})
+```
+
+#### Policy (`internal/policy/`)
+
+Instrument `CreateRule` and `DeleteRule`:
+
+```go
+// In Engine.CreateRule, after success:
+e.audit.Log(ctx, audit.Event{
+    Caller:    callerUsername, // passed from the handler
+    Operation: "create-policy",
+    Outcome:   "success",
+    Detail:    map[string]interface{}{"rule_id": rule.ID, "effect": rule.Effect},
+})
+```
+
+### Operations to Audit
+
+| Category | Operations | Outcome on deny |
+|----------|------------|-----------------|
+| System | `init`, `unseal`, `seal`, `rotate-mek`, `rotate-key`, `migrate` | `denied` or `error` |
+| CA | `import-root`, `create-issuer`, `delete-issuer`, `issue`, `sign-csr`, `renew`, `revoke-cert`, `delete-cert` | `denied` |
+| SSH CA | `sign-host`, `sign-user`, `create-profile`, `update-profile`, `delete-profile`, `revoke-cert`, `delete-cert` | `denied` |
+| Transit | `create-key`, `delete-key`, `rotate-key`, `update-key-config`, `trim-key`, `encrypt`, `decrypt`, `rewrap`, `sign`, `verify`, `hmac` | `denied` |
+| User | `register`, `provision`, `encrypt`, `decrypt`, `re-encrypt`, `rotate-key`, `delete-user` | `denied` |
+| Policy | `create-policy`, `delete-policy` | N/A (admin-only) |
+| Auth | `login` (success and failure) | `denied` |
+
+**Read-only operations** (`get-cert`, `list-certs`, `get-profile`,
+`list-profiles`, `get-key`, `list-keys`, `list-users`, `get-public-key`,
+`status`) are **not audited** by default. They generate operational log
+entries via the existing HTTP/gRPC logging middleware but do not produce
+audit events. This keeps the audit log focused on state-changing operations.
+
+If the operator wants read auditing, a config flag can enable it:
+
+```toml
+[audit]
+include_reads = false  # default
+```
+
+## File Layout
+
+```
+internal/
+  audit/
+    audit.go          # Logger, Event, LevelAudit
+    audit_test.go     # Tests
+```
+
+One file, one type, no interfaces. The audit logger is a concrete struct
+passed by pointer. Nil-safe for disabled mode.
+
+## Configuration
+
+Add to `config.go`:
+
+```go
+type AuditConfig struct {
+    Mode         string `toml:"mode"`          // "file", "stdout", ""
+    Path         string `toml:"path"`          // file path (mode=file)
+    IncludeReads bool   `toml:"include_reads"` // audit read operations
+}
+```
+
+Add to example config:
+
+```toml
+[audit]
+mode = "file"
+path = "/srv/metacrypt/audit.log"
+include_reads = false
+```
+
+## Implementation Steps
+
+1. **Create `internal/audit/audit.go`** — `Logger`, `Event`, `LevelAudit`,
+   `New(handler)`, nil-safe `Log` method.
+
+2. **Add `AuditConfig` to config** — mode, path, include_reads. Validate
+   that `path` is set when `mode = "file"`.
+
+3. **Create audit logger in `cmd/metacrypt/server.go`** — based on config,
+   open file or use stdout. Pass to Server, GRPCServer, SealManager,
+   PolicyEngine.
+
+4. **Add `audit *audit.Logger` field** to `Server`, `GRPCServer`,
+   `seal.Manager`, `policy.Engine`. Update constructors.
+
+5. **Instrument REST handlers** — add `auditEngineOp` helper to `Server`.
+   Call after every mutating operation in typed handlers and
+   `handleEngineRequest`.
+
+6. **Instrument gRPC** — add audit interceptor to the interceptor chain.
+
+7. **Instrument seal/unseal** — emit events in `Init`, `Unseal`, `Seal`,
+   `RotateMEK`.
+
+8. **Instrument policy** — emit events in `CreateRule`, `DeleteRule`.
+
+9. **Instrument login** — emit events in the auth login handler (both
+   REST and gRPC).
+
+10. **Update ARCHITECTURE.md** — document audit logging in the Security
+    Model section. Remove from Future Work.
+
+11. **Update example configs** — add `[audit]` section.
+
+12. **Add tests** — verify events are emitted for success, denied, and
+    error outcomes. Verify nil logger is safe. Verify read operations are
+    excluded by default.
+
+## Querying the Audit Log
+
+```bash
+# All events for a user:
+jq 'select(.caller == "kyle")' /srv/metacrypt/audit.log
+
+# All certificate issuances:
+jq 'select(.operation == "issue")' /srv/metacrypt/audit.log
+
+# All denied operations:
+jq 'select(.outcome == "denied")' /srv/metacrypt/audit.log
+
+# All SSH CA events in the last hour:
+jq 'select(.engine == "sshca" and .time > "2026-03-17T03:00:00Z")' /srv/metacrypt/audit.log
+
+# Count operations by type:
+jq -r '.operation' /srv/metacrypt/audit.log | sort | uniq -c | sort -rn
+
+# Failed unseal attempts:
+jq 'select(.operation == "unseal" and .outcome == "denied")' /srv/metacrypt/audit.log
+```
+
+## Rotation
+
+For file mode, use logrotate:
+
+```
+/srv/metacrypt/audit.log {
+    daily
+    rotate 90
+    compress
+    delaycompress
+    missingok
+    notifempty
+    copytruncate
+}
+```
+
+`copytruncate` avoids the need for a signal-based reopen mechanism. The
+Go `slog.JSONHandler` writes are not buffered, so no data is lost.
+
+At homelab scale with moderate usage, 90 days of uncompressed audit logs
+will be well under 100 MB.