Files

Kyle Isom 5c5d7e184e Fix ECDH zeroization, add audit logging, and remediate high findings

- Fix #61: handleRotateKey and handleDeleteUser now zeroize stored
  privBytes instead of calling Bytes() (which returns a copy). New
  state populates privBytes; old references nil'd for GC.
- Add audit logging subsystem (internal/audit) with structured event
  recording for cryptographic operations.
- Add audit log engine spec (engines/auditlog.md).
- Add ValidateName checks across all engines for path traversal (#48).
- Update AUDIT.md: all High findings resolved (0 open).
- Add REMEDIATION.md with detailed remediation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-17 14:04:39 -07:00

15 KiB

Raw Blame History

Audit Logging Design

Overview

Metacrypt is a cryptographic service for a homelab/personal infrastructure platform. Audit logging gives the operator visibility into what happened, when, and by whom — essential for a service that issues certificates, signs SSH keys, and manages encryption keys, even at homelab scale.

The design prioritizes simplicity and operational clarity over enterprise features. There is one operator. There is no SIEM. The audit log should be a structured, append-only file that can be read with jq, tailed with journalctl, and rotated with logrotate. It should not require a database, a separate service, or additional infrastructure.

Goals

Record all security-relevant operations — who did what, when, and whether it succeeded.
Separate audit events from operational logs — operational logs (slog.Info) are for debugging; audit events are for accountability.
Zero additional dependencies — use Go's log/slog with a dedicated handler writing to a file or stdout.
No performance overhead that matters at homelab scale — synchronous writes are fine. This is not a high-throughput system.
Queryable with standard tools — one JSON object per line, greppable, jq-friendly.

Non-Goals

Tamper-evident chaining (hash chains, Merkle trees). The operator has root access to the machine; tamper evidence against the operator is theatre. If the threat model changes, this can be added later.
Remote log shipping. If needed, journalctl or filebeat can ship the file externally.
Log aggregation across services. Each Metacircular service logs independently.
Structured querying (SQL, full-text search). jq and grep are sufficient.

Event Model

Every audit event is a single JSON line with these fields:

{
  "time":      "2026-03-17T04:15:42.577Z",
  "level":     "AUDIT",
  "msg":       "operation completed",
  "caller":    "kyle",
  "roles":     ["admin"],
  "operation": "issue",
  "engine":    "ca",
  "mount":     "pki",
  "resource":  "ca/pki/id/example.com",
  "outcome":   "success",
  "detail":    {"serial": "01:02:03", "issuer": "default", "cn": "example.com"}
}

Required Fields

Field	Type	Description
`time`	RFC 3339	When the event occurred
`level`	string	Always `"AUDIT"` — distinguishes from operational logs
`msg`	string	Human-readable summary
`caller`	string	MCIAS username, or `"anonymous"` for unauthenticated ops
`operation`	string	Engine operation name (e.g., `issue`, `sign-user`, `encrypt`)
`outcome`	string	`"success"`, `"denied"`, or `"error"`

Optional Fields

Field	Type	Description
`roles`	[]string	Caller's MCIAS roles
`engine`	string	Engine type (`ca`, `sshca`, `transit`, `user`)
`mount`	string	Mount name
`resource`	string	Policy resource path evaluated
`detail`	object	Operation-specific metadata (see below)
`error`	string	Error message on `"error"` or `"denied"` outcomes

Detail Fields by Operation Category

Certificate operations (CA):

serial, issuer, cn, profile, ttl

SSH CA operations:

serial, cert_type (user/host), principals, profile, key_id

Transit operations:

key (key name), key_version, batch_size (for batch ops)

User E2E operations:

recipients (list), sender

Policy operations:

rule_id, effect

System operations (seal/unseal/init):

No detail fields; the operation name is sufficient.

What NOT to Log

Plaintext, ciphertext, signatures, HMACs, envelopes, or any cryptographic material.
Private keys, public keys, or key bytes.
Passwords, tokens, or credentials.
Full request/response bodies.

The audit log records what happened, not what the data was.

Architecture

Audit Logger

A thin wrapper around slog.Logger with a dedicated handler:

// Package audit provides structured audit event logging.
package audit

import (
    "context"
    "log/slog"
)

// Logger writes structured audit events.
type Logger struct {
    logger *slog.Logger
}

// New creates an audit logger that writes to the given handler.
func New(h slog.Handler) *Logger {
    return &Logger{logger: slog.New(h)}
}

// Event represents a single audit event.
type Event struct {
    Caller    string
    Roles     []string
    Operation string
    Engine    string
    Mount     string
    Resource  string
    Outcome   string // "success", "denied", "error"
    Error     string
    Detail    map[string]interface{}
}

// Log writes an audit event.
func (l *Logger) Log(ctx context.Context, e Event) {
    attrs := []slog.Attr{
        slog.String("caller", e.Caller),
        slog.String("operation", e.Operation),
        slog.String("outcome", e.Outcome),
    }
    if len(e.Roles) > 0 {
        attrs = append(attrs, slog.Any("roles", e.Roles))
    }
    if e.Engine != "" {
        attrs = append(attrs, slog.String("engine", e.Engine))
    }
    if e.Mount != "" {
        attrs = append(attrs, slog.String("mount", e.Mount))
    }
    if e.Resource != "" {
        attrs = append(attrs, slog.String("resource", e.Resource))
    }
    if e.Error != "" {
        attrs = append(attrs, slog.String("error", e.Error))
    }
    if len(e.Detail) > 0 {
        attrs = append(attrs, slog.Any("detail", e.Detail))
    }

    // Use a custom level that sorts above Info but is labelled "AUDIT".
    l.logger.LogAttrs(ctx, LevelAudit, "operation completed", attrs...)
}

// LevelAudit is a custom slog level for audit events.
const LevelAudit = slog.Level(12) // between Warn (4) and Error (8+)

The custom level ensures audit events are never suppressed by log level filtering (operators may set level = "warn" to quiet debug noise, but audit events must always be emitted).

Output Configuration

Two modes, controlled by a config option:

[audit]
# "file" writes to a dedicated audit log file.
# "stdout" writes to stdout alongside operational logs (for journalctl).
# Empty string disables audit logging.
mode = "file"
path = "/srv/metacrypt/audit.log"

File mode: Opens the file append-only with 0600 permissions. Uses slog.NewJSONHandler writing to the file. The file can be rotated with logrotate — the logger re-opens on the next write if the file is renamed/truncated. For simplicity, just write and let logrotate handle rotation; Go's slog.JSONHandler does not buffer.

Stdout mode: Uses slog.NewJSONHandler writing to os.Stdout. Events are interleaved with operational logs but distinguishable by the "AUDIT" level. Suitable for systemd/journalctl capture where all output goes to the journal.

Disabled: No audit logger is created. The Logger is nil-safe — all methods are no-ops on a nil receiver.

func (l *Logger) Log(ctx context.Context, e Event) {
    if l == nil {
        return
    }
    // ...
}

Integration Points

The audit logger is created at startup and injected into the components that need it:

cmd/metacrypt/server.go
  └── audit.New(handler)
        ├── server.Server        (REST handlers)
        ├── grpcserver.GRPCServer (gRPC interceptor)
        ├── seal.Manager         (seal/unseal/init)
        └── policy.Engine        (rule create/delete)

Engine operations are logged at the server layer (REST handlers and gRPC interceptors), not inside the engines themselves. This keeps the engines focused on business logic and avoids threading the audit logger through every engine method.

Instrumentation

REST API (`internal/server/`)

Instrument handleEngineRequest and every typed handler. The audit event is emitted after the operation completes (success or failure):

func (s *Server) handleGetCert(w http.ResponseWriter, r *http.Request) {
    // ... existing handler logic ...

    s.audit.Log(r.Context(), audit.Event{
        Caller:    info.Username,
        Roles:     info.Roles,
        Operation: "get-cert",
        Engine:    "ca",
        Mount:     mountName,
        Outcome:   "success",
        Detail:    map[string]interface{}{"serial": serial},
    })
}

On error:

s.audit.Log(r.Context(), audit.Event{
    Caller:    info.Username,
    Roles:     info.Roles,
    Operation: "get-cert",
    Engine:    "ca",
    Mount:     mountName,
    Outcome:   "error",
    Error:     err.Error(),
})

To avoid duplicating this in every handler, use a helper:

func (s *Server) auditEngineOp(r *http.Request, info *auth.TokenInfo,
    op, engineType, mount, outcome string, detail map[string]interface{}, err error) {
    e := audit.Event{
        Caller:    info.Username,
        Roles:     info.Roles,
        Operation: op,
        Engine:    engineType,
        Mount:     mount,
        Outcome:   outcome,
        Detail:    detail,
    }
    if err != nil {
        e.Error = err.Error()
    }
    s.audit.Log(r.Context(), e)
}

gRPC API (`internal/grpcserver/`)

Add an audit interceptor that fires after each RPC completes. This is cleaner than instrumenting every handler individually:

func (g *GRPCServer) auditInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    resp, err := handler(ctx, req)

    // Extract caller info from context (set by auth interceptor).
    caller := callerFromContext(ctx)

    outcome := "success"
    if err != nil {
        outcome = "error"
    }

    g.audit.Log(ctx, audit.Event{
        Caller:    caller.Username,
        Roles:     caller.Roles,
        Operation: path.Base(info.FullMethod), // e.g., "IssueCert"
        Resource:  info.FullMethod,
        Outcome:   outcome,
        Error:     errString(err),
    })

    return resp, err
}

Seal/Unseal (`internal/seal/`)

Instrument Init, Unseal, Seal, and RotateMEK:

// In Manager.Unseal, after success:
m.audit.Log(ctx, audit.Event{
    Caller:    "operator", // unseal is not authenticated
    Operation: "unseal",
    Outcome:   "success",
})

// On failure:
m.audit.Log(ctx, audit.Event{
    Caller:    "operator",
    Operation: "unseal",
    Outcome:   "denied",
    Error:     "invalid password",
})

Policy (`internal/policy/`)

Instrument CreateRule and DeleteRule:

// In Engine.CreateRule, after success:
e.audit.Log(ctx, audit.Event{
    Caller:    callerUsername, // passed from the handler
    Operation: "create-policy",
    Outcome:   "success",
    Detail:    map[string]interface{}{"rule_id": rule.ID, "effect": rule.Effect},
})

Operations to Audit

Category	Operations	Outcome on deny
System	`init`, `unseal`, `seal`, `rotate-mek`, `rotate-key`, `migrate`	`denied` or `error`
CA	`import-root`, `create-issuer`, `delete-issuer`, `issue`, `sign-csr`, `renew`, `revoke-cert`, `delete-cert`	`denied`
SSH CA	`sign-host`, `sign-user`, `create-profile`, `update-profile`, `delete-profile`, `revoke-cert`, `delete-cert`	`denied`
Transit	`create-key`, `delete-key`, `rotate-key`, `update-key-config`, `trim-key`, `encrypt`, `decrypt`, `rewrap`, `sign`, `verify`, `hmac`	`denied`
User	`register`, `provision`, `encrypt`, `decrypt`, `re-encrypt`, `rotate-key`, `delete-user`	`denied`
Policy	`create-policy`, `delete-policy`	N/A (admin-only)
Auth	`login` (success and failure)	`denied`

Read-only operations (get-cert, list-certs, get-profile, list-profiles, get-key, list-keys, list-users, get-public-key, status) are not audited by default. They generate operational log entries via the existing HTTP/gRPC logging middleware but do not produce audit events. This keeps the audit log focused on state-changing operations.

If the operator wants read auditing, a config flag can enable it:

[audit]
include_reads = false  # default

File Layout

internal/
  audit/
    audit.go          # Logger, Event, LevelAudit
    audit_test.go     # Tests

One file, one type, no interfaces. The audit logger is a concrete struct passed by pointer. Nil-safe for disabled mode.

Configuration

Add to config.go:

type AuditConfig struct {
    Mode         string `toml:"mode"`          // "file", "stdout", ""
    Path         string `toml:"path"`          // file path (mode=file)
    IncludeReads bool   `toml:"include_reads"` // audit read operations
}

Add to example config:

[audit]
mode = "file"
path = "/srv/metacrypt/audit.log"
include_reads = false

Implementation Steps

Create internal/audit/audit.go — Logger, Event, LevelAudit, New(handler), nil-safe Log method.
Add AuditConfig to config — mode, path, include_reads. Validate that path is set when mode = "file".
Create audit logger in cmd/metacrypt/server.go — based on config, open file or use stdout. Pass to Server, GRPCServer, SealManager, PolicyEngine.
Add audit *audit.Logger field to Server, GRPCServer, seal.Manager, policy.Engine. Update constructors.
Instrument REST handlers — add auditEngineOp helper to Server. Call after every mutating operation in typed handlers and handleEngineRequest.
Instrument gRPC — add audit interceptor to the interceptor chain.
Instrument seal/unseal — emit events in Init, Unseal, Seal, RotateMEK.
Instrument policy — emit events in CreateRule, DeleteRule.
Instrument login — emit events in the auth login handler (both REST and gRPC).
Update ARCHITECTURE.md — document audit logging in the Security Model section. Remove from Future Work.
Update example configs — add [audit] section.
Add tests — verify events are emitted for success, denied, and error outcomes. Verify nil logger is safe. Verify read operations are excluded by default.

Querying the Audit Log

# All events for a user:
jq 'select(.caller == "kyle")' /srv/metacrypt/audit.log

# All certificate issuances:
jq 'select(.operation == "issue")' /srv/metacrypt/audit.log

# All denied operations:
jq 'select(.outcome == "denied")' /srv/metacrypt/audit.log

# All SSH CA events in the last hour:
jq 'select(.engine == "sshca" and .time > "2026-03-17T03:00:00Z")' /srv/metacrypt/audit.log

# Count operations by type:
jq -r '.operation' /srv/metacrypt/audit.log | sort | uniq -c | sort -rn

# Failed unseal attempts:
jq 'select(.operation == "unseal" and .outcome == "denied")' /srv/metacrypt/audit.log

Rotation

For file mode, use logrotate:

/srv/metacrypt/audit.log {
    daily
    rotate 90
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

copytruncate avoids the need for a signal-based reopen mechanism. The Go slog.JSONHandler writes are not buffered, so no data is lost.

At homelab scale with moderate usage, 90 days of uncompressed audit logs will be well under 100 MB.

15 KiB Raw Blame History