Files
mcp/DESIGN_AUDIT.md
Kyle Isom 12d8d733be Add DESIGN_AUDIT.md from second review pass
Engineering and security review of the rewritten ARCHITECTURE.md.
14 issues identified and resolved: node registry in CLI config,
active/desired_state semantics, container naming convention
(<service>-<component>), exec-style alert commands to prevent
injection, agent binds to overlay IP only (not behind MC-Proxy),
RPC audit logging, /srv/ owned by mcp user, and others.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:58:04 -07:00

7.7 KiB

MCP Design Audit

Second review of ARCHITECTURE.md after the rewrite incorporating the first review's findings. Two categories: engineering (cohesion/completeness) and security.

Engineering Issues

E1: Node registry is missing — RESOLVED

The CLI has mcp node add/list/remove commands but has no database. The CLI needs to know that "rift" is at 100.95.252.120:9444 to connect.

Resolution: [[nodes]] section in the CLI config file (mcp.toml). mcp node add/remove edits the config. mcp node list reads it. For v1, nodes are just name + address. Future versions add a state field (active/inactive/down) for maintenance and health tracking; CLI filters by state when routing commands.

E2: Duplicate events table definition — RESOLVED

The events table appears twice: in the Monitoring section and in the Database Schema section. The monitoring section's index is missing the service column.

Resolution: Remove the DDL from the Monitoring section. Define the table once in the Database Schema section (which has the correct index). Monitoring section references it narratively.

E3: active: false relationship to desired_state is undefined — RESOLVED

Service definition has active = true/false. Registry has per-component desired_state = running/stopped/ignore. What does active = false mean?

Resolution: active is the authoritative desired state for the service.

  • active = true → CLI tells agent: components should be running
  • active = false → CLI tells agent: components should be stopped
  • mcp stop <service> sets active = false in the local service definition file and tells the agent to stop
  • mcp start <service> sets active = true and tells the agent to start
  • mcp sync pushes all definitions — agent stops inactive services, keeps active ones running

The service definition file is always the source of truth. Lifecycle commands modify it. The agent receives the resolved intent and acts.

E4: mcp adopt CLI missing component name argument — RESOLVED

CLI: mcp adopt <container> <service> (two args). Proto: AdoptRequest has container, service, component (three fields).

Resolution: mcp adopt <service> adopts all containers matching <service>-* as components, stripping the service prefix to derive the component name: metacrypt-api → component api, metacrypt-web → component web.

Container naming convention: <service>-<component>. No bare service names as containers. This should be noted in the architecture doc as a platform convention. Existing containers (e.g., metacrypt without a component suffix) should be renamed to metacrypt-api before or during MCP bootstrap.

For single-component services (e.g., mc-proxy), the container name is the service name and the component name matches: mc-proxy → service mc-proxy, component mc-proxy.

E5: EventInfo proto missing service field — RESOLVED

EventInfo has component but no service. Component names aren't unique across services.

Resolution: Add string service = 2 to EventInfo, shift other fields.

E6: version field undefined — RESOLVED

ComponentInfo and the schema have version but nothing sets it.

Resolution: Extract version from the image tag. At deploy time, the agent parses the tag from the image reference (e.g., mcr.../metacrypt:v1.7.0v1.7.0). During reconciliation, podman inspect provides the image reference. If the tag is latest, version reports latest.

Build discipline (enforced in mcdeploy/CI, not MCP): latest should always be an alias for a concrete version tag, so the version is recoverable from the image metadata.

E7: No snapshot/backup for agent database — RESOLVED

Platform convention: every service has a snapshot command and backup timer. Agent has SQLite but no snapshot command and no backup timer in the systemd artifacts.

Resolution: Add mcp-agent snapshot command and backup timer. Systemd artifacts: mcp-agent-backup.service + mcp-agent-backup.timer (daily, 02:00 UTC, 5-minute jitter, per platform convention). Add to project structure and deployment section.

E8: CLI flag overrides were dropped — RESOLVED

Original design had CLI flags overriding service definition fields. The rewrite has only file > registry. Per-component deploy replaces --image but other overrides (e.g., temporary node change) are gone.

Resolution: Explicitly drop flag overrides from v1 scope. Per-component deploy covers the primary use case. For one-off changes, edit the service definition and deploy. Flag overrides are a future convenience feature.

Security Issues

S1: Alert command injection — RESOLVED

Alert command is a shell string with env vars (MCP_SERVICE, MCP_COMPONENT, etc.). Crafted container names with shell metacharacters could inject commands.

Resolution: Alert command is an argv array, not a shell string. The agent calls exec with the array and sets MCP_* env vars on the child process. No sh -c, no shell interpretation, no injection vector.

Config format changes from:

alert_command = "ntfy publish mcp ..."

to:

alert_command = ["/usr/local/bin/notify", "--channel", "mcp"]

S2: Agent binds to all interfaces — RESOLVED

grpc_addr = ":9444" binds to all interfaces including LAN. The agent should bind to the overlay interface only.

Resolution: Config example should show the overlay IP explicitly: grpc_addr = "100.95.252.120:9444". The agent does NOT sit behind MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy, so the agent must be reachable without MC-Proxy). The agent is in the same category as MC-Proxy itself: infrastructure directly reachable on the overlay.

S3: Unattended auth model unclear — RESOLVED

The old watch config had password_file for unattended auth. With the agent merge, unattended CLI use (scripting, cron) needs a clear token acquisition path.

Resolution: CLI config supports optional system account credentials:

[auth]
token_path = "~/.config/mcp/token"
username = "mcp-operator"
password_file = "~/.config/mcp/credentials"

If set, CLI auto-authenticates to MCIAS when the token is expired/missing. MCP_TOKEN env var remains as an override. Unattended CLI use is rare for v1 — interactive mcp login is the primary path.

S4: Podman rootless UID mapping with volume mounts — RESOLVED

Rootless podman maps container UIDs via subuid. user = "0:0" in the container + volumes from /srv/<service>/ means the actual host UID accessing files depends on subuid config.

Resolution: Documentation item. The deployment section should note that the mcp user needs subuid/subgid ranges configured and that files in /srv/<service>/ must be accessible to the mapped UIDs. NixOS handles subuid provisioning as part of user creation.

S5: No RPC audit log — RESOLVED

Agent records container state-change events but not RPC calls. Deploy and PushFile are significant actions with no record of who initiated them or when.

Resolution: Log every RPC at info level via the gRPC auth interceptor (after token validation succeeds): method name, caller identity (from MCIAS token), timestamp. Uses log/slog per platform convention. Minimal implementation — one log line in the interceptor.

S6: File push creates arbitrary directory trees — RESOLVED

mcp push cert.pem mcr deeply/nested/path/cert.pem creates /srv/mcr/deeply/nested/path/.

Resolution: This is intentional and necessary. The agent needs to create /srv/<service>/ directories for new deployments and arbitrary subdirectories within them (certs/, backups/, etc.). /srv/ should be owned by the mcp user to support this. The path traversal restriction (no .., no symlink escape) remains the security boundary.