diff --git a/DESIGN_AUDIT.md b/DESIGN_AUDIT.md new file mode 100644 index 0000000..d2ba924 --- /dev/null +++ b/DESIGN_AUDIT.md @@ -0,0 +1,190 @@ +# MCP Design Audit + +Second review of ARCHITECTURE.md after the rewrite incorporating the first +review's findings. Two categories: engineering (cohesion/completeness) and +security. + +## Engineering Issues + +### E1: Node registry is missing — RESOLVED + +The CLI has `mcp node add/list/remove` commands but has no database. The +CLI needs to know that "rift" is at `100.95.252.120:9444` to connect. + +**Resolution:** `[[nodes]]` section in the CLI config file (`mcp.toml`). +`mcp node add/remove` edits the config. `mcp node list` reads it. For v1, +nodes are just name + address. Future versions add a `state` field +(active/inactive/down) for maintenance and health tracking; CLI filters +by state when routing commands. + +### E2: Duplicate events table definition — RESOLVED + +The events table appears twice: in the Monitoring section and in the +Database Schema section. The monitoring section's index is missing the +`service` column. + +**Resolution:** Remove the DDL from the Monitoring section. Define the +table once in the Database Schema section (which has the correct index). +Monitoring section references it narratively. + +### E3: `active: false` relationship to `desired_state` is undefined — RESOLVED + +Service definition has `active = true/false`. Registry has per-component +`desired_state = running/stopped/ignore`. What does `active = false` mean? + +**Resolution:** `active` is the authoritative desired state for the service. +- `active = true` → CLI tells agent: components should be `running` +- `active = false` → CLI tells agent: components should be `stopped` +- `mcp stop ` sets `active = false` in the local service + definition file and tells the agent to stop +- `mcp start ` sets `active = true` and tells the agent to start +- `mcp sync` pushes all definitions — agent stops inactive services, + keeps active ones running + +The service definition file is always the source of truth. Lifecycle +commands modify it. The agent receives the resolved intent and acts. + +### E4: `mcp adopt` CLI missing component name argument — RESOLVED + +CLI: `mcp adopt ` (two args). +Proto: `AdoptRequest` has `container`, `service`, `component` (three fields). + +**Resolution:** `mcp adopt ` adopts all containers matching +`-*` as components, stripping the service prefix to derive the +component name: `metacrypt-api` → component `api`, `metacrypt-web` → +component `web`. + +Container naming convention: `-`. No bare service +names as containers. This should be noted in the architecture doc as a +platform convention. Existing containers (e.g., `metacrypt` without a +component suffix) should be renamed to `metacrypt-api` before or during +MCP bootstrap. + +For single-component services (e.g., mc-proxy), the container name is +the service name and the component name matches: `mc-proxy` → service +`mc-proxy`, component `mc-proxy`. + +### E5: EventInfo proto missing service field — RESOLVED + +`EventInfo` has `component` but no `service`. Component names aren't unique +across services. + +**Resolution:** Add `string service = 2` to EventInfo, shift other fields. + +### E6: `version` field undefined — RESOLVED + +`ComponentInfo` and the schema have `version` but nothing sets it. + +**Resolution:** Extract version from the image tag. At deploy time, the +agent parses the tag from the image reference (e.g., +`mcr.../metacrypt:v1.7.0` → `v1.7.0`). During reconciliation, `podman +inspect` provides the image reference. If the tag is `latest`, version +reports `latest`. + +Build discipline (enforced in mcdeploy/CI, not MCP): `latest` should +always be an alias for a concrete version tag, so the version is +recoverable from the image metadata. + +### E7: No snapshot/backup for agent database — RESOLVED + +Platform convention: every service has a `snapshot` command and backup timer. +Agent has SQLite but no snapshot command and no backup timer in the systemd +artifacts. + +**Resolution:** Add `mcp-agent snapshot` command and backup timer. Systemd +artifacts: `mcp-agent-backup.service` + `mcp-agent-backup.timer` (daily, +02:00 UTC, 5-minute jitter, per platform convention). Add to project +structure and deployment section. + +### E8: CLI flag overrides were dropped — RESOLVED + +Original design had CLI flags overriding service definition fields. The +rewrite has only file > registry. Per-component deploy replaces `--image` +but other overrides (e.g., temporary node change) are gone. + +**Resolution:** Explicitly drop flag overrides from v1 scope. Per-component +deploy covers the primary use case. For one-off changes, edit the service +definition and deploy. Flag overrides are a future convenience feature. + +## Security Issues + +### S1: Alert command injection — RESOLVED + +Alert command is a shell string with env vars (`MCP_SERVICE`, +`MCP_COMPONENT`, etc.). Crafted container names with shell metacharacters +could inject commands. + +**Resolution:** Alert command is an argv array, not a shell string. The +agent calls `exec` with the array and sets `MCP_*` env vars on the child +process. No `sh -c`, no shell interpretation, no injection vector. + +Config format changes from: +```toml +alert_command = "ntfy publish mcp ..." +``` +to: +```toml +alert_command = ["/usr/local/bin/notify", "--channel", "mcp"] +``` + +### S2: Agent binds to all interfaces — RESOLVED + +`grpc_addr = ":9444"` binds to all interfaces including LAN. The agent +should bind to the overlay interface only. + +**Resolution:** Config example should show the overlay IP explicitly: +`grpc_addr = "100.95.252.120:9444"`. The agent does NOT sit behind +MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy, +so the agent must be reachable without MC-Proxy). The agent is in the +same category as MC-Proxy itself: infrastructure directly reachable on +the overlay. + +### S3: Unattended auth model unclear — RESOLVED + +The old watch config had `password_file` for unattended auth. With the +agent merge, unattended CLI use (scripting, cron) needs a clear token +acquisition path. + +**Resolution:** CLI config supports optional system account credentials: +```toml +[auth] +token_path = "~/.config/mcp/token" +username = "mcp-operator" +password_file = "~/.config/mcp/credentials" +``` +If set, CLI auto-authenticates to MCIAS when the token is expired/missing. +`MCP_TOKEN` env var remains as an override. Unattended CLI use is rare +for v1 — interactive `mcp login` is the primary path. + +### S4: Podman rootless UID mapping with volume mounts — RESOLVED + +Rootless podman maps container UIDs via subuid. `user = "0:0"` in the +container + volumes from `/srv//` means the actual host UID +accessing files depends on subuid config. + +**Resolution:** Documentation item. The deployment section should note +that the `mcp` user needs subuid/subgid ranges configured and that files +in `/srv//` must be accessible to the mapped UIDs. NixOS handles +subuid provisioning as part of user creation. + +### S5: No RPC audit log — RESOLVED + +Agent records container state-change events but not RPC calls. Deploy +and PushFile are significant actions with no record of who initiated +them or when. + +**Resolution:** Log every RPC at `info` level via the gRPC auth +interceptor (after token validation succeeds): method name, caller +identity (from MCIAS token), timestamp. Uses `log/slog` per platform +convention. Minimal implementation — one log line in the interceptor. + +### S6: File push creates arbitrary directory trees — RESOLVED + +`mcp push cert.pem mcr deeply/nested/path/cert.pem` creates +`/srv/mcr/deeply/nested/path/`. + +**Resolution:** This is intentional and necessary. The agent needs to +create `/srv//` directories for new deployments and arbitrary +subdirectories within them (certs/, backups/, etc.). `/srv/` should be +owned by the `mcp` user to support this. The path traversal restriction +(no `..`, no symlink escape) remains the security boundary.