# MCP Design Audit Second review of ARCHITECTURE.md after the rewrite incorporating the first review's findings. Two categories: engineering (cohesion/completeness) and security. ## Engineering Issues ### E1: Node registry is missing — RESOLVED The CLI has `mcp node add/list/remove` commands but has no database. The CLI needs to know that "rift" is at `100.95.252.120:9444` to connect. **Resolution:** `[[nodes]]` section in the CLI config file (`mcp.toml`). `mcp node add/remove` edits the config. `mcp node list` reads it. For v1, nodes are just name + address. Future versions add a `state` field (active/inactive/down) for maintenance and health tracking; CLI filters by state when routing commands. ### E2: Duplicate events table definition — RESOLVED The events table appears twice: in the Monitoring section and in the Database Schema section. The monitoring section's index is missing the `service` column. **Resolution:** Remove the DDL from the Monitoring section. Define the table once in the Database Schema section (which has the correct index). Monitoring section references it narratively. ### E3: `active: false` relationship to `desired_state` is undefined — RESOLVED Service definition has `active = true/false`. Registry has per-component `desired_state = running/stopped/ignore`. What does `active = false` mean? **Resolution:** `active` is the authoritative desired state for the service. - `active = true` → CLI tells agent: components should be `running` - `active = false` → CLI tells agent: components should be `stopped` - `mcp stop ` sets `active = false` in the local service definition file and tells the agent to stop - `mcp start ` sets `active = true` and tells the agent to start - `mcp sync` pushes all definitions — agent stops inactive services, keeps active ones running The service definition file is always the source of truth. Lifecycle commands modify it. The agent receives the resolved intent and acts. ### E4: `mcp adopt` CLI missing component name argument — RESOLVED CLI: `mcp adopt ` (two args). Proto: `AdoptRequest` has `container`, `service`, `component` (three fields). **Resolution:** `mcp adopt ` adopts all containers matching `-*` as components, stripping the service prefix to derive the component name: `metacrypt-api` → component `api`, `metacrypt-web` → component `web`. Container naming convention: `-`. No bare service names as containers. This should be noted in the architecture doc as a platform convention. Existing containers (e.g., `metacrypt` without a component suffix) should be renamed to `metacrypt-api` before or during MCP bootstrap. For single-component services (e.g., mc-proxy), the container name is the service name and the component name matches: `mc-proxy` → service `mc-proxy`, component `mc-proxy`. ### E5: EventInfo proto missing service field — RESOLVED `EventInfo` has `component` but no `service`. Component names aren't unique across services. **Resolution:** Add `string service = 2` to EventInfo, shift other fields. ### E6: `version` field undefined — RESOLVED `ComponentInfo` and the schema have `version` but nothing sets it. **Resolution:** Extract version from the image tag. At deploy time, the agent parses the tag from the image reference (e.g., `mcr.../metacrypt:v1.7.0` → `v1.7.0`). During reconciliation, `podman inspect` provides the image reference. If the tag is `latest`, version reports `latest`. Build discipline (enforced in mcdeploy/CI, not MCP): `latest` should always be an alias for a concrete version tag, so the version is recoverable from the image metadata. ### E7: No snapshot/backup for agent database — RESOLVED Platform convention: every service has a `snapshot` command and backup timer. Agent has SQLite but no snapshot command and no backup timer in the systemd artifacts. **Resolution:** Add `mcp-agent snapshot` command and backup timer. Systemd artifacts: `mcp-agent-backup.service` + `mcp-agent-backup.timer` (daily, 02:00 UTC, 5-minute jitter, per platform convention). Add to project structure and deployment section. ### E8: CLI flag overrides were dropped — RESOLVED Original design had CLI flags overriding service definition fields. The rewrite has only file > registry. Per-component deploy replaces `--image` but other overrides (e.g., temporary node change) are gone. **Resolution:** Explicitly drop flag overrides from v1 scope. Per-component deploy covers the primary use case. For one-off changes, edit the service definition and deploy. Flag overrides are a future convenience feature. ## Security Issues ### S1: Alert command injection — RESOLVED Alert command is a shell string with env vars (`MCP_SERVICE`, `MCP_COMPONENT`, etc.). Crafted container names with shell metacharacters could inject commands. **Resolution:** Alert command is an argv array, not a shell string. The agent calls `exec` with the array and sets `MCP_*` env vars on the child process. No `sh -c`, no shell interpretation, no injection vector. Config format changes from: ```toml alert_command = "ntfy publish mcp ..." ``` to: ```toml alert_command = ["/usr/local/bin/notify", "--channel", "mcp"] ``` ### S2: Agent binds to all interfaces — RESOLVED `grpc_addr = ":9444"` binds to all interfaces including LAN. The agent should bind to the overlay interface only. **Resolution:** Config example should show the overlay IP explicitly: `grpc_addr = "100.95.252.120:9444"`. The agent does NOT sit behind MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy, so the agent must be reachable without MC-Proxy). The agent is in the same category as MC-Proxy itself: infrastructure directly reachable on the overlay. ### S3: Unattended auth model unclear — RESOLVED The old watch config had `password_file` for unattended auth. With the agent merge, unattended CLI use (scripting, cron) needs a clear token acquisition path. **Resolution:** CLI config supports optional system account credentials: ```toml [auth] token_path = "~/.config/mcp/token" username = "mcp-operator" password_file = "~/.config/mcp/credentials" ``` If set, CLI auto-authenticates to MCIAS when the token is expired/missing. `MCP_TOKEN` env var remains as an override. Unattended CLI use is rare for v1 — interactive `mcp login` is the primary path. ### S4: Podman rootless UID mapping with volume mounts — RESOLVED Rootless podman maps container UIDs via subuid. `user = "0:0"` in the container + volumes from `/srv//` means the actual host UID accessing files depends on subuid config. **Resolution:** Documentation item. The deployment section should note that the `mcp` user needs subuid/subgid ranges configured and that files in `/srv//` must be accessible to the mapped UIDs. NixOS handles subuid provisioning as part of user creation. ### S5: No RPC audit log — RESOLVED Agent records container state-change events but not RPC calls. Deploy and PushFile are significant actions with no record of who initiated them or when. **Resolution:** Log every RPC at `info` level via the gRPC auth interceptor (after token validation succeeds): method name, caller identity (from MCIAS token), timestamp. Uses `log/slog` per platform convention. Minimal implementation — one log line in the interceptor. ### S6: File push creates arbitrary directory trees — RESOLVED `mcp push cert.pem mcr deeply/nested/path/cert.pem` creates `/srv/mcr/deeply/nested/path/`. **Resolution:** This is intentional and necessary. The agent needs to create `/srv//` directories for new deployments and arbitrary subdirectories within them (certs/, backups/, etc.). `/srv/` should be owned by the `mcp` user to support this. The path traversal restriction (no `..`, no symlink escape) remains the security boundary.