Engineering and security review of the rewritten ARCHITECTURE.md. 14 issues identified and resolved: node registry in CLI config, active/desired_state semantics, container naming convention (<service>-<component>), exec-style alert commands to prevent injection, agent binds to overlay IP only (not behind MC-Proxy), RPC audit logging, /srv/ owned by mcp user, and others. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.7 KiB
MCP Design Audit
Second review of ARCHITECTURE.md after the rewrite incorporating the first review's findings. Two categories: engineering (cohesion/completeness) and security.
Engineering Issues
E1: Node registry is missing — RESOLVED
The CLI has mcp node add/list/remove commands but has no database. The
CLI needs to know that "rift" is at 100.95.252.120:9444 to connect.
Resolution: [[nodes]] section in the CLI config file (mcp.toml).
mcp node add/remove edits the config. mcp node list reads it. For v1,
nodes are just name + address. Future versions add a state field
(active/inactive/down) for maintenance and health tracking; CLI filters
by state when routing commands.
E2: Duplicate events table definition — RESOLVED
The events table appears twice: in the Monitoring section and in the
Database Schema section. The monitoring section's index is missing the
service column.
Resolution: Remove the DDL from the Monitoring section. Define the table once in the Database Schema section (which has the correct index). Monitoring section references it narratively.
E3: active: false relationship to desired_state is undefined — RESOLVED
Service definition has active = true/false. Registry has per-component
desired_state = running/stopped/ignore. What does active = false mean?
Resolution: active is the authoritative desired state for the service.
active = true→ CLI tells agent: components should berunningactive = false→ CLI tells agent: components should bestoppedmcp stop <service>setsactive = falsein the local service definition file and tells the agent to stopmcp start <service>setsactive = trueand tells the agent to startmcp syncpushes all definitions — agent stops inactive services, keeps active ones running
The service definition file is always the source of truth. Lifecycle commands modify it. The agent receives the resolved intent and acts.
E4: mcp adopt CLI missing component name argument — RESOLVED
CLI: mcp adopt <container> <service> (two args).
Proto: AdoptRequest has container, service, component (three fields).
Resolution: mcp adopt <service> adopts all containers matching
<service>-* as components, stripping the service prefix to derive the
component name: metacrypt-api → component api, metacrypt-web →
component web.
Container naming convention: <service>-<component>. No bare service
names as containers. This should be noted in the architecture doc as a
platform convention. Existing containers (e.g., metacrypt without a
component suffix) should be renamed to metacrypt-api before or during
MCP bootstrap.
For single-component services (e.g., mc-proxy), the container name is
the service name and the component name matches: mc-proxy → service
mc-proxy, component mc-proxy.
E5: EventInfo proto missing service field — RESOLVED
EventInfo has component but no service. Component names aren't unique
across services.
Resolution: Add string service = 2 to EventInfo, shift other fields.
E6: version field undefined — RESOLVED
ComponentInfo and the schema have version but nothing sets it.
Resolution: Extract version from the image tag. At deploy time, the
agent parses the tag from the image reference (e.g.,
mcr.../metacrypt:v1.7.0 → v1.7.0). During reconciliation, podman inspect provides the image reference. If the tag is latest, version
reports latest.
Build discipline (enforced in mcdeploy/CI, not MCP): latest should
always be an alias for a concrete version tag, so the version is
recoverable from the image metadata.
E7: No snapshot/backup for agent database — RESOLVED
Platform convention: every service has a snapshot command and backup timer.
Agent has SQLite but no snapshot command and no backup timer in the systemd
artifacts.
Resolution: Add mcp-agent snapshot command and backup timer. Systemd
artifacts: mcp-agent-backup.service + mcp-agent-backup.timer (daily,
02:00 UTC, 5-minute jitter, per platform convention). Add to project
structure and deployment section.
E8: CLI flag overrides were dropped — RESOLVED
Original design had CLI flags overriding service definition fields. The
rewrite has only file > registry. Per-component deploy replaces --image
but other overrides (e.g., temporary node change) are gone.
Resolution: Explicitly drop flag overrides from v1 scope. Per-component deploy covers the primary use case. For one-off changes, edit the service definition and deploy. Flag overrides are a future convenience feature.
Security Issues
S1: Alert command injection — RESOLVED
Alert command is a shell string with env vars (MCP_SERVICE,
MCP_COMPONENT, etc.). Crafted container names with shell metacharacters
could inject commands.
Resolution: Alert command is an argv array, not a shell string. The
agent calls exec with the array and sets MCP_* env vars on the child
process. No sh -c, no shell interpretation, no injection vector.
Config format changes from:
alert_command = "ntfy publish mcp ..."
to:
alert_command = ["/usr/local/bin/notify", "--channel", "mcp"]
S2: Agent binds to all interfaces — RESOLVED
grpc_addr = ":9444" binds to all interfaces including LAN. The agent
should bind to the overlay interface only.
Resolution: Config example should show the overlay IP explicitly:
grpc_addr = "100.95.252.120:9444". The agent does NOT sit behind
MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy,
so the agent must be reachable without MC-Proxy). The agent is in the
same category as MC-Proxy itself: infrastructure directly reachable on
the overlay.
S3: Unattended auth model unclear — RESOLVED
The old watch config had password_file for unattended auth. With the
agent merge, unattended CLI use (scripting, cron) needs a clear token
acquisition path.
Resolution: CLI config supports optional system account credentials:
[auth]
token_path = "~/.config/mcp/token"
username = "mcp-operator"
password_file = "~/.config/mcp/credentials"
If set, CLI auto-authenticates to MCIAS when the token is expired/missing.
MCP_TOKEN env var remains as an override. Unattended CLI use is rare
for v1 — interactive mcp login is the primary path.
S4: Podman rootless UID mapping with volume mounts — RESOLVED
Rootless podman maps container UIDs via subuid. user = "0:0" in the
container + volumes from /srv/<service>/ means the actual host UID
accessing files depends on subuid config.
Resolution: Documentation item. The deployment section should note
that the mcp user needs subuid/subgid ranges configured and that files
in /srv/<service>/ must be accessible to the mapped UIDs. NixOS handles
subuid provisioning as part of user creation.
S5: No RPC audit log — RESOLVED
Agent records container state-change events but not RPC calls. Deploy and PushFile are significant actions with no record of who initiated them or when.
Resolution: Log every RPC at info level via the gRPC auth
interceptor (after token validation succeeds): method name, caller
identity (from MCIAS token), timestamp. Uses log/slog per platform
convention. Minimal implementation — one log line in the interceptor.
S6: File push creates arbitrary directory trees — RESOLVED
mcp push cert.pem mcr deeply/nested/path/cert.pem creates
/srv/mcr/deeply/nested/path/.
Resolution: This is intentional and necessary. The agent needs to
create /srv/<service>/ directories for new deployments and arbitrary
subdirectories within them (certs/, backups/, etc.). /srv/ should be
owned by the mcp user to support this. The path traversal restriction
(no .., no symlink escape) remains the security boundary.