Add DESIGN_AUDIT.md from second review pass

Engineering and security review of the rewritten ARCHITECTURE.md.
14 issues identified and resolved: node registry in CLI config,
active/desired_state semantics, container naming convention
(<service>-<component>), exec-style alert commands to prevent
injection, agent binds to overlay IP only (not behind MC-Proxy),
RPC audit logging, /srv/ owned by mcp user, and others.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 10:58:04 -07:00
parent ea7a9dcf4d
commit 12d8d733be

190
DESIGN_AUDIT.md Normal file
View File

@@ -0,0 +1,190 @@
# MCP Design Audit
Second review of ARCHITECTURE.md after the rewrite incorporating the first
review's findings. Two categories: engineering (cohesion/completeness) and
security.
## Engineering Issues
### E1: Node registry is missing — RESOLVED
The CLI has `mcp node add/list/remove` commands but has no database. The
CLI needs to know that "rift" is at `100.95.252.120:9444` to connect.
**Resolution:** `[[nodes]]` section in the CLI config file (`mcp.toml`).
`mcp node add/remove` edits the config. `mcp node list` reads it. For v1,
nodes are just name + address. Future versions add a `state` field
(active/inactive/down) for maintenance and health tracking; CLI filters
by state when routing commands.
### E2: Duplicate events table definition — RESOLVED
The events table appears twice: in the Monitoring section and in the
Database Schema section. The monitoring section's index is missing the
`service` column.
**Resolution:** Remove the DDL from the Monitoring section. Define the
table once in the Database Schema section (which has the correct index).
Monitoring section references it narratively.
### E3: `active: false` relationship to `desired_state` is undefined — RESOLVED
Service definition has `active = true/false`. Registry has per-component
`desired_state = running/stopped/ignore`. What does `active = false` mean?
**Resolution:** `active` is the authoritative desired state for the service.
- `active = true` → CLI tells agent: components should be `running`
- `active = false` → CLI tells agent: components should be `stopped`
- `mcp stop <service>` sets `active = false` in the local service
definition file and tells the agent to stop
- `mcp start <service>` sets `active = true` and tells the agent to start
- `mcp sync` pushes all definitions — agent stops inactive services,
keeps active ones running
The service definition file is always the source of truth. Lifecycle
commands modify it. The agent receives the resolved intent and acts.
### E4: `mcp adopt` CLI missing component name argument — RESOLVED
CLI: `mcp adopt <container> <service>` (two args).
Proto: `AdoptRequest` has `container`, `service`, `component` (three fields).
**Resolution:** `mcp adopt <service>` adopts all containers matching
`<service>-*` as components, stripping the service prefix to derive the
component name: `metacrypt-api` → component `api`, `metacrypt-web`
component `web`.
Container naming convention: `<service>-<component>`. No bare service
names as containers. This should be noted in the architecture doc as a
platform convention. Existing containers (e.g., `metacrypt` without a
component suffix) should be renamed to `metacrypt-api` before or during
MCP bootstrap.
For single-component services (e.g., mc-proxy), the container name is
the service name and the component name matches: `mc-proxy` → service
`mc-proxy`, component `mc-proxy`.
### E5: EventInfo proto missing service field — RESOLVED
`EventInfo` has `component` but no `service`. Component names aren't unique
across services.
**Resolution:** Add `string service = 2` to EventInfo, shift other fields.
### E6: `version` field undefined — RESOLVED
`ComponentInfo` and the schema have `version` but nothing sets it.
**Resolution:** Extract version from the image tag. At deploy time, the
agent parses the tag from the image reference (e.g.,
`mcr.../metacrypt:v1.7.0``v1.7.0`). During reconciliation, `podman
inspect` provides the image reference. If the tag is `latest`, version
reports `latest`.
Build discipline (enforced in mcdeploy/CI, not MCP): `latest` should
always be an alias for a concrete version tag, so the version is
recoverable from the image metadata.
### E7: No snapshot/backup for agent database — RESOLVED
Platform convention: every service has a `snapshot` command and backup timer.
Agent has SQLite but no snapshot command and no backup timer in the systemd
artifacts.
**Resolution:** Add `mcp-agent snapshot` command and backup timer. Systemd
artifacts: `mcp-agent-backup.service` + `mcp-agent-backup.timer` (daily,
02:00 UTC, 5-minute jitter, per platform convention). Add to project
structure and deployment section.
### E8: CLI flag overrides were dropped — RESOLVED
Original design had CLI flags overriding service definition fields. The
rewrite has only file > registry. Per-component deploy replaces `--image`
but other overrides (e.g., temporary node change) are gone.
**Resolution:** Explicitly drop flag overrides from v1 scope. Per-component
deploy covers the primary use case. For one-off changes, edit the service
definition and deploy. Flag overrides are a future convenience feature.
## Security Issues
### S1: Alert command injection — RESOLVED
Alert command is a shell string with env vars (`MCP_SERVICE`,
`MCP_COMPONENT`, etc.). Crafted container names with shell metacharacters
could inject commands.
**Resolution:** Alert command is an argv array, not a shell string. The
agent calls `exec` with the array and sets `MCP_*` env vars on the child
process. No `sh -c`, no shell interpretation, no injection vector.
Config format changes from:
```toml
alert_command = "ntfy publish mcp ..."
```
to:
```toml
alert_command = ["/usr/local/bin/notify", "--channel", "mcp"]
```
### S2: Agent binds to all interfaces — RESOLVED
`grpc_addr = ":9444"` binds to all interfaces including LAN. The agent
should bind to the overlay interface only.
**Resolution:** Config example should show the overlay IP explicitly:
`grpc_addr = "100.95.252.120:9444"`. The agent does NOT sit behind
MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy,
so the agent must be reachable without MC-Proxy). The agent is in the
same category as MC-Proxy itself: infrastructure directly reachable on
the overlay.
### S3: Unattended auth model unclear — RESOLVED
The old watch config had `password_file` for unattended auth. With the
agent merge, unattended CLI use (scripting, cron) needs a clear token
acquisition path.
**Resolution:** CLI config supports optional system account credentials:
```toml
[auth]
token_path = "~/.config/mcp/token"
username = "mcp-operator"
password_file = "~/.config/mcp/credentials"
```
If set, CLI auto-authenticates to MCIAS when the token is expired/missing.
`MCP_TOKEN` env var remains as an override. Unattended CLI use is rare
for v1 — interactive `mcp login` is the primary path.
### S4: Podman rootless UID mapping with volume mounts — RESOLVED
Rootless podman maps container UIDs via subuid. `user = "0:0"` in the
container + volumes from `/srv/<service>/` means the actual host UID
accessing files depends on subuid config.
**Resolution:** Documentation item. The deployment section should note
that the `mcp` user needs subuid/subgid ranges configured and that files
in `/srv/<service>/` must be accessible to the mapped UIDs. NixOS handles
subuid provisioning as part of user creation.
### S5: No RPC audit log — RESOLVED
Agent records container state-change events but not RPC calls. Deploy
and PushFile are significant actions with no record of who initiated
them or when.
**Resolution:** Log every RPC at `info` level via the gRPC auth
interceptor (after token validation succeeds): method name, caller
identity (from MCIAS token), timestamp. Uses `log/slog` per platform
convention. Minimal implementation — one log line in the interceptor.
### S6: File push creates arbitrary directory trees — RESOLVED
`mcp push cert.pem mcr deeply/nested/path/cert.pem` creates
`/srv/mcr/deeply/nested/path/`.
**Resolution:** This is intentional and necessary. The agent needs to
create `/srv/<service>/` directories for new deployments and arbitrary
subdirectories within them (certs/, backups/, etc.). `/srv/` should be
owned by the `mcp` user to support this. The path traversal restriction
(no `..`, no symlink escape) remains the security boundary.