Engineering and security review of the rewritten ARCHITECTURE.md. 14 issues identified and resolved: node registry in CLI config, active/desired_state semantics, container naming convention (<service>-<component>), exec-style alert commands to prevent injection, agent binds to overlay IP only (not behind MC-Proxy), RPC audit logging, /srv/ owned by mcp user, and others. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
191 lines
7.7 KiB
Markdown
191 lines
7.7 KiB
Markdown
# MCP Design Audit
|
|
|
|
Second review of ARCHITECTURE.md after the rewrite incorporating the first
|
|
review's findings. Two categories: engineering (cohesion/completeness) and
|
|
security.
|
|
|
|
## Engineering Issues
|
|
|
|
### E1: Node registry is missing — RESOLVED
|
|
|
|
The CLI has `mcp node add/list/remove` commands but has no database. The
|
|
CLI needs to know that "rift" is at `100.95.252.120:9444` to connect.
|
|
|
|
**Resolution:** `[[nodes]]` section in the CLI config file (`mcp.toml`).
|
|
`mcp node add/remove` edits the config. `mcp node list` reads it. For v1,
|
|
nodes are just name + address. Future versions add a `state` field
|
|
(active/inactive/down) for maintenance and health tracking; CLI filters
|
|
by state when routing commands.
|
|
|
|
### E2: Duplicate events table definition — RESOLVED
|
|
|
|
The events table appears twice: in the Monitoring section and in the
|
|
Database Schema section. The monitoring section's index is missing the
|
|
`service` column.
|
|
|
|
**Resolution:** Remove the DDL from the Monitoring section. Define the
|
|
table once in the Database Schema section (which has the correct index).
|
|
Monitoring section references it narratively.
|
|
|
|
### E3: `active: false` relationship to `desired_state` is undefined — RESOLVED
|
|
|
|
Service definition has `active = true/false`. Registry has per-component
|
|
`desired_state = running/stopped/ignore`. What does `active = false` mean?
|
|
|
|
**Resolution:** `active` is the authoritative desired state for the service.
|
|
- `active = true` → CLI tells agent: components should be `running`
|
|
- `active = false` → CLI tells agent: components should be `stopped`
|
|
- `mcp stop <service>` sets `active = false` in the local service
|
|
definition file and tells the agent to stop
|
|
- `mcp start <service>` sets `active = true` and tells the agent to start
|
|
- `mcp sync` pushes all definitions — agent stops inactive services,
|
|
keeps active ones running
|
|
|
|
The service definition file is always the source of truth. Lifecycle
|
|
commands modify it. The agent receives the resolved intent and acts.
|
|
|
|
### E4: `mcp adopt` CLI missing component name argument — RESOLVED
|
|
|
|
CLI: `mcp adopt <container> <service>` (two args).
|
|
Proto: `AdoptRequest` has `container`, `service`, `component` (three fields).
|
|
|
|
**Resolution:** `mcp adopt <service>` adopts all containers matching
|
|
`<service>-*` as components, stripping the service prefix to derive the
|
|
component name: `metacrypt-api` → component `api`, `metacrypt-web` →
|
|
component `web`.
|
|
|
|
Container naming convention: `<service>-<component>`. No bare service
|
|
names as containers. This should be noted in the architecture doc as a
|
|
platform convention. Existing containers (e.g., `metacrypt` without a
|
|
component suffix) should be renamed to `metacrypt-api` before or during
|
|
MCP bootstrap.
|
|
|
|
For single-component services (e.g., mc-proxy), the container name is
|
|
the service name and the component name matches: `mc-proxy` → service
|
|
`mc-proxy`, component `mc-proxy`.
|
|
|
|
### E5: EventInfo proto missing service field — RESOLVED
|
|
|
|
`EventInfo` has `component` but no `service`. Component names aren't unique
|
|
across services.
|
|
|
|
**Resolution:** Add `string service = 2` to EventInfo, shift other fields.
|
|
|
|
### E6: `version` field undefined — RESOLVED
|
|
|
|
`ComponentInfo` and the schema have `version` but nothing sets it.
|
|
|
|
**Resolution:** Extract version from the image tag. At deploy time, the
|
|
agent parses the tag from the image reference (e.g.,
|
|
`mcr.../metacrypt:v1.7.0` → `v1.7.0`). During reconciliation, `podman
|
|
inspect` provides the image reference. If the tag is `latest`, version
|
|
reports `latest`.
|
|
|
|
Build discipline (enforced in mcdeploy/CI, not MCP): `latest` should
|
|
always be an alias for a concrete version tag, so the version is
|
|
recoverable from the image metadata.
|
|
|
|
### E7: No snapshot/backup for agent database — RESOLVED
|
|
|
|
Platform convention: every service has a `snapshot` command and backup timer.
|
|
Agent has SQLite but no snapshot command and no backup timer in the systemd
|
|
artifacts.
|
|
|
|
**Resolution:** Add `mcp-agent snapshot` command and backup timer. Systemd
|
|
artifacts: `mcp-agent-backup.service` + `mcp-agent-backup.timer` (daily,
|
|
02:00 UTC, 5-minute jitter, per platform convention). Add to project
|
|
structure and deployment section.
|
|
|
|
### E8: CLI flag overrides were dropped — RESOLVED
|
|
|
|
Original design had CLI flags overriding service definition fields. The
|
|
rewrite has only file > registry. Per-component deploy replaces `--image`
|
|
but other overrides (e.g., temporary node change) are gone.
|
|
|
|
**Resolution:** Explicitly drop flag overrides from v1 scope. Per-component
|
|
deploy covers the primary use case. For one-off changes, edit the service
|
|
definition and deploy. Flag overrides are a future convenience feature.
|
|
|
|
## Security Issues
|
|
|
|
### S1: Alert command injection — RESOLVED
|
|
|
|
Alert command is a shell string with env vars (`MCP_SERVICE`,
|
|
`MCP_COMPONENT`, etc.). Crafted container names with shell metacharacters
|
|
could inject commands.
|
|
|
|
**Resolution:** Alert command is an argv array, not a shell string. The
|
|
agent calls `exec` with the array and sets `MCP_*` env vars on the child
|
|
process. No `sh -c`, no shell interpretation, no injection vector.
|
|
|
|
Config format changes from:
|
|
```toml
|
|
alert_command = "ntfy publish mcp ..."
|
|
```
|
|
to:
|
|
```toml
|
|
alert_command = ["/usr/local/bin/notify", "--channel", "mcp"]
|
|
```
|
|
|
|
### S2: Agent binds to all interfaces — RESOLVED
|
|
|
|
`grpc_addr = ":9444"` binds to all interfaces including LAN. The agent
|
|
should bind to the overlay interface only.
|
|
|
|
**Resolution:** Config example should show the overlay IP explicitly:
|
|
`grpc_addr = "100.95.252.120:9444"`. The agent does NOT sit behind
|
|
MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy,
|
|
so the agent must be reachable without MC-Proxy). The agent is in the
|
|
same category as MC-Proxy itself: infrastructure directly reachable on
|
|
the overlay.
|
|
|
|
### S3: Unattended auth model unclear — RESOLVED
|
|
|
|
The old watch config had `password_file` for unattended auth. With the
|
|
agent merge, unattended CLI use (scripting, cron) needs a clear token
|
|
acquisition path.
|
|
|
|
**Resolution:** CLI config supports optional system account credentials:
|
|
```toml
|
|
[auth]
|
|
token_path = "~/.config/mcp/token"
|
|
username = "mcp-operator"
|
|
password_file = "~/.config/mcp/credentials"
|
|
```
|
|
If set, CLI auto-authenticates to MCIAS when the token is expired/missing.
|
|
`MCP_TOKEN` env var remains as an override. Unattended CLI use is rare
|
|
for v1 — interactive `mcp login` is the primary path.
|
|
|
|
### S4: Podman rootless UID mapping with volume mounts — RESOLVED
|
|
|
|
Rootless podman maps container UIDs via subuid. `user = "0:0"` in the
|
|
container + volumes from `/srv/<service>/` means the actual host UID
|
|
accessing files depends on subuid config.
|
|
|
|
**Resolution:** Documentation item. The deployment section should note
|
|
that the `mcp` user needs subuid/subgid ranges configured and that files
|
|
in `/srv/<service>/` must be accessible to the mapped UIDs. NixOS handles
|
|
subuid provisioning as part of user creation.
|
|
|
|
### S5: No RPC audit log — RESOLVED
|
|
|
|
Agent records container state-change events but not RPC calls. Deploy
|
|
and PushFile are significant actions with no record of who initiated
|
|
them or when.
|
|
|
|
**Resolution:** Log every RPC at `info` level via the gRPC auth
|
|
interceptor (after token validation succeeds): method name, caller
|
|
identity (from MCIAS token), timestamp. Uses `log/slog` per platform
|
|
convention. Minimal implementation — one log line in the interceptor.
|
|
|
|
### S6: File push creates arbitrary directory trees — RESOLVED
|
|
|
|
`mcp push cert.pem mcr deeply/nested/path/cert.pem` creates
|
|
`/srv/mcr/deeply/nested/path/`.
|
|
|
|
**Resolution:** This is intentional and necessary. The agent needs to
|
|
create `/srv/<service>/` directories for new deployments and arbitrary
|
|
subdirectories within them (certs/, backups/, etc.). `/srv/` should be
|
|
owned by the `mcp` user to support this. The path traversal restriction
|
|
(no `..`, no symlink escape) remains the security boundary.
|