mcp/DESIGN_AUDIT.md

# MCP Design Audit

Second review of ARCHITECTURE.md after the rewrite incorporating the first
review's findings. Two categories: engineering (cohesion/completeness) and
security.

## Engineering Issues

### E1: Node registry is missing — RESOLVED

The CLI has `mcp node add/list/remove` commands but has no database. The
CLI needs to know that "rift" is at `100.95.252.120:9444` to connect.

**Resolution:** `[[nodes]]` section in the CLI config file (`mcp.toml`).
`mcp node add/remove` edits the config. `mcp node list` reads it. For v1,
nodes are just name + address. Future versions add a `state` field
(active/inactive/down) for maintenance and health tracking; CLI filters
by state when routing commands.

### E2: Duplicate events table definition — RESOLVED

The events table appears twice: in the Monitoring section and in the
Database Schema section. The monitoring section's index is missing the
`service` column.

**Resolution:** Remove the DDL from the Monitoring section. Define the
table once in the Database Schema section (which has the correct index).
Monitoring section references it narratively.

### E3: `active: false` relationship to `desired_state` is undefined — RESOLVED

Service definition has `active = true/false`. Registry has per-component
`desired_state = running/stopped/ignore`. What does `active = false` mean?

**Resolution:** `active` is the authoritative desired state for the service.
- `active = true` → CLI tells agent: components should be `running`
- `active = false` → CLI tells agent: components should be `stopped`
- `mcp stop <service>` sets `active = false` in the local service
  definition file and tells the agent to stop
- `mcp start <service>` sets `active = true` and tells the agent to start
- `mcp sync` pushes all definitions — agent stops inactive services,
  keeps active ones running

The service definition file is always the source of truth. Lifecycle
commands modify it. The agent receives the resolved intent and acts.

### E4: `mcp adopt` CLI missing component name argument — RESOLVED

CLI: `mcp adopt <container> <service>` (two args).
Proto: `AdoptRequest` has `container`, `service`, `component` (three fields).

**Resolution:** `mcp adopt <service>` adopts all containers matching
`<service>-*` as components, stripping the service prefix to derive the
component name: `metacrypt-api` → component `api`, `metacrypt-web` →
component `web`.

Container naming convention: `<service>-<component>`. No bare service
names as containers. This should be noted in the architecture doc as a
platform convention. Existing containers (e.g., `metacrypt` without a
component suffix) should be renamed to `metacrypt-api` before or during
MCP bootstrap.

For single-component services (e.g., mc-proxy), the container name is
the service name and the component name matches: `mc-proxy` → service
`mc-proxy`, component `mc-proxy`.

### E5: EventInfo proto missing service field — RESOLVED

`EventInfo` has `component` but no `service`. Component names aren't unique
across services.

**Resolution:** Add `string service = 2` to EventInfo, shift other fields.

### E6: `version` field undefined — RESOLVED

`ComponentInfo` and the schema have `version` but nothing sets it.

**Resolution:** Extract version from the image tag. At deploy time, the
agent parses the tag from the image reference (e.g.,
`mcr.../metacrypt:v1.7.0` → `v1.7.0`). During reconciliation, `podman
inspect` provides the image reference. If the tag is `latest`, version
reports `latest`.

Build discipline (enforced in mcdeploy/CI, not MCP): `latest` should
always be an alias for a concrete version tag, so the version is
recoverable from the image metadata.

### E7: No snapshot/backup for agent database — RESOLVED

Platform convention: every service has a `snapshot` command and backup timer.
Agent has SQLite but no snapshot command and no backup timer in the systemd
artifacts.

**Resolution:** Add `mcp-agent snapshot` command and backup timer. Systemd
artifacts: `mcp-agent-backup.service` + `mcp-agent-backup.timer` (daily,
02:00 UTC, 5-minute jitter, per platform convention). Add to project
structure and deployment section.

### E8: CLI flag overrides were dropped — RESOLVED

Original design had CLI flags overriding service definition fields. The
rewrite has only file > registry. Per-component deploy replaces `--image`
but other overrides (e.g., temporary node change) are gone.

**Resolution:** Explicitly drop flag overrides from v1 scope. Per-component
deploy covers the primary use case. For one-off changes, edit the service
definition and deploy. Flag overrides are a future convenience feature.

## Security Issues

### S1: Alert command injection — RESOLVED

Alert command is a shell string with env vars (`MCP_SERVICE`,
`MCP_COMPONENT`, etc.). Crafted container names with shell metacharacters
could inject commands.

**Resolution:** Alert command is an argv array, not a shell string. The
agent calls `exec` with the array and sets `MCP_*` env vars on the child
process. No `sh -c`, no shell interpretation, no injection vector.

Config format changes from:
```toml
alert_command = "ntfy publish mcp ..."
```
to:
```toml
alert_command = ["/usr/local/bin/notify", "--channel", "mcp"]
```

### S2: Agent binds to all interfaces — RESOLVED

`grpc_addr = ":9444"` binds to all interfaces including LAN. The agent
should bind to the overlay interface only.

**Resolution:** Config example should show the overlay IP explicitly:
`grpc_addr = "100.95.252.120:9444"`. The agent does NOT sit behind
MC-Proxy — that would create a circular dependency (MCP manages MC-Proxy,
so the agent must be reachable without MC-Proxy). The agent is in the
same category as MC-Proxy itself: infrastructure directly reachable on
the overlay.

### S3: Unattended auth model unclear — RESOLVED

The old watch config had `password_file` for unattended auth. With the
agent merge, unattended CLI use (scripting, cron) needs a clear token
acquisition path.

**Resolution:** CLI config supports optional system account credentials:
```toml
[auth]
token_path = "~/.config/mcp/token"
username = "mcp-operator"
password_file = "~/.config/mcp/credentials"
```
If set, CLI auto-authenticates to MCIAS when the token is expired/missing.
`MCP_TOKEN` env var remains as an override. Unattended CLI use is rare
for v1 — interactive `mcp login` is the primary path.

### S4: Podman rootless UID mapping with volume mounts — RESOLVED

Rootless podman maps container UIDs via subuid. `user = "0:0"` in the
container + volumes from `/srv/<service>/` means the actual host UID
accessing files depends on subuid config.

**Resolution:** Documentation item. The deployment section should note
that the `mcp` user needs subuid/subgid ranges configured and that files
in `/srv/<service>/` must be accessible to the mapped UIDs. NixOS handles
subuid provisioning as part of user creation.

### S5: No RPC audit log — RESOLVED

Agent records container state-change events but not RPC calls. Deploy
and PushFile are significant actions with no record of who initiated
them or when.

**Resolution:** Log every RPC at `info` level via the gRPC auth
interceptor (after token validation succeeds): method name, caller
identity (from MCIAS token), timestamp. Uses `log/slog` per platform
convention. Minimal implementation — one log line in the interceptor.

### S6: File push creates arbitrary directory trees — RESOLVED

`mcp push cert.pem mcr deeply/nested/path/cert.pem` creates
`/srv/mcr/deeply/nested/path/`.

**Resolution:** This is intentional and necessary. The agent needs to
create `/srv/<service>/` directories for new deployments and arbitrary
subdirectories within them (certs/, backups/, etc.). `/srv/` should be
owned by the `mcp` user to support this. The path traversal restriction
(no `..`, no symlink escape) remains the security boundary.