Files
mcp/REVIEW.md
Kyle Isom c8d0d42ea8 Add REVIEW.md from architecture review session
Documents 12 issues found during critical review of ARCHITECTURE.md
and their resolutions: merged agent/watcher into single smart daemon,
components model for independent deploy within services, database
lives on agent not CLI, TLS+bearer (not mTLS), desired_state=ignore
for unmanaged containers, and other clarifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:27:57 -07:00

207 lines
9.1 KiB
Markdown

# MCP ARCHITECTURE.md Review
Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to
incorporate the design changes.
## Issues
### 1. Database location — RESOLVED
**Problem:** Doc says `~/.config/mcp/mcp.db` on vade and `/srv/mcp/mcp.db`
on rift. Two databases would diverge.
**Resolution:** No database on vade. Service definition files on vade are
the operator's intent (with `active: true/false`). The registry database
lives only on rift, owned by `mcp watch`. The CLI writes desired state and
deployed specs to the rift registry during deploy/stop/start. The watch
process reads from it.
| Location | What | Purpose |
|----------|------|---------|
| vade: `~/.config/mcp/services/*.toml` | Service definitions | Operator intent |
| rift: `/srv/mcp/mcp.db` | Registry | Operational truth (observed state, events, deployed specs) |
**Open sub-question:** How does the CLI write to the rift registry? Options:
(a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent
which writes to the registry, (c) CLI SSHes to rift. Needs decision.
### 2. "mTLS" claim is inaccurate — RESOLVED
**Problem:** Communication section says "gRPC with mutual TLS" but the
actual auth model is server-side TLS (agent has a cert) + bearer token
(master presents MCIAS token in gRPC metadata). That's not mTLS.
**Resolution:** Fix wording to "server-side TLS + MCIAS bearer token."
mTLS was discussed and deemed not worth the complexity for v1: the agent
is behind the overlay network, tokens are short-lived, and the scenarios
where client certs help (stolen token, MCIAS compromise) already imply
broader platform compromise. Note mTLS as a potential future hardening
measure subject to security review.
### 3. Agent system user can't manage containers — RESOLVED
**Problem:** Agent systemd unit runs as `User=mcp`. But podman/docker CLI
requires root, docker group membership, or rootless podman setup. Doc
doesn't address how the `mcp` user gets container runtime permissions.
**Resolution:** Provision a dedicated `mcp` user via NixOS config. Agent
runs as `mcp`. Podman runs rootless under that user. All containers are
owned by `mcp`. Existing containers on rift (currently under kyle) will
need to be re-created under `mcp`'s podman instance as a one-time
migration during MCP bootstrap. `/srv/` directories need to be accessible
to the `mcp` user.
### 4. No way to target individual containers from CLI — RESOLVED
**Problem:** Doc says "lifecycle commands can target individual containers"
but CLI only shows `mcp stop <service>`, not per-container syntax.
**Resolution:** CLI operates at service level only (like compose). Drop the
per-container claim. Per-container operations are the agent's internal
concern — the agent decides how to manage individual containers within a
service. If an operator needs per-container control, they use podman
directly on the node.
### 5. `--image` flag ambiguous for multi-container services — RESOLVED
**Problem:** `mcp deploy metacrypt --image v2.0.0` — metacrypt has two
containers with different images. Does `--image` override both? Which one?
**Resolution:** Rename `[[containers]]` to `[[components]]`. Components
are independently deployable within a service. Deploy targets:
- `mcp deploy metacrypt` — all components (default)
- `mcp deploy metacrypt/web` — just the web component
No `--image` flag needed. To update an image, edit the service definition
(or `mcp service edit metacrypt`) and deploy the specific component.
Components share the service's node and `/srv/<service>/` directory. This
models the real constraint: api and web are co-located, share state, but
have independent lifecycles (e.g., restarting metacrypt-api requires
unsealing, but web can be redeployed independently).
### 6. Unmanaged desired_state contradicts schema — RESOLVED
**Problem:** Doc says "Unmanaged containers have no desired state set
(desired_state is empty)" but schema has `desired_state TEXT NOT NULL
DEFAULT 'running'`. Empty string is not a valid state value.
**Resolution:** Add `ignore` as a desired state. Unmanaged containers
discovered during sync get `desired_state = 'ignore'`. The agent sees
them, reports them in status, but doesn't alert on drift or try to
reconcile. `mcp adopt` changes desired_state from `ignore` to `running`.
Desired states: `running`, `stopped`, `ignore`.
### 7. `restart` vs `restart_policy` naming inconsistency — RESOLVED
**Problem:** Service definition TOML uses `restart`. Proto and schema use
`restart_policy`. Pick one.
**Resolution:** Use `restart` everywhere — TOML, proto, schema. Shorter,
matches compose convention.
### 8. `mcp ps` vs `mcp status` vs `mcp sync` unclear boundaries — RESOLVED
**Problem:** Status and sync both query agents and reconcile. What's the
difference?
**Resolution:** Three distinct commands with clear semantics:
- `mcp list` — read the agent's registry. No runtime query. Fast. Shows
services, desired state, last-known observed state, version.
- `mcp ps` — force a live runtime query, update observed state. Shows
what's actually running with uptime and version.
- `mcp status` — full picture: live query + drift detection + recent
events. "What do I need to worry about."
- `mcp sync` — push service definitions from CLI to agent. Updates the
agent's desired state without deploying. "Here's what should be
running on your node."
### 9. No REST on agent — unstated deviation from engineering standards — RESOLVED
**Problem:** Engineering standards require REST+gRPC parity. Agent is
gRPC-only.
**Resolution:** The REST+gRPC parity rule applies to user-facing services,
not internal infrastructure. MCP's agent is a C2 channel with no
meaningful REST use case. Call out the exception in ARCHITECTURE.md, and
update engineering-standards.md to clarify the rule applies where both
interfaces provide value (user services with external consumers), not
universally.
### 10. Master-side deploy flow undocumented — RESOLVED
**Problem:** The agent-side deploy flow is documented step by step. The
master-side flow is fragmented across sections.
**Resolution:** Resolved by the agent merge. The CLI is now a thin client
that pushes service definitions to the agent. The agent handles the full
deploy flow (resolve spec, pull image, stop/start containers, update
registry). The ARCHITECTURE.md rewrite will document this as one coherent
flow.
### 11. `mcp watch` connecting to local agent — RESOLVED
**Problem:** Watch and agent on same machine — unclear wiring.
**Resolution:** Eliminated by the agent merge. Watch is now a subsystem
within the agent, not a separate process. No cross-process wiring needed.
### 12. Container name uniqueness across nodes — RESOLVED
**Problem:** Container `name` is the PK. Works for v1 with one node.
Would collide with multi-node.
**Resolution:** With the merged agent model, each agent has its own
SQLite database on its own node. No cross-node name collision. The
unique identity of a component is `node/service/component`. The CLI
aggregates across agents in output and qualifies by node.
## Design Changes Agreed During Review
- `active: true/false` in service definitions replaces `state` field
- No database on vade; registry lives only on rift
- Service definition files are operator-local (vade), not replicated to rift
- CLI writes desired state + deployed spec to rift registry during actions
### Major: Merge agent and watcher into a single smart node daemon
The original design had three components: CLI (master), dumb agent, watch
process. This is now simplified to two:
**CLI** (`mcp`) — thin client on operator's workstation:
- Reads local service definition files (`~/.config/mcp/services/`)
- Pushes desired state to the agent on each node
- Reads status, events, drift back from agents
- Provides operator UX (ps, status, deploy, service show/edit/export)
- No database, no daemon
**Agent** (`mcp-agent`) — smart per-node daemon:
- Receives desired state from CLI (service definitions)
- Observes actual state via container runtime
- Acts on the difference (deploy, stop, start)
- Stores the registry (SQLite: desired state, observed state, deployed
specs, events)
- Monitors continuously (watch loop, event recording, drift detection)
- Alerts on drift/flapping (configurable alert command)
- Handles file transfers (push/pull within /srv/<service>/)
- Reports node resources (disk, memory, CPU — for future scheduling)
The "dumb" container runtime interaction (podman/docker exec) is an
internal subcomponent of the agent, not a separate concern.
This sets up for the declarative future: the primary operation is "here's
what should be running on your node" and the agent works to make reality
match intent. v1 reconciliation is operator-triggered (deploy/sync). v2
can add continuous auto-reconciliation.
Implications for other review items:
- #1 (database location): fully resolved — database is on the agent
- #2 (mTLS): still needs fixing (terminology)
- #3 (agent user permissions): still needs answer
- #10 (master-side deploy flow): changes significantly — CLI pushes to
agent, agent executes
- #11 (watch connecting to agent): eliminated — they're the same process