Files
mcp/REVIEW.md
Kyle Isom c8d0d42ea8 Add REVIEW.md from architecture review session
Documents 12 issues found during critical review of ARCHITECTURE.md
and their resolutions: merged agent/watcher into single smart daemon,
components model for independent deploy within services, database
lives on agent not CLI, TLS+bearer (not mTLS), desired_state=ignore
for unmanaged containers, and other clarifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:27:57 -07:00

9.1 KiB

MCP ARCHITECTURE.md Review

Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to incorporate the design changes.

Issues

1. Database location — RESOLVED

Problem: Doc says ~/.config/mcp/mcp.db on vade and /srv/mcp/mcp.db on rift. Two databases would diverge.

Resolution: No database on vade. Service definition files on vade are the operator's intent (with active: true/false). The registry database lives only on rift, owned by mcp watch. The CLI writes desired state and deployed specs to the rift registry during deploy/stop/start. The watch process reads from it.

Location What Purpose
vade: ~/.config/mcp/services/*.toml Service definitions Operator intent
rift: /srv/mcp/mcp.db Registry Operational truth (observed state, events, deployed specs)

Open sub-question: How does the CLI write to the rift registry? Options: (a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent which writes to the registry, (c) CLI SSHes to rift. Needs decision.

2. "mTLS" claim is inaccurate — RESOLVED

Problem: Communication section says "gRPC with mutual TLS" but the actual auth model is server-side TLS (agent has a cert) + bearer token (master presents MCIAS token in gRPC metadata). That's not mTLS.

Resolution: Fix wording to "server-side TLS + MCIAS bearer token." mTLS was discussed and deemed not worth the complexity for v1: the agent is behind the overlay network, tokens are short-lived, and the scenarios where client certs help (stolen token, MCIAS compromise) already imply broader platform compromise. Note mTLS as a potential future hardening measure subject to security review.

3. Agent system user can't manage containers — RESOLVED

Problem: Agent systemd unit runs as User=mcp. But podman/docker CLI requires root, docker group membership, or rootless podman setup. Doc doesn't address how the mcp user gets container runtime permissions.

Resolution: Provision a dedicated mcp user via NixOS config. Agent runs as mcp. Podman runs rootless under that user. All containers are owned by mcp. Existing containers on rift (currently under kyle) will need to be re-created under mcp's podman instance as a one-time migration during MCP bootstrap. /srv/ directories need to be accessible to the mcp user.

4. No way to target individual containers from CLI — RESOLVED

Problem: Doc says "lifecycle commands can target individual containers" but CLI only shows mcp stop <service>, not per-container syntax.

Resolution: CLI operates at service level only (like compose). Drop the per-container claim. Per-container operations are the agent's internal concern — the agent decides how to manage individual containers within a service. If an operator needs per-container control, they use podman directly on the node.

5. --image flag ambiguous for multi-container services — RESOLVED

Problem: mcp deploy metacrypt --image v2.0.0 — metacrypt has two containers with different images. Does --image override both? Which one?

Resolution: Rename [[containers]] to [[components]]. Components are independently deployable within a service. Deploy targets:

  • mcp deploy metacrypt — all components (default)
  • mcp deploy metacrypt/web — just the web component

No --image flag needed. To update an image, edit the service definition (or mcp service edit metacrypt) and deploy the specific component.

Components share the service's node and /srv/<service>/ directory. This models the real constraint: api and web are co-located, share state, but have independent lifecycles (e.g., restarting metacrypt-api requires unsealing, but web can be redeployed independently).

6. Unmanaged desired_state contradicts schema — RESOLVED

Problem: Doc says "Unmanaged containers have no desired state set (desired_state is empty)" but schema has desired_state TEXT NOT NULL DEFAULT 'running'. Empty string is not a valid state value.

Resolution: Add ignore as a desired state. Unmanaged containers discovered during sync get desired_state = 'ignore'. The agent sees them, reports them in status, but doesn't alert on drift or try to reconcile. mcp adopt changes desired_state from ignore to running.

Desired states: running, stopped, ignore.

7. restart vs restart_policy naming inconsistency — RESOLVED

Problem: Service definition TOML uses restart. Proto and schema use restart_policy. Pick one.

Resolution: Use restart everywhere — TOML, proto, schema. Shorter, matches compose convention.

8. mcp ps vs mcp status vs mcp sync unclear boundaries — RESOLVED

Problem: Status and sync both query agents and reconcile. What's the difference?

Resolution: Three distinct commands with clear semantics:

  • mcp list — read the agent's registry. No runtime query. Fast. Shows services, desired state, last-known observed state, version.
  • mcp ps — force a live runtime query, update observed state. Shows what's actually running with uptime and version.
  • mcp status — full picture: live query + drift detection + recent events. "What do I need to worry about."
  • mcp sync — push service definitions from CLI to agent. Updates the agent's desired state without deploying. "Here's what should be running on your node."

9. No REST on agent — unstated deviation from engineering standards — RESOLVED

Problem: Engineering standards require REST+gRPC parity. Agent is gRPC-only.

Resolution: The REST+gRPC parity rule applies to user-facing services, not internal infrastructure. MCP's agent is a C2 channel with no meaningful REST use case. Call out the exception in ARCHITECTURE.md, and update engineering-standards.md to clarify the rule applies where both interfaces provide value (user services with external consumers), not universally.

10. Master-side deploy flow undocumented — RESOLVED

Problem: The agent-side deploy flow is documented step by step. The master-side flow is fragmented across sections.

Resolution: Resolved by the agent merge. The CLI is now a thin client that pushes service definitions to the agent. The agent handles the full deploy flow (resolve spec, pull image, stop/start containers, update registry). The ARCHITECTURE.md rewrite will document this as one coherent flow.

11. mcp watch connecting to local agent — RESOLVED

Problem: Watch and agent on same machine — unclear wiring.

Resolution: Eliminated by the agent merge. Watch is now a subsystem within the agent, not a separate process. No cross-process wiring needed.

12. Container name uniqueness across nodes — RESOLVED

Problem: Container name is the PK. Works for v1 with one node. Would collide with multi-node.

Resolution: With the merged agent model, each agent has its own SQLite database on its own node. No cross-node name collision. The unique identity of a component is node/service/component. The CLI aggregates across agents in output and qualifies by node.

Design Changes Agreed During Review

  • active: true/false in service definitions replaces state field
  • No database on vade; registry lives only on rift
  • Service definition files are operator-local (vade), not replicated to rift
  • CLI writes desired state + deployed spec to rift registry during actions

Major: Merge agent and watcher into a single smart node daemon

The original design had three components: CLI (master), dumb agent, watch process. This is now simplified to two:

CLI (mcp) — thin client on operator's workstation:

  • Reads local service definition files (~/.config/mcp/services/)
  • Pushes desired state to the agent on each node
  • Reads status, events, drift back from agents
  • Provides operator UX (ps, status, deploy, service show/edit/export)
  • No database, no daemon

Agent (mcp-agent) — smart per-node daemon:

  • Receives desired state from CLI (service definitions)
  • Observes actual state via container runtime
  • Acts on the difference (deploy, stop, start)
  • Stores the registry (SQLite: desired state, observed state, deployed specs, events)
  • Monitors continuously (watch loop, event recording, drift detection)
  • Alerts on drift/flapping (configurable alert command)
  • Handles file transfers (push/pull within /srv//)
  • Reports node resources (disk, memory, CPU — for future scheduling)

The "dumb" container runtime interaction (podman/docker exec) is an internal subcomponent of the agent, not a separate concern.

This sets up for the declarative future: the primary operation is "here's what should be running on your node" and the agent works to make reality match intent. v1 reconciliation is operator-triggered (deploy/sync). v2 can add continuous auto-reconciliation.

Implications for other review items:

  • #1 (database location): fully resolved — database is on the agent
  • #2 (mTLS): still needs fixing (terminology)
  • #3 (agent user permissions): still needs answer
  • #10 (master-side deploy flow): changes significantly — CLI pushes to agent, agent executes
  • #11 (watch connecting to agent): eliminated — they're the same process