# MCP ARCHITECTURE.md Review Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to incorporate the design changes. ## Issues ### 1. Database location — RESOLVED **Problem:** Doc says `~/.config/mcp/mcp.db` on vade and `/srv/mcp/mcp.db` on rift. Two databases would diverge. **Resolution:** No database on vade. Service definition files on vade are the operator's intent (with `active: true/false`). The registry database lives only on rift, owned by `mcp watch`. The CLI writes desired state and deployed specs to the rift registry during deploy/stop/start. The watch process reads from it. | Location | What | Purpose | |----------|------|---------| | vade: `~/.config/mcp/services/*.toml` | Service definitions | Operator intent | | rift: `/srv/mcp/mcp.db` | Registry | Operational truth (observed state, events, deployed specs) | **Open sub-question:** How does the CLI write to the rift registry? Options: (a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent which writes to the registry, (c) CLI SSHes to rift. Needs decision. ### 2. "mTLS" claim is inaccurate — RESOLVED **Problem:** Communication section says "gRPC with mutual TLS" but the actual auth model is server-side TLS (agent has a cert) + bearer token (master presents MCIAS token in gRPC metadata). That's not mTLS. **Resolution:** Fix wording to "server-side TLS + MCIAS bearer token." mTLS was discussed and deemed not worth the complexity for v1: the agent is behind the overlay network, tokens are short-lived, and the scenarios where client certs help (stolen token, MCIAS compromise) already imply broader platform compromise. Note mTLS as a potential future hardening measure subject to security review. ### 3. Agent system user can't manage containers — RESOLVED **Problem:** Agent systemd unit runs as `User=mcp`. But podman/docker CLI requires root, docker group membership, or rootless podman setup. Doc doesn't address how the `mcp` user gets container runtime permissions. **Resolution:** Provision a dedicated `mcp` user via NixOS config. Agent runs as `mcp`. Podman runs rootless under that user. All containers are owned by `mcp`. Existing containers on rift (currently under kyle) will need to be re-created under `mcp`'s podman instance as a one-time migration during MCP bootstrap. `/srv/` directories need to be accessible to the `mcp` user. ### 4. No way to target individual containers from CLI — RESOLVED **Problem:** Doc says "lifecycle commands can target individual containers" but CLI only shows `mcp stop `, not per-container syntax. **Resolution:** CLI operates at service level only (like compose). Drop the per-container claim. Per-container operations are the agent's internal concern — the agent decides how to manage individual containers within a service. If an operator needs per-container control, they use podman directly on the node. ### 5. `--image` flag ambiguous for multi-container services — RESOLVED **Problem:** `mcp deploy metacrypt --image v2.0.0` — metacrypt has two containers with different images. Does `--image` override both? Which one? **Resolution:** Rename `[[containers]]` to `[[components]]`. Components are independently deployable within a service. Deploy targets: - `mcp deploy metacrypt` — all components (default) - `mcp deploy metacrypt/web` — just the web component No `--image` flag needed. To update an image, edit the service definition (or `mcp service edit metacrypt`) and deploy the specific component. Components share the service's node and `/srv//` directory. This models the real constraint: api and web are co-located, share state, but have independent lifecycles (e.g., restarting metacrypt-api requires unsealing, but web can be redeployed independently). ### 6. Unmanaged desired_state contradicts schema — RESOLVED **Problem:** Doc says "Unmanaged containers have no desired state set (desired_state is empty)" but schema has `desired_state TEXT NOT NULL DEFAULT 'running'`. Empty string is not a valid state value. **Resolution:** Add `ignore` as a desired state. Unmanaged containers discovered during sync get `desired_state = 'ignore'`. The agent sees them, reports them in status, but doesn't alert on drift or try to reconcile. `mcp adopt` changes desired_state from `ignore` to `running`. Desired states: `running`, `stopped`, `ignore`. ### 7. `restart` vs `restart_policy` naming inconsistency — RESOLVED **Problem:** Service definition TOML uses `restart`. Proto and schema use `restart_policy`. Pick one. **Resolution:** Use `restart` everywhere — TOML, proto, schema. Shorter, matches compose convention. ### 8. `mcp ps` vs `mcp status` vs `mcp sync` unclear boundaries — RESOLVED **Problem:** Status and sync both query agents and reconcile. What's the difference? **Resolution:** Three distinct commands with clear semantics: - `mcp list` — read the agent's registry. No runtime query. Fast. Shows services, desired state, last-known observed state, version. - `mcp ps` — force a live runtime query, update observed state. Shows what's actually running with uptime and version. - `mcp status` — full picture: live query + drift detection + recent events. "What do I need to worry about." - `mcp sync` — push service definitions from CLI to agent. Updates the agent's desired state without deploying. "Here's what should be running on your node." ### 9. No REST on agent — unstated deviation from engineering standards — RESOLVED **Problem:** Engineering standards require REST+gRPC parity. Agent is gRPC-only. **Resolution:** The REST+gRPC parity rule applies to user-facing services, not internal infrastructure. MCP's agent is a C2 channel with no meaningful REST use case. Call out the exception in ARCHITECTURE.md, and update engineering-standards.md to clarify the rule applies where both interfaces provide value (user services with external consumers), not universally. ### 10. Master-side deploy flow undocumented — RESOLVED **Problem:** The agent-side deploy flow is documented step by step. The master-side flow is fragmented across sections. **Resolution:** Resolved by the agent merge. The CLI is now a thin client that pushes service definitions to the agent. The agent handles the full deploy flow (resolve spec, pull image, stop/start containers, update registry). The ARCHITECTURE.md rewrite will document this as one coherent flow. ### 11. `mcp watch` connecting to local agent — RESOLVED **Problem:** Watch and agent on same machine — unclear wiring. **Resolution:** Eliminated by the agent merge. Watch is now a subsystem within the agent, not a separate process. No cross-process wiring needed. ### 12. Container name uniqueness across nodes — RESOLVED **Problem:** Container `name` is the PK. Works for v1 with one node. Would collide with multi-node. **Resolution:** With the merged agent model, each agent has its own SQLite database on its own node. No cross-node name collision. The unique identity of a component is `node/service/component`. The CLI aggregates across agents in output and qualifies by node. ## Design Changes Agreed During Review - `active: true/false` in service definitions replaces `state` field - No database on vade; registry lives only on rift - Service definition files are operator-local (vade), not replicated to rift - CLI writes desired state + deployed spec to rift registry during actions ### Major: Merge agent and watcher into a single smart node daemon The original design had three components: CLI (master), dumb agent, watch process. This is now simplified to two: **CLI** (`mcp`) — thin client on operator's workstation: - Reads local service definition files (`~/.config/mcp/services/`) - Pushes desired state to the agent on each node - Reads status, events, drift back from agents - Provides operator UX (ps, status, deploy, service show/edit/export) - No database, no daemon **Agent** (`mcp-agent`) — smart per-node daemon: - Receives desired state from CLI (service definitions) - Observes actual state via container runtime - Acts on the difference (deploy, stop, start) - Stores the registry (SQLite: desired state, observed state, deployed specs, events) - Monitors continuously (watch loop, event recording, drift detection) - Alerts on drift/flapping (configurable alert command) - Handles file transfers (push/pull within /srv//) - Reports node resources (disk, memory, CPU — for future scheduling) The "dumb" container runtime interaction (podman/docker exec) is an internal subcomponent of the agent, not a separate concern. This sets up for the declarative future: the primary operation is "here's what should be running on your node" and the agent works to make reality match intent. v1 reconciliation is operator-triggered (deploy/sync). v2 can add continuous auto-reconciliation. Implications for other review items: - #1 (database location): fully resolved — database is on the agent - #2 (mTLS): still needs fixing (terminology) - #3 (agent user permissions): still needs answer - #10 (master-side deploy flow): changes significantly — CLI pushes to agent, agent executes - #11 (watch connecting to agent): eliminated — they're the same process