diff --git a/REVIEW.md b/REVIEW.md new file mode 100644 index 0000000..f5ce3a6 --- /dev/null +++ b/REVIEW.md @@ -0,0 +1,206 @@ +# MCP ARCHITECTURE.md Review + +Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to +incorporate the design changes. + +## Issues + +### 1. Database location — RESOLVED + +**Problem:** Doc says `~/.config/mcp/mcp.db` on vade and `/srv/mcp/mcp.db` +on rift. Two databases would diverge. + +**Resolution:** No database on vade. Service definition files on vade are +the operator's intent (with `active: true/false`). The registry database +lives only on rift, owned by `mcp watch`. The CLI writes desired state and +deployed specs to the rift registry during deploy/stop/start. The watch +process reads from it. + +| Location | What | Purpose | +|----------|------|---------| +| vade: `~/.config/mcp/services/*.toml` | Service definitions | Operator intent | +| rift: `/srv/mcp/mcp.db` | Registry | Operational truth (observed state, events, deployed specs) | + +**Open sub-question:** How does the CLI write to the rift registry? Options: +(a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent +which writes to the registry, (c) CLI SSHes to rift. Needs decision. + +### 2. "mTLS" claim is inaccurate — RESOLVED + +**Problem:** Communication section says "gRPC with mutual TLS" but the +actual auth model is server-side TLS (agent has a cert) + bearer token +(master presents MCIAS token in gRPC metadata). That's not mTLS. + +**Resolution:** Fix wording to "server-side TLS + MCIAS bearer token." +mTLS was discussed and deemed not worth the complexity for v1: the agent +is behind the overlay network, tokens are short-lived, and the scenarios +where client certs help (stolen token, MCIAS compromise) already imply +broader platform compromise. Note mTLS as a potential future hardening +measure subject to security review. + +### 3. Agent system user can't manage containers — RESOLVED + +**Problem:** Agent systemd unit runs as `User=mcp`. But podman/docker CLI +requires root, docker group membership, or rootless podman setup. Doc +doesn't address how the `mcp` user gets container runtime permissions. + +**Resolution:** Provision a dedicated `mcp` user via NixOS config. Agent +runs as `mcp`. Podman runs rootless under that user. All containers are +owned by `mcp`. Existing containers on rift (currently under kyle) will +need to be re-created under `mcp`'s podman instance as a one-time +migration during MCP bootstrap. `/srv/` directories need to be accessible +to the `mcp` user. + +### 4. No way to target individual containers from CLI — RESOLVED + +**Problem:** Doc says "lifecycle commands can target individual containers" +but CLI only shows `mcp stop `, not per-container syntax. + +**Resolution:** CLI operates at service level only (like compose). Drop the +per-container claim. Per-container operations are the agent's internal +concern — the agent decides how to manage individual containers within a +service. If an operator needs per-container control, they use podman +directly on the node. + +### 5. `--image` flag ambiguous for multi-container services — RESOLVED + +**Problem:** `mcp deploy metacrypt --image v2.0.0` — metacrypt has two +containers with different images. Does `--image` override both? Which one? + +**Resolution:** Rename `[[containers]]` to `[[components]]`. Components +are independently deployable within a service. Deploy targets: + +- `mcp deploy metacrypt` — all components (default) +- `mcp deploy metacrypt/web` — just the web component + +No `--image` flag needed. To update an image, edit the service definition +(or `mcp service edit metacrypt`) and deploy the specific component. + +Components share the service's node and `/srv//` directory. This +models the real constraint: api and web are co-located, share state, but +have independent lifecycles (e.g., restarting metacrypt-api requires +unsealing, but web can be redeployed independently). + +### 6. Unmanaged desired_state contradicts schema — RESOLVED + +**Problem:** Doc says "Unmanaged containers have no desired state set +(desired_state is empty)" but schema has `desired_state TEXT NOT NULL +DEFAULT 'running'`. Empty string is not a valid state value. + +**Resolution:** Add `ignore` as a desired state. Unmanaged containers +discovered during sync get `desired_state = 'ignore'`. The agent sees +them, reports them in status, but doesn't alert on drift or try to +reconcile. `mcp adopt` changes desired_state from `ignore` to `running`. + +Desired states: `running`, `stopped`, `ignore`. + +### 7. `restart` vs `restart_policy` naming inconsistency — RESOLVED + +**Problem:** Service definition TOML uses `restart`. Proto and schema use +`restart_policy`. Pick one. + +**Resolution:** Use `restart` everywhere — TOML, proto, schema. Shorter, +matches compose convention. + +### 8. `mcp ps` vs `mcp status` vs `mcp sync` unclear boundaries — RESOLVED + +**Problem:** Status and sync both query agents and reconcile. What's the +difference? + +**Resolution:** Three distinct commands with clear semantics: + +- `mcp list` — read the agent's registry. No runtime query. Fast. Shows + services, desired state, last-known observed state, version. +- `mcp ps` — force a live runtime query, update observed state. Shows + what's actually running with uptime and version. +- `mcp status` — full picture: live query + drift detection + recent + events. "What do I need to worry about." +- `mcp sync` — push service definitions from CLI to agent. Updates the + agent's desired state without deploying. "Here's what should be + running on your node." + +### 9. No REST on agent — unstated deviation from engineering standards — RESOLVED + +**Problem:** Engineering standards require REST+gRPC parity. Agent is +gRPC-only. + +**Resolution:** The REST+gRPC parity rule applies to user-facing services, +not internal infrastructure. MCP's agent is a C2 channel with no +meaningful REST use case. Call out the exception in ARCHITECTURE.md, and +update engineering-standards.md to clarify the rule applies where both +interfaces provide value (user services with external consumers), not +universally. + +### 10. Master-side deploy flow undocumented — RESOLVED + +**Problem:** The agent-side deploy flow is documented step by step. The +master-side flow is fragmented across sections. + +**Resolution:** Resolved by the agent merge. The CLI is now a thin client +that pushes service definitions to the agent. The agent handles the full +deploy flow (resolve spec, pull image, stop/start containers, update +registry). The ARCHITECTURE.md rewrite will document this as one coherent +flow. + +### 11. `mcp watch` connecting to local agent — RESOLVED + +**Problem:** Watch and agent on same machine — unclear wiring. + +**Resolution:** Eliminated by the agent merge. Watch is now a subsystem +within the agent, not a separate process. No cross-process wiring needed. + +### 12. Container name uniqueness across nodes — RESOLVED + +**Problem:** Container `name` is the PK. Works for v1 with one node. +Would collide with multi-node. + +**Resolution:** With the merged agent model, each agent has its own +SQLite database on its own node. No cross-node name collision. The +unique identity of a component is `node/service/component`. The CLI +aggregates across agents in output and qualifies by node. + +## Design Changes Agreed During Review + +- `active: true/false` in service definitions replaces `state` field +- No database on vade; registry lives only on rift +- Service definition files are operator-local (vade), not replicated to rift +- CLI writes desired state + deployed spec to rift registry during actions + +### Major: Merge agent and watcher into a single smart node daemon + +The original design had three components: CLI (master), dumb agent, watch +process. This is now simplified to two: + +**CLI** (`mcp`) — thin client on operator's workstation: +- Reads local service definition files (`~/.config/mcp/services/`) +- Pushes desired state to the agent on each node +- Reads status, events, drift back from agents +- Provides operator UX (ps, status, deploy, service show/edit/export) +- No database, no daemon + +**Agent** (`mcp-agent`) — smart per-node daemon: +- Receives desired state from CLI (service definitions) +- Observes actual state via container runtime +- Acts on the difference (deploy, stop, start) +- Stores the registry (SQLite: desired state, observed state, deployed + specs, events) +- Monitors continuously (watch loop, event recording, drift detection) +- Alerts on drift/flapping (configurable alert command) +- Handles file transfers (push/pull within /srv//) +- Reports node resources (disk, memory, CPU — for future scheduling) + +The "dumb" container runtime interaction (podman/docker exec) is an +internal subcomponent of the agent, not a separate concern. + +This sets up for the declarative future: the primary operation is "here's +what should be running on your node" and the agent works to make reality +match intent. v1 reconciliation is operator-triggered (deploy/sync). v2 +can add continuous auto-reconciliation. + +Implications for other review items: +- #1 (database location): fully resolved — database is on the agent +- #2 (mTLS): still needs fixing (terminology) +- #3 (agent user permissions): still needs answer +- #10 (master-side deploy flow): changes significantly — CLI pushes to + agent, agent executes +- #11 (watch connecting to agent): eliminated — they're the same process