mcp/REVIEW.md

# MCP ARCHITECTURE.md Review

Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to
incorporate the design changes.

## Issues

### 1. Database location — RESOLVED

**Problem:** Doc says `~/.config/mcp/mcp.db` on vade and `/srv/mcp/mcp.db`
on rift. Two databases would diverge.

**Resolution:** No database on vade. Service definition files on vade are
the operator's intent (with `active: true/false`). The registry database
lives only on rift, owned by `mcp watch`. The CLI writes desired state and
deployed specs to the rift registry during deploy/stop/start. The watch
process reads from it.

| Location | What | Purpose |
|----------|------|---------|
| vade: `~/.config/mcp/services/*.toml` | Service definitions | Operator intent |
| rift: `/srv/mcp/mcp.db` | Registry | Operational truth (observed state, events, deployed specs) |

**Open sub-question:** How does the CLI write to the rift registry? Options:
(a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent
which writes to the registry, (c) CLI SSHes to rift. Needs decision.

### 2. "mTLS" claim is inaccurate — RESOLVED

**Problem:** Communication section says "gRPC with mutual TLS" but the
actual auth model is server-side TLS (agent has a cert) + bearer token
(master presents MCIAS token in gRPC metadata). That's not mTLS.

**Resolution:** Fix wording to "server-side TLS + MCIAS bearer token."
mTLS was discussed and deemed not worth the complexity for v1: the agent
is behind the overlay network, tokens are short-lived, and the scenarios
where client certs help (stolen token, MCIAS compromise) already imply
broader platform compromise. Note mTLS as a potential future hardening
measure subject to security review.

### 3. Agent system user can't manage containers — RESOLVED

**Problem:** Agent systemd unit runs as `User=mcp`. But podman/docker CLI
requires root, docker group membership, or rootless podman setup. Doc
doesn't address how the `mcp` user gets container runtime permissions.

**Resolution:** Provision a dedicated `mcp` user via NixOS config. Agent
runs as `mcp`. Podman runs rootless under that user. All containers are
owned by `mcp`. Existing containers on rift (currently under kyle) will
need to be re-created under `mcp`'s podman instance as a one-time
migration during MCP bootstrap. `/srv/` directories need to be accessible
to the `mcp` user.

### 4. No way to target individual containers from CLI — RESOLVED

**Problem:** Doc says "lifecycle commands can target individual containers"
but CLI only shows `mcp stop <service>`, not per-container syntax.

**Resolution:** CLI operates at service level only (like compose). Drop the
per-container claim. Per-container operations are the agent's internal
concern — the agent decides how to manage individual containers within a
service. If an operator needs per-container control, they use podman
directly on the node.

### 5. `--image` flag ambiguous for multi-container services — RESOLVED

**Problem:** `mcp deploy metacrypt --image v2.0.0` — metacrypt has two
containers with different images. Does `--image` override both? Which one?

**Resolution:** Rename `[[containers]]` to `[[components]]`. Components
are independently deployable within a service. Deploy targets:

- `mcp deploy metacrypt` — all components (default)
- `mcp deploy metacrypt/web` — just the web component

No `--image` flag needed. To update an image, edit the service definition
(or `mcp service edit metacrypt`) and deploy the specific component.

Components share the service's node and `/srv/<service>/` directory. This
models the real constraint: api and web are co-located, share state, but
have independent lifecycles (e.g., restarting metacrypt-api requires
unsealing, but web can be redeployed independently).

### 6. Unmanaged desired_state contradicts schema — RESOLVED

**Problem:** Doc says "Unmanaged containers have no desired state set
(desired_state is empty)" but schema has `desired_state TEXT NOT NULL
DEFAULT 'running'`. Empty string is not a valid state value.

**Resolution:** Add `ignore` as a desired state. Unmanaged containers
discovered during sync get `desired_state = 'ignore'`. The agent sees
them, reports them in status, but doesn't alert on drift or try to
reconcile. `mcp adopt` changes desired_state from `ignore` to `running`.

Desired states: `running`, `stopped`, `ignore`.

### 7. `restart` vs `restart_policy` naming inconsistency — RESOLVED

**Problem:** Service definition TOML uses `restart`. Proto and schema use
`restart_policy`. Pick one.

**Resolution:** Use `restart` everywhere — TOML, proto, schema. Shorter,
matches compose convention.

### 8. `mcp ps` vs `mcp status` vs `mcp sync` unclear boundaries — RESOLVED

**Problem:** Status and sync both query agents and reconcile. What's the
difference?

**Resolution:** Three distinct commands with clear semantics:

- `mcp list` — read the agent's registry. No runtime query. Fast. Shows
  services, desired state, last-known observed state, version.
- `mcp ps` — force a live runtime query, update observed state. Shows
  what's actually running with uptime and version.
- `mcp status` — full picture: live query + drift detection + recent
  events. "What do I need to worry about."
- `mcp sync` — push service definitions from CLI to agent. Updates the
  agent's desired state without deploying. "Here's what should be
  running on your node."

### 9. No REST on agent — unstated deviation from engineering standards — RESOLVED

**Problem:** Engineering standards require REST+gRPC parity. Agent is
gRPC-only.

**Resolution:** The REST+gRPC parity rule applies to user-facing services,
not internal infrastructure. MCP's agent is a C2 channel with no
meaningful REST use case. Call out the exception in ARCHITECTURE.md, and
update engineering-standards.md to clarify the rule applies where both
interfaces provide value (user services with external consumers), not
universally.

### 10. Master-side deploy flow undocumented — RESOLVED

**Problem:** The agent-side deploy flow is documented step by step. The
master-side flow is fragmented across sections.

**Resolution:** Resolved by the agent merge. The CLI is now a thin client
that pushes service definitions to the agent. The agent handles the full
deploy flow (resolve spec, pull image, stop/start containers, update
registry). The ARCHITECTURE.md rewrite will document this as one coherent
flow.

### 11. `mcp watch` connecting to local agent — RESOLVED

**Problem:** Watch and agent on same machine — unclear wiring.

**Resolution:** Eliminated by the agent merge. Watch is now a subsystem
within the agent, not a separate process. No cross-process wiring needed.

### 12. Container name uniqueness across nodes — RESOLVED

**Problem:** Container `name` is the PK. Works for v1 with one node.
Would collide with multi-node.

**Resolution:** With the merged agent model, each agent has its own
SQLite database on its own node. No cross-node name collision. The
unique identity of a component is `node/service/component`. The CLI
aggregates across agents in output and qualifies by node.

## Design Changes Agreed During Review

- `active: true/false` in service definitions replaces `state` field
- No database on vade; registry lives only on rift
- Service definition files are operator-local (vade), not replicated to rift
- CLI writes desired state + deployed spec to rift registry during actions

### Major: Merge agent and watcher into a single smart node daemon

The original design had three components: CLI (master), dumb agent, watch
process. This is now simplified to two:

**CLI** (`mcp`) — thin client on operator's workstation:
- Reads local service definition files (`~/.config/mcp/services/`)
- Pushes desired state to the agent on each node
- Reads status, events, drift back from agents
- Provides operator UX (ps, status, deploy, service show/edit/export)
- No database, no daemon

**Agent** (`mcp-agent`) — smart per-node daemon:
- Receives desired state from CLI (service definitions)
- Observes actual state via container runtime
- Acts on the difference (deploy, stop, start)
- Stores the registry (SQLite: desired state, observed state, deployed
  specs, events)
- Monitors continuously (watch loop, event recording, drift detection)
- Alerts on drift/flapping (configurable alert command)
- Handles file transfers (push/pull within /srv/<service>/)
- Reports node resources (disk, memory, CPU — for future scheduling)

The "dumb" container runtime interaction (podman/docker exec) is an
internal subcomponent of the agent, not a separate concern.

This sets up for the declarative future: the primary operation is "here's
what should be running on your node" and the agent works to make reality
match intent. v1 reconciliation is operator-triggered (deploy/sync). v2
can add continuous auto-reconciliation.

Implications for other review items:
- #1 (database location): fully resolved — database is on the agent
- #2 (mTLS): still needs fixing (terminology)
- #3 (agent user permissions): still needs answer
- #10 (master-side deploy flow): changes significantly — CLI pushes to
  agent, agent executes
- #11 (watch connecting to agent): eliminated — they're the same process