Documents 12 issues found during critical review of ARCHITECTURE.md and their resolutions: merged agent/watcher into single smart daemon, components model for independent deploy within services, database lives on agent not CLI, TLS+bearer (not mTLS), desired_state=ignore for unmanaged containers, and other clarifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.1 KiB
MCP ARCHITECTURE.md Review
Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to incorporate the design changes.
Issues
1. Database location — RESOLVED
Problem: Doc says ~/.config/mcp/mcp.db on vade and /srv/mcp/mcp.db
on rift. Two databases would diverge.
Resolution: No database on vade. Service definition files on vade are
the operator's intent (with active: true/false). The registry database
lives only on rift, owned by mcp watch. The CLI writes desired state and
deployed specs to the rift registry during deploy/stop/start. The watch
process reads from it.
| Location | What | Purpose |
|---|---|---|
vade: ~/.config/mcp/services/*.toml |
Service definitions | Operator intent |
rift: /srv/mcp/mcp.db |
Registry | Operational truth (observed state, events, deployed specs) |
Open sub-question: How does the CLI write to the rift registry? Options: (a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent which writes to the registry, (c) CLI SSHes to rift. Needs decision.
2. "mTLS" claim is inaccurate — RESOLVED
Problem: Communication section says "gRPC with mutual TLS" but the actual auth model is server-side TLS (agent has a cert) + bearer token (master presents MCIAS token in gRPC metadata). That's not mTLS.
Resolution: Fix wording to "server-side TLS + MCIAS bearer token." mTLS was discussed and deemed not worth the complexity for v1: the agent is behind the overlay network, tokens are short-lived, and the scenarios where client certs help (stolen token, MCIAS compromise) already imply broader platform compromise. Note mTLS as a potential future hardening measure subject to security review.
3. Agent system user can't manage containers — RESOLVED
Problem: Agent systemd unit runs as User=mcp. But podman/docker CLI
requires root, docker group membership, or rootless podman setup. Doc
doesn't address how the mcp user gets container runtime permissions.
Resolution: Provision a dedicated mcp user via NixOS config. Agent
runs as mcp. Podman runs rootless under that user. All containers are
owned by mcp. Existing containers on rift (currently under kyle) will
need to be re-created under mcp's podman instance as a one-time
migration during MCP bootstrap. /srv/ directories need to be accessible
to the mcp user.
4. No way to target individual containers from CLI — RESOLVED
Problem: Doc says "lifecycle commands can target individual containers"
but CLI only shows mcp stop <service>, not per-container syntax.
Resolution: CLI operates at service level only (like compose). Drop the per-container claim. Per-container operations are the agent's internal concern — the agent decides how to manage individual containers within a service. If an operator needs per-container control, they use podman directly on the node.
5. --image flag ambiguous for multi-container services — RESOLVED
Problem: mcp deploy metacrypt --image v2.0.0 — metacrypt has two
containers with different images. Does --image override both? Which one?
Resolution: Rename [[containers]] to [[components]]. Components
are independently deployable within a service. Deploy targets:
mcp deploy metacrypt— all components (default)mcp deploy metacrypt/web— just the web component
No --image flag needed. To update an image, edit the service definition
(or mcp service edit metacrypt) and deploy the specific component.
Components share the service's node and /srv/<service>/ directory. This
models the real constraint: api and web are co-located, share state, but
have independent lifecycles (e.g., restarting metacrypt-api requires
unsealing, but web can be redeployed independently).
6. Unmanaged desired_state contradicts schema — RESOLVED
Problem: Doc says "Unmanaged containers have no desired state set
(desired_state is empty)" but schema has desired_state TEXT NOT NULL DEFAULT 'running'. Empty string is not a valid state value.
Resolution: Add ignore as a desired state. Unmanaged containers
discovered during sync get desired_state = 'ignore'. The agent sees
them, reports them in status, but doesn't alert on drift or try to
reconcile. mcp adopt changes desired_state from ignore to running.
Desired states: running, stopped, ignore.
7. restart vs restart_policy naming inconsistency — RESOLVED
Problem: Service definition TOML uses restart. Proto and schema use
restart_policy. Pick one.
Resolution: Use restart everywhere — TOML, proto, schema. Shorter,
matches compose convention.
8. mcp ps vs mcp status vs mcp sync unclear boundaries — RESOLVED
Problem: Status and sync both query agents and reconcile. What's the difference?
Resolution: Three distinct commands with clear semantics:
mcp list— read the agent's registry. No runtime query. Fast. Shows services, desired state, last-known observed state, version.mcp ps— force a live runtime query, update observed state. Shows what's actually running with uptime and version.mcp status— full picture: live query + drift detection + recent events. "What do I need to worry about."mcp sync— push service definitions from CLI to agent. Updates the agent's desired state without deploying. "Here's what should be running on your node."
9. No REST on agent — unstated deviation from engineering standards — RESOLVED
Problem: Engineering standards require REST+gRPC parity. Agent is gRPC-only.
Resolution: The REST+gRPC parity rule applies to user-facing services, not internal infrastructure. MCP's agent is a C2 channel with no meaningful REST use case. Call out the exception in ARCHITECTURE.md, and update engineering-standards.md to clarify the rule applies where both interfaces provide value (user services with external consumers), not universally.
10. Master-side deploy flow undocumented — RESOLVED
Problem: The agent-side deploy flow is documented step by step. The master-side flow is fragmented across sections.
Resolution: Resolved by the agent merge. The CLI is now a thin client that pushes service definitions to the agent. The agent handles the full deploy flow (resolve spec, pull image, stop/start containers, update registry). The ARCHITECTURE.md rewrite will document this as one coherent flow.
11. mcp watch connecting to local agent — RESOLVED
Problem: Watch and agent on same machine — unclear wiring.
Resolution: Eliminated by the agent merge. Watch is now a subsystem within the agent, not a separate process. No cross-process wiring needed.
12. Container name uniqueness across nodes — RESOLVED
Problem: Container name is the PK. Works for v1 with one node.
Would collide with multi-node.
Resolution: With the merged agent model, each agent has its own
SQLite database on its own node. No cross-node name collision. The
unique identity of a component is node/service/component. The CLI
aggregates across agents in output and qualifies by node.
Design Changes Agreed During Review
active: true/falsein service definitions replacesstatefield- No database on vade; registry lives only on rift
- Service definition files are operator-local (vade), not replicated to rift
- CLI writes desired state + deployed spec to rift registry during actions
Major: Merge agent and watcher into a single smart node daemon
The original design had three components: CLI (master), dumb agent, watch process. This is now simplified to two:
CLI (mcp) — thin client on operator's workstation:
- Reads local service definition files (
~/.config/mcp/services/) - Pushes desired state to the agent on each node
- Reads status, events, drift back from agents
- Provides operator UX (ps, status, deploy, service show/edit/export)
- No database, no daemon
Agent (mcp-agent) — smart per-node daemon:
- Receives desired state from CLI (service definitions)
- Observes actual state via container runtime
- Acts on the difference (deploy, stop, start)
- Stores the registry (SQLite: desired state, observed state, deployed specs, events)
- Monitors continuously (watch loop, event recording, drift detection)
- Alerts on drift/flapping (configurable alert command)
- Handles file transfers (push/pull within /srv//)
- Reports node resources (disk, memory, CPU — for future scheduling)
The "dumb" container runtime interaction (podman/docker exec) is an internal subcomponent of the agent, not a separate concern.
This sets up for the declarative future: the primary operation is "here's what should be running on your node" and the agent works to make reality match intent. v1 reconciliation is operator-triggered (deploy/sync). v2 can add continuous auto-reconciliation.
Implications for other review items:
- #1 (database location): fully resolved — database is on the agent
- #2 (mTLS): still needs fixing (terminology)
- #3 (agent user permissions): still needs answer
- #10 (master-side deploy flow): changes significantly — CLI pushes to agent, agent executes
- #11 (watch connecting to agent): eliminated — they're the same process