Documents 12 issues found during critical review of ARCHITECTURE.md and their resolutions: merged agent/watcher into single smart daemon, components model for independent deploy within services, database lives on agent not CLI, TLS+bearer (not mTLS), desired_state=ignore for unmanaged containers, and other clarifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
207 lines
9.1 KiB
Markdown
207 lines
9.1 KiB
Markdown
# MCP ARCHITECTURE.md Review
|
|
|
|
Status: all 12 issues resolved. ARCHITECTURE.md needs a rewrite to
|
|
incorporate the design changes.
|
|
|
|
## Issues
|
|
|
|
### 1. Database location — RESOLVED
|
|
|
|
**Problem:** Doc says `~/.config/mcp/mcp.db` on vade and `/srv/mcp/mcp.db`
|
|
on rift. Two databases would diverge.
|
|
|
|
**Resolution:** No database on vade. Service definition files on vade are
|
|
the operator's intent (with `active: true/false`). The registry database
|
|
lives only on rift, owned by `mcp watch`. The CLI writes desired state and
|
|
deployed specs to the rift registry during deploy/stop/start. The watch
|
|
process reads from it.
|
|
|
|
| Location | What | Purpose |
|
|
|----------|------|---------|
|
|
| vade: `~/.config/mcp/services/*.toml` | Service definitions | Operator intent |
|
|
| rift: `/srv/mcp/mcp.db` | Registry | Operational truth (observed state, events, deployed specs) |
|
|
|
|
**Open sub-question:** How does the CLI write to the rift registry? Options:
|
|
(a) CLI talks to the watch process via gRPC, (b) CLI talks to the agent
|
|
which writes to the registry, (c) CLI SSHes to rift. Needs decision.
|
|
|
|
### 2. "mTLS" claim is inaccurate — RESOLVED
|
|
|
|
**Problem:** Communication section says "gRPC with mutual TLS" but the
|
|
actual auth model is server-side TLS (agent has a cert) + bearer token
|
|
(master presents MCIAS token in gRPC metadata). That's not mTLS.
|
|
|
|
**Resolution:** Fix wording to "server-side TLS + MCIAS bearer token."
|
|
mTLS was discussed and deemed not worth the complexity for v1: the agent
|
|
is behind the overlay network, tokens are short-lived, and the scenarios
|
|
where client certs help (stolen token, MCIAS compromise) already imply
|
|
broader platform compromise. Note mTLS as a potential future hardening
|
|
measure subject to security review.
|
|
|
|
### 3. Agent system user can't manage containers — RESOLVED
|
|
|
|
**Problem:** Agent systemd unit runs as `User=mcp`. But podman/docker CLI
|
|
requires root, docker group membership, or rootless podman setup. Doc
|
|
doesn't address how the `mcp` user gets container runtime permissions.
|
|
|
|
**Resolution:** Provision a dedicated `mcp` user via NixOS config. Agent
|
|
runs as `mcp`. Podman runs rootless under that user. All containers are
|
|
owned by `mcp`. Existing containers on rift (currently under kyle) will
|
|
need to be re-created under `mcp`'s podman instance as a one-time
|
|
migration during MCP bootstrap. `/srv/` directories need to be accessible
|
|
to the `mcp` user.
|
|
|
|
### 4. No way to target individual containers from CLI — RESOLVED
|
|
|
|
**Problem:** Doc says "lifecycle commands can target individual containers"
|
|
but CLI only shows `mcp stop <service>`, not per-container syntax.
|
|
|
|
**Resolution:** CLI operates at service level only (like compose). Drop the
|
|
per-container claim. Per-container operations are the agent's internal
|
|
concern — the agent decides how to manage individual containers within a
|
|
service. If an operator needs per-container control, they use podman
|
|
directly on the node.
|
|
|
|
### 5. `--image` flag ambiguous for multi-container services — RESOLVED
|
|
|
|
**Problem:** `mcp deploy metacrypt --image v2.0.0` — metacrypt has two
|
|
containers with different images. Does `--image` override both? Which one?
|
|
|
|
**Resolution:** Rename `[[containers]]` to `[[components]]`. Components
|
|
are independently deployable within a service. Deploy targets:
|
|
|
|
- `mcp deploy metacrypt` — all components (default)
|
|
- `mcp deploy metacrypt/web` — just the web component
|
|
|
|
No `--image` flag needed. To update an image, edit the service definition
|
|
(or `mcp service edit metacrypt`) and deploy the specific component.
|
|
|
|
Components share the service's node and `/srv/<service>/` directory. This
|
|
models the real constraint: api and web are co-located, share state, but
|
|
have independent lifecycles (e.g., restarting metacrypt-api requires
|
|
unsealing, but web can be redeployed independently).
|
|
|
|
### 6. Unmanaged desired_state contradicts schema — RESOLVED
|
|
|
|
**Problem:** Doc says "Unmanaged containers have no desired state set
|
|
(desired_state is empty)" but schema has `desired_state TEXT NOT NULL
|
|
DEFAULT 'running'`. Empty string is not a valid state value.
|
|
|
|
**Resolution:** Add `ignore` as a desired state. Unmanaged containers
|
|
discovered during sync get `desired_state = 'ignore'`. The agent sees
|
|
them, reports them in status, but doesn't alert on drift or try to
|
|
reconcile. `mcp adopt` changes desired_state from `ignore` to `running`.
|
|
|
|
Desired states: `running`, `stopped`, `ignore`.
|
|
|
|
### 7. `restart` vs `restart_policy` naming inconsistency — RESOLVED
|
|
|
|
**Problem:** Service definition TOML uses `restart`. Proto and schema use
|
|
`restart_policy`. Pick one.
|
|
|
|
**Resolution:** Use `restart` everywhere — TOML, proto, schema. Shorter,
|
|
matches compose convention.
|
|
|
|
### 8. `mcp ps` vs `mcp status` vs `mcp sync` unclear boundaries — RESOLVED
|
|
|
|
**Problem:** Status and sync both query agents and reconcile. What's the
|
|
difference?
|
|
|
|
**Resolution:** Three distinct commands with clear semantics:
|
|
|
|
- `mcp list` — read the agent's registry. No runtime query. Fast. Shows
|
|
services, desired state, last-known observed state, version.
|
|
- `mcp ps` — force a live runtime query, update observed state. Shows
|
|
what's actually running with uptime and version.
|
|
- `mcp status` — full picture: live query + drift detection + recent
|
|
events. "What do I need to worry about."
|
|
- `mcp sync` — push service definitions from CLI to agent. Updates the
|
|
agent's desired state without deploying. "Here's what should be
|
|
running on your node."
|
|
|
|
### 9. No REST on agent — unstated deviation from engineering standards — RESOLVED
|
|
|
|
**Problem:** Engineering standards require REST+gRPC parity. Agent is
|
|
gRPC-only.
|
|
|
|
**Resolution:** The REST+gRPC parity rule applies to user-facing services,
|
|
not internal infrastructure. MCP's agent is a C2 channel with no
|
|
meaningful REST use case. Call out the exception in ARCHITECTURE.md, and
|
|
update engineering-standards.md to clarify the rule applies where both
|
|
interfaces provide value (user services with external consumers), not
|
|
universally.
|
|
|
|
### 10. Master-side deploy flow undocumented — RESOLVED
|
|
|
|
**Problem:** The agent-side deploy flow is documented step by step. The
|
|
master-side flow is fragmented across sections.
|
|
|
|
**Resolution:** Resolved by the agent merge. The CLI is now a thin client
|
|
that pushes service definitions to the agent. The agent handles the full
|
|
deploy flow (resolve spec, pull image, stop/start containers, update
|
|
registry). The ARCHITECTURE.md rewrite will document this as one coherent
|
|
flow.
|
|
|
|
### 11. `mcp watch` connecting to local agent — RESOLVED
|
|
|
|
**Problem:** Watch and agent on same machine — unclear wiring.
|
|
|
|
**Resolution:** Eliminated by the agent merge. Watch is now a subsystem
|
|
within the agent, not a separate process. No cross-process wiring needed.
|
|
|
|
### 12. Container name uniqueness across nodes — RESOLVED
|
|
|
|
**Problem:** Container `name` is the PK. Works for v1 with one node.
|
|
Would collide with multi-node.
|
|
|
|
**Resolution:** With the merged agent model, each agent has its own
|
|
SQLite database on its own node. No cross-node name collision. The
|
|
unique identity of a component is `node/service/component`. The CLI
|
|
aggregates across agents in output and qualifies by node.
|
|
|
|
## Design Changes Agreed During Review
|
|
|
|
- `active: true/false` in service definitions replaces `state` field
|
|
- No database on vade; registry lives only on rift
|
|
- Service definition files are operator-local (vade), not replicated to rift
|
|
- CLI writes desired state + deployed spec to rift registry during actions
|
|
|
|
### Major: Merge agent and watcher into a single smart node daemon
|
|
|
|
The original design had three components: CLI (master), dumb agent, watch
|
|
process. This is now simplified to two:
|
|
|
|
**CLI** (`mcp`) — thin client on operator's workstation:
|
|
- Reads local service definition files (`~/.config/mcp/services/`)
|
|
- Pushes desired state to the agent on each node
|
|
- Reads status, events, drift back from agents
|
|
- Provides operator UX (ps, status, deploy, service show/edit/export)
|
|
- No database, no daemon
|
|
|
|
**Agent** (`mcp-agent`) — smart per-node daemon:
|
|
- Receives desired state from CLI (service definitions)
|
|
- Observes actual state via container runtime
|
|
- Acts on the difference (deploy, stop, start)
|
|
- Stores the registry (SQLite: desired state, observed state, deployed
|
|
specs, events)
|
|
- Monitors continuously (watch loop, event recording, drift detection)
|
|
- Alerts on drift/flapping (configurable alert command)
|
|
- Handles file transfers (push/pull within /srv/<service>/)
|
|
- Reports node resources (disk, memory, CPU — for future scheduling)
|
|
|
|
The "dumb" container runtime interaction (podman/docker exec) is an
|
|
internal subcomponent of the agent, not a separate concern.
|
|
|
|
This sets up for the declarative future: the primary operation is "here's
|
|
what should be running on your node" and the agent works to make reality
|
|
match intent. v1 reconciliation is operator-triggered (deploy/sync). v2
|
|
can add continuous auto-reconciliation.
|
|
|
|
Implications for other review items:
|
|
- #1 (database location): fully resolved — database is on the agent
|
|
- #2 (mTLS): still needs fixing (terminology)
|
|
- #3 (agent user permissions): still needs answer
|
|
- #10 (master-side deploy flow): changes significantly — CLI pushes to
|
|
agent, agent executes
|
|
- #11 (watch connecting to agent): eliminated — they're the same process
|