- docs/architecture-v2.md: Full MCP v2 multi-node architecture covering master/agent topology, security model, service placement, edge routing, snapshots, migration, and 6-phase implementation plan. Reviewed and ready for implementation. - docs/architecture-v2-review.md: External architecture review with all items resolved. - docs/phase-e-plan.md: Updated to match v2 decisions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
211 lines
8.1 KiB
Markdown
211 lines
8.1 KiB
Markdown
# ARCHITECTURE_V2.md Follow-Up Review
|
|
|
|
Reviewer: Claude (automated architecture review)
|
|
Date: 2026-04-02
|
|
Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
|
|
|
|
---
|
|
|
|
## Review Disposition
|
|
|
|
The revised document addresses 20 of 23 original items. The remaining
|
|
gaps are minor and don't block implementation. This document is **ready
|
|
to move to implementation**.
|
|
|
|
---
|
|
|
|
## Resolution of Original High-Priority Items
|
|
|
|
### #3 ExportServiceData mode (live vs migration) -- RESOLVED
|
|
|
|
The document takes the "master owns stop/start" approach: for migration,
|
|
the master stops the container (step 2) before calling
|
|
`ExportServiceData` (step 3). For scheduled snapshots, the master calls
|
|
`ExportServiceData` without stopping. The agent behaves the same either
|
|
way -- runs the configured snapshot method, tars, streams back. No proto
|
|
change needed.
|
|
|
|
One edge case to handle during implementation: if the agent's persisted
|
|
snapshot method is `grpc` or `cli` and the container is already stopped
|
|
(migration case), the agent can't call the service or exec into it.
|
|
The shutdown handler's vacuum provides consistency, so the agent should
|
|
**detect the container is stopped and skip the method step**, falling
|
|
back to a direct tar with auto-vacuum. This doesn't need a proto change
|
|
-- just a behavioral note in the agent implementation.
|
|
|
|
### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
|
|
|
|
The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
|
|
config is stored in the agent's registry at deploy time." This is the
|
|
right approach (option b from the original review).
|
|
|
|
**Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
|
|
fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
|
|
converts to `ServiceSpec` proto → sends to master → master forwards to
|
|
agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
|
|
the agent's registry.
|
|
|
|
**Fix needed:** Add snapshot fields to `ServiceSpec`:
|
|
|
|
```protobuf
|
|
message ServiceSpec {
|
|
string name = 1;
|
|
bool active = 2;
|
|
repeated ComponentSpec components = 3;
|
|
string tier = 4;
|
|
string node = 5;
|
|
SnapshotConfig snapshot = 6; // snapshot method and excludes
|
|
}
|
|
|
|
message SnapshotConfig {
|
|
string method = 1; // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
|
|
repeated string excludes = 2; // paths to skip
|
|
}
|
|
```
|
|
|
|
This is a one-line proto addition -- not a design issue, just a gap
|
|
in the spec.
|
|
|
|
### #12 backend_tls protobuf default -- RESOLVED
|
|
|
|
Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
|
|
|
|
The agent validates the field and rejects `false`. This is the strongest
|
|
fix -- the proto default doesn't matter because the agent enforces the
|
|
invariant. No silent cleartext is possible.
|
|
|
|
### #16 Public DNS registration -- RESOLVED
|
|
|
|
Lines 852-856: New step 6 documents that public DNS records are
|
|
pre-provisioned manually at Hurricane Electric. The master resolves the
|
|
hostname as a validation check, warns if it fails, but continues
|
|
(pragmatic for parallel setup). Clear and complete.
|
|
|
|
---
|
|
|
|
## Resolution of Original Medium-Priority Items
|
|
|
|
### #1 Bootstrap circularity -- RESOLVED
|
|
|
|
Lines 745-760: New "Bootstrap (first boot)" subsection covers image
|
|
pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
|
|
that boot sequence config contains full service definitions, and notes
|
|
this is the only place definitions live on the agent. Thorough.
|
|
|
|
### #2 Destructive sync -- RESOLVED
|
|
|
|
Line 1404: `mcp sync --dry-run` added. The destructive-by-default
|
|
behavior is a deliberate design choice -- the services directory is the
|
|
source of truth, and sync enforces it. The dry-run flag provides
|
|
adequate safety. Acceptable.
|
|
|
|
### #6 MasterDeployResponse success semantics -- RESOLVED
|
|
|
|
Lines 403-407: `success` is now documented as "true only if ALL steps
|
|
succeeded," with per-step results showing partial failures. Clear.
|
|
|
|
### #8 Migration destination directory check -- RESOLVED
|
|
|
|
Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
|
|
agent handles the directory check and enforces the overwrite guard.
|
|
Clean solution -- the check is where the filesystem is.
|
|
|
|
### #10 Master trust boundary -- RESOLVED
|
|
|
|
Lines 173-180: New "Trust Assumptions" subsection explicitly states the
|
|
master is fully trusted, documents the blast radius, and lists
|
|
mitigations. Exactly what was recommended.
|
|
|
|
### #11 Agent-to-master TLS verification -- RESOLVED
|
|
|
|
Lines 183-196: New "TLS Verification" subsection documents CA cert
|
|
usage, pre-provisioning, and startup retry behavior. Covers the edge
|
|
case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
|
|
|
|
---
|
|
|
|
## Resolution of Original Low-Priority Items
|
|
|
|
| # | Item | Status |
|
|
|---|------|--------|
|
|
| 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
|
|
| 7 | MigrateRequest validation | RESOLVED (line 1249) |
|
|
| 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
|
|
| 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
|
|
| 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
|
|
| 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
|
|
| 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
|
|
|
|
---
|
|
|
|
## New Observations
|
|
|
|
### A. Boot config drift potential
|
|
|
|
Lines 741-742, 757-760: Boot sequence config contains full service
|
|
definitions for stage 1-3 services. When the operator bumps an image
|
|
version via `mcp deploy`, the boot config is NOT automatically updated.
|
|
On next reboot, the agent starts the old version from boot config; the
|
|
master then deploys the new version.
|
|
|
|
This is acceptable for a personal platform (brief window of old version
|
|
on reboot) and self-correcting (the master's placement takes over). But
|
|
the operator should know to update the boot config for foundation
|
|
services (MCIAS, MCNS) where running the wrong version could cause
|
|
authentication or DNS failures before the master is even up.
|
|
|
|
**Recommendation:** Add a note to the boot sequencing section: "When
|
|
updating foundation service images, also update the boot sequence
|
|
config. The master corrects worker service versions after startup, but
|
|
foundation services run before the master exists."
|
|
|
|
### B. Migration step 3 vs snapshot method
|
|
|
|
When the master calls `ExportServiceData` after stopping the container
|
|
(migration step 3), the agent can't execute `grpc` or `cli` snapshot
|
|
methods because the container isn't running. The agent should fall back
|
|
to a direct tar with auto-vacuum of `.db` files. This is correct
|
|
behavior (the container already vacuumed on shutdown), but should be
|
|
documented as an implementation rule: "If the container is not running
|
|
when `ExportServiceData` is called, skip the snapshot method and tar
|
|
directly."
|
|
|
|
This is the same point as item #3 above -- noting it here as an
|
|
implementation requirement rather than a design gap.
|
|
|
|
---
|
|
|
|
## Security Assessment
|
|
|
|
No new security concerns. The revised document strengthens the security
|
|
posture:
|
|
|
|
- **Trust Assumptions** section (new) makes the threat model explicit.
|
|
- **TLS Verification** section (new) closes the gap on inter-component
|
|
TLS validation.
|
|
- **backend_tls rejection** ensures no accidental cleartext.
|
|
- **Public DNS validation** (warn-and-continue) prevents silent
|
|
misconfiguration without blocking legitimate parallel setup.
|
|
|
|
The existing security model (identity-bound registration, cert SAN
|
|
restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
|
|
sound.
|
|
|
|
---
|
|
|
|
## Readiness for Implementation
|
|
|
|
The document is ready for implementation. Summary of what remains:
|
|
|
|
| Item | Action | When |
|
|
|------|--------|------|
|
|
| Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
|
|
| Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
|
|
| Boot config drift note | Optional doc edit | Any time |
|
|
| mc-proxy binding addresses | Optional doc edit | Any time |
|
|
|
|
None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
|
|
routing RPCs). The snapshot-related items only matter at Phase 5.
|
|
|
|
**Verdict: proceed to implementation.**
|