Remove completed architecture review

Review is complete — all items resolved, document approved for
implementation. The review served its purpose and is no longer needed
in the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-02 11:09:44 -07:00
parent f3a608c4ae
commit f9eea00a6c

View File

@@ -1,210 +0,0 @@
# ARCHITECTURE_V2.md Follow-Up Review
Reviewer: Claude (automated architecture review)
Date: 2026-04-02
Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
---
## Review Disposition
The revised document addresses 20 of 23 original items. The remaining
gaps are minor and don't block implementation. This document is **ready
to move to implementation**.
---
## Resolution of Original High-Priority Items
### #3 ExportServiceData mode (live vs migration) -- RESOLVED
The document takes the "master owns stop/start" approach: for migration,
the master stops the container (step 2) before calling
`ExportServiceData` (step 3). For scheduled snapshots, the master calls
`ExportServiceData` without stopping. The agent behaves the same either
way -- runs the configured snapshot method, tars, streams back. No proto
change needed.
One edge case to handle during implementation: if the agent's persisted
snapshot method is `grpc` or `cli` and the container is already stopped
(migration case), the agent can't call the service or exec into it.
The shutdown handler's vacuum provides consistency, so the agent should
**detect the container is stopped and skip the method step**, falling
back to a direct tar with auto-vacuum. This doesn't need a proto change
-- just a behavioral note in the agent implementation.
### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
config is stored in the agent's registry at deploy time." This is the
right approach (option b from the original review).
**Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
converts to `ServiceSpec` proto → sends to master → master forwards to
agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
the agent's registry.
**Fix needed:** Add snapshot fields to `ServiceSpec`:
```protobuf
message ServiceSpec {
string name = 1;
bool active = 2;
repeated ComponentSpec components = 3;
string tier = 4;
string node = 5;
SnapshotConfig snapshot = 6; // snapshot method and excludes
}
message SnapshotConfig {
string method = 1; // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
repeated string excludes = 2; // paths to skip
}
```
This is a one-line proto addition -- not a design issue, just a gap
in the spec.
### #12 backend_tls protobuf default -- RESOLVED
Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
The agent validates the field and rejects `false`. This is the strongest
fix -- the proto default doesn't matter because the agent enforces the
invariant. No silent cleartext is possible.
### #16 Public DNS registration -- RESOLVED
Lines 852-856: New step 6 documents that public DNS records are
pre-provisioned manually at Hurricane Electric. The master resolves the
hostname as a validation check, warns if it fails, but continues
(pragmatic for parallel setup). Clear and complete.
---
## Resolution of Original Medium-Priority Items
### #1 Bootstrap circularity -- RESOLVED
Lines 745-760: New "Bootstrap (first boot)" subsection covers image
pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
that boot sequence config contains full service definitions, and notes
this is the only place definitions live on the agent. Thorough.
### #2 Destructive sync -- RESOLVED
Line 1404: `mcp sync --dry-run` added. The destructive-by-default
behavior is a deliberate design choice -- the services directory is the
source of truth, and sync enforces it. The dry-run flag provides
adequate safety. Acceptable.
### #6 MasterDeployResponse success semantics -- RESOLVED
Lines 403-407: `success` is now documented as "true only if ALL steps
succeeded," with per-step results showing partial failures. Clear.
### #8 Migration destination directory check -- RESOLVED
Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
agent handles the directory check and enforces the overwrite guard.
Clean solution -- the check is where the filesystem is.
### #10 Master trust boundary -- RESOLVED
Lines 173-180: New "Trust Assumptions" subsection explicitly states the
master is fully trusted, documents the blast radius, and lists
mitigations. Exactly what was recommended.
### #11 Agent-to-master TLS verification -- RESOLVED
Lines 183-196: New "TLS Verification" subsection documents CA cert
usage, pre-provisioning, and startup retry behavior. Covers the edge
case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
---
## Resolution of Original Low-Priority Items
| # | Item | Status |
|---|------|--------|
| 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
| 7 | MigrateRequest validation | RESOLVED (line 1249) |
| 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
| 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
| 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
| 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
| 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
---
## New Observations
### A. Boot config drift potential
Lines 741-742, 757-760: Boot sequence config contains full service
definitions for stage 1-3 services. When the operator bumps an image
version via `mcp deploy`, the boot config is NOT automatically updated.
On next reboot, the agent starts the old version from boot config; the
master then deploys the new version.
This is acceptable for a personal platform (brief window of old version
on reboot) and self-correcting (the master's placement takes over). But
the operator should know to update the boot config for foundation
services (MCIAS, MCNS) where running the wrong version could cause
authentication or DNS failures before the master is even up.
**Recommendation:** Add a note to the boot sequencing section: "When
updating foundation service images, also update the boot sequence
config. The master corrects worker service versions after startup, but
foundation services run before the master exists."
### B. Migration step 3 vs snapshot method
When the master calls `ExportServiceData` after stopping the container
(migration step 3), the agent can't execute `grpc` or `cli` snapshot
methods because the container isn't running. The agent should fall back
to a direct tar with auto-vacuum of `.db` files. This is correct
behavior (the container already vacuumed on shutdown), but should be
documented as an implementation rule: "If the container is not running
when `ExportServiceData` is called, skip the snapshot method and tar
directly."
This is the same point as item #3 above -- noting it here as an
implementation requirement rather than a design gap.
---
## Security Assessment
No new security concerns. The revised document strengthens the security
posture:
- **Trust Assumptions** section (new) makes the threat model explicit.
- **TLS Verification** section (new) closes the gap on inter-component
TLS validation.
- **backend_tls rejection** ensures no accidental cleartext.
- **Public DNS validation** (warn-and-continue) prevents silent
misconfiguration without blocking legitimate parallel setup.
The existing security model (identity-bound registration, cert SAN
restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
sound.
---
## Readiness for Implementation
The document is ready for implementation. Summary of what remains:
| Item | Action | When |
|------|--------|------|
| Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
| Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
| Boot config drift note | Optional doc edit | Any time |
| mc-proxy binding addresses | Optional doc edit | Any time |
None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
routing RPCs). The snapshot-related items only matter at Phase 5.
**Verdict: proceed to implementation.**