Add v2 architecture and review docs

- docs/architecture-v2.md: Full MCP v2 multi-node architecture covering master/agent topology, security model, service placement, edge routing, snapshots, migration, and 6-phase implementation plan. Reviewed and ready for implementation. - docs/architecture-v2-review.md: External architecture review with all items resolved. - docs/phase-e-plan.md: Updated to match v2 decisions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:09:36 -07:00
parent 8fb6374257
commit f3a608c4ae
3 changed files with 1885 additions and 37 deletions
--- a/docs/architecture-v2-review.md
+++ b/docs/architecture-v2-review.md
@@ -0,0 +1,210 @@
 # ARCHITECTURE_V2.md Follow-Up Review
 Reviewer: Claude (automated architecture review)
 Date: 2026-04-02
 Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
 ---
 ## Review Disposition
 The revised document addresses 20 of 23 original items. The remaining
 gaps are minor and don't block implementation. This document is **ready
 to move to implementation**.
 ---
 ## Resolution of Original High-Priority Items
 ### #3 ExportServiceData mode (live vs migration) -- RESOLVED
 The document takes the "master owns stop/start" approach: for migration,
 the master stops the container (step 2) before calling
 `ExportServiceData` (step 3). For scheduled snapshots, the master calls
 `ExportServiceData` without stopping. The agent behaves the same either
 way -- runs the configured snapshot method, tars, streams back. No proto
 change needed.
 One edge case to handle during implementation: if the agent's persisted
 snapshot method is `grpc` or `cli` and the container is already stopped
 (migration case), the agent can't call the service or exec into it.
 The shutdown handler's vacuum provides consistency, so the agent should
 **detect the container is stopped and skip the method step**, falling
 back to a direct tar with auto-vacuum. This doesn't need a proto change
 -- just a behavioral note in the agent implementation.
 ### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
 The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
 config is stored in the agent's registry at deploy time." This is the
 right approach (option b from the original review).
 **Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
 fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
 converts to `ServiceSpec` proto → sends to master → master forwards to
 agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
 the agent's registry.
 **Fix needed:** Add snapshot fields to `ServiceSpec`:
 ```protobuf
 message ServiceSpec {
  string name    = 1;
  bool   active  = 2;
  repeated ComponentSpec components = 3;
  string tier    = 4;
  string node    = 5;
  SnapshotConfig snapshot = 6;  // snapshot method and excludes
 }
 message SnapshotConfig {
  string method            = 1;  // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
  repeated string excludes = 2;  // paths to skip
 }
 ```
 This is a one-line proto addition -- not a design issue, just a gap
 in the spec.
 ### #12 backend_tls protobuf default -- RESOLVED
 Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
 The agent validates the field and rejects `false`. This is the strongest
 fix -- the proto default doesn't matter because the agent enforces the
 invariant. No silent cleartext is possible.
 ### #16 Public DNS registration -- RESOLVED
 Lines 852-856: New step 6 documents that public DNS records are
 pre-provisioned manually at Hurricane Electric. The master resolves the
 hostname as a validation check, warns if it fails, but continues
 (pragmatic for parallel setup). Clear and complete.
 ---
 ## Resolution of Original Medium-Priority Items
 ### #1 Bootstrap circularity -- RESOLVED
 Lines 745-760: New "Bootstrap (first boot)" subsection covers image
 pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
 that boot sequence config contains full service definitions, and notes
 this is the only place definitions live on the agent. Thorough.
 ### #2 Destructive sync -- RESOLVED
 Line 1404: `mcp sync --dry-run` added. The destructive-by-default
 behavior is a deliberate design choice -- the services directory is the
 source of truth, and sync enforces it. The dry-run flag provides
 adequate safety. Acceptable.
 ### #6 MasterDeployResponse success semantics -- RESOLVED
 Lines 403-407: `success` is now documented as "true only if ALL steps
 succeeded," with per-step results showing partial failures. Clear.
 ### #8 Migration destination directory check -- RESOLVED
 Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
 agent handles the directory check and enforces the overwrite guard.
 Clean solution -- the check is where the filesystem is.
 ### #10 Master trust boundary -- RESOLVED
 Lines 173-180: New "Trust Assumptions" subsection explicitly states the
 master is fully trusted, documents the blast radius, and lists
 mitigations. Exactly what was recommended.
 ### #11 Agent-to-master TLS verification -- RESOLVED
 Lines 183-196: New "TLS Verification" subsection documents CA cert
 usage, pre-provisioning, and startup retry behavior. Covers the edge
 case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
 ---
 ## Resolution of Original Low-Priority Items
 | # | Item | Status |
 |---|------|--------|
 | 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
 | 7 | MigrateRequest validation | RESOLVED (line 1249) |
 | 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
 | 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
 | 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
 | 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
 | 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
 ---
 ## New Observations
 ### A. Boot config drift potential
 Lines 741-742, 757-760: Boot sequence config contains full service
 definitions for stage 1-3 services. When the operator bumps an image
 version via `mcp deploy`, the boot config is NOT automatically updated.
 On next reboot, the agent starts the old version from boot config; the
 master then deploys the new version.
 This is acceptable for a personal platform (brief window of old version
 on reboot) and self-correcting (the master's placement takes over). But
 the operator should know to update the boot config for foundation
 services (MCIAS, MCNS) where running the wrong version could cause
 authentication or DNS failures before the master is even up.
 **Recommendation:** Add a note to the boot sequencing section: "When
 updating foundation service images, also update the boot sequence
 config. The master corrects worker service versions after startup, but
 foundation services run before the master exists."
 ### B. Migration step 3 vs snapshot method
 When the master calls `ExportServiceData` after stopping the container
 (migration step 3), the agent can't execute `grpc` or `cli` snapshot
 methods because the container isn't running. The agent should fall back
 to a direct tar with auto-vacuum of `.db` files. This is correct
 behavior (the container already vacuumed on shutdown), but should be
 documented as an implementation rule: "If the container is not running
 when `ExportServiceData` is called, skip the snapshot method and tar
 directly."
 This is the same point as item #3 above -- noting it here as an
 implementation requirement rather than a design gap.
 ---
 ## Security Assessment
 No new security concerns. The revised document strengthens the security
 posture:
 - **Trust Assumptions** section (new) makes the threat model explicit.
 - **TLS Verification** section (new) closes the gap on inter-component
  TLS validation.
 - **backend_tls rejection** ensures no accidental cleartext.
 - **Public DNS validation** (warn-and-continue) prevents silent
  misconfiguration without blocking legitimate parallel setup.
 The existing security model (identity-bound registration, cert SAN
 restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
 sound.
 ---
 ## Readiness for Implementation
 The document is ready for implementation. Summary of what remains:
 | Item | Action | When |
 |------|--------|------|
 | Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
 | Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
 | Boot config drift note | Optional doc edit | Any time |
 | mc-proxy binding addresses | Optional doc edit | Any time |
 None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
 routing RPCs). The snapshot-related items only matter at Phase 5.
 **Verdict: proceed to implementation.**
--- a/docs/architecture-v2.md
+++ b/docs/architecture-v2.md
--- a/docs/phase-e-plan.md
+++ b/docs/phase-e-plan.md
@@ -15,25 +15,59 @@ same as today on rift, but across the fleet.
 | Node | OS | Arch | Role |
 |------|----|------|------|
-| desktop (TBD) | NixOS | amd64 | Control plane -- runs master + MCIAS + MCNS |
+| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
-| rift | NixOS | amd64 | Compute -- application services |
+| orion | NixOS | amd64 | Worker |
-| orion | NixOS | amd64 | Compute |
+| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
 | hyperborea | Debian | arm64 | Compute (Raspberry Pi) |
 | svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
 Tailnet is the interconnect between all nodes. Public traffic enters via
-mc-proxy on svc, which forwards over Tailnet to compute nodes.
+mc-proxy on svc, which forwards over Tailnet to worker nodes.
 ## Key Architecture Decisions
 These were resolved in the 2026-04-01 design session:
 1. **Rift is the master node.** No separate straylight machine. Core infra
   stays on rift, which gains mcp-master alongside its existing agent.
 2. **Master-mediated coordination.** Agents never talk to each other. All
   cross-node operations go through the master. Agents only dial the master
   (for registration and heartbeats) and respond to master RPCs.
 3. **Agent self-registration.** Agents register with the master on startup
   (name, role, address, arch). The master maintains the live node registry.
   No static `[[nodes]]` config required except for bootstrap.
 4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
   (with resource data). If the master misses 3 heartbeats (90s), it
   actively probes the agent. Failed probe marks the node unhealthy.
 5. **Tier-based placement.** `tier = "core"` runs on the master node.
   `tier = "worker"` (default) is auto-placed on a worker with capacity.
   Explicit `node = "orion"` overrides tier for pinned services.
 6. **Two separate certs for public services.** Internal cert
   (`svc.mcp.metacircular.net`) issued by worker agent. Public cert
   (`metacircular.net`) issued by edge agent. Internal names never
   appear on edge certs.
 7. **`public = true` on routes.** Public routes declare intent with a
   boolean flag. The master assigns the route to an edge node (currently
   always svc). No explicit `edge` field in service definitions.
 ## Components
 ### Master (`mcp-master`)
-Long-lived orchestrator on the control plane node. Responsibilities:
+Long-lived orchestrator on rift. Responsibilities:
 - Accept CLI commands and dispatch to the correct agent
 - Maintain node registry from agent self-registration
 - Place services based on tier, explicit node, and resource availability
 - Detect `public = true` routes and coordinate edge setup
 - Validate public hostnames against allowed domain list
 - Aggregate status from all agents (fleet-wide view)
- Node selection when `node` is omitted from a service definition
+- Probe agents on missed heartbeats
 - Health-aware scheduling using agent heartbeat data
 The master is stateless in the durable sense -- it rebuilds its world view
 from agents on startup. If the master goes down, running services continue
@@ -54,50 +88,40 @@ inputs don't work as a universal update mechanism.
 - All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
  `mcp-agent.service`
 Upgrades must be coordinated -- new RPCs cause `Unimplemented` errors on
 old agents.
 ### Edge agents
 svc runs an agent but does NOT run containers. Its agent manages mc-proxy
-routing only: when the master provisions a service on a compute node, svc's
+routing only: when the master tells it to set up an edge route, it
-agent updates mc-proxy routes to point at the compute node's Tailnet
+provisions a TLS cert from Metacrypt and registers the route in its local
-address.
+mc-proxy via the gRPC admin API.
-### MCIAS migration
+## Migration Plan
-MCIAS moves from the svc VPS to the control plane node, running as an
+### Phase 1: Agent on svc
-MCP-managed container with an independent lifecycle. Bootstrap order:
+Deploy mcp-agent to svc. Verify with `mcp node list`.
-1. MCIAS image pre-staged or pulled unauthenticated
+### Phase 2: Edge routing RPCs
-2. MCIAS starts (L4 passthrough through mc-proxy -- manages its own TLS)
+Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
-3. All other services bootstrap after MCIAS is up
+Test by calling directly from CLI.
-## Scheduling
+### Phase 3: Build mcp-master
 Core loop: registration, heartbeats, deploy routing, placement, edge
 coordination.
-Three placement modes, in order of specificity:
+### Phase 4: Agent registration and health
 Self-registration, heartbeat loop, master probe fallback, fleet status.
-1. `node = "rift"` -- explicit placement on a named node
+### Phase 5: Cut over
-2. `node = "pi-pool"` -- master picks within a named cluster
+Point CLI at master, add tier fields to service defs, deploy agents to
-3. `node` omitted -- master picks any compute node with capacity
+orion and hyperborea.
 Resource-aware placement via agent heartbeats (CPU, memory, disk). RPis
 with 4-8 GB RAM need resource tracking more than beefy servers.
 ## Open Questions
 - **Control plane machine**: which desktop becomes the always-on node?
 - **Heartbeat model**: agent push vs. master poll?
 - **Cluster definition**: explicit pool config in master vs. node labels/tags?
 - **MCIAS migration timeline**: when to cut over from svc to control plane?
 - **Agent on svc**: what subset of agent RPCs does an edge-only agent need?
 ## What Phase E Does NOT Include
 These remain future work:
 - Auto-reconciliation (agent auto-restarting drifted containers)
- Migration (snapshot streaming between nodes)
+- Live migration (snapshot streaming between nodes)
 - Web UI for fleet management
 - Observability / log aggregation
 - Object store
 - Multiple edge nodes / master HA