Add v2 architecture and review docs

- docs/architecture-v2.md: Full MCP v2 multi-node architecture
  covering master/agent topology, security model, service placement,
  edge routing, snapshots, migration, and 6-phase implementation plan.
  Reviewed and ready for implementation.
- docs/architecture-v2-review.md: External architecture review with
  all items resolved.
- docs/phase-e-plan.md: Updated to match v2 decisions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-02 11:09:36 -07:00
parent 8fb6374257
commit f3a608c4ae
3 changed files with 1885 additions and 37 deletions

View File

@@ -0,0 +1,210 @@
# ARCHITECTURE_V2.md Follow-Up Review
Reviewer: Claude (automated architecture review)
Date: 2026-04-02
Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
---
## Review Disposition
The revised document addresses 20 of 23 original items. The remaining
gaps are minor and don't block implementation. This document is **ready
to move to implementation**.
---
## Resolution of Original High-Priority Items
### #3 ExportServiceData mode (live vs migration) -- RESOLVED
The document takes the "master owns stop/start" approach: for migration,
the master stops the container (step 2) before calling
`ExportServiceData` (step 3). For scheduled snapshots, the master calls
`ExportServiceData` without stopping. The agent behaves the same either
way -- runs the configured snapshot method, tars, streams back. No proto
change needed.
One edge case to handle during implementation: if the agent's persisted
snapshot method is `grpc` or `cli` and the container is already stopped
(migration case), the agent can't call the service or exec into it.
The shutdown handler's vacuum provides consistency, so the agent should
**detect the container is stopped and skip the method step**, falling
back to a direct tar with auto-vacuum. This doesn't need a proto change
-- just a behavioral note in the agent implementation.
### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
config is stored in the agent's registry at deploy time." This is the
right approach (option b from the original review).
**Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
converts to `ServiceSpec` proto → sends to master → master forwards to
agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
the agent's registry.
**Fix needed:** Add snapshot fields to `ServiceSpec`:
```protobuf
message ServiceSpec {
string name = 1;
bool active = 2;
repeated ComponentSpec components = 3;
string tier = 4;
string node = 5;
SnapshotConfig snapshot = 6; // snapshot method and excludes
}
message SnapshotConfig {
string method = 1; // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
repeated string excludes = 2; // paths to skip
}
```
This is a one-line proto addition -- not a design issue, just a gap
in the spec.
### #12 backend_tls protobuf default -- RESOLVED
Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
The agent validates the field and rejects `false`. This is the strongest
fix -- the proto default doesn't matter because the agent enforces the
invariant. No silent cleartext is possible.
### #16 Public DNS registration -- RESOLVED
Lines 852-856: New step 6 documents that public DNS records are
pre-provisioned manually at Hurricane Electric. The master resolves the
hostname as a validation check, warns if it fails, but continues
(pragmatic for parallel setup). Clear and complete.
---
## Resolution of Original Medium-Priority Items
### #1 Bootstrap circularity -- RESOLVED
Lines 745-760: New "Bootstrap (first boot)" subsection covers image
pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
that boot sequence config contains full service definitions, and notes
this is the only place definitions live on the agent. Thorough.
### #2 Destructive sync -- RESOLVED
Line 1404: `mcp sync --dry-run` added. The destructive-by-default
behavior is a deliberate design choice -- the services directory is the
source of truth, and sync enforces it. The dry-run flag provides
adequate safety. Acceptable.
### #6 MasterDeployResponse success semantics -- RESOLVED
Lines 403-407: `success` is now documented as "true only if ALL steps
succeeded," with per-step results showing partial failures. Clear.
### #8 Migration destination directory check -- RESOLVED
Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
agent handles the directory check and enforces the overwrite guard.
Clean solution -- the check is where the filesystem is.
### #10 Master trust boundary -- RESOLVED
Lines 173-180: New "Trust Assumptions" subsection explicitly states the
master is fully trusted, documents the blast radius, and lists
mitigations. Exactly what was recommended.
### #11 Agent-to-master TLS verification -- RESOLVED
Lines 183-196: New "TLS Verification" subsection documents CA cert
usage, pre-provisioning, and startup retry behavior. Covers the edge
case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
---
## Resolution of Original Low-Priority Items
| # | Item | Status |
|---|------|--------|
| 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
| 7 | MigrateRequest validation | RESOLVED (line 1249) |
| 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
| 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
| 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
| 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
| 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
---
## New Observations
### A. Boot config drift potential
Lines 741-742, 757-760: Boot sequence config contains full service
definitions for stage 1-3 services. When the operator bumps an image
version via `mcp deploy`, the boot config is NOT automatically updated.
On next reboot, the agent starts the old version from boot config; the
master then deploys the new version.
This is acceptable for a personal platform (brief window of old version
on reboot) and self-correcting (the master's placement takes over). But
the operator should know to update the boot config for foundation
services (MCIAS, MCNS) where running the wrong version could cause
authentication or DNS failures before the master is even up.
**Recommendation:** Add a note to the boot sequencing section: "When
updating foundation service images, also update the boot sequence
config. The master corrects worker service versions after startup, but
foundation services run before the master exists."
### B. Migration step 3 vs snapshot method
When the master calls `ExportServiceData` after stopping the container
(migration step 3), the agent can't execute `grpc` or `cli` snapshot
methods because the container isn't running. The agent should fall back
to a direct tar with auto-vacuum of `.db` files. This is correct
behavior (the container already vacuumed on shutdown), but should be
documented as an implementation rule: "If the container is not running
when `ExportServiceData` is called, skip the snapshot method and tar
directly."
This is the same point as item #3 above -- noting it here as an
implementation requirement rather than a design gap.
---
## Security Assessment
No new security concerns. The revised document strengthens the security
posture:
- **Trust Assumptions** section (new) makes the threat model explicit.
- **TLS Verification** section (new) closes the gap on inter-component
TLS validation.
- **backend_tls rejection** ensures no accidental cleartext.
- **Public DNS validation** (warn-and-continue) prevents silent
misconfiguration without blocking legitimate parallel setup.
The existing security model (identity-bound registration, cert SAN
restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
sound.
---
## Readiness for Implementation
The document is ready for implementation. Summary of what remains:
| Item | Action | When |
|------|--------|------|
| Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
| Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
| Boot config drift note | Optional doc edit | Any time |
| mc-proxy binding addresses | Optional doc edit | Any time |
None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
routing RPCs). The snapshot-related items only matter at Phase 5.
**Verdict: proceed to implementation.**

1614
docs/architecture-v2.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -15,25 +15,59 @@ same as today on rift, but across the fleet.
| Node | OS | Arch | Role | | Node | OS | Arch | Role |
|------|----|------|------| |------|----|------|------|
| desktop (TBD) | NixOS | amd64 | Control plane -- runs master + MCIAS + MCNS | | rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
| rift | NixOS | amd64 | Compute -- application services | | orion | NixOS | amd64 | Worker |
| orion | NixOS | amd64 | Compute | | hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
| hyperborea | Debian | arm64 | Compute (Raspberry Pi) |
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers | | svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
Tailnet is the interconnect between all nodes. Public traffic enters via Tailnet is the interconnect between all nodes. Public traffic enters via
mc-proxy on svc, which forwards over Tailnet to compute nodes. mc-proxy on svc, which forwards over Tailnet to worker nodes.
## Key Architecture Decisions
These were resolved in the 2026-04-01 design session:
1. **Rift is the master node.** No separate straylight machine. Core infra
stays on rift, which gains mcp-master alongside its existing agent.
2. **Master-mediated coordination.** Agents never talk to each other. All
cross-node operations go through the master. Agents only dial the master
(for registration and heartbeats) and respond to master RPCs.
3. **Agent self-registration.** Agents register with the master on startup
(name, role, address, arch). The master maintains the live node registry.
No static `[[nodes]]` config required except for bootstrap.
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
(with resource data). If the master misses 3 heartbeats (90s), it
actively probes the agent. Failed probe marks the node unhealthy.
5. **Tier-based placement.** `tier = "core"` runs on the master node.
`tier = "worker"` (default) is auto-placed on a worker with capacity.
Explicit `node = "orion"` overrides tier for pinned services.
6. **Two separate certs for public services.** Internal cert
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
(`metacircular.net`) issued by edge agent. Internal names never
appear on edge certs.
7. **`public = true` on routes.** Public routes declare intent with a
boolean flag. The master assigns the route to an edge node (currently
always svc). No explicit `edge` field in service definitions.
## Components ## Components
### Master (`mcp-master`) ### Master (`mcp-master`)
Long-lived orchestrator on the control plane node. Responsibilities: Long-lived orchestrator on rift. Responsibilities:
- Accept CLI commands and dispatch to the correct agent - Accept CLI commands and dispatch to the correct agent
- Maintain node registry from agent self-registration
- Place services based on tier, explicit node, and resource availability
- Detect `public = true` routes and coordinate edge setup
- Validate public hostnames against allowed domain list
- Aggregate status from all agents (fleet-wide view) - Aggregate status from all agents (fleet-wide view)
- Node selection when `node` is omitted from a service definition - Probe agents on missed heartbeats
- Health-aware scheduling using agent heartbeat data
The master is stateless in the durable sense -- it rebuilds its world view The master is stateless in the durable sense -- it rebuilds its world view
from agents on startup. If the master goes down, running services continue from agents on startup. If the master goes down, running services continue
@@ -54,50 +88,40 @@ inputs don't work as a universal update mechanism.
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit - All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
`mcp-agent.service` `mcp-agent.service`
Upgrades must be coordinated -- new RPCs cause `Unimplemented` errors on
old agents.
### Edge agents ### Edge agents
svc runs an agent but does NOT run containers. Its agent manages mc-proxy svc runs an agent but does NOT run containers. Its agent manages mc-proxy
routing only: when the master provisions a service on a compute node, svc's routing only: when the master tells it to set up an edge route, it
agent updates mc-proxy routes to point at the compute node's Tailnet provisions a TLS cert from Metacrypt and registers the route in its local
address. mc-proxy via the gRPC admin API.
### MCIAS migration ## Migration Plan
MCIAS moves from the svc VPS to the control plane node, running as an ### Phase 1: Agent on svc
MCP-managed container with an independent lifecycle. Bootstrap order: Deploy mcp-agent to svc. Verify with `mcp node list`.
1. MCIAS image pre-staged or pulled unauthenticated ### Phase 2: Edge routing RPCs
2. MCIAS starts (L4 passthrough through mc-proxy -- manages its own TLS) Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
3. All other services bootstrap after MCIAS is up Test by calling directly from CLI.
## Scheduling ### Phase 3: Build mcp-master
Core loop: registration, heartbeats, deploy routing, placement, edge
coordination.
Three placement modes, in order of specificity: ### Phase 4: Agent registration and health
Self-registration, heartbeat loop, master probe fallback, fleet status.
1. `node = "rift"` -- explicit placement on a named node ### Phase 5: Cut over
2. `node = "pi-pool"` -- master picks within a named cluster Point CLI at master, add tier fields to service defs, deploy agents to
3. `node` omitted -- master picks any compute node with capacity orion and hyperborea.
Resource-aware placement via agent heartbeats (CPU, memory, disk). RPis
with 4-8 GB RAM need resource tracking more than beefy servers.
## Open Questions
- **Control plane machine**: which desktop becomes the always-on node?
- **Heartbeat model**: agent push vs. master poll?
- **Cluster definition**: explicit pool config in master vs. node labels/tags?
- **MCIAS migration timeline**: when to cut over from svc to control plane?
- **Agent on svc**: what subset of agent RPCs does an edge-only agent need?
## What Phase E Does NOT Include ## What Phase E Does NOT Include
These remain future work: These remain future work:
- Auto-reconciliation (agent auto-restarting drifted containers) - Auto-reconciliation (agent auto-restarting drifted containers)
- Migration (snapshot streaming between nodes) - Live migration (snapshot streaming between nodes)
- Web UI for fleet management - Web UI for fleet management
- Observability / log aggregation - Observability / log aggregation
- Object store - Object store
- Multiple edge nodes / master HA