Add v2 architecture and review docs
- docs/architecture-v2.md: Full MCP v2 multi-node architecture covering master/agent topology, security model, service placement, edge routing, snapshots, migration, and 6-phase implementation plan. Reviewed and ready for implementation. - docs/architecture-v2-review.md: External architecture review with all items resolved. - docs/phase-e-plan.md: Updated to match v2 decisions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
210
docs/architecture-v2-review.md
Normal file
210
docs/architecture-v2-review.md
Normal file
@@ -0,0 +1,210 @@
|
|||||||
|
# ARCHITECTURE_V2.md Follow-Up Review
|
||||||
|
|
||||||
|
Reviewer: Claude (automated architecture review)
|
||||||
|
Date: 2026-04-02
|
||||||
|
Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Review Disposition
|
||||||
|
|
||||||
|
The revised document addresses 20 of 23 original items. The remaining
|
||||||
|
gaps are minor and don't block implementation. This document is **ready
|
||||||
|
to move to implementation**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolution of Original High-Priority Items
|
||||||
|
|
||||||
|
### #3 ExportServiceData mode (live vs migration) -- RESOLVED
|
||||||
|
|
||||||
|
The document takes the "master owns stop/start" approach: for migration,
|
||||||
|
the master stops the container (step 2) before calling
|
||||||
|
`ExportServiceData` (step 3). For scheduled snapshots, the master calls
|
||||||
|
`ExportServiceData` without stopping. The agent behaves the same either
|
||||||
|
way -- runs the configured snapshot method, tars, streams back. No proto
|
||||||
|
change needed.
|
||||||
|
|
||||||
|
One edge case to handle during implementation: if the agent's persisted
|
||||||
|
snapshot method is `grpc` or `cli` and the container is already stopped
|
||||||
|
(migration case), the agent can't call the service or exec into it.
|
||||||
|
The shutdown handler's vacuum provides consistency, so the agent should
|
||||||
|
**detect the container is stopped and skip the method step**, falling
|
||||||
|
back to a direct tar with auto-vacuum. This doesn't need a proto change
|
||||||
|
-- just a behavioral note in the agent implementation.
|
||||||
|
|
||||||
|
### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
|
||||||
|
|
||||||
|
The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
|
||||||
|
config is stored in the agent's registry at deploy time." This is the
|
||||||
|
right approach (option b from the original review).
|
||||||
|
|
||||||
|
**Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
|
||||||
|
fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
|
||||||
|
converts to `ServiceSpec` proto → sends to master → master forwards to
|
||||||
|
agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
|
||||||
|
the agent's registry.
|
||||||
|
|
||||||
|
**Fix needed:** Add snapshot fields to `ServiceSpec`:
|
||||||
|
|
||||||
|
```protobuf
|
||||||
|
message ServiceSpec {
|
||||||
|
string name = 1;
|
||||||
|
bool active = 2;
|
||||||
|
repeated ComponentSpec components = 3;
|
||||||
|
string tier = 4;
|
||||||
|
string node = 5;
|
||||||
|
SnapshotConfig snapshot = 6; // snapshot method and excludes
|
||||||
|
}
|
||||||
|
|
||||||
|
message SnapshotConfig {
|
||||||
|
string method = 1; // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
|
||||||
|
repeated string excludes = 2; // paths to skip
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a one-line proto addition -- not a design issue, just a gap
|
||||||
|
in the spec.
|
||||||
|
|
||||||
|
### #12 backend_tls protobuf default -- RESOLVED
|
||||||
|
|
||||||
|
Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
|
||||||
|
|
||||||
|
The agent validates the field and rejects `false`. This is the strongest
|
||||||
|
fix -- the proto default doesn't matter because the agent enforces the
|
||||||
|
invariant. No silent cleartext is possible.
|
||||||
|
|
||||||
|
### #16 Public DNS registration -- RESOLVED
|
||||||
|
|
||||||
|
Lines 852-856: New step 6 documents that public DNS records are
|
||||||
|
pre-provisioned manually at Hurricane Electric. The master resolves the
|
||||||
|
hostname as a validation check, warns if it fails, but continues
|
||||||
|
(pragmatic for parallel setup). Clear and complete.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolution of Original Medium-Priority Items
|
||||||
|
|
||||||
|
### #1 Bootstrap circularity -- RESOLVED
|
||||||
|
|
||||||
|
Lines 745-760: New "Bootstrap (first boot)" subsection covers image
|
||||||
|
pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
|
||||||
|
that boot sequence config contains full service definitions, and notes
|
||||||
|
this is the only place definitions live on the agent. Thorough.
|
||||||
|
|
||||||
|
### #2 Destructive sync -- RESOLVED
|
||||||
|
|
||||||
|
Line 1404: `mcp sync --dry-run` added. The destructive-by-default
|
||||||
|
behavior is a deliberate design choice -- the services directory is the
|
||||||
|
source of truth, and sync enforces it. The dry-run flag provides
|
||||||
|
adequate safety. Acceptable.
|
||||||
|
|
||||||
|
### #6 MasterDeployResponse success semantics -- RESOLVED
|
||||||
|
|
||||||
|
Lines 403-407: `success` is now documented as "true only if ALL steps
|
||||||
|
succeeded," with per-step results showing partial failures. Clear.
|
||||||
|
|
||||||
|
### #8 Migration destination directory check -- RESOLVED
|
||||||
|
|
||||||
|
Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
|
||||||
|
agent handles the directory check and enforces the overwrite guard.
|
||||||
|
Clean solution -- the check is where the filesystem is.
|
||||||
|
|
||||||
|
### #10 Master trust boundary -- RESOLVED
|
||||||
|
|
||||||
|
Lines 173-180: New "Trust Assumptions" subsection explicitly states the
|
||||||
|
master is fully trusted, documents the blast radius, and lists
|
||||||
|
mitigations. Exactly what was recommended.
|
||||||
|
|
||||||
|
### #11 Agent-to-master TLS verification -- RESOLVED
|
||||||
|
|
||||||
|
Lines 183-196: New "TLS Verification" subsection documents CA cert
|
||||||
|
usage, pre-provisioning, and startup retry behavior. Covers the edge
|
||||||
|
case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolution of Original Low-Priority Items
|
||||||
|
|
||||||
|
| # | Item | Status |
|
||||||
|
|---|------|--------|
|
||||||
|
| 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
|
||||||
|
| 7 | MigrateRequest validation | RESOLVED (line 1249) |
|
||||||
|
| 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
|
||||||
|
| 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
|
||||||
|
| 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
|
||||||
|
| 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
|
||||||
|
| 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New Observations
|
||||||
|
|
||||||
|
### A. Boot config drift potential
|
||||||
|
|
||||||
|
Lines 741-742, 757-760: Boot sequence config contains full service
|
||||||
|
definitions for stage 1-3 services. When the operator bumps an image
|
||||||
|
version via `mcp deploy`, the boot config is NOT automatically updated.
|
||||||
|
On next reboot, the agent starts the old version from boot config; the
|
||||||
|
master then deploys the new version.
|
||||||
|
|
||||||
|
This is acceptable for a personal platform (brief window of old version
|
||||||
|
on reboot) and self-correcting (the master's placement takes over). But
|
||||||
|
the operator should know to update the boot config for foundation
|
||||||
|
services (MCIAS, MCNS) where running the wrong version could cause
|
||||||
|
authentication or DNS failures before the master is even up.
|
||||||
|
|
||||||
|
**Recommendation:** Add a note to the boot sequencing section: "When
|
||||||
|
updating foundation service images, also update the boot sequence
|
||||||
|
config. The master corrects worker service versions after startup, but
|
||||||
|
foundation services run before the master exists."
|
||||||
|
|
||||||
|
### B. Migration step 3 vs snapshot method
|
||||||
|
|
||||||
|
When the master calls `ExportServiceData` after stopping the container
|
||||||
|
(migration step 3), the agent can't execute `grpc` or `cli` snapshot
|
||||||
|
methods because the container isn't running. The agent should fall back
|
||||||
|
to a direct tar with auto-vacuum of `.db` files. This is correct
|
||||||
|
behavior (the container already vacuumed on shutdown), but should be
|
||||||
|
documented as an implementation rule: "If the container is not running
|
||||||
|
when `ExportServiceData` is called, skip the snapshot method and tar
|
||||||
|
directly."
|
||||||
|
|
||||||
|
This is the same point as item #3 above -- noting it here as an
|
||||||
|
implementation requirement rather than a design gap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Assessment
|
||||||
|
|
||||||
|
No new security concerns. The revised document strengthens the security
|
||||||
|
posture:
|
||||||
|
|
||||||
|
- **Trust Assumptions** section (new) makes the threat model explicit.
|
||||||
|
- **TLS Verification** section (new) closes the gap on inter-component
|
||||||
|
TLS validation.
|
||||||
|
- **backend_tls rejection** ensures no accidental cleartext.
|
||||||
|
- **Public DNS validation** (warn-and-continue) prevents silent
|
||||||
|
misconfiguration without blocking legitimate parallel setup.
|
||||||
|
|
||||||
|
The existing security model (identity-bound registration, cert SAN
|
||||||
|
restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
|
||||||
|
sound.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Readiness for Implementation
|
||||||
|
|
||||||
|
The document is ready for implementation. Summary of what remains:
|
||||||
|
|
||||||
|
| Item | Action | When |
|
||||||
|
|------|--------|------|
|
||||||
|
| Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
|
||||||
|
| Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
|
||||||
|
| Boot config drift note | Optional doc edit | Any time |
|
||||||
|
| mc-proxy binding addresses | Optional doc edit | Any time |
|
||||||
|
|
||||||
|
None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
|
||||||
|
routing RPCs). The snapshot-related items only matter at Phase 5.
|
||||||
|
|
||||||
|
**Verdict: proceed to implementation.**
|
||||||
1614
docs/architecture-v2.md
Normal file
1614
docs/architecture-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -15,25 +15,59 @@ same as today on rift, but across the fleet.
|
|||||||
|
|
||||||
| Node | OS | Arch | Role |
|
| Node | OS | Arch | Role |
|
||||||
|------|----|------|------|
|
|------|----|------|------|
|
||||||
| desktop (TBD) | NixOS | amd64 | Control plane -- runs master + MCIAS + MCNS |
|
| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
|
||||||
| rift | NixOS | amd64 | Compute -- application services |
|
| orion | NixOS | amd64 | Worker |
|
||||||
| orion | NixOS | amd64 | Compute |
|
| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
|
||||||
| hyperborea | Debian | arm64 | Compute (Raspberry Pi) |
|
|
||||||
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
|
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
|
||||||
|
|
||||||
Tailnet is the interconnect between all nodes. Public traffic enters via
|
Tailnet is the interconnect between all nodes. Public traffic enters via
|
||||||
mc-proxy on svc, which forwards over Tailnet to compute nodes.
|
mc-proxy on svc, which forwards over Tailnet to worker nodes.
|
||||||
|
|
||||||
|
## Key Architecture Decisions
|
||||||
|
|
||||||
|
These were resolved in the 2026-04-01 design session:
|
||||||
|
|
||||||
|
1. **Rift is the master node.** No separate straylight machine. Core infra
|
||||||
|
stays on rift, which gains mcp-master alongside its existing agent.
|
||||||
|
|
||||||
|
2. **Master-mediated coordination.** Agents never talk to each other. All
|
||||||
|
cross-node operations go through the master. Agents only dial the master
|
||||||
|
(for registration and heartbeats) and respond to master RPCs.
|
||||||
|
|
||||||
|
3. **Agent self-registration.** Agents register with the master on startup
|
||||||
|
(name, role, address, arch). The master maintains the live node registry.
|
||||||
|
No static `[[nodes]]` config required except for bootstrap.
|
||||||
|
|
||||||
|
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
|
||||||
|
(with resource data). If the master misses 3 heartbeats (90s), it
|
||||||
|
actively probes the agent. Failed probe marks the node unhealthy.
|
||||||
|
|
||||||
|
5. **Tier-based placement.** `tier = "core"` runs on the master node.
|
||||||
|
`tier = "worker"` (default) is auto-placed on a worker with capacity.
|
||||||
|
Explicit `node = "orion"` overrides tier for pinned services.
|
||||||
|
|
||||||
|
6. **Two separate certs for public services.** Internal cert
|
||||||
|
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
|
||||||
|
(`metacircular.net`) issued by edge agent. Internal names never
|
||||||
|
appear on edge certs.
|
||||||
|
|
||||||
|
7. **`public = true` on routes.** Public routes declare intent with a
|
||||||
|
boolean flag. The master assigns the route to an edge node (currently
|
||||||
|
always svc). No explicit `edge` field in service definitions.
|
||||||
|
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### Master (`mcp-master`)
|
### Master (`mcp-master`)
|
||||||
|
|
||||||
Long-lived orchestrator on the control plane node. Responsibilities:
|
Long-lived orchestrator on rift. Responsibilities:
|
||||||
|
|
||||||
- Accept CLI commands and dispatch to the correct agent
|
- Accept CLI commands and dispatch to the correct agent
|
||||||
|
- Maintain node registry from agent self-registration
|
||||||
|
- Place services based on tier, explicit node, and resource availability
|
||||||
|
- Detect `public = true` routes and coordinate edge setup
|
||||||
|
- Validate public hostnames against allowed domain list
|
||||||
- Aggregate status from all agents (fleet-wide view)
|
- Aggregate status from all agents (fleet-wide view)
|
||||||
- Node selection when `node` is omitted from a service definition
|
- Probe agents on missed heartbeats
|
||||||
- Health-aware scheduling using agent heartbeat data
|
|
||||||
|
|
||||||
The master is stateless in the durable sense -- it rebuilds its world view
|
The master is stateless in the durable sense -- it rebuilds its world view
|
||||||
from agents on startup. If the master goes down, running services continue
|
from agents on startup. If the master goes down, running services continue
|
||||||
@@ -54,50 +88,40 @@ inputs don't work as a universal update mechanism.
|
|||||||
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
|
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
|
||||||
`mcp-agent.service`
|
`mcp-agent.service`
|
||||||
|
|
||||||
Upgrades must be coordinated -- new RPCs cause `Unimplemented` errors on
|
|
||||||
old agents.
|
|
||||||
|
|
||||||
### Edge agents
|
### Edge agents
|
||||||
|
|
||||||
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
|
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
|
||||||
routing only: when the master provisions a service on a compute node, svc's
|
routing only: when the master tells it to set up an edge route, it
|
||||||
agent updates mc-proxy routes to point at the compute node's Tailnet
|
provisions a TLS cert from Metacrypt and registers the route in its local
|
||||||
address.
|
mc-proxy via the gRPC admin API.
|
||||||
|
|
||||||
### MCIAS migration
|
## Migration Plan
|
||||||
|
|
||||||
MCIAS moves from the svc VPS to the control plane node, running as an
|
### Phase 1: Agent on svc
|
||||||
MCP-managed container with an independent lifecycle. Bootstrap order:
|
Deploy mcp-agent to svc. Verify with `mcp node list`.
|
||||||
|
|
||||||
1. MCIAS image pre-staged or pulled unauthenticated
|
### Phase 2: Edge routing RPCs
|
||||||
2. MCIAS starts (L4 passthrough through mc-proxy -- manages its own TLS)
|
Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
|
||||||
3. All other services bootstrap after MCIAS is up
|
Test by calling directly from CLI.
|
||||||
|
|
||||||
## Scheduling
|
### Phase 3: Build mcp-master
|
||||||
|
Core loop: registration, heartbeats, deploy routing, placement, edge
|
||||||
|
coordination.
|
||||||
|
|
||||||
Three placement modes, in order of specificity:
|
### Phase 4: Agent registration and health
|
||||||
|
Self-registration, heartbeat loop, master probe fallback, fleet status.
|
||||||
|
|
||||||
1. `node = "rift"` -- explicit placement on a named node
|
### Phase 5: Cut over
|
||||||
2. `node = "pi-pool"` -- master picks within a named cluster
|
Point CLI at master, add tier fields to service defs, deploy agents to
|
||||||
3. `node` omitted -- master picks any compute node with capacity
|
orion and hyperborea.
|
||||||
|
|
||||||
Resource-aware placement via agent heartbeats (CPU, memory, disk). RPis
|
|
||||||
with 4-8 GB RAM need resource tracking more than beefy servers.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
- **Control plane machine**: which desktop becomes the always-on node?
|
|
||||||
- **Heartbeat model**: agent push vs. master poll?
|
|
||||||
- **Cluster definition**: explicit pool config in master vs. node labels/tags?
|
|
||||||
- **MCIAS migration timeline**: when to cut over from svc to control plane?
|
|
||||||
- **Agent on svc**: what subset of agent RPCs does an edge-only agent need?
|
|
||||||
|
|
||||||
## What Phase E Does NOT Include
|
## What Phase E Does NOT Include
|
||||||
|
|
||||||
These remain future work:
|
These remain future work:
|
||||||
|
|
||||||
- Auto-reconciliation (agent auto-restarting drifted containers)
|
- Auto-reconciliation (agent auto-restarting drifted containers)
|
||||||
- Migration (snapshot streaming between nodes)
|
- Live migration (snapshot streaming between nodes)
|
||||||
- Web UI for fleet management
|
- Web UI for fleet management
|
||||||
- Observability / log aggregation
|
- Observability / log aggregation
|
||||||
- Object store
|
- Object store
|
||||||
|
- Multiple edge nodes / master HA
|
||||||
|
|||||||
Reference in New Issue
Block a user