diff --git a/docs/architecture-v2-review.md b/docs/architecture-v2-review.md
new file mode 100644
index 0000000..8195cde
--- /dev/null
+++ b/docs/architecture-v2-review.md
@@ -0,0 +1,210 @@
+# ARCHITECTURE_V2.md Follow-Up Review
+
+Reviewer: Claude (automated architecture review)
+Date: 2026-04-02
+Document reviewed: `mcp/ARCHITECTURE_V2.md` (revised)
+
+---
+
+## Review Disposition
+
+The revised document addresses 20 of 23 original items. The remaining
+gaps are minor and don't block implementation. This document is **ready
+to move to implementation**.
+
+---
+
+## Resolution of Original High-Priority Items
+
+### #3 ExportServiceData mode (live vs migration) -- RESOLVED
+
+The document takes the "master owns stop/start" approach: for migration,
+the master stops the container (step 2) before calling
+`ExportServiceData` (step 3). For scheduled snapshots, the master calls
+`ExportServiceData` without stopping. The agent behaves the same either
+way -- runs the configured snapshot method, tars, streams back. No proto
+change needed.
+
+One edge case to handle during implementation: if the agent's persisted
+snapshot method is `grpc` or `cli` and the container is already stopped
+(migration case), the agent can't call the service or exec into it.
+The shutdown handler's vacuum provides consistency, so the agent should
+**detect the container is stopped and skip the method step**, falling
+back to a direct tar with auto-vacuum. This doesn't need a proto change
+-- just a behavioral note in the agent implementation.
+
+### #4 Snapshot method config propagation -- PARTIALLY RESOLVED
+
+The `ExportServiceDataRequest` comment (line 1093) now states: "Snapshot
+config is stored in the agent's registry at deploy time." This is the
+right approach (option b from the original review).
+
+**Remaining gap:** `ServiceSpec` (lines 357-363) has no snapshot config
+fields. The deploy flow is: CLI reads TOML (which has `[snapshot]`) →
+converts to `ServiceSpec` proto → sends to master → master forwards to
+agent. If `ServiceSpec` doesn't carry snapshot config, it can't reach
+the agent's registry.
+
+**Fix needed:** Add snapshot fields to `ServiceSpec`:
+
+```protobuf
+message ServiceSpec {
+  string name    = 1;
+  bool   active  = 2;
+  repeated ComponentSpec components = 3;
+  string tier    = 4;
+  string node    = 5;
+  SnapshotConfig snapshot = 6;  // snapshot method and excludes
+}
+
+message SnapshotConfig {
+  string method            = 1;  // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
+  repeated string excludes = 2;  // paths to skip
+}
+```
+
+This is a one-line proto addition -- not a design issue, just a gap
+in the spec.
+
+### #12 backend_tls protobuf default -- RESOLVED
+
+Line 520: `bool backend_tls = 4; // MUST be true; agent rejects false`
+
+The agent validates the field and rejects `false`. This is the strongest
+fix -- the proto default doesn't matter because the agent enforces the
+invariant. No silent cleartext is possible.
+
+### #16 Public DNS registration -- RESOLVED
+
+Lines 852-856: New step 6 documents that public DNS records are
+pre-provisioned manually at Hurricane Electric. The master resolves the
+hostname as a validation check, warns if it fails, but continues
+(pragmatic for parallel setup). Clear and complete.
+
+---
+
+## Resolution of Original Medium-Priority Items
+
+### #1 Bootstrap circularity -- RESOLVED
+
+Lines 745-760: New "Bootstrap (first boot)" subsection covers image
+pre-staging for stages 1-2 via `podman load`/`podman pull`, documents
+that boot sequence config contains full service definitions, and notes
+this is the only place definitions live on the agent. Thorough.
+
+### #2 Destructive sync -- RESOLVED
+
+Line 1404: `mcp sync --dry-run` added. The destructive-by-default
+behavior is a deliberate design choice -- the services directory is the
+source of truth, and sync enforces it. The dry-run flag provides
+adequate safety. Acceptable.
+
+### #6 MasterDeployResponse success semantics -- RESOLVED
+
+Lines 403-407: `success` is now documented as "true only if ALL steps
+succeeded," with per-step results showing partial failures. Clear.
+
+### #8 Migration destination directory check -- RESOLVED
+
+Line 1106: `bool force = 3` added to `ImportServiceDataChunk`. The
+agent handles the directory check and enforces the overwrite guard.
+Clean solution -- the check is where the filesystem is.
+
+### #10 Master trust boundary -- RESOLVED
+
+Lines 173-180: New "Trust Assumptions" subsection explicitly states the
+master is fully trusted, documents the blast radius, and lists
+mitigations. Exactly what was recommended.
+
+### #11 Agent-to-master TLS verification -- RESOLVED
+
+Lines 183-196: New "TLS Verification" subsection documents CA cert
+usage, pre-provisioning, and startup retry behavior. Covers the edge
+case where Metacrypt isn't up yet (CA cert is static, pre-provisioned).
+
+---
+
+## Resolution of Original Low-Priority Items
+
+| # | Item | Status |
+|---|------|--------|
+| 5 | --direct mode caveat | RESOLVED (lines 1407-1412) |
+| 7 | MigrateRequest validation | RESOLVED (line 1249) |
+| 14 | Proto completeness | MOSTLY RESOLVED -- Snapshot RPCs added to service def, renamed to CreateSnapshot. Empty response messages unchanged (acceptable). |
+| 15 | Agent schema for snapshot config | IMPLICITLY RESOLVED by line 1093 comment; schema is implementation detail |
+| 17 | mc-proxy binding addresses | NOT ADDRESSED -- minor; implementable from context |
+| 18 | Phase plan for snapshots | RESOLVED -- new Phase 5 (lines 1496-1503), cut over moved to Phase 6 |
+| 23 | Snapshot proto naming | RESOLVED -- renamed to CreateSnapshot (line 389), disambiguation comment added (lines 1256-1258) |
+
+---
+
+## New Observations
+
+### A. Boot config drift potential
+
+Lines 741-742, 757-760: Boot sequence config contains full service
+definitions for stage 1-3 services. When the operator bumps an image
+version via `mcp deploy`, the boot config is NOT automatically updated.
+On next reboot, the agent starts the old version from boot config; the
+master then deploys the new version.
+
+This is acceptable for a personal platform (brief window of old version
+on reboot) and self-correcting (the master's placement takes over). But
+the operator should know to update the boot config for foundation
+services (MCIAS, MCNS) where running the wrong version could cause
+authentication or DNS failures before the master is even up.
+
+**Recommendation:** Add a note to the boot sequencing section: "When
+updating foundation service images, also update the boot sequence
+config. The master corrects worker service versions after startup, but
+foundation services run before the master exists."
+
+### B. Migration step 3 vs snapshot method
+
+When the master calls `ExportServiceData` after stopping the container
+(migration step 3), the agent can't execute `grpc` or `cli` snapshot
+methods because the container isn't running. The agent should fall back
+to a direct tar with auto-vacuum of `.db` files. This is correct
+behavior (the container already vacuumed on shutdown), but should be
+documented as an implementation rule: "If the container is not running
+when `ExportServiceData` is called, skip the snapshot method and tar
+directly."
+
+This is the same point as item #3 above -- noting it here as an
+implementation requirement rather than a design gap.
+
+---
+
+## Security Assessment
+
+No new security concerns. The revised document strengthens the security
+posture:
+
+- **Trust Assumptions** section (new) makes the threat model explicit.
+- **TLS Verification** section (new) closes the gap on inter-component
+  TLS validation.
+- **backend_tls rejection** ensures no accidental cleartext.
+- **Public DNS validation** (warn-and-continue) prevents silent
+  misconfiguration without blocking legitimate parallel setup.
+
+The existing security model (identity-bound registration, cert SAN
+restrictions, Tailscale ACLs, rate limiting) is unchanged and remains
+sound.
+
+---
+
+## Readiness for Implementation
+
+The document is ready for implementation. Summary of what remains:
+
+| Item | Action | When |
+|------|--------|------|
+| Add `SnapshotConfig` to `ServiceSpec` proto | Proto file edit | Phase 5 (when implementing snapshots) |
+| Agent fallback when container is stopped during export | Implementation detail | Phase 5 |
+| Boot config drift note | Optional doc edit | Any time |
+| mc-proxy binding addresses | Optional doc edit | Any time |
+
+None of these block starting Phase 1 (agent on svc) or Phase 2 (edge
+routing RPCs). The snapshot-related items only matter at Phase 5.
+
+**Verdict: proceed to implementation.**
diff --git a/docs/architecture-v2.md b/docs/architecture-v2.md
new file mode 100644
index 0000000..07a6278
--- /dev/null
+++ b/docs/architecture-v2.md
@@ -0,0 +1,1614 @@
+# MCP v2 -- Multi-Node Control Plane
+
+## Overview
+
+MCP v2 introduces multi-node orchestration with a master/agent topology.
+The CLI no longer dials agents directly. A dedicated **mcp-master** daemon
+coordinates deployments across nodes, handles cross-node concerns (edge
+routing, certificate provisioning, DNS), and serves as the single control
+point for the platform.
+
+### Motivation
+
+v1 deployed successfully on a single node (rift) but exposed operational
+pain points as services needed public-facing routes through svc:
+
+- **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing
+  mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating
+  the SQLite database when the config and database diverged, and debugging
+  silent failures. Every redeployment risked breaking the public route.
+
+- **Dynamic port instability**: The route system assigns ephemeral host ports
+  that change on every deploy. svc's mc-proxy pointed at a specific port
+  (e.g., `100.95.252.120:48080`), which went stale after redeployment.
+  Container ports are also localhost-only under rootless podman, requiring
+  explicit Tailscale IP bindings for external access.
+
+- **$PORT env override conflict**: The mcdsl config loader overrides
+  `listen_addr` from `$PORT` when routes are present. This meant containers
+  ignored their configured port and listened on the route-allocated one
+  instead, breaking explicit port mappings that expected the config port.
+
+- **Cert chain issues**: mc-proxy requires full certificate chains (leaf +
+  intermediates). Certs provisioned outside the standard metacrypt flow
+  were leaf-only and caused silent TLS handshake failures (`client_bytes=7
+  backend_bytes=0` with no error logged).
+
+- **mc-proxy database divergence**: mc-proxy persists routes in SQLite.
+  Routes added via the admin API override the TOML config. Editing the TOML
+  alone had no effect until the database was manually updated -- a failure
+  mode that took hours to diagnose.
+
+- **No cross-node coordination**: The v1 CLI talks directly to individual
+  agents. There is no mechanism for one agent to tell another "set up a
+  route for this service." Every cross-node operation was manual.
+
+v2 addresses all of these by making the master the single coordination
+point for deployments, with agents handling local concerns (containers,
+mc-proxy routes, cert provisioning) on instruction from the master.
+
+### What Changes from v1
+
+| Concern | v1 | v2 |
+|---------|----|----|
+| CLI target | CLI dials agents directly | CLI dials the master |
+| Node awareness | CLI routes by `node` field in service defs | Master owns the node registry |
+| Service placement | Explicit `node` required | `tier` field; master auto-places workers |
+| Edge routing | Manual mc-proxy config on svc | Master coordinates edge setup |
+| Cert provisioning | Agent provisions for local mc-proxy only | Edge agent provisions its own public certs |
+| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones |
+| Auth model | Token validation only | Per-RPC role-based authorization |
+
+### What Stays the Same
+
+The agent's core responsibilities are unchanged: it manages containers via
+podman, stores its local registry in SQLite, monitors for drift, and alerts
+the operator. The agent gains new RPCs for edge routing and health reporting
+but does not become aware of other nodes -- the master handles all
+cross-node coordination. Agents never communicate with each other.
+
+---
+
+## Topology
+
+```
+Operator workstation (vade)
+  ┌──────────────────────────┐
+  │  mcp (CLI)               │
+  │                          │
+  │  gRPC ───────────────────┼─── Tailnet ──┐
+  └──────────────────────────┘              │
+                                            ▼
+Master + worker node (rift)
+  ┌──────────────────────────────────────────────────────┐
+  │  mcp-master                                          │
+  │    ├── node registry (agents self-register)          │
+  │    ├── service placement (tier-aware)                │
+  │    ├── edge routing coordinator                      │
+  │    └── SQLite state (edge routes, placements)        │
+  │                                                      │
+  │  mcp-agent                                           │
+  │    ├── mcias container                               │
+  │    ├── mcns container                                │
+  │    ├── metacrypt container                           │
+  │    ├── mcr container                                 │
+  │    ├── mcq, mcdoc, exo, sgard, kls ...              │
+  │    └── mc-proxy (rift)                               │
+  └──────────┬──────────────────┬───────────┬────────────┘
+             │                  │           │
+          Tailnet           Tailnet      Tailnet
+             │                  │           │
+             ▼                  ▼           ▼
+Worker (orion)                  Edge (svc)
+  ┌──────────────────┐           ┌─────────────────────┐
+  │  mcp-agent       │           │  mcp-agent          │
+  │    ├── services  │           │    ├── mc-proxy     │
+  │    └── mc-proxy  │           │    └── (routes only)│
+  └──────────────────┘           └─────────────────────┘
+  NixOS / amd64                  Debian / amd64
+```
+
+### Node Roles
+
+| Role | Purpose | Nodes |
+|------|---------|-------|
+| **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure. Single coordination point. | rift |
+| **worker** | Runs mcp-agent. Hosts application services. | orion |
+| **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc |
+
+Every node runs an mcp-agent. Rift also runs mcp-master. The master's
+local agent manages the infrastructure services (MCIAS, mcns, metacrypt,
+mcr) the same way other agents manage application services.
+
+### mc-proxy Mesh
+
+Each node runs its own mc-proxy instance. They form a routing mesh:
+
+```
+mc-proxy (rift)
+  ├── :443  L7 routes for internal .svc.mcp hostnames
+  ├── :8443 L4 passthrough for API servers (MCIAS, metacrypt, mcr)
+  └── :9443 L4 passthrough for gRPC services
+
+mc-proxy (orion)
+  ├── :443  L7 routes for services hosted on this node
+  └── :8443 L4/L7 routes for internal APIs
+
+mc-proxy (svc)
+  └── :443  L7 termination for public hostnames
+            → forwards to internal .svc.mcp endpoints over Tailnet
+```
+
+---
+
+## Security Model
+
+### Authentication and Authorization
+
+All gRPC channels (CLI↔master, master↔agent, agent→master) use TLS 1.3
+with MCIAS bearer tokens. Every entity has a distinct MCIAS identity:
+
+| Entity | MCIAS Identity | Account Type |
+|--------|---------------|--------------|
+| Operator CLI | `kyle` (or personal account) | human |
+| mcp-master | `mcp-master` | service |
+| Agent on rift | `agent-rift` | service |
+| Agent on orion | `agent-orion` | service |
+| Agent on svc | `agent-svc` | service |
+
+RPCs are authorized by **caller role**, not just authentication:
+
+| RPC Category | Allowed Callers | Rejected |
+|--------------|-----------------|----------|
+| CLI→master (Deploy, Undeploy, Status, Sync) | human accounts, `mcp-master` (for self-management) | agent service accounts |
+| Agent→master (Register, Heartbeat) | `agent-*` service accounts | human accounts, `mcp-master` |
+| Master→agent (Deploy, SetupEdgeRoute, HealthCheck) | `mcp-master` only | all others |
+
+The auth interceptor on both master and agent validates the bearer token
+via MCIAS, then checks the caller's account type and service name against
+the RPC's allowed-caller list. Unauthorized calls return
+`PermissionDenied`.
+
+### Trust Assumptions
+
+The master is a **fully trusted** component. A compromised master can
+control the entire fleet: deploy arbitrary containers, exfiltrate data
+via snapshots, redirect traffic via edge routes. This is inherent to
+the master/agent topology and acceptable for a single-operator personal
+platform. Mitigations: the master runs on the operator's always-on
+machine (rift) behind Tailscale, authenticates to MCIAS with its own
+service identity, and all communication is TLS 1.3.
+
+### TLS Verification
+
+All gRPC connections verify the peer's TLS certificate against the
+Metacrypt CA cert. Agents configure the CA cert path in their config:
+
+```toml
+[tls]
+ca_cert = "/srv/mcp/certs/metacircular-ca.pem"
+```
+
+When an agent starts before the master is available (e.g., svc's agent
+starts before rift's boot sequence completes), the TLS connection fails
+and the agent retries with exponential backoff. The CA cert itself is
+pre-provisioned on all nodes — it does not depend on Metacrypt being
+running.
+
+### Registration Security
+
+Agents self-register with the master, but registration is **identity-bound**:
+
+1. The master extracts the caller's MCIAS service name from the validated
+   token (e.g., `agent-rift`).
+2. The expected node name is derived by stripping the `agent-` prefix.
+3. The `RegisterRequest.name` must match. `agent-rift` can only register
+   `name = "rift"`. A rogue agent cannot impersonate another node.
+4. The master maintains an allowlist of permitted agent identities:
+
+```toml
+[registration]
+allowed_agents = ["agent-rift", "agent-svc", "agent-orion"]
+```
+
+Registration from unknown identities is rejected. Re-registration from the
+same identity updates the entry (handles restarts) and logs a warning with
+the previous address for audit.
+
+### Edge Route Validation
+
+When the master sets up an edge route, it validates both ends:
+
+- **Public hostname**: must fall under an allowed domain
+  (`metacircular.net`, `wntrmute.net`). Validation uses proper domain
+  label matching — `evilmetacircular.net` is rejected. Implementation:
+  the hostname must equal the allowed domain or be preceded by a `.`
+  (e.g., `mcq.metacircular.net` matches, `metacircular.net` matches,
+  `xmetacircular.net` does not).
+
+- **Backend hostname**: must end with `.svc.mcp.metacircular.net`
+  (the internal DNS zone). The edge agent resolves it and verifies the
+  result is a Tailnet IP (100.64.0.0/10). Non-Tailnet backends are
+  rejected.
+
+### Certificate Issuance Policies
+
+Per-identity restrictions in Metacrypt limit what each agent can issue:
+
+| Agent | Allowed SANs | Denied SANs |
+|-------|-------------|-------------|
+| `agent-rift`, `agent-orion` | `*.svc.mcp.metacircular.net` | public domains |
+| `agent-svc` | `*.metacircular.net`, `*.wntrmute.net` | `.svc.mcp.` names |
+
+This ensures a compromised edge agent cannot issue certs for internal
+names, and a compromised worker agent cannot issue certs for public
+names. The Metacrypt CA is not publicly trusted, which limits blast
+radius further.
+
+### Rate Limiting
+
+The master rate-limits agent RPCs:
+
+- `Register`: 1 per minute per identity.
+- `Heartbeat`: 1 per 10 seconds per identity.
+- Maximum registered nodes: 16 (configurable).
+
+Excess calls return `ResourceExhausted`.
+
+### Tailscale ACLs
+
+Network-level restriction (configured in Tailscale admin, not MCP):
+
+- rift (master): can reach all agent gRPC ports (9444) on all nodes.
+  The master process needs this to forward deploys and set up edge
+  routes.
+- svc: can reach master gRPC (9555), backend service ports (443, 8443,
+  9443), and Metacrypt (8443). Blocked from MCIAS management, MCR push,
+  and agent gRPC on other nodes.
+- Workers: can reach master gRPC, MCR (pull), Metacrypt, MCIAS. Blocked
+  from other workers' agent ports and svc's agent port.
+
+---
+
+## Service Placement
+
+Services declare a **tier** that determines where they run:
+
+- **`tier = "core"`** — scheduled on the master node. Used for platform
+  infrastructure: MCIAS, metacrypt, mcr, mcns.
+- **`tier = "worker"`** (default) — auto-placed on a worker node. The
+  master selects the node based on container count and health.
+
+Explicit node pinning is still supported via `node = "orion"` for cases
+where a service must run on a specific machine. When `node` is set, it
+overrides `tier`.
+
+### Placement Algorithm
+
+Worker placement is deliberately simple:
+
+1. Filter eligible nodes: healthy workers.
+2. Select the node with the fewest running containers.
+3. Break ties alphabetically by node name (deterministic).
+
+All v2 nodes are amd64, so architecture filtering is not needed.
+Services do not declare resource requirements for v2. The heartbeat
+reports available resources (CPU, memory, disk) which the master uses
+for health assessment, but placement is container-count based. Resource-
+aware bin-packing is future work.
+
+### Service Definition
+
+```toml
+name   = "mcq"
+tier   = "worker"                    # default; placed by master
+active = true
+
+[[components]]
+name    = "mcq"
+image   = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
+volumes = ["/srv/mcq:/srv/mcq"]
+cmd     = ["server", "--config", "/srv/mcq/mcq.toml"]
+
+# Internal route: handled by the local node's mc-proxy.
+[[components.routes]]
+name     = "internal"
+port     = 8443
+mode     = "l7"
+
+# Public route: master sets up edge routing on svc.
+[[components.routes]]
+name     = "public"
+port     = 8443
+mode     = "l7"
+hostname = "mcq.metacircular.net"
+public   = true
+```
+
+Core service example:
+
+```toml
+name   = "mcias"
+tier   = "core"                      # always on master node
+active = true
+
+[[components]]
+name    = "mcias"
+image   = "mcr.svc.mcp.metacircular.net:8443/mcias:v1.10.5"
+volumes = ["/srv/mcias:/srv/mcias"]
+cmd     = ["mciassrv", "-config", "/srv/mcias/mcias.toml"]
+```
+
+### v1 Compatibility
+
+Existing v1 service definitions with `node = "rift"` continue to work
+(explicit pinning). New v2 fields (`tier`, `public`) default to their
+zero values (`"worker"`, `false`) when absent. The validation rule
+changes from "node required" to "either node or tier must be set;
+tier defaults to worker if both are empty."
+
+---
+
+## Proto Definitions
+
+### ServiceSpec and RouteSpec Updates
+
+```protobuf
+message ServiceSpec {
+  string name    = 1;
+  bool   active  = 2;
+  repeated ComponentSpec components = 3;  // unchanged from v1
+  string tier    = 4;  // "core" or "worker" (default: "worker")
+  string node    = 5;  // explicit node pin (overrides tier)
+  SnapshotConfig snapshot = 6;  // snapshot method and excludes
+}
+
+message SnapshotConfig {
+  string method            = 1;  // "grpc", "cli", "exec: <cmd>", "full", or "" (default)
+  repeated string excludes = 2;  // paths relative to /srv/<service>/ to skip
+}
+
+message RouteSpec {
+  string name     = 1;
+  int32  port     = 2;
+  string mode     = 3;  // "l4" or "l7"
+  string hostname = 4;
+  bool   public   = 5;  // triggers edge routing
+}
+```
+
+### McpMasterService
+
+```protobuf
+service McpMasterService {
+  // CLI operations.
+  rpc Deploy(MasterDeployRequest) returns (MasterDeployResponse);
+  rpc Undeploy(MasterUndeployRequest) returns (MasterUndeployResponse);
+  rpc Status(MasterStatusRequest) returns (MasterStatusResponse);
+  rpc Sync(MasterSyncRequest) returns (MasterSyncResponse);
+  rpc Migrate(MigrateRequest) returns (MigrateResponse);
+
+  // Fleet management.
+  rpc ListNodes(ListNodesRequest) returns (ListNodesResponse);
+
+  // Snapshots (CLI-triggered).
+  rpc CreateSnapshot(CreateSnapshotRequest) returns (CreateSnapshotResponse);
+  rpc ListSnapshots(ListSnapshotsRequest) returns (ListSnapshotsResponse);
+
+  // Agent registration and health (called by agents).
+  rpc Register(RegisterRequest) returns (RegisterResponse);
+  rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);
+}
+
+message MasterDeployRequest {
+  ServiceSpec service = 1;
+}
+
+message MasterDeployResponse {
+  string node    = 1;  // node the service was placed on
+  bool   success = 2;  // true only if ALL steps succeeded
+  string error   = 3;
+  // Per-step results for operator visibility. Partial failure is
+  // possible: deploy succeeds but edge routing fails. The CLI shows
+  // exactly what worked and what didn't.
+  StepResult deploy_result     = 4;
+  StepResult edge_route_result = 5;
+  StepResult dns_result        = 6;
+}
+
+message StepResult {
+  string step    = 1;
+  bool   success = 2;
+  string error   = 3;
+}
+
+message MasterUndeployRequest {
+  string service_name = 1;
+}
+
+message MasterUndeployResponse {
+  bool   success = 1;
+  string error   = 2;
+}
+
+message MasterStatusRequest {
+  string service_name = 1;  // empty = all services
+}
+
+message MasterStatusResponse {
+  repeated ServiceStatus services = 1;
+}
+
+message ServiceStatus {
+  string name   = 1;
+  string node   = 2;
+  string tier   = 3;
+  string status = 4;  // "running", "stopped", "unhealthy", "unknown"
+  repeated EdgeRouteStatus edge_routes = 5;
+}
+
+message EdgeRouteStatus {
+  string hostname   = 1;
+  string edge_node  = 2;
+  string cert_expires = 3;
+}
+
+message MasterSyncRequest {
+  repeated ServiceSpec services = 1;
+}
+
+message MasterSyncResponse {
+  repeated StepResult results = 1;
+}
+
+message ListNodesRequest {}
+
+message ListNodesResponse {
+  repeated NodeInfo nodes = 1;
+}
+
+message NodeInfo {
+  string name     = 1;
+  string role     = 2;
+  string address  = 3;
+  string arch     = 4;
+  string status   = 5;  // "healthy", "unhealthy", "unknown"
+  int32  containers = 6;
+  string last_heartbeat = 7;  // RFC3339
+}
+
+message RegisterRequest {
+  string name    = 1;
+  string role    = 2;
+  string address = 3;
+  string arch    = 4;
+}
+
+message RegisterResponse {
+  bool accepted = 1;
+}
+
+message HeartbeatRequest {
+  string name           = 1;
+  int64  cpu_millicores = 2;
+  int64  memory_bytes   = 3;
+  int64  disk_bytes     = 4;
+  int32  containers     = 5;
+}
+
+message HeartbeatResponse {
+  bool acknowledged = 1;
+}
+```
+
+### Agent RPC Additions
+
+```protobuf
+// Health probe -- called by master on missed heartbeats.
+rpc HealthCheck(HealthCheckRequest) returns (HealthCheckResponse);
+
+// Edge routing -- called by master on edge nodes.
+rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
+rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
+rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);
+
+message HealthCheckRequest {}
+
+message HealthCheckResponse {
+  string status     = 1;  // "healthy" or "degraded"
+  int32  containers = 2;
+}
+
+message SetupEdgeRouteRequest {
+  string hostname         = 1;  // public hostname
+  string backend_hostname = 2;  // internal .svc.mcp hostname
+  int32  backend_port     = 3;  // port on worker's mc-proxy
+  bool   backend_tls      = 4;  // MUST be true; agent rejects false
+}
+
+message SetupEdgeRouteResponse {}
+
+message RemoveEdgeRouteRequest {
+  string hostname = 1;
+}
+
+message RemoveEdgeRouteResponse {}
+
+message ListEdgeRoutesRequest {}
+
+message ListEdgeRoutesResponse {
+  repeated EdgeRoute routes = 1;
+}
+
+message EdgeRoute {
+  string hostname         = 1;
+  string backend_hostname = 2;
+  int32  backend_port     = 3;
+  string cert_serial      = 4;
+  string cert_expires     = 5;
+}
+```
+
+---
+
+## Agent Registration and Health
+
+### Registration
+
+Agents self-register with the master on startup by calling
+`McpMasterService.Register`. The master validates the caller's MCIAS
+identity (see Security Model) and adds the node to its registry (SQLite).
+
+If the master is unreachable at startup, the agent retries with
+exponential backoff (1s, 2s, 4s, ... capped at 60s). Running containers
+are unaffected — registration is a management concern, not a runtime one.
+
+### Heartbeats
+
+Agents send heartbeats every 30 seconds via `McpMasterService.Heartbeat`.
+Each heartbeat includes resource data (CPU, memory, disk, container count).
+The master derives the agent's node name from the authenticated MCIAS
+identity (same as registration) — the `name` field in the heartbeat is
+verified against the token, not trusted blindly.
+
+If the master has not received a heartbeat from an agent in 90 seconds
+(3 missed intervals), it probes the agent with `HealthCheck`. If the
+probe fails (5-second timeout), the agent is marked unhealthy. Unhealthy
+nodes are excluded from placement but their services continue running.
+
+When a previously unhealthy agent sends a heartbeat, the master marks it
+healthy again.
+
+### Node Identity
+
+Each agent authenticates to MCIAS as a distinct service user:
+`agent-rift`, `agent-svc`, `agent-orion`. Benefits:
+
+- **Audit**: logs show which node performed an action.
+- **Least privilege**: edge agents don't need image pull access.
+- **Revocation**: a compromised node's credentials can be revoked
+  without affecting the fleet.
+
+---
+
+## mcp-master
+
+### Responsibilities
+
+1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync).
+2. **Maintain node registry** from agent self-registration (SQLite).
+3. **Place services** on nodes based on tier, explicit node, and
+   container count.
+4. **Detect public routes** (`public = true`) and coordinate edge routing.
+5. **Validate public hostnames** against allowed domain list.
+6. **Assign edge nodes** for public routes (currently always svc).
+7. **Coordinate undeploy** across nodes.
+8. **Aggregate status** from all agents for fleet-wide views.
+
+### What the Master Does NOT Do
+
+- Store container state (agents own their registries).
+- Manage container lifecycle directly (agents do this).
+- Run containers (the co-located agent does).
+- Replace the agent on any node.
+- Talk to agents on behalf of other agents.
+
+### Master State (SQLite)
+
+The master maintains a SQLite database at `/srv/mcp-master/master.db`
+with three tables:
+
+```sql
+-- Registered nodes. Populated by agent Register RPCs.
+-- Rebuilt from agent re-registration on master restart.
+CREATE TABLE nodes (
+    name        TEXT PRIMARY KEY,
+    role        TEXT NOT NULL,
+    address     TEXT NOT NULL,
+    arch        TEXT NOT NULL,
+    status      TEXT NOT NULL DEFAULT 'unknown',
+    containers  INTEGER NOT NULL DEFAULT 0,
+    last_heartbeat TEXT
+);
+
+-- Service placements. Records which node hosts which service.
+-- Populated on deploy, removed on undeploy.
+CREATE TABLE placements (
+    service_name TEXT PRIMARY KEY,
+    node         TEXT NOT NULL REFERENCES nodes(name),
+    tier         TEXT NOT NULL,
+    deployed_at  TEXT NOT NULL
+);
+
+-- Edge routes. Records public routes for undeploy cleanup.
+CREATE TABLE edge_routes (
+    hostname         TEXT PRIMARY KEY,
+    service_name     TEXT NOT NULL REFERENCES placements(service_name),
+    edge_node        TEXT NOT NULL REFERENCES nodes(name),
+    backend_hostname TEXT NOT NULL,
+    backend_port     INTEGER NOT NULL,
+    created_at       TEXT NOT NULL
+);
+
+-- Snapshot metadata. The archive files live on disk at
+-- /srv/mcp-master/snapshots/<service>/<timestamp>.tar.zst.
+CREATE TABLE snapshots (
+    id           INTEGER PRIMARY KEY AUTOINCREMENT,
+    service_name TEXT    NOT NULL,
+    node         TEXT    NOT NULL,
+    filename     TEXT    NOT NULL,
+    size_bytes   INTEGER NOT NULL,
+    created_at   TEXT    NOT NULL
+);
+CREATE INDEX idx_snapshots_service ON snapshots(service_name, created_at DESC);
+```
+
+On master restart, the node registry is rebuilt as agents re-register
+(within 30s via heartbeat). Placements and edge routes persist across
+restarts. The master reconciles placements against actual agent state
+on startup (see Reconciliation).
+
+### Master Configuration
+
+```toml
+[server]
+grpc_addr = "100.x.x.x:9555"     # master listens on Tailnet
+tls_cert  = "/srv/mcp-master/certs/cert.pem"
+tls_key   = "/srv/mcp-master/certs/key.pem"
+
+[database]
+path = "/srv/mcp-master/master.db"
+
+[mcias]
+server_url   = "https://mcias.metacircular.net:8443"
+service_name = "mcp-master"
+
+[edge]
+allowed_domains = ["metacircular.net", "wntrmute.net"]
+
+[registration]
+allowed_agents = ["agent-rift", "agent-svc", "agent-orion"]
+max_nodes      = 16
+
+[timeouts]
+deploy       = "5m"
+edge_route   = "30s"
+health_check = "5s"
+undeploy     = "2m"
+snapshot     = "10m"
+
+# Bootstrap: master's own agent (can't self-register before master starts).
+[[nodes]]
+name    = "rift"
+address = "100.95.252.120:9444"
+role    = "master"
+```
+
+### Boot Sequencing
+
+The master node runs core infrastructure that other services depend on.
+On boot, these services must start in dependency order. Only the master
+needs sequencing -- worker and edge nodes start their agent and wait
+for registration with the master.
+
+The master's agent config declares boot stages:
+
+```toml
+[[boot.sequence]]
+name     = "foundation"
+services = ["mcias", "mcns"]
+timeout  = "120s"
+health   = "tcp"
+
+[[boot.sequence]]
+name     = "core"
+services = ["metacrypt", "mcr"]
+timeout  = "60s"
+health   = "tcp"
+
+[[boot.sequence]]
+name     = "management"
+services = ["mcp-master"]
+timeout  = "30s"
+health   = "grpc"
+```
+
+**Stage 1 -- Foundation**: MCIAS and MCNS start first. Every other
+service needs authentication (MCIAS) and DNS resolution (MCNS).
+
+**Stage 2 -- Core**: Metacrypt and MCR start once auth and DNS are
+available. Agents need Metacrypt for cert provisioning and MCR for
+image pulls.
+
+**Stage 3 -- Management**: MCP-Master starts last. It requires all
+infrastructure services to be running before it can coordinate the fleet.
+
+mcp-master runs as a container managed by the agent, just like any
+other service. This means updates are a normal `mcp deploy` (or image
+bump in the bootstrap config), and the agent handles restarts via
+podman's `--restart unless-stopped` policy.
+
+**Bootstrap (first boot):** On initial cluster setup, no images exist
+in MCR yet. The boot sequence config references container images, but
+MCR doesn't start until stage 2. Resolution:
+
+- Stage 1 and 2 images (MCIAS, MCNS, Metacrypt, MCR) must be
+  **pre-staged** into the local podman image store before first boot
+  (`podman load` or `podman pull` from an external source).
+- Once MCR is running (stage 2), stage 3 (mcp-master) can pull its
+  image from MCR normally.
+- Subsequent boots use cached images. Image updates go through the
+  normal `mcp deploy` flow (which pulls from MCR).
+
+The boot sequence config contains full service definitions (image,
+volumes, cmd, routes) — not just service names. This is the only
+place where service definitions live on the agent rather than being
+pushed from the CLI via the master.
+
+**Health check types:**
+- `tcp` — connect to the container's mapped port. Success = connection
+  accepted. Used for most services.
+- `grpc` — call the gRPC health endpoint. Used for services with gRPC.
+- `http` — GET a health endpoint. Future option.
+
+**Timeout behavior:** Depends on the stage:
+- **Foundation** (MCIAS, MCNS): failure **blocks** boot. The agent
+  retries indefinitely with backoff and alerts the operator. All
+  downstream services depend on auth and DNS — proceeding is futile.
+- **Core and management**: failure logs an error and proceeds. The
+  operator can fix the failed service manually. Partial boot is
+  better than no boot for non-foundation services.
+
+The agent treats boot sequencing as a startup concern only. Once all
+stages complete, normal operations proceed. If a foundation service
+crashes at runtime, the agent restarts it independently via the
+`--restart unless-stopped` podman policy.
+
+**Boot config drift:** The boot sequence config contains pinned image
+versions. When the operator updates a service via `mcp deploy`, the
+boot config is NOT automatically updated. On reboot, the agent starts
+the old version; the master then deploys the current version. This is
+self-correcting for core and management services, but foundation
+services (MCIAS, MCNS) run before the master exists. **When updating
+foundation service images, also update the boot sequence config.**
+
+### Reconciliation
+
+On startup, the master actively probes all nodes it knows about from
+its persisted `nodes` table — it does not wait for agents to
+re-register. This means the master has a fleet-wide view within seconds
+of starting, rather than waiting up to 30s per agent heartbeat cycle.
+
+The initial probe cycle is a **warm-up** phase: the master builds its
+fleet view but does not emit health alerts. Once all known nodes have
+been probed (or the probe timeout expires), the master transitions to
+**ready** and begins normal health alerting. This avoids noisy
+"unhealthy" warnings for agents that simply haven't started yet.
+
+1. **Probe known nodes**: For each node in the `nodes` table, the
+   master calls `HealthCheck` (5s timeout). Nodes that respond are
+   marked healthy; nodes that don't respond are marked unhealthy.
+   Agent self-registration still runs in the background and updates
+   addresses or adds new nodes, but reconciliation does not depend
+   on it.
+2. **Check placements**: For each placement in the database, query
+   the hosting agent's `Status` RPC (bulk — one call per agent, not
+   per service). If the agent reports a service is not running, mark
+   the placement as stale (log warning, do not auto-redeploy).
+3. **Detect orphans**: For each service running on an agent that has
+   no matching placement record, log it as an orphan. Orphans may
+   result from failed deploys, manual `podman run`, or v1 leftovers.
+4. **Check edge routes**: For each edge route in the database, query
+   the edge agent for route status.
+5. **Check snapshot freshness**: Flag any service whose latest
+   snapshot is older than 2x the snapshot interval (e.g., older than
+   48 hours with a 24-hour cycle). Stale snapshots are a disaster
+   recovery risk.
+6. **Report**: All discrepancies (stale placements, orphans, missing
+   edge routes, unhealthy nodes, stale snapshots) are reported via
+   `mcp status` and structured logs.
+
+Reconciliation is read-only — it detects drift but does not
+auto-remediate. The operator reviews `mcp status` output and takes
+action. Auto-reconciliation is future work.
+
+---
+
+## Edge Routing
+
+The core v2 feature: when a service declares `public = true` on a route,
+the master automatically provisions the edge route.
+
+### Deploy Flow with Edge Routing
+
+When the master receives `Deploy(mcq)`:
+
+1. **Place service**: Master selects the target node based on tier/node/
+   container count. For mcq (tier=worker), master picks the least-loaded
+   healthy worker.
+
+2. **Deploy to worker**: Master sends `Deploy` RPC to the worker's agent
+   (timeout: 5m). The agent deploys the container, provisions a TLS cert
+   for `mcq.svc.mcp.metacircular.net` from Metacrypt, and registers the
+   internal mc-proxy route.
+
+3. **Register DNS**: Master registers an A record for the internal
+   hostname (`mcq.svc.mcp.metacircular.net`) pointing to the worker's
+   Tailnet IP via MCNS. This is the backend address that edge and
+   internal clients resolve.
+
+4. **Detect public routes**: Master inspects the service spec for routes
+   with `public = true`.
+
+5. **Validate hostname**: Master checks that `mcq.metacircular.net` falls
+   under an allowed domain using proper domain label matching.
+
+6. **Check public DNS**: Master resolves `mcq.metacircular.net` to
+   verify it points to the edge node's public IP. Public DNS records
+   are pre-provisioned manually at Hurricane Electric. If the hostname
+   does not resolve, the master warns but continues — the operator
+   may be setting up DNS in parallel.
+
+7. **Validate backend hostname**: Master verifies the internal hostname
+   (`mcq.svc.mcp.metacircular.net`) ends with `.svc.mcp.metacircular.net`.
+   The internal hostname is derived from the service and component name
+   using the convention `<component>.svc.mcp.metacircular.net`.
+
+8. **Assign edge node**: Master selects an edge node (currently svc).
+
+9. **Set up edge route**: Master sends `SetupEdgeRoute` RPC to svc's
+   agent (timeout: 30s):
+   ```
+   SetupEdgeRoute(
+     hostname:         "mcq.metacircular.net"
+     backend_hostname: "mcq.svc.mcp.metacircular.net"
+     backend_port:     8443
+     backend_tls:      true
+   )
+   ```
+
+10. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent:
+    a. Validates that `backend_hostname` ends with `.svc.mcp.metacircular.net`.
+    b. Resolves `backend_hostname` — verifies result is a Tailnet IP
+       (100.64.0.0/10).
+    c. Provisions a TLS certificate from Metacrypt for the **public**
+       hostname `mcq.metacircular.net` only. Internal names never appear
+       on edge certs.
+    d. Registers an L7 route in its local mc-proxy:
+       `mcq.metacircular.net:443 → <worker-tailnet-ip>:8443`
+       with `backend_tls = true`.
+
+11. **Master records the edge route** in its SQLite database.
+
+12. **Master returns structured result** to CLI with per-step status.
+
+**Failure handling:** If any step fails, the master returns the error
+to the CLI with the step that failed. If the deploy succeeded but
+edge routing failed, the service is running internally but not publicly
+reachable. The operator can retry with `mcp deploy` (idempotent) or
+fix the issue and run `mcp sync`.
+
+If cert provisioning fails during deploy (step 2 or 8), the deploy
+**fails** — the agent does not register an mc-proxy route pointing to
+a nonexistent cert. This prevents the silent TLS failure from v1.
+
+### Undeploy Flow
+
+1. **Undeploy on worker first**: Master sends `Undeploy` RPC to the
+   worker agent (timeout: 2m). The agent tears down the container,
+   routes, DNS, and certs. This stops the backend, ensuring no traffic
+   is served during edge cleanup.
+2. **Remove edge route**: Master sends `RemoveEdgeRoute` to svc's agent.
+   Svc removes the mc-proxy route and cleans up the cert.
+3. **Master removes records** from placements and edge_routes tables.
+
+Ordering rationale: undeploy the backend first so that if edge cleanup
+fails, the service is already stopped and the edge route returns a
+502 rather than serving stale content.
+
+### Certificate Model
+
+Two separate certs per public service — internal names never appear on
+edge certs:
+
+| Cert | Provisioned by | SAN | Used on |
+|------|---------------|-----|---------|
+| Internal | Worker agent → Metacrypt | `mcq.svc.mcp.metacircular.net` | Worker's mc-proxy |
+| Public | Edge agent → Metacrypt | `mcq.metacircular.net` | Edge's mc-proxy |
+
+Edge cert renewal is the edge agent's responsibility. The agent runs
+the same `renewWindow` check as worker agents, renewing certs before
+they expire (90-day TTL, renew at 30 days remaining).
+
+---
+
+## Snapshots
+
+The master maintains periodic snapshots of every service's data.
+Snapshots are the foundation for both migration and disaster recovery —
+if a node dies, the master can restore a service to a new node from its
+latest snapshot without the source node being alive.
+
+All nodes have LUKS-encrypted disks. Snapshots are stored on the
+master's encrypted disk, so service data is encrypted at rest at both
+source and destination. An existing backup service on rift replicates
+to external storage, covering the case where rift itself is lost.
+
+### Snapshot Mechanism
+
+The service definition declares how the agent should trigger a
+consistent snapshot via the `method` field:
+
+```toml
+[snapshot]
+method  = "grpc"                       # preferred for Metacircular services
+exclude = ["layers/", "uploads/"]      # paths to skip (optional)
+```
+
+**Methods:**
+
+| Method | How it works | Best for |
+|--------|-------------|----------|
+| `grpc` | Agent calls the standard `Snapshot` gRPC RPC on the service's gRPC port. The service vacuums databases and confirms. Agent then tars. | Metacircular services with gRPC servers |
+| `cli` | Agent runs `podman exec <container> <service> snapshot` (the engineering standard's snapshot CLI command). Agent then tars. | Metacircular services without gRPC |
+| `exec: <cmd>` | Agent runs `podman exec <container> <cmd>`. Agent then tars. | Non-standard services with custom backup scripts |
+| `full` | Agent tars the entire `/srv/<service>/` directory, auto-vacuuming any `.db` files found. | Services that need everything backed up |
+| *(omitted)* | Agent collects only `*.toml`, `*.db`, and `*.pem` files from `/srv/<service>/` — config, database, and certs. `.db` files are auto-vacuumed. | Default — covers the essentials without configuration |
+
+The **default** (no `[snapshot]` section) captures the minimum needed
+to restore a service: config, database, and TLS certs. This keeps
+snapshot sizes small and predictable. Services that need more data
+(e.g., file uploads, state directories) opt into `full` or specify
+paths explicitly.
+
+**`exclude`** works with any method. MCR uses `exclude` to skip layer
+blobs (which can be rebuilt from git) while still capturing its
+database and config.
+
+**Database consistency:** For `grpc` and `cli` methods, the service
+owns its own vacuum logic. For `full` and the default, the agent
+detects `.db` files and runs `VACUUM INTO` to a temp copy before
+including them in the tar. WAL and SHM files are excluded (the
+vacuumed copy is self-contained).
+
+### Standard Snapshot gRPC Service (mcdsl)
+
+The `grpc` snapshot method uses a standard RPC that Metacircular
+services implement via the `mcdsl/snapshot` package — same pattern as
+`mcdsl/health`:
+
+```protobuf
+service SnapshotService {
+  rpc Snapshot(SnapshotRequest) returns (SnapshotResponse);
+}
+
+message SnapshotRequest {}
+
+message SnapshotResponse {
+  bool   success = 1;
+  string error   = 2;
+  string path    = 3;  // path to the vacuumed backup (e.g. /srv/mcq/backups/...)
+}
+```
+
+Services register the `SnapshotService` on their gRPC server. The
+`mcdsl/snapshot` package provides a default implementation that reads
+the database path from the service's config, runs `VACUUM INTO`, and
+returns the backup path. Services with custom snapshot needs can
+override the handler.
+
+### Service Definition Examples
+
+Metacircular service with gRPC (preferred):
+```toml
+[snapshot]
+method = "grpc"
+```
+
+MCR (skip layer blobs):
+```toml
+[snapshot]
+method = "grpc"
+exclude = ["layers/", "uploads/"]
+```
+
+Non-Metacircular service with custom backup:
+```toml
+[snapshot]
+method = "exec: /usr/local/bin/backup.sh"
+```
+
+Service with no snapshot config (default — captures *.toml, *.db, *.pem):
+```toml
+# No [snapshot] section needed
+```
+
+### Snapshot Storage
+
+Snapshots are stored as flat files on the master node:
+
+```
+/srv/mcp-master/snapshots/
+  mcq/
+    2026-04-01T00:00:00Z.tar.zst
+  mcias/
+    2026-04-01T00:00:00Z.tar.zst
+```
+
+Format: tar.zst (tar archive with zstandard compression). One file per
+snapshot, named by UTC timestamp.
+
+### Snapshot Scheduling
+
+The master runs a scheduled job that snapshots all services every 24
+hours. The master iterates over all placements and for each one:
+
+1. Acquires a per-service lock (skips if deploy/migrate/undeploy is
+   in progress).
+2. Sends `ExportServiceData(service_name)` to the hosting agent
+   (timeout: 10m).
+3. The agent runs the snapshot command (if configured), creates a
+   tar.zst archive of `/srv/<service>/` (respecting excludes), and
+   streams it back.
+4. The master writes the archive to the snapshots directory.
+5. The master prunes old snapshots (keep last N, configurable).
+
+Scheduled snapshots are **live** — the service keeps running. Database
+consistency is ensured by the vacuum step, not by stopping the
+container. Migration snapshots use a different flow (stop first, then
+tar) for perfect consistency.
+
+**Agent fallback rule:** If `ExportServiceData` is called and the
+container is not running (migration case), the agent skips the
+configured snapshot method (`grpc`/`cli`/`exec`) and falls back to a
+direct tar with auto-vacuum of `.db` files. This is correct because
+the container already vacuumed on shutdown (SIGTERM handler).
+
+For v2, the master always requests a full snapshot — no change
+detection. Intelligence about dirty vs. clean services is future
+optimization.
+
+### Concurrency
+
+The master holds a per-service lock for all operations that touch a
+service (deploy, undeploy, migrate, snapshot). If a scheduled snapshot
+overlaps with a deploy or migration, the snapshot waits. This prevents
+capturing partial state during multi-step operations.
+
+### Snapshot RPCs
+
+```protobuf
+// Service data export -- called by master on any agent.
+// Authorization: mcp-master only.
+rpc ExportServiceData(ExportServiceDataRequest)
+    returns (stream DataChunk);
+
+// Service data import -- called by master on any agent.
+// Authorization: mcp-master only.
+rpc ImportServiceData(stream ImportServiceDataChunk)
+    returns (ImportServiceDataResponse);
+
+message ExportServiceDataRequest {
+  string service_name = 1;
+  // Snapshot config is stored in the agent's registry at deploy time.
+  // The agent uses its persisted config to determine the snapshot method
+  // (grpc, cli, exec, full, default) and exclude patterns.
+}
+
+message DataChunk {
+  bytes data = 1;
+}
+
+message ImportServiceDataChunk {
+  // First message sets the service name; subsequent messages carry data.
+  string service_name = 1;
+  bytes  data         = 2;
+  bool   force        = 3;  // overwrite existing /srv/<service>/ (first message only)
+}
+
+message ImportServiceDataResponse {
+  int64 bytes_written = 1;
+}
+```
+
+Note: `ExportServiceData`/`ImportServiceData` transfer full directory
+archives. The existing `PushFile`/`PullFile` RPCs transfer individual
+files and serve a different purpose (config distribution, cert
+provisioning).
+
+### Master Snapshot Config
+
+```toml
+[snapshots]
+dir      = "/srv/mcp-master/snapshots"
+interval = "24h"
+retain   = 7    # keep last 7 snapshots per service
+```
+
+---
+
+## Service Migration
+
+Services can be migrated between nodes with `mcp migrate`. This is
+essential for moving workloads off rift (which starts as both master
+and worker) onto dedicated workers like orion as they come online.
+
+Migration uses snapshots for data transfer. This means migration works
+even if the source node is down (disaster recovery).
+
+### Constraints
+
+- **Core services cannot be migrated.** `tier = "core"` services are
+  bound to the master node. Moving core services means designating a
+  new master — a manual, deliberate operation outside the scope of
+  `mcp migrate`.
+- **Edge nodes are not migration targets.** Edge nodes run mc-proxy
+  only, not application containers.
+
+### Migration Flow
+
+```
+mcp migrate mcq --to orion
+```
+
+When the master receives `Migrate(mcq, orion)`:
+
+1. **Validate**: Master verifies `orion` is a healthy worker. Rejects
+   migration of `tier = "core"` services and migration to edge nodes.
+
+2. **Stop on source** (if source is alive): Master sends `Stop` RPC
+   to the source agent. The agent gracefully stops the container
+   (SIGTERM). The service runs its shutdown handler, which vacuums
+   databases per the engineering standard. If the source is down,
+   skip this step.
+
+3. **Snapshot** (if source is alive): Agent tars `/srv/<service>/`
+   (now consistent — the service vacuumed on shutdown) and streams
+   it to the master. If the source is down, the master uses the most
+   recent stored snapshot.
+
+4. **Push snapshot to destination**: Master streams the snapshot to
+   the destination agent via `ImportServiceData`. The agent creates
+   `/srv/<service>/` (with correct permissions) and extracts the
+   archive.
+
+5. **Deploy on destination**: Master sends `Deploy` RPC to the
+   destination agent (orion). The agent deploys the container using
+   the restored data. Provisions internal TLS cert and registers
+   mc-proxy route on the new node.
+
+6. **Update DNS**: Master updates the internal A record
+   (`mcq.svc.mcp.metacircular.net`) to point to orion's Tailnet IP.
+
+7. **Update edge route** (if public): Master sends `SetupEdgeRoute`
+   to svc's agent with the updated backend. The edge agent updates
+   the mc-proxy route. No new cert needed — the public hostname
+   hasn't changed.
+
+8. **Clean up source** (if source is alive): Master sends `Undeploy`
+   to the source agent to remove the stopped container, old routes,
+   old certs, and old DNS records.
+
+9. **Update placement**: Master updates the `placements` table to
+   reflect the new node. This step runs regardless of whether source
+   cleanup succeeded.
+
+### Disaster Recovery
+
+If a node dies, the operator migrates its services to another node:
+
+```
+mcp migrate mcq --to orion          # source is down, uses latest snapshot
+mcp migrate --all --from rift --to orion  # evacuate all services
+```
+
+The master detects the source is unreachable (unhealthy in node
+registry), skips the stop and cleanup steps, and restores from
+the stored snapshot. Data loss is bounded by the snapshot interval
+(24 hours).
+
+### Batch Migration
+
+Full node evacuation for decommissioning or disaster recovery:
+
+```
+mcp migrate --all --from rift --to orion
+```
+
+The master migrates each service sequentially. Core services are
+skipped (they cannot be migrated). The operator sees per-service
+progress. If any migration fails, the master stops and reports which
+service failed — the operator can fix the issue and resume with
+`--all` (already-migrated services are skipped since they no longer
+have placements on the source node).
+
+### Migration Safety
+
+- The source data is not deleted until step 8 (cleanup). If migration
+  fails mid-transfer, the source still has the complete data and the
+  operator can retry or roll back.
+- The master rejects migration if the destination already has a
+  `/srv/<service>/` directory (prevents accidental overwrite).
+  Use `--force` to override.
+- Downtime window: from stop (step 2) to the new container starting
+  (step 5). For a personal platform this is acceptable.
+- Migration snapshots use stop-then-tar for perfect consistency.
+  Scheduled daily snapshots use live vacuum (no downtime).
+
+### Migration Proto
+
+```protobuf
+rpc Migrate(MigrateRequest) returns (MigrateResponse);
+
+message MigrateRequest {
+  string service_name = 1;
+  string target_node  = 2;
+  bool   force        = 3;  // overwrite existing /srv/<service>/ on target
+  bool   all          = 4;  // migrate all services from source
+  string source_node  = 5;  // required when all=true
+  // Validation: reject if all=true AND service_name is set (ambiguous).
+}
+
+message MigrateResponse {
+  repeated StepResult results = 1;
+}
+
+// Note: CreateSnapshot/ListSnapshots are master CLI commands.
+// The mcdsl SnapshotService.Snapshot RPC is a separate, service-level
+// RPC called by agents on individual services.
+rpc CreateSnapshot(CreateSnapshotRequest) returns (CreateSnapshotResponse);
+rpc ListSnapshots(ListSnapshotsRequest) returns (ListSnapshotsResponse);
+
+message CreateSnapshotRequest {
+  string service_name = 1;
+}
+
+message CreateSnapshotResponse {
+  string filename   = 1;
+  int64  size_bytes = 2;
+}
+
+message ListSnapshotsRequest {
+  string service_name = 1;
+}
+
+message ListSnapshotsResponse {
+  repeated SnapshotInfo snapshots = 1;
+}
+
+message SnapshotInfo {
+  string service_name = 1;
+  string node         = 2;  // node the snapshot was taken from
+  string filename     = 3;
+  int64  size_bytes   = 4;
+  string created_at   = 5;  // RFC3339
+}
+```
+
+### CLI
+
+```
+mcp migrate <service> --to <node>              # migrate single service
+mcp migrate <service> --to <node> --force      # overwrite existing data
+mcp migrate --all --from <node> --to <node>    # evacuate all services
+mcp snapshot <service>                         # take an on-demand snapshot
+mcp snapshot list <service>                    # list available snapshots
+```
+
+---
+
+## Agent Changes for v2
+
+### New RPCs
+
+See Proto Definitions section above for full message definitions.
+
+- `HealthCheck` — called by master on missed heartbeats.
+- `SetupEdgeRoute` — called by master on edge nodes.
+- `RemoveEdgeRoute` — called by master on edge nodes.
+- `ListEdgeRoutes` — called by master on edge nodes.
+
+All new RPCs require the caller to be `mcp-master` (authorization check).
+
+### Cert Provisioning on All Agents
+
+All agents need Metacrypt configuration:
+
+```toml
+[metacrypt]
+server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
+ca_cert    = "/srv/mcp/certs/metacircular-ca.pem"
+mount      = "pki"
+issuer     = "infra"
+token_path = "/srv/mcp/metacrypt-token"
+```
+
+Worker agents provision certs for internal hostnames. Edge agents
+provision certs for public hostnames. Both use the same Metacrypt API
+but with different identity-scoped policies.
+
+### mc-proxy Management
+
+The agent is the sole manager of mc-proxy routes via the gRPC admin API.
+TOML config is not used for route management — this avoids the
+database/config divergence problem from v1. mc-proxy's TOML config
+only sets listener addresses and TLS defaults.
+
+On mc-proxy restart, routes survive in mc-proxy's own SQLite database.
+If mc-proxy's database is lost, the agent detects missing routes during
+its monitoring cycle and re-registers them.
+
+### Deploy Failure on Cert Error
+
+If cert provisioning fails during deploy, the agent **must** fail the
+deploy — do not register an mc-proxy route pointing to a nonexistent
+cert. Return an error to the master, which reports it to the CLI. The
+current v1 behavior (log warning, continue) is a bug.
+
+---
+
+## CLI Changes for v2
+
+The CLI gains a `[master]` section and retains `[[nodes]]` for direct
+access:
+
+```toml
+[master]
+address = "100.x.x.x:9555"
+
+# Retained for --direct mode (bypass master when it's down).
+[[nodes]]
+name    = "rift"
+address = "100.95.252.120:9444"
+
+[[nodes]]
+name    = "svc"
+address = "100.x.x.x:9444"
+
+[mcias]
+server_url   = "https://mcias.metacircular.net:8443"
+service_name = "mcp"
+
+[auth]
+token_path = "/home/kyle/.config/mcp/token"
+
+[services]
+dir = "/home/kyle/.config/mcp/services"
+```
+
+By default, all commands go through the master. The `--direct` flag
+bypasses the master and dials agents directly (v1 behavior):
+
+```
+mcp deploy mcq              # → master
+mcp deploy mcq --direct -n rift  # → agent on rift (v1 mode)
+mcp ps                      # → master aggregates all agents
+mcp ps --direct             # → each agent individually (v1 mode)
+```
+
+`--direct` is the escape hatch when the master is down. In direct mode,
+deploy requires an explicit `--node` flag (the CLI cannot auto-place
+without the master).
+
+### Sync Semantics
+
+`mcp sync` is **declarative**: the service definitions on the operator's
+workstation are the source of truth. The master converges the fleet:
+
+- New definitions → deploy.
+- Changed definitions → redeploy.
+- Definitions present in the master's placement table but absent from
+  the sync request → undeploy.
+
+This makes the services directory a complete, auditable declaration of
+what should be running. Use `mcp sync --dry-run` to preview what sync
+would do without executing.
+
+### Direct Mode Caveat
+
+Services deployed via `--direct` (bypassing the master) are invisible
+to the master — no placement record exists. Reconciliation detects
+them as orphans. To bring a directly-deployed service under master
+management, redeploy it through the master.
+
+### New Commands
+
+```
+mcp edge list               # list all public edge routes
+mcp edge status             # health of edge routes (cert expiry, backend reachable)
+mcp node list               # fleet status from master
+```
+
+Service definition files remain on the operator's workstation. The CLI
+pushes them to the master on `mcp deploy` and `mcp sync`.
+
+---
+
+## Agent Upgrades
+
+The fleet is heterogeneous (NixOS + Debian, amd64 + arm64). NixOS flake
+inputs don't work as a universal update mechanism.
+
+MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
+
+```
+mcp agent upgrade [node]    # cross-compile, SCP, restart via SSH
+```
+
+- CLI cross-compiles for the target's GOARCH.
+- Copies via SCP to `/srv/mcp/mcp-agent.new`.
+- Restarts via SSH. The restart command is OS-aware: `doas` on NixOS
+  (rift, orion), `sudo` on Debian (svc). Configurable per node.
+- Running containers survive the restart — rootless podman containers
+  are independent of the agent process. `--restart unless-stopped` means
+  podman handles liveness.
+- The upgrade window (agent down for ~2s) only affects management
+  operations. The master marks the agent as temporarily unhealthy until
+  the next heartbeat.
+
+All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
+`mcp-agent.service`.
+
+---
+
+## Migration Plan
+
+### Phase 1: Agent on svc
+
+Deploy mcp-agent to svc (Debian):
+
+- Create `mcp` user, install binary via SCP, configure systemd.
+- Configure with Metacrypt access and mc-proxy gRPC socket access.
+- Migrate existing mc-proxy TOML routes to agent-managed routes:
+  export current routes from mc-proxy SQLite, import via agent
+  `AddProxyRoute` RPCs.
+- Verify with `mcp node list` (svc shows up).
+
+### Phase 2: Edge routing RPCs
+
+Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the
+agent. Test by calling directly from the CLI (temporary `mcp edge setup`
+scaffolding command, removed after phase 3).
+
+### Phase 3: Build mcp-master
+
+Core coordination loop. Uses bootstrap `[[nodes]]` config for agent
+addresses (dynamic registration comes in phase 4):
+
+1. gRPC server with `McpMasterService`.
+2. SQLite database for placements and edge routes.
+3. Accept `Deploy` / `Undeploy` from CLI.
+4. Place service on a node (tier / container-count).
+5. Forward deploy to the correct agent.
+6. Register DNS via MCNS.
+7. Detect `public = true` routes, validate, call `SetupEdgeRoute`.
+8. Return structured per-step results to CLI.
+
+### Phase 4: Agent registration and health
+
+- Agents self-register on startup (identity-bound).
+- Heartbeat loop (30s interval, resource data).
+- Master probe on missed heartbeats (90s threshold, 5s timeout).
+- Fleet status aggregation for `mcp ps` and `mcp node list`.
+- Reconciliation on master startup.
+- Master transitions from bootstrap `[[nodes]]` to dynamic registry.
+
+### Phase 5: Snapshots and migration
+
+- Implement `ExportServiceData` / `ImportServiceData` on agents.
+- Implement `mcdsl/snapshot` standard gRPC service.
+- Add snapshot scheduling to master (24h cycle, retention pruning).
+- Implement `CreateSnapshot`, `ListSnapshots`, `Migrate` on master.
+- Add `mcp snapshot`, `mcp snapshot list`, `mcp migrate` CLI commands.
+- Test migration between rift and orion.
+
+### Phase 6: Cut over
+
+- Update CLI config to add `[master]` section.
+- Update service definitions with `tier` and `public` fields.
+- Deploy agent to orion.
+- Verify all services via `mcp ps` and public endpoint tests.
+- Keep `[[nodes]]` config and `--direct` flag as escape hatch.
+
+---
+
+## Hostname Convention for Public Services
+
+Services with public routes have two hostnames:
+
+| Hostname | Purpose | Example |
+|----------|---------|---------|
+| `<svc>.metacircular.net` | Public — browser access, SSO login | `mcq.metacircular.net` |
+| `<svc>.svc.mcp.metacircular.net` | Internal — API clients, service-to-service | `mcq.svc.mcp.metacircular.net` |
+
+**SSO always uses the public hostname.** The service's `[sso].redirect_uri`
+and the MCIAS SSO client registration both point to the public hostname
+(e.g., `https://mcq.metacircular.net/sso/callback`). SSO state cookies
+are bound to the domain they are set on, so the entire browser-based
+login flow must stay on a single hostname.
+
+**API clients use the internal hostname.** Service-to-service calls,
+CLI tools, and MCP server communication authenticate with bearer tokens
+(not SSO) and use the internal `.svc.mcp.` hostname. These do not
+involve browser cookies and are unaffected by the SSO hostname
+constraint.
+
+This means:
+- Human users bookmark `mcq.metacircular.net`, not the `.svc.mcp.` URL.
+- The web UI's SSO "Sign in" button always initiates the flow on the
+  public hostname.
+- API endpoints on both hostnames accept the same bearer tokens —
+  the hostname distinction is a routing and cookie concern, not an
+  auth concern.
+
+---
+
+## Superseded Documents
+
+`docs/edge-routing-design.md` is superseded by this document. It used
+agent-to-agent communication, a single shared cert, private key
+transmission over gRPC, and an `edge` field instead of `public`. None
+of these design choices carried forward to v2.
+
+---
+
+## Open Questions
+
+1. **Master HA**: mcp-master is a single point of failure. For v2, this
+   is acceptable — the operator can use `--direct` to bypass the master.
+   Future work could add master replication.
+
+2. **Auto-reconciliation**: The master detects drift but does not
+   auto-remediate. Future work could add automatic redeploy on drift.
+
+## v2 Scope
+
+v2 targets amd64 nodes only: rift (master+worker), orion (worker),
+svc (edge). All images are single-arch amd64.
+
+## Fast-Follow: arm64 Support
+
+Immediate follow-up after v2 to onboard Raspberry Pi workers
+(hyperborea and others):
+
+1. **MCR manifest list support**: Accept and serve OCI image indexes
+   (`application/vnd.oci.image.index.v1+json`) so a single tag
+   references both amd64 and arm64 variants.
+2. **`mcp build` multi-arch**: Build `linux/amd64` + `linux/arm64`
+   images and push manifest lists to MCR.
+3. **Onboard RPi workers**: Deploy agents, add to registration
+   allowlist. Placement remains arch-agnostic — podman pulls the
+   correct variant automatically.
+
+## What v2 Does NOT Include
+
+These remain future work beyond the arm64 fast-follow:
+
+- Auto-reconciliation (master-driven redeploy on drift)
+- Zero-downtime live migration (v2 migration stops the service)
+- Web UI for fleet management
+- Observability / log aggregation
+- Object store
+- Multiple edge nodes with load-based assignment
+- Master replication / HA
+- Resource-aware bin-packing (requires resource declarations in service defs)
diff --git a/docs/phase-e-plan.md b/docs/phase-e-plan.md
index 035b961..53083d7 100644
--- a/docs/phase-e-plan.md
+++ b/docs/phase-e-plan.md
@@ -15,25 +15,59 @@ same as today on rift, but across the fleet.
 
 | Node | OS | Arch | Role |
 |------|----|------|------|
-| desktop (TBD) | NixOS | amd64 | Control plane -- runs master + MCIAS + MCNS |
-| rift | NixOS | amd64 | Compute -- application services |
-| orion | NixOS | amd64 | Compute |
-| hyperborea | Debian | arm64 | Compute (Raspberry Pi) |
+| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
+| orion | NixOS | amd64 | Worker |
+| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
 | svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
 
 Tailnet is the interconnect between all nodes. Public traffic enters via
-mc-proxy on svc, which forwards over Tailnet to compute nodes.
+mc-proxy on svc, which forwards over Tailnet to worker nodes.
+
+## Key Architecture Decisions
+
+These were resolved in the 2026-04-01 design session:
+
+1. **Rift is the master node.** No separate straylight machine. Core infra
+   stays on rift, which gains mcp-master alongside its existing agent.
+
+2. **Master-mediated coordination.** Agents never talk to each other. All
+   cross-node operations go through the master. Agents only dial the master
+   (for registration and heartbeats) and respond to master RPCs.
+
+3. **Agent self-registration.** Agents register with the master on startup
+   (name, role, address, arch). The master maintains the live node registry.
+   No static `[[nodes]]` config required except for bootstrap.
+
+4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
+   (with resource data). If the master misses 3 heartbeats (90s), it
+   actively probes the agent. Failed probe marks the node unhealthy.
+
+5. **Tier-based placement.** `tier = "core"` runs on the master node.
+   `tier = "worker"` (default) is auto-placed on a worker with capacity.
+   Explicit `node = "orion"` overrides tier for pinned services.
+
+6. **Two separate certs for public services.** Internal cert
+   (`svc.mcp.metacircular.net`) issued by worker agent. Public cert
+   (`metacircular.net`) issued by edge agent. Internal names never
+   appear on edge certs.
+
+7. **`public = true` on routes.** Public routes declare intent with a
+   boolean flag. The master assigns the route to an edge node (currently
+   always svc). No explicit `edge` field in service definitions.
 
 ## Components
 
 ### Master (`mcp-master`)
 
-Long-lived orchestrator on the control plane node. Responsibilities:
+Long-lived orchestrator on rift. Responsibilities:
 
 - Accept CLI commands and dispatch to the correct agent
+- Maintain node registry from agent self-registration
+- Place services based on tier, explicit node, and resource availability
+- Detect `public = true` routes and coordinate edge setup
+- Validate public hostnames against allowed domain list
 - Aggregate status from all agents (fleet-wide view)
-- Node selection when `node` is omitted from a service definition
-- Health-aware scheduling using agent heartbeat data
+- Probe agents on missed heartbeats
 
 The master is stateless in the durable sense -- it rebuilds its world view
 from agents on startup. If the master goes down, running services continue
@@ -54,50 +88,40 @@ inputs don't work as a universal update mechanism.
 - All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
   `mcp-agent.service`
 
-Upgrades must be coordinated -- new RPCs cause `Unimplemented` errors on
-old agents.
-
 ### Edge agents
 
 svc runs an agent but does NOT run containers. Its agent manages mc-proxy
-routing only: when the master provisions a service on a compute node, svc's
-agent updates mc-proxy routes to point at the compute node's Tailnet
-address.
+routing only: when the master tells it to set up an edge route, it
+provisions a TLS cert from Metacrypt and registers the route in its local
+mc-proxy via the gRPC admin API.
 
-### MCIAS migration
+## Migration Plan
 
-MCIAS moves from the svc VPS to the control plane node, running as an
-MCP-managed container with an independent lifecycle. Bootstrap order:
+### Phase 1: Agent on svc
+Deploy mcp-agent to svc. Verify with `mcp node list`.
 
-1. MCIAS image pre-staged or pulled unauthenticated
-2. MCIAS starts (L4 passthrough through mc-proxy -- manages its own TLS)
-3. All other services bootstrap after MCIAS is up
+### Phase 2: Edge routing RPCs
+Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
+Test by calling directly from CLI.
 
-## Scheduling
+### Phase 3: Build mcp-master
+Core loop: registration, heartbeats, deploy routing, placement, edge
+coordination.
 
-Three placement modes, in order of specificity:
+### Phase 4: Agent registration and health
+Self-registration, heartbeat loop, master probe fallback, fleet status.
 
-1. `node = "rift"` -- explicit placement on a named node
-2. `node = "pi-pool"` -- master picks within a named cluster
-3. `node` omitted -- master picks any compute node with capacity
-
-Resource-aware placement via agent heartbeats (CPU, memory, disk). RPis
-with 4-8 GB RAM need resource tracking more than beefy servers.
-
-## Open Questions
-
-- **Control plane machine**: which desktop becomes the always-on node?
-- **Heartbeat model**: agent push vs. master poll?
-- **Cluster definition**: explicit pool config in master vs. node labels/tags?
-- **MCIAS migration timeline**: when to cut over from svc to control plane?
-- **Agent on svc**: what subset of agent RPCs does an edge-only agent need?
+### Phase 5: Cut over
+Point CLI at master, add tier fields to service defs, deploy agents to
+orion and hyperborea.
 
 ## What Phase E Does NOT Include
 
 These remain future work:
 
 - Auto-reconciliation (agent auto-restarting drifted containers)
-- Migration (snapshot streaming between nodes)
+- Live migration (snapshot streaming between nodes)
 - Web UI for fleet management
 - Observability / log aggregation
 - Object store
+- Multiple edge nodes / master HA