Add ARCHITECTURE_V2.md for multi-node master/agent topology

Documents the planned v2 architecture: mcp-master on straylight
coordinates deployments across worker (rift) and edge (svc) nodes.
Includes edge routing flow, agent RPCs, migration plan, and
operational issues from v1 that motivate the redesign.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 20:37:24 -07:00
parent d8f45ca520
commit 4ac8a6d60b

489
ARCHITECTURE_V2.md Normal file
View File

@@ -0,0 +1,489 @@
# MCP v2 -- Multi-Node Control Plane
## Overview
MCP v2 introduces multi-node orchestration with a master/agent topology.
The CLI no longer dials agents directly. A dedicated **mcp-master** daemon
coordinates deployments across nodes, handles cross-node concerns (edge
routing, certificate provisioning, DNS), and serves as the single control
point for the platform.
### Motivation
v1 deployed successfully on a single node (rift) but exposed operational
pain points as services needed public-facing routes through svc:
- **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing
mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating
the SQLite database when the config and database diverged, and debugging
silent failures. Every redeployment risked breaking the public route.
- **Dynamic port instability**: The route system assigns ephemeral host ports
that change on every deploy. svc's mc-proxy pointed at a specific port
(e.g., `100.95.252.120:48080`), which went stale after redeployment.
Container ports are also localhost-only under rootless podman, requiring
explicit Tailscale IP bindings for external access.
- **$PORT env override conflict**: The mcdsl config loader overrides
`listen_addr` from `$PORT` when routes are present. This meant containers
ignored their configured port and listened on the route-allocated one
instead, breaking explicit port mappings that expected the config port.
- **Cert chain issues**: mc-proxy requires full certificate chains (leaf +
intermediates). Certs provisioned outside the standard metacrypt flow
were leaf-only and caused silent TLS handshake failures (`client_bytes=7
backend_bytes=0` with no error logged).
- **mc-proxy database divergence**: mc-proxy persists routes in SQLite.
Routes added via the admin API override the TOML config. Editing the TOML
alone had no effect until the database was manually updated -- a failure
mode that took hours to diagnose.
- **No cross-node coordination**: The v1 CLI talks directly to individual
agents. There is no mechanism for one agent to tell another "set up a
route for this service." Every cross-node operation was manual.
v2 addresses all of these by making the master the single coordination
point for deployments, with agents handling local concerns (containers,
mc-proxy routes, cert provisioning) on instruction from the master.
### What Changes from v1
| Concern | v1 | v2 |
|---------|----|----|
| CLI target | CLI dials agents directly | CLI dials the master |
| Node awareness | CLI routes by `node` field in service defs | Master owns the node registry |
| Service definitions | Live on operator workstation | Pushed to master, which distributes to agents |
| Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup |
| Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) |
| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones |
### What Stays the Same
The agent's core responsibilities are unchanged: it manages containers via
podman, stores its local registry in SQLite, monitors for drift, and alerts
the operator. The agent gains new RPCs for edge routing but does not become
aware of other nodes -- the master handles all cross-node coordination.
---
## Topology
```
Operator workstation (vade)
┌──────────────────────────┐
│ mcp (CLI) │
│ │
│ gRPC ───────────────────┼─── overlay ───┐
└──────────────────────────┘ │
Master node (straylight)
┌──────────────────────────────────────────────────────┐
│ mcp-master │
│ ├── node registry (all nodes, roles, addresses) │
│ ├── service definitions (pushed from CLI) │
│ └── deployment coordinator │
│ │
│ mcp-agent │
│ ├── mcns container │
│ ├── metacrypt container │
│ ├── mcr container │
│ └── mc-proxy (straylight) │
└──────────┬──────────────────────────┬────────────────┘
│ │
overlay overlay
│ │
▼ ▼
Worker node (rift) Edge node (svc)
┌─────────────────────┐ ┌─────────────────────────┐
│ mcp-agent │ │ mcp-agent │
│ ├── exo │ │ ├── mc-proxy (svc) │
│ ├── mcq │ │ └── (edge routes only) │
│ ├── mcdoc │ │ │
│ ├── sgard │ │ Edge routes: │
│ ├── kls │ │ mcq.metacircular.net │
│ └── mc-proxy │ │ mcdoc.metacircular.net │
│ (rift) │ │ exo.metacircular.net │
└─────────────────────┘ │ sgard.metacircular.net │
└─────────────────────────┘
```
### Node Roles
| Role | Purpose | Nodes |
|------|---------|-------|
| **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight |
| **worker** | Runs mcp-agent. Hosts application services. | rift |
| **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc |
Every node runs an mcp-agent. The master node also runs mcp-master.
The master's local agent manages the infrastructure services (mcns,
metacrypt, mcr) the same way rift's agent manages application services.
### mc-proxy Mesh
Each node runs its own mc-proxy instance. They form a routing mesh:
```
mc-proxy (straylight)
├── :443 L7 routes for metacrypt-web, mcr-web
├── :8443 L4 passthrough for metacrypt-api, mcr-api
└── :9443 L4 passthrough for gRPC services
mc-proxy (rift)
├── :443 L7 routes for internal .svc.mcp hostnames
└── :8443 L4/L7 routes for internal APIs
mc-proxy (svc)
└── :443 L7 termination for public hostnames
→ forwards to internal .svc.mcp endpoints
```
---
## mcp-master
The master is a new binary that coordinates cross-node operations. It is
**not** a replacement for the agent -- it sits above agents and orchestrates
them.
### Responsibilities
1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync).
2. **Route deployments** to the correct agent based on the service
definition's `node` field.
3. **Detect public hostnames** in service definitions and coordinate edge
routing with the edge node's agent.
4. **Validate public hostnames** against a configured allowlist of domains
(e.g., `metacircular.net`, `wntrmute.net`).
5. **Resolve edge nodes** by checking DNS CNAME records to determine which
node handles public traffic for a given hostname.
6. **Coordinate undeploy** across nodes: tear down the service on the
worker, then clean up edge routes on the edge node.
### What the Master Does NOT Do
- Store container state (agents own their registries).
- Manage container lifecycle directly (agents do this).
- Run containers (the co-located agent does).
- Replace the agent on any node.
### Master Configuration
```toml
[server]
grpc_addr = "100.x.x.x:9555" # master listens on overlay
tls_cert = "/srv/mcp-master/certs/cert.pem"
tls_key = "/srv/mcp-master/certs/key.pem"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp-master"
# Allowed public domains. Hostnames in service definitions must fall
# under one of these suffixes.
[edge]
allowed_domains = ["metacircular.net", "wntrmute.net"]
# Node registry. The master knows about all nodes.
[[nodes]]
name = "straylight"
address = "100.x.x.x:9444"
role = "master"
[[nodes]]
name = "rift"
address = "100.95.252.120:9444"
role = "worker"
[[nodes]]
name = "svc"
address = "100.x.x.x:9444"
role = "edge"
```
---
## Edge Routing
The core v2 feature: when a service declares a public hostname, the
master automatically provisions the edge route.
### Service Definition
Public hostnames are declared in the route's `hostname` field. The
master distinguishes public from internal hostnames by checking whether
they fall under a `.svc.mcp.` subdomain:
- `mcq.svc.mcp.metacircular.net` → internal (handled by local mc-proxy)
- `mcq.metacircular.net` → public (requires edge routing)
```toml
name = "mcq"
node = "rift"
active = true
[[components]]
name = "mcq"
image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
volumes = ["/srv/mcq:/srv/mcq"]
cmd = ["server", "--config", "/srv/mcq/mcq.toml"]
# Internal route: handled by rift's mc-proxy.
[[components.routes]]
name = "internal"
port = 8443
mode = "l7"
# Public route: master detects this and sets up edge routing on svc.
[[components.routes]]
name = "public"
port = 8443
mode = "l7"
hostname = "mcq.metacircular.net"
```
### Deploy Flow with Edge Routing
When the master receives `Deploy(mcq)`:
1. **Route to worker**: Master sends `Deploy` RPC to rift's agent with
the full service spec. Rift's agent deploys the container and
registers mc-proxy routes for all hostnames (both internal and public)
on its local mc-proxy.
2. **Detect public hostnames**: Master inspects the service spec for
hostnames that are not `.svc.mcp.` subdomains.
3. **Validate domains**: Master checks that `mcq.metacircular.net` falls
under an allowed domain (`metacircular.net` ✓).
4. **Resolve edge node**: Master performs a DNS lookup for
`mcq.metacircular.net`. If it's a CNAME to `svc.metacircular.net`,
the master resolves `svc.metacircular.net` to identify the edge node
as `svc`. If DNS is not yet configured (no CNAME), the master uses
the default edge node from config.
5. **Set up edge route**: Master sends an `SetupEdgeRoute` RPC to svc's
agent:
```
SetupEdgeRoute(
hostname: "mcq.metacircular.net"
backend_hostname: "mcq.svc.mcp.metacircular.net"
backend_port: 8443
)
```
6. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent:
a. Provisions a TLS certificate from Metacrypt for
`mcq.metacircular.net`.
b. Registers an L7 route in its local mc-proxy:
`mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443`
with the provisioned cert.
7. **Master records the edge route** in its own registry for undeploy
cleanup.
### Undeploy Flow
When the master receives `Undeploy(mcq)`:
1. **Look up edge routes**: Master checks its registry for edge routes
associated with mcq.
2. **Remove edge route**: Master sends `RemoveEdgeRoute(mcq.metacircular.net)`
to svc's agent. Svc's agent removes the mc-proxy route and cleans up
the cert.
3. **Undeploy on worker**: Master sends `Undeploy` RPC to rift's agent.
Rift's agent tears down the container, routes, DNS, and certs as in v1.
### Edge Node DNS Resolution
The master determines which edge node handles a public hostname by
checking DNS:
1. Look up `mcq.metacircular.net` → CNAME `svc.metacircular.net`
2. Look up `svc.metacircular.net` → IP address
3. Match the IP against known edge nodes
If no CNAME exists yet (operator hasn't set it up), the master warns but
does not fail. The operator sets up DNS manually at Hurricane Electric.
The master can provide a `mcp dns check` command that verifies all public
hostnames resolve correctly.
---
## Agent Changes for v2
### New RPCs
```protobuf
// Edge routing -- called by master on edge nodes.
rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);
message SetupEdgeRouteRequest {
string hostname = 1; // public hostname (e.g. "mcq.metacircular.net")
string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net")
int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443)
}
message SetupEdgeRouteResponse {}
message RemoveEdgeRouteRequest {
string hostname = 1;
}
message RemoveEdgeRouteResponse {}
message ListEdgeRoutesRequest {}
message ListEdgeRoutesResponse {
repeated EdgeRoute routes = 1;
}
message EdgeRoute {
string hostname = 1;
string backend_hostname = 2;
int32 backend_port = 3;
string cert_serial = 4;
string cert_expires = 5;
}
```
### SetupEdgeRoute Implementation
When the agent receives `SetupEdgeRoute`:
1. **Resolve backend address**: The agent resolves `backend_hostname` to
an IP address (the worker node's overlay IP). It uses the port from
the request to form the backend address (e.g., `100.95.252.120:8443`).
2. **Provision TLS cert**: The agent calls Metacrypt's CA API to issue a
certificate for the public hostname. The cert and key are written to
the mc-proxy cert directory.
3. **Register mc-proxy route**: The agent adds an L7 route to its local
mc-proxy:
- Listener: `:443`
- Hostname: `mcq.metacircular.net`
- Backend: `100.95.252.120:8443`
- Mode: `l7`
- TLS cert/key: the provisioned cert
- Backend TLS: `true` (worker's mc-proxy serves TLS)
4. **Record the edge route** in the agent's local registry for listing
and cleanup.
### Cert Provisioning on All Agents
All agents need Metacrypt configuration to provision certs:
```toml
[metacrypt]
server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
ca_cert = "/srv/mcp/certs/metacircular-ca.pem"
mount = "pki"
issuer = "infra"
token_path = "/srv/mcp/metacrypt-token"
```
The svc agent provisions certs for public hostnames. The rift agent
provisions certs for internal hostnames. Both use the same Metacrypt API.
---
## CLI Changes for v2
The CLI's `[[nodes]]` config is replaced by a single master address:
```toml
[master]
address = "100.x.x.x:9555"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp"
[auth]
token_path = "/home/kyle/.config/mcp/token"
[services]
dir = "/home/kyle/.config/mcp/services"
```
Commands that currently iterate over nodes (`mcp ps`, `mcp list`,
`mcp node list`) instead query the master, which aggregates from all
agents.
Service definition files remain on the operator's workstation. The CLI
pushes them to the master on `mcp deploy` and `mcp sync`.
---
## Migration Plan
### Phase 1: Deploy mcp-agent on svc
svc currently has no mcp-agent. Install and configure one:
- Create `mcp` user on svc (Debian: `install-agent.sh`)
- Install mcp-agent binary
- Configure with Metacrypt and mc-proxy socket access
- Verify with `mcp node list` (svc shows up)
### Phase 2: Add edge routing RPCs to agents
Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the
agent. Test by calling them directly from the CLI before the master exists.
### Phase 3: Build mcp-master
Start with the core coordination loop:
1. Accept `Deploy` from CLI
2. Forward to the correct agent
3. Detect public hostnames
4. Call `SetupEdgeRoute` on the edge agent
### Phase 4: Provision straylight
New node (straylight) takes over as master and hosts core infrastructure:
1. Deploy mcp-agent on straylight
2. Migrate mcns, metacrypt, mcr from rift to straylight
3. Deploy mcp-master on straylight
4. Update CLI config to point at master
### Phase 5: Cut over
- Update DNS to point `*.svc.mcp.metacircular.net` at straylight
- Update service definitions to use new node assignments
- Verify all services via `mcp ps` and public endpoint tests
---
## Open Questions
1. **Master HA**: mcp-master is a single point of failure. For v2, this
is acceptable (operator can SSH to agents directly if master is down).
v3 could add master replication or make agents self-sufficient for
local operations when the master is unreachable.
2. **Service placement**: v2 still requires explicit `node` assignment
in service definitions. Automatic placement based on resource
availability is a future concern.
3. **Cert renewal on edge**: Edge certs have a 90-day TTL. The edge
agent needs a renewal loop (similar to the existing `renewWindow`
check in `EnsureCert`) or the master needs to periodically re-check
edge routes.
4. **mc-proxy database vs config**: mc-proxy persists routes in SQLite,
which can diverge from the TOML config. The agent should be the sole
manager of mc-proxy routes via the gRPC admin API, not the TOML file.
This avoids the stale-database problem encountered during v1
operations on svc.
5. **straylight hardware**: What hardware is straylight? Does it run
NixOS or Debian? Does it use rootless podman like rift?