Add ARCHITECTURE_V2.md for multi-node master/agent topology
Documents the planned v2 architecture: mcp-master on straylight coordinates deployments across worker (rift) and edge (svc) nodes. Includes edge routing flow, agent RPCs, migration plan, and operational issues from v1 that motivate the redesign. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
489
ARCHITECTURE_V2.md
Normal file
489
ARCHITECTURE_V2.md
Normal file
@@ -0,0 +1,489 @@
|
||||
# MCP v2 -- Multi-Node Control Plane
|
||||
|
||||
## Overview
|
||||
|
||||
MCP v2 introduces multi-node orchestration with a master/agent topology.
|
||||
The CLI no longer dials agents directly. A dedicated **mcp-master** daemon
|
||||
coordinates deployments across nodes, handles cross-node concerns (edge
|
||||
routing, certificate provisioning, DNS), and serves as the single control
|
||||
point for the platform.
|
||||
|
||||
### Motivation
|
||||
|
||||
v1 deployed successfully on a single node (rift) but exposed operational
|
||||
pain points as services needed public-facing routes through svc:
|
||||
|
||||
- **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing
|
||||
mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating
|
||||
the SQLite database when the config and database diverged, and debugging
|
||||
silent failures. Every redeployment risked breaking the public route.
|
||||
|
||||
- **Dynamic port instability**: The route system assigns ephemeral host ports
|
||||
that change on every deploy. svc's mc-proxy pointed at a specific port
|
||||
(e.g., `100.95.252.120:48080`), which went stale after redeployment.
|
||||
Container ports are also localhost-only under rootless podman, requiring
|
||||
explicit Tailscale IP bindings for external access.
|
||||
|
||||
- **$PORT env override conflict**: The mcdsl config loader overrides
|
||||
`listen_addr` from `$PORT` when routes are present. This meant containers
|
||||
ignored their configured port and listened on the route-allocated one
|
||||
instead, breaking explicit port mappings that expected the config port.
|
||||
|
||||
- **Cert chain issues**: mc-proxy requires full certificate chains (leaf +
|
||||
intermediates). Certs provisioned outside the standard metacrypt flow
|
||||
were leaf-only and caused silent TLS handshake failures (`client_bytes=7
|
||||
backend_bytes=0` with no error logged).
|
||||
|
||||
- **mc-proxy database divergence**: mc-proxy persists routes in SQLite.
|
||||
Routes added via the admin API override the TOML config. Editing the TOML
|
||||
alone had no effect until the database was manually updated -- a failure
|
||||
mode that took hours to diagnose.
|
||||
|
||||
- **No cross-node coordination**: The v1 CLI talks directly to individual
|
||||
agents. There is no mechanism for one agent to tell another "set up a
|
||||
route for this service." Every cross-node operation was manual.
|
||||
|
||||
v2 addresses all of these by making the master the single coordination
|
||||
point for deployments, with agents handling local concerns (containers,
|
||||
mc-proxy routes, cert provisioning) on instruction from the master.
|
||||
|
||||
### What Changes from v1
|
||||
|
||||
| Concern | v1 | v2 |
|
||||
|---------|----|----|
|
||||
| CLI target | CLI dials agents directly | CLI dials the master |
|
||||
| Node awareness | CLI routes by `node` field in service defs | Master owns the node registry |
|
||||
| Service definitions | Live on operator workstation | Pushed to master, which distributes to agents |
|
||||
| Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup |
|
||||
| Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) |
|
||||
| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones |
|
||||
|
||||
### What Stays the Same
|
||||
|
||||
The agent's core responsibilities are unchanged: it manages containers via
|
||||
podman, stores its local registry in SQLite, monitors for drift, and alerts
|
||||
the operator. The agent gains new RPCs for edge routing but does not become
|
||||
aware of other nodes -- the master handles all cross-node coordination.
|
||||
|
||||
---
|
||||
|
||||
## Topology
|
||||
|
||||
```
|
||||
Operator workstation (vade)
|
||||
┌──────────────────────────┐
|
||||
│ mcp (CLI) │
|
||||
│ │
|
||||
│ gRPC ───────────────────┼─── overlay ───┐
|
||||
└──────────────────────────┘ │
|
||||
▼
|
||||
Master node (straylight)
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ mcp-master │
|
||||
│ ├── node registry (all nodes, roles, addresses) │
|
||||
│ ├── service definitions (pushed from CLI) │
|
||||
│ └── deployment coordinator │
|
||||
│ │
|
||||
│ mcp-agent │
|
||||
│ ├── mcns container │
|
||||
│ ├── metacrypt container │
|
||||
│ ├── mcr container │
|
||||
│ └── mc-proxy (straylight) │
|
||||
└──────────┬──────────────────────────┬────────────────┘
|
||||
│ │
|
||||
overlay overlay
|
||||
│ │
|
||||
▼ ▼
|
||||
Worker node (rift) Edge node (svc)
|
||||
┌─────────────────────┐ ┌─────────────────────────┐
|
||||
│ mcp-agent │ │ mcp-agent │
|
||||
│ ├── exo │ │ ├── mc-proxy (svc) │
|
||||
│ ├── mcq │ │ └── (edge routes only) │
|
||||
│ ├── mcdoc │ │ │
|
||||
│ ├── sgard │ │ Edge routes: │
|
||||
│ ├── kls │ │ mcq.metacircular.net │
|
||||
│ └── mc-proxy │ │ mcdoc.metacircular.net │
|
||||
│ (rift) │ │ exo.metacircular.net │
|
||||
└─────────────────────┘ │ sgard.metacircular.net │
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
### Node Roles
|
||||
|
||||
| Role | Purpose | Nodes |
|
||||
|------|---------|-------|
|
||||
| **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight |
|
||||
| **worker** | Runs mcp-agent. Hosts application services. | rift |
|
||||
| **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc |
|
||||
|
||||
Every node runs an mcp-agent. The master node also runs mcp-master.
|
||||
The master's local agent manages the infrastructure services (mcns,
|
||||
metacrypt, mcr) the same way rift's agent manages application services.
|
||||
|
||||
### mc-proxy Mesh
|
||||
|
||||
Each node runs its own mc-proxy instance. They form a routing mesh:
|
||||
|
||||
```
|
||||
mc-proxy (straylight)
|
||||
├── :443 L7 routes for metacrypt-web, mcr-web
|
||||
├── :8443 L4 passthrough for metacrypt-api, mcr-api
|
||||
└── :9443 L4 passthrough for gRPC services
|
||||
|
||||
mc-proxy (rift)
|
||||
├── :443 L7 routes for internal .svc.mcp hostnames
|
||||
└── :8443 L4/L7 routes for internal APIs
|
||||
|
||||
mc-proxy (svc)
|
||||
└── :443 L7 termination for public hostnames
|
||||
→ forwards to internal .svc.mcp endpoints
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## mcp-master
|
||||
|
||||
The master is a new binary that coordinates cross-node operations. It is
|
||||
**not** a replacement for the agent -- it sits above agents and orchestrates
|
||||
them.
|
||||
|
||||
### Responsibilities
|
||||
|
||||
1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync).
|
||||
2. **Route deployments** to the correct agent based on the service
|
||||
definition's `node` field.
|
||||
3. **Detect public hostnames** in service definitions and coordinate edge
|
||||
routing with the edge node's agent.
|
||||
4. **Validate public hostnames** against a configured allowlist of domains
|
||||
(e.g., `metacircular.net`, `wntrmute.net`).
|
||||
5. **Resolve edge nodes** by checking DNS CNAME records to determine which
|
||||
node handles public traffic for a given hostname.
|
||||
6. **Coordinate undeploy** across nodes: tear down the service on the
|
||||
worker, then clean up edge routes on the edge node.
|
||||
|
||||
### What the Master Does NOT Do
|
||||
|
||||
- Store container state (agents own their registries).
|
||||
- Manage container lifecycle directly (agents do this).
|
||||
- Run containers (the co-located agent does).
|
||||
- Replace the agent on any node.
|
||||
|
||||
### Master Configuration
|
||||
|
||||
```toml
|
||||
[server]
|
||||
grpc_addr = "100.x.x.x:9555" # master listens on overlay
|
||||
tls_cert = "/srv/mcp-master/certs/cert.pem"
|
||||
tls_key = "/srv/mcp-master/certs/key.pem"
|
||||
|
||||
[mcias]
|
||||
server_url = "https://mcias.metacircular.net:8443"
|
||||
service_name = "mcp-master"
|
||||
|
||||
# Allowed public domains. Hostnames in service definitions must fall
|
||||
# under one of these suffixes.
|
||||
[edge]
|
||||
allowed_domains = ["metacircular.net", "wntrmute.net"]
|
||||
|
||||
# Node registry. The master knows about all nodes.
|
||||
[[nodes]]
|
||||
name = "straylight"
|
||||
address = "100.x.x.x:9444"
|
||||
role = "master"
|
||||
|
||||
[[nodes]]
|
||||
name = "rift"
|
||||
address = "100.95.252.120:9444"
|
||||
role = "worker"
|
||||
|
||||
[[nodes]]
|
||||
name = "svc"
|
||||
address = "100.x.x.x:9444"
|
||||
role = "edge"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Edge Routing
|
||||
|
||||
The core v2 feature: when a service declares a public hostname, the
|
||||
master automatically provisions the edge route.
|
||||
|
||||
### Service Definition
|
||||
|
||||
Public hostnames are declared in the route's `hostname` field. The
|
||||
master distinguishes public from internal hostnames by checking whether
|
||||
they fall under a `.svc.mcp.` subdomain:
|
||||
|
||||
- `mcq.svc.mcp.metacircular.net` → internal (handled by local mc-proxy)
|
||||
- `mcq.metacircular.net` → public (requires edge routing)
|
||||
|
||||
```toml
|
||||
name = "mcq"
|
||||
node = "rift"
|
||||
active = true
|
||||
|
||||
[[components]]
|
||||
name = "mcq"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
|
||||
volumes = ["/srv/mcq:/srv/mcq"]
|
||||
cmd = ["server", "--config", "/srv/mcq/mcq.toml"]
|
||||
|
||||
# Internal route: handled by rift's mc-proxy.
|
||||
[[components.routes]]
|
||||
name = "internal"
|
||||
port = 8443
|
||||
mode = "l7"
|
||||
|
||||
# Public route: master detects this and sets up edge routing on svc.
|
||||
[[components.routes]]
|
||||
name = "public"
|
||||
port = 8443
|
||||
mode = "l7"
|
||||
hostname = "mcq.metacircular.net"
|
||||
```
|
||||
|
||||
### Deploy Flow with Edge Routing
|
||||
|
||||
When the master receives `Deploy(mcq)`:
|
||||
|
||||
1. **Route to worker**: Master sends `Deploy` RPC to rift's agent with
|
||||
the full service spec. Rift's agent deploys the container and
|
||||
registers mc-proxy routes for all hostnames (both internal and public)
|
||||
on its local mc-proxy.
|
||||
|
||||
2. **Detect public hostnames**: Master inspects the service spec for
|
||||
hostnames that are not `.svc.mcp.` subdomains.
|
||||
|
||||
3. **Validate domains**: Master checks that `mcq.metacircular.net` falls
|
||||
under an allowed domain (`metacircular.net` ✓).
|
||||
|
||||
4. **Resolve edge node**: Master performs a DNS lookup for
|
||||
`mcq.metacircular.net`. If it's a CNAME to `svc.metacircular.net`,
|
||||
the master resolves `svc.metacircular.net` to identify the edge node
|
||||
as `svc`. If DNS is not yet configured (no CNAME), the master uses
|
||||
the default edge node from config.
|
||||
|
||||
5. **Set up edge route**: Master sends an `SetupEdgeRoute` RPC to svc's
|
||||
agent:
|
||||
```
|
||||
SetupEdgeRoute(
|
||||
hostname: "mcq.metacircular.net"
|
||||
backend_hostname: "mcq.svc.mcp.metacircular.net"
|
||||
backend_port: 8443
|
||||
)
|
||||
```
|
||||
|
||||
6. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent:
|
||||
a. Provisions a TLS certificate from Metacrypt for
|
||||
`mcq.metacircular.net`.
|
||||
b. Registers an L7 route in its local mc-proxy:
|
||||
`mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443`
|
||||
with the provisioned cert.
|
||||
|
||||
7. **Master records the edge route** in its own registry for undeploy
|
||||
cleanup.
|
||||
|
||||
### Undeploy Flow
|
||||
|
||||
When the master receives `Undeploy(mcq)`:
|
||||
|
||||
1. **Look up edge routes**: Master checks its registry for edge routes
|
||||
associated with mcq.
|
||||
2. **Remove edge route**: Master sends `RemoveEdgeRoute(mcq.metacircular.net)`
|
||||
to svc's agent. Svc's agent removes the mc-proxy route and cleans up
|
||||
the cert.
|
||||
3. **Undeploy on worker**: Master sends `Undeploy` RPC to rift's agent.
|
||||
Rift's agent tears down the container, routes, DNS, and certs as in v1.
|
||||
|
||||
### Edge Node DNS Resolution
|
||||
|
||||
The master determines which edge node handles a public hostname by
|
||||
checking DNS:
|
||||
|
||||
1. Look up `mcq.metacircular.net` → CNAME `svc.metacircular.net`
|
||||
2. Look up `svc.metacircular.net` → IP address
|
||||
3. Match the IP against known edge nodes
|
||||
|
||||
If no CNAME exists yet (operator hasn't set it up), the master warns but
|
||||
does not fail. The operator sets up DNS manually at Hurricane Electric.
|
||||
The master can provide a `mcp dns check` command that verifies all public
|
||||
hostnames resolve correctly.
|
||||
|
||||
---
|
||||
|
||||
## Agent Changes for v2
|
||||
|
||||
### New RPCs
|
||||
|
||||
```protobuf
|
||||
// Edge routing -- called by master on edge nodes.
|
||||
rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
|
||||
rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
|
||||
rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);
|
||||
|
||||
message SetupEdgeRouteRequest {
|
||||
string hostname = 1; // public hostname (e.g. "mcq.metacircular.net")
|
||||
string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net")
|
||||
int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443)
|
||||
}
|
||||
|
||||
message SetupEdgeRouteResponse {}
|
||||
|
||||
message RemoveEdgeRouteRequest {
|
||||
string hostname = 1;
|
||||
}
|
||||
|
||||
message RemoveEdgeRouteResponse {}
|
||||
|
||||
message ListEdgeRoutesRequest {}
|
||||
|
||||
message ListEdgeRoutesResponse {
|
||||
repeated EdgeRoute routes = 1;
|
||||
}
|
||||
|
||||
message EdgeRoute {
|
||||
string hostname = 1;
|
||||
string backend_hostname = 2;
|
||||
int32 backend_port = 3;
|
||||
string cert_serial = 4;
|
||||
string cert_expires = 5;
|
||||
}
|
||||
```
|
||||
|
||||
### SetupEdgeRoute Implementation
|
||||
|
||||
When the agent receives `SetupEdgeRoute`:
|
||||
|
||||
1. **Resolve backend address**: The agent resolves `backend_hostname` to
|
||||
an IP address (the worker node's overlay IP). It uses the port from
|
||||
the request to form the backend address (e.g., `100.95.252.120:8443`).
|
||||
|
||||
2. **Provision TLS cert**: The agent calls Metacrypt's CA API to issue a
|
||||
certificate for the public hostname. The cert and key are written to
|
||||
the mc-proxy cert directory.
|
||||
|
||||
3. **Register mc-proxy route**: The agent adds an L7 route to its local
|
||||
mc-proxy:
|
||||
- Listener: `:443`
|
||||
- Hostname: `mcq.metacircular.net`
|
||||
- Backend: `100.95.252.120:8443`
|
||||
- Mode: `l7`
|
||||
- TLS cert/key: the provisioned cert
|
||||
- Backend TLS: `true` (worker's mc-proxy serves TLS)
|
||||
|
||||
4. **Record the edge route** in the agent's local registry for listing
|
||||
and cleanup.
|
||||
|
||||
### Cert Provisioning on All Agents
|
||||
|
||||
All agents need Metacrypt configuration to provision certs:
|
||||
|
||||
```toml
|
||||
[metacrypt]
|
||||
server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
|
||||
ca_cert = "/srv/mcp/certs/metacircular-ca.pem"
|
||||
mount = "pki"
|
||||
issuer = "infra"
|
||||
token_path = "/srv/mcp/metacrypt-token"
|
||||
```
|
||||
|
||||
The svc agent provisions certs for public hostnames. The rift agent
|
||||
provisions certs for internal hostnames. Both use the same Metacrypt API.
|
||||
|
||||
---
|
||||
|
||||
## CLI Changes for v2
|
||||
|
||||
The CLI's `[[nodes]]` config is replaced by a single master address:
|
||||
|
||||
```toml
|
||||
[master]
|
||||
address = "100.x.x.x:9555"
|
||||
|
||||
[mcias]
|
||||
server_url = "https://mcias.metacircular.net:8443"
|
||||
service_name = "mcp"
|
||||
|
||||
[auth]
|
||||
token_path = "/home/kyle/.config/mcp/token"
|
||||
|
||||
[services]
|
||||
dir = "/home/kyle/.config/mcp/services"
|
||||
```
|
||||
|
||||
Commands that currently iterate over nodes (`mcp ps`, `mcp list`,
|
||||
`mcp node list`) instead query the master, which aggregates from all
|
||||
agents.
|
||||
|
||||
Service definition files remain on the operator's workstation. The CLI
|
||||
pushes them to the master on `mcp deploy` and `mcp sync`.
|
||||
|
||||
---
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Deploy mcp-agent on svc
|
||||
|
||||
svc currently has no mcp-agent. Install and configure one:
|
||||
|
||||
- Create `mcp` user on svc (Debian: `install-agent.sh`)
|
||||
- Install mcp-agent binary
|
||||
- Configure with Metacrypt and mc-proxy socket access
|
||||
- Verify with `mcp node list` (svc shows up)
|
||||
|
||||
### Phase 2: Add edge routing RPCs to agents
|
||||
|
||||
Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the
|
||||
agent. Test by calling them directly from the CLI before the master exists.
|
||||
|
||||
### Phase 3: Build mcp-master
|
||||
|
||||
Start with the core coordination loop:
|
||||
|
||||
1. Accept `Deploy` from CLI
|
||||
2. Forward to the correct agent
|
||||
3. Detect public hostnames
|
||||
4. Call `SetupEdgeRoute` on the edge agent
|
||||
|
||||
### Phase 4: Provision straylight
|
||||
|
||||
New node (straylight) takes over as master and hosts core infrastructure:
|
||||
|
||||
1. Deploy mcp-agent on straylight
|
||||
2. Migrate mcns, metacrypt, mcr from rift to straylight
|
||||
3. Deploy mcp-master on straylight
|
||||
4. Update CLI config to point at master
|
||||
|
||||
### Phase 5: Cut over
|
||||
|
||||
- Update DNS to point `*.svc.mcp.metacircular.net` at straylight
|
||||
- Update service definitions to use new node assignments
|
||||
- Verify all services via `mcp ps` and public endpoint tests
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Master HA**: mcp-master is a single point of failure. For v2, this
|
||||
is acceptable (operator can SSH to agents directly if master is down).
|
||||
v3 could add master replication or make agents self-sufficient for
|
||||
local operations when the master is unreachable.
|
||||
|
||||
2. **Service placement**: v2 still requires explicit `node` assignment
|
||||
in service definitions. Automatic placement based on resource
|
||||
availability is a future concern.
|
||||
|
||||
3. **Cert renewal on edge**: Edge certs have a 90-day TTL. The edge
|
||||
agent needs a renewal loop (similar to the existing `renewWindow`
|
||||
check in `EnsureCert`) or the master needs to periodically re-check
|
||||
edge routes.
|
||||
|
||||
4. **mc-proxy database vs config**: mc-proxy persists routes in SQLite,
|
||||
which can diverge from the TOML config. The agent should be the sole
|
||||
manager of mc-proxy routes via the gRPC admin API, not the TOML file.
|
||||
This avoids the stale-database problem encountered during v1
|
||||
operations on svc.
|
||||
|
||||
5. **straylight hardware**: What hardware is straylight? Does it run
|
||||
NixOS or Debian? Does it use rootless podman like rift?
|
||||
Reference in New Issue
Block a user