18 KiB
MCP v2 -- Multi-Node Control Plane
Overview
MCP v2 introduces multi-node orchestration with a master/agent topology. The CLI no longer dials agents directly. A dedicated mcp-master daemon coordinates deployments across nodes, handles cross-node concerns (edge routing, certificate provisioning, DNS), and serves as the single control point for the platform.
Motivation
v1 deployed successfully on a single node (rift) but exposed operational pain points as services needed public-facing routes through svc:
-
Manual edge routing: Exposing mcq.metacircular.net required hand-editing mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating the SQLite database when the config and database diverged, and debugging silent failures. Every redeployment risked breaking the public route.
-
Dynamic port instability: The route system assigns ephemeral host ports that change on every deploy. svc's mc-proxy pointed at a specific port (e.g.,
100.95.252.120:48080), which went stale after redeployment. Container ports are also localhost-only under rootless podman, requiring explicit Tailscale IP bindings for external access. -
$PORT env override conflict: The mcdsl config loader overrides
listen_addrfrom$PORTwhen routes are present. This meant containers ignored their configured port and listened on the route-allocated one instead, breaking explicit port mappings that expected the config port. -
Cert chain issues: mc-proxy requires full certificate chains (leaf + intermediates). Certs provisioned outside the standard metacrypt flow were leaf-only and caused silent TLS handshake failures (
client_bytes=7 backend_bytes=0with no error logged). -
mc-proxy database divergence: mc-proxy persists routes in SQLite. Routes added via the admin API override the TOML config. Editing the TOML alone had no effect until the database was manually updated -- a failure mode that took hours to diagnose.
-
No cross-node coordination: The v1 CLI talks directly to individual agents. There is no mechanism for one agent to tell another "set up a route for this service." Every cross-node operation was manual.
v2 addresses all of these by making the master the single coordination point for deployments, with agents handling local concerns (containers, mc-proxy routes, cert provisioning) on instruction from the master.
What Changes from v1
| Concern | v1 | v2 |
|---|---|---|
| CLI target | CLI dials agents directly | CLI dials the master |
| Node awareness | CLI routes by node field in service defs |
Master owns the node registry |
| Service definitions | Live on operator workstation | Pushed to master, which distributes to agents |
| Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup |
| Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) |
| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones |
What Stays the Same
The agent's core responsibilities are unchanged: it manages containers via podman, stores its local registry in SQLite, monitors for drift, and alerts the operator. The agent gains new RPCs for edge routing but does not become aware of other nodes -- the master handles all cross-node coordination.
Topology
Operator workstation (vade)
┌──────────────────────────┐
│ mcp (CLI) │
│ │
│ gRPC ───────────────────┼─── overlay ───┐
└──────────────────────────┘ │
▼
Master node (straylight)
┌──────────────────────────────────────────────────────┐
│ mcp-master │
│ ├── node registry (all nodes, roles, addresses) │
│ ├── service definitions (pushed from CLI) │
│ └── deployment coordinator │
│ │
│ mcp-agent │
│ ├── mcns container │
│ ├── metacrypt container │
│ ├── mcr container │
│ └── mc-proxy (straylight) │
└──────────┬──────────────────────────┬────────────────┘
│ │
overlay overlay
│ │
▼ ▼
Worker node (rift) Edge node (svc)
┌─────────────────────┐ ┌─────────────────────────┐
│ mcp-agent │ │ mcp-agent │
│ ├── exo │ │ ├── mc-proxy (svc) │
│ ├── mcq │ │ └── (edge routes only) │
│ ├── mcdoc │ │ │
│ ├── sgard │ │ Edge routes: │
│ ├── kls │ │ mcq.metacircular.net │
│ └── mc-proxy │ │ mcdoc.metacircular.net │
│ (rift) │ │ exo.metacircular.net │
└─────────────────────┘ │ sgard.metacircular.net │
└─────────────────────────┘
Node Roles
| Role | Purpose | Nodes |
|---|---|---|
| master | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight |
| worker | Runs mcp-agent. Hosts application services. | rift |
| edge | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc |
Every node runs an mcp-agent. The master node also runs mcp-master. The master's local agent manages the infrastructure services (mcns, metacrypt, mcr) the same way rift's agent manages application services.
mc-proxy Mesh
Each node runs its own mc-proxy instance. They form a routing mesh:
mc-proxy (straylight)
├── :443 L7 routes for metacrypt-web, mcr-web
├── :8443 L4 passthrough for metacrypt-api, mcr-api
└── :9443 L4 passthrough for gRPC services
mc-proxy (rift)
├── :443 L7 routes for internal .svc.mcp hostnames
└── :8443 L4/L7 routes for internal APIs
mc-proxy (svc)
└── :443 L7 termination for public hostnames
→ forwards to internal .svc.mcp endpoints
mcp-master
The master is a new binary that coordinates cross-node operations. It is not a replacement for the agent -- it sits above agents and orchestrates them.
Responsibilities
- Accept CLI commands via gRPC (deploy, undeploy, status, sync).
- Route deployments to the correct agent based on the service
definition's
nodefield. - Detect public hostnames in service definitions and coordinate edge routing with the edge node's agent.
- Validate public hostnames against a configured allowlist of domains
(e.g.,
metacircular.net,wntrmute.net). - Resolve edge nodes by checking DNS CNAME records to determine which node handles public traffic for a given hostname.
- Coordinate undeploy across nodes: tear down the service on the worker, then clean up edge routes on the edge node.
What the Master Does NOT Do
- Store container state (agents own their registries).
- Manage container lifecycle directly (agents do this).
- Run containers (the co-located agent does).
- Replace the agent on any node.
Master Configuration
[server]
grpc_addr = "100.x.x.x:9555" # master listens on overlay
tls_cert = "/srv/mcp-master/certs/cert.pem"
tls_key = "/srv/mcp-master/certs/key.pem"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp-master"
# Allowed public domains. Hostnames in service definitions must fall
# under one of these suffixes.
[edge]
allowed_domains = ["metacircular.net", "wntrmute.net"]
# Node registry. The master knows about all nodes.
[[nodes]]
name = "straylight"
address = "100.x.x.x:9444"
role = "master"
[[nodes]]
name = "rift"
address = "100.95.252.120:9444"
role = "worker"
[[nodes]]
name = "svc"
address = "100.x.x.x:9444"
role = "edge"
Edge Routing
The core v2 feature: when a service declares a public hostname, the master automatically provisions the edge route.
Service Definition
Public hostnames are declared in the route's hostname field. The
master distinguishes public from internal hostnames by checking whether
they fall under a .svc.mcp. subdomain:
mcq.svc.mcp.metacircular.net→ internal (handled by local mc-proxy)mcq.metacircular.net→ public (requires edge routing)
name = "mcq"
node = "rift"
active = true
[[components]]
name = "mcq"
image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
volumes = ["/srv/mcq:/srv/mcq"]
cmd = ["server", "--config", "/srv/mcq/mcq.toml"]
# Internal route: handled by rift's mc-proxy.
[[components.routes]]
name = "internal"
port = 8443
mode = "l7"
# Public route: master detects this and sets up edge routing on svc.
[[components.routes]]
name = "public"
port = 8443
mode = "l7"
hostname = "mcq.metacircular.net"
Deploy Flow with Edge Routing
When the master receives Deploy(mcq):
-
Route to worker: Master sends
DeployRPC to rift's agent with the full service spec. Rift's agent deploys the container and registers mc-proxy routes for all hostnames (both internal and public) on its local mc-proxy. -
Detect public hostnames: Master inspects the service spec for hostnames that are not
.svc.mcp.subdomains. -
Validate domains: Master checks that
mcq.metacircular.netfalls under an allowed domain (metacircular.net✓). -
Resolve edge node: Master performs a DNS lookup for
mcq.metacircular.net. If it's a CNAME tosvc.metacircular.net, the master resolvessvc.metacircular.netto identify the edge node assvc. If DNS is not yet configured (no CNAME), the master uses the default edge node from config. -
Set up edge route: Master sends an
SetupEdgeRouteRPC to svc's agent:SetupEdgeRoute( hostname: "mcq.metacircular.net" backend_hostname: "mcq.svc.mcp.metacircular.net" backend_port: 8443 ) -
Svc agent provisions: On receiving
SetupEdgeRoute, svc's agent: a. Provisions a TLS certificate from Metacrypt formcq.metacircular.net. b. Registers an L7 route in its local mc-proxy:mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443with the provisioned cert. -
Master records the edge route in its own registry for undeploy cleanup.
Undeploy Flow
When the master receives Undeploy(mcq):
- Look up edge routes: Master checks its registry for edge routes associated with mcq.
- Remove edge route: Master sends
RemoveEdgeRoute(mcq.metacircular.net)to svc's agent. Svc's agent removes the mc-proxy route and cleans up the cert. - Undeploy on worker: Master sends
UndeployRPC to rift's agent. Rift's agent tears down the container, routes, DNS, and certs as in v1.
Edge Node DNS Resolution
The master determines which edge node handles a public hostname by checking DNS:
- Look up
mcq.metacircular.net→ CNAMEsvc.metacircular.net - Look up
svc.metacircular.net→ IP address - Match the IP against known edge nodes
If no CNAME exists yet (operator hasn't set it up), the master warns but
does not fail. The operator sets up DNS manually at Hurricane Electric.
The master can provide a mcp dns check command that verifies all public
hostnames resolve correctly.
Agent Changes for v2
New RPCs
// Edge routing -- called by master on edge nodes.
rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);
message SetupEdgeRouteRequest {
string hostname = 1; // public hostname (e.g. "mcq.metacircular.net")
string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net")
int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443)
}
message SetupEdgeRouteResponse {}
message RemoveEdgeRouteRequest {
string hostname = 1;
}
message RemoveEdgeRouteResponse {}
message ListEdgeRoutesRequest {}
message ListEdgeRoutesResponse {
repeated EdgeRoute routes = 1;
}
message EdgeRoute {
string hostname = 1;
string backend_hostname = 2;
int32 backend_port = 3;
string cert_serial = 4;
string cert_expires = 5;
}
SetupEdgeRoute Implementation
When the agent receives SetupEdgeRoute:
-
Resolve backend address: The agent resolves
backend_hostnameto an IP address (the worker node's overlay IP). It uses the port from the request to form the backend address (e.g.,100.95.252.120:8443). -
Provision TLS cert: The agent calls Metacrypt's CA API to issue a certificate for the public hostname. The cert and key are written to the mc-proxy cert directory.
-
Register mc-proxy route: The agent adds an L7 route to its local mc-proxy:
- Listener:
:443 - Hostname:
mcq.metacircular.net - Backend:
100.95.252.120:8443 - Mode:
l7 - TLS cert/key: the provisioned cert
- Backend TLS:
true(worker's mc-proxy serves TLS)
- Listener:
-
Record the edge route in the agent's local registry for listing and cleanup.
Cert Provisioning on All Agents
All agents need Metacrypt configuration to provision certs:
[metacrypt]
server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
ca_cert = "/srv/mcp/certs/metacircular-ca.pem"
mount = "pki"
issuer = "infra"
token_path = "/srv/mcp/metacrypt-token"
The svc agent provisions certs for public hostnames. The rift agent provisions certs for internal hostnames. Both use the same Metacrypt API.
CLI Changes for v2
The CLI's [[nodes]] config is replaced by a single master address:
[master]
address = "100.x.x.x:9555"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp"
[auth]
token_path = "/home/kyle/.config/mcp/token"
[services]
dir = "/home/kyle/.config/mcp/services"
Commands that currently iterate over nodes (mcp ps, mcp list,
mcp node list) instead query the master, which aggregates from all
agents.
Service definition files remain on the operator's workstation. The CLI
pushes them to the master on mcp deploy and mcp sync.
Migration Plan
Phase 1: Deploy mcp-agent on svc
svc currently has no mcp-agent. Install and configure one:
- Create
mcpuser on svc (Debian:install-agent.sh) - Install mcp-agent binary
- Configure with Metacrypt and mc-proxy socket access
- Verify with
mcp node list(svc shows up)
Phase 2: Add edge routing RPCs to agents
Implement SetupEdgeRoute, RemoveEdgeRoute, ListEdgeRoutes on the
agent. Test by calling them directly from the CLI before the master exists.
Phase 3: Build mcp-master
Start with the core coordination loop:
- Accept
Deployfrom CLI - Forward to the correct agent
- Detect public hostnames
- Call
SetupEdgeRouteon the edge agent
Phase 4: Provision straylight
New node (straylight) takes over as master and hosts core infrastructure:
- Deploy mcp-agent on straylight
- Migrate mcns, metacrypt, mcr from rift to straylight
- Deploy mcp-master on straylight
- Update CLI config to point at master
Phase 5: Cut over
- Update DNS to point
*.svc.mcp.metacircular.netat straylight - Update service definitions to use new node assignments
- Verify all services via
mcp psand public endpoint tests
Open Questions
-
Master HA: mcp-master is a single point of failure. For v2, this is acceptable (operator can SSH to agents directly if master is down). v3 could add master replication or make agents self-sufficient for local operations when the master is unreachable.
-
Service placement: v2 still requires explicit
nodeassignment in service definitions. Automatic placement based on resource availability is a future concern. -
Cert renewal on edge: Edge certs have a 90-day TTL. The edge agent needs a renewal loop (similar to the existing
renewWindowcheck inEnsureCert) or the master needs to periodically re-check edge routes. -
mc-proxy database vs config: mc-proxy persists routes in SQLite, which can diverge from the TOML config. The agent should be the sole manager of mc-proxy routes via the gRPC admin API, not the TOML file. This avoids the stale-database problem encountered during v1 operations on svc.
-
straylight hardware: What hardware is straylight? Does it run NixOS or Debian? Does it use rootless podman like rift?
-
Mono-repo for core infrastructure: The current layout has each service as a separate git repo under
~/src/metacircular/. A mono-repo for core infrastructure (mcp, mcp-master, mcns, metacrypt, mcr, mc-proxy, mcdsl) would simplify coordinated changes (e.g., a proto change that touches agent + CLI + mc-proxy client), eliminate theuses_mcdslbuild flag / vendoring, enable a single CI pipeline, and allow atomic platform versioning (one tag per release). Non-core application services (exo, mcq, mcdoc, sgard, kls, mcat) would remain as separate repos with independent release cadences. This is a large migration best tackled after straylight is running and the master exists, when the build/deploy pipeline is already being reorganized.