# MCP v2 -- Multi-Node Control Plane ## Overview MCP v2 introduces multi-node orchestration with a master/agent topology. The CLI no longer dials agents directly. A dedicated **mcp-master** daemon coordinates deployments across nodes, handles cross-node concerns (edge routing, certificate provisioning, DNS), and serves as the single control point for the platform. ### Motivation v1 deployed successfully on a single node (rift) but exposed operational pain points as services needed public-facing routes through svc: - **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating the SQLite database when the config and database diverged, and debugging silent failures. Every redeployment risked breaking the public route. - **Dynamic port instability**: The route system assigns ephemeral host ports that change on every deploy. svc's mc-proxy pointed at a specific port (e.g., `100.95.252.120:48080`), which went stale after redeployment. Container ports are also localhost-only under rootless podman, requiring explicit Tailscale IP bindings for external access. - **$PORT env override conflict**: The mcdsl config loader overrides `listen_addr` from `$PORT` when routes are present. This meant containers ignored their configured port and listened on the route-allocated one instead, breaking explicit port mappings that expected the config port. - **Cert chain issues**: mc-proxy requires full certificate chains (leaf + intermediates). Certs provisioned outside the standard metacrypt flow were leaf-only and caused silent TLS handshake failures (`client_bytes=7 backend_bytes=0` with no error logged). - **mc-proxy database divergence**: mc-proxy persists routes in SQLite. Routes added via the admin API override the TOML config. Editing the TOML alone had no effect until the database was manually updated -- a failure mode that took hours to diagnose. - **No cross-node coordination**: The v1 CLI talks directly to individual agents. There is no mechanism for one agent to tell another "set up a route for this service." Every cross-node operation was manual. v2 addresses all of these by making the master the single coordination point for deployments, with agents handling local concerns (containers, mc-proxy routes, cert provisioning) on instruction from the master. ### What Changes from v1 | Concern | v1 | v2 | |---------|----|----| | CLI target | CLI dials agents directly | CLI dials the master | | Node awareness | CLI routes by `node` field in service defs | Master owns the node registry | | Service definitions | Live on operator workstation | Pushed to master, which distributes to agents | | Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup | | Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) | | DNS registration | Agent registers records on deploy | Master coordinates DNS across zones | ### What Stays the Same The agent's core responsibilities are unchanged: it manages containers via podman, stores its local registry in SQLite, monitors for drift, and alerts the operator. The agent gains new RPCs for edge routing but does not become aware of other nodes -- the master handles all cross-node coordination. --- ## Topology ``` Operator workstation (vade) ┌──────────────────────────┐ │ mcp (CLI) │ │ │ │ gRPC ───────────────────┼─── overlay ───┐ └──────────────────────────┘ │ ▼ Master node (straylight) ┌──────────────────────────────────────────────────────┐ │ mcp-master │ │ ├── node registry (all nodes, roles, addresses) │ │ ├── service definitions (pushed from CLI) │ │ └── deployment coordinator │ │ │ │ mcp-agent │ │ ├── mcns container │ │ ├── metacrypt container │ │ ├── mcr container │ │ └── mc-proxy (straylight) │ └──────────┬──────────────────────────┬────────────────┘ │ │ overlay overlay │ │ ▼ ▼ Worker node (rift) Edge node (svc) ┌─────────────────────┐ ┌─────────────────────────┐ │ mcp-agent │ │ mcp-agent │ │ ├── exo │ │ ├── mc-proxy (svc) │ │ ├── mcq │ │ └── (edge routes only) │ │ ├── mcdoc │ │ │ │ ├── sgard │ │ Edge routes: │ │ ├── kls │ │ mcq.metacircular.net │ │ └── mc-proxy │ │ mcdoc.metacircular.net │ │ (rift) │ │ exo.metacircular.net │ └─────────────────────┘ │ sgard.metacircular.net │ └─────────────────────────┘ ``` ### Node Roles | Role | Purpose | Nodes | |------|---------|-------| | **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight | | **worker** | Runs mcp-agent. Hosts application services. | rift | | **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc | Every node runs an mcp-agent. The master node also runs mcp-master. The master's local agent manages the infrastructure services (mcns, metacrypt, mcr) the same way rift's agent manages application services. ### mc-proxy Mesh Each node runs its own mc-proxy instance. They form a routing mesh: ``` mc-proxy (straylight) ├── :443 L7 routes for metacrypt-web, mcr-web ├── :8443 L4 passthrough for metacrypt-api, mcr-api └── :9443 L4 passthrough for gRPC services mc-proxy (rift) ├── :443 L7 routes for internal .svc.mcp hostnames └── :8443 L4/L7 routes for internal APIs mc-proxy (svc) └── :443 L7 termination for public hostnames → forwards to internal .svc.mcp endpoints ``` --- ## mcp-master The master is a new binary that coordinates cross-node operations. It is **not** a replacement for the agent -- it sits above agents and orchestrates them. ### Responsibilities 1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync). 2. **Route deployments** to the correct agent based on the service definition's `node` field. 3. **Detect public hostnames** in service definitions and coordinate edge routing with the edge node's agent. 4. **Validate public hostnames** against a configured allowlist of domains (e.g., `metacircular.net`, `wntrmute.net`). 5. **Resolve edge nodes** by checking DNS CNAME records to determine which node handles public traffic for a given hostname. 6. **Coordinate undeploy** across nodes: tear down the service on the worker, then clean up edge routes on the edge node. ### What the Master Does NOT Do - Store container state (agents own their registries). - Manage container lifecycle directly (agents do this). - Run containers (the co-located agent does). - Replace the agent on any node. ### Master Configuration ```toml [server] grpc_addr = "100.x.x.x:9555" # master listens on overlay tls_cert = "/srv/mcp-master/certs/cert.pem" tls_key = "/srv/mcp-master/certs/key.pem" [mcias] server_url = "https://mcias.metacircular.net:8443" service_name = "mcp-master" # Allowed public domains. Hostnames in service definitions must fall # under one of these suffixes. [edge] allowed_domains = ["metacircular.net", "wntrmute.net"] # Node registry. The master knows about all nodes. [[nodes]] name = "straylight" address = "100.x.x.x:9444" role = "master" [[nodes]] name = "rift" address = "100.95.252.120:9444" role = "worker" [[nodes]] name = "svc" address = "100.x.x.x:9444" role = "edge" ``` --- ## Edge Routing The core v2 feature: when a service declares a public hostname, the master automatically provisions the edge route. ### Service Definition Public hostnames are declared in the route's `hostname` field. The master distinguishes public from internal hostnames by checking whether they fall under a `.svc.mcp.` subdomain: - `mcq.svc.mcp.metacircular.net` → internal (handled by local mc-proxy) - `mcq.metacircular.net` → public (requires edge routing) ```toml name = "mcq" node = "rift" active = true [[components]] name = "mcq" image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0" volumes = ["/srv/mcq:/srv/mcq"] cmd = ["server", "--config", "/srv/mcq/mcq.toml"] # Internal route: handled by rift's mc-proxy. [[components.routes]] name = "internal" port = 8443 mode = "l7" # Public route: master detects this and sets up edge routing on svc. [[components.routes]] name = "public" port = 8443 mode = "l7" hostname = "mcq.metacircular.net" ``` ### Deploy Flow with Edge Routing When the master receives `Deploy(mcq)`: 1. **Route to worker**: Master sends `Deploy` RPC to rift's agent with the full service spec. Rift's agent deploys the container and registers mc-proxy routes for all hostnames (both internal and public) on its local mc-proxy. 2. **Detect public hostnames**: Master inspects the service spec for hostnames that are not `.svc.mcp.` subdomains. 3. **Validate domains**: Master checks that `mcq.metacircular.net` falls under an allowed domain (`metacircular.net` ✓). 4. **Resolve edge node**: Master performs a DNS lookup for `mcq.metacircular.net`. If it's a CNAME to `svc.metacircular.net`, the master resolves `svc.metacircular.net` to identify the edge node as `svc`. If DNS is not yet configured (no CNAME), the master uses the default edge node from config. 5. **Set up edge route**: Master sends an `SetupEdgeRoute` RPC to svc's agent: ``` SetupEdgeRoute( hostname: "mcq.metacircular.net" backend_hostname: "mcq.svc.mcp.metacircular.net" backend_port: 8443 ) ``` 6. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent: a. Provisions a TLS certificate from Metacrypt for `mcq.metacircular.net`. b. Registers an L7 route in its local mc-proxy: `mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443` with the provisioned cert. 7. **Master records the edge route** in its own registry for undeploy cleanup. ### Undeploy Flow When the master receives `Undeploy(mcq)`: 1. **Look up edge routes**: Master checks its registry for edge routes associated with mcq. 2. **Remove edge route**: Master sends `RemoveEdgeRoute(mcq.metacircular.net)` to svc's agent. Svc's agent removes the mc-proxy route and cleans up the cert. 3. **Undeploy on worker**: Master sends `Undeploy` RPC to rift's agent. Rift's agent tears down the container, routes, DNS, and certs as in v1. ### Edge Node DNS Resolution The master determines which edge node handles a public hostname by checking DNS: 1. Look up `mcq.metacircular.net` → CNAME `svc.metacircular.net` 2. Look up `svc.metacircular.net` → IP address 3. Match the IP against known edge nodes If no CNAME exists yet (operator hasn't set it up), the master warns but does not fail. The operator sets up DNS manually at Hurricane Electric. The master can provide a `mcp dns check` command that verifies all public hostnames resolve correctly. --- ## Agent Changes for v2 ### New RPCs ```protobuf // Edge routing -- called by master on edge nodes. rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse); rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse); rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse); message SetupEdgeRouteRequest { string hostname = 1; // public hostname (e.g. "mcq.metacircular.net") string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net") int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443) } message SetupEdgeRouteResponse {} message RemoveEdgeRouteRequest { string hostname = 1; } message RemoveEdgeRouteResponse {} message ListEdgeRoutesRequest {} message ListEdgeRoutesResponse { repeated EdgeRoute routes = 1; } message EdgeRoute { string hostname = 1; string backend_hostname = 2; int32 backend_port = 3; string cert_serial = 4; string cert_expires = 5; } ``` ### SetupEdgeRoute Implementation When the agent receives `SetupEdgeRoute`: 1. **Resolve backend address**: The agent resolves `backend_hostname` to an IP address (the worker node's overlay IP). It uses the port from the request to form the backend address (e.g., `100.95.252.120:8443`). 2. **Provision TLS cert**: The agent calls Metacrypt's CA API to issue a certificate for the public hostname. The cert and key are written to the mc-proxy cert directory. 3. **Register mc-proxy route**: The agent adds an L7 route to its local mc-proxy: - Listener: `:443` - Hostname: `mcq.metacircular.net` - Backend: `100.95.252.120:8443` - Mode: `l7` - TLS cert/key: the provisioned cert - Backend TLS: `true` (worker's mc-proxy serves TLS) 4. **Record the edge route** in the agent's local registry for listing and cleanup. ### Cert Provisioning on All Agents All agents need Metacrypt configuration to provision certs: ```toml [metacrypt] server_url = "https://metacrypt.svc.mcp.metacircular.net:8443" ca_cert = "/srv/mcp/certs/metacircular-ca.pem" mount = "pki" issuer = "infra" token_path = "/srv/mcp/metacrypt-token" ``` The svc agent provisions certs for public hostnames. The rift agent provisions certs for internal hostnames. Both use the same Metacrypt API. --- ## CLI Changes for v2 The CLI's `[[nodes]]` config is replaced by a single master address: ```toml [master] address = "100.x.x.x:9555" [mcias] server_url = "https://mcias.metacircular.net:8443" service_name = "mcp" [auth] token_path = "/home/kyle/.config/mcp/token" [services] dir = "/home/kyle/.config/mcp/services" ``` Commands that currently iterate over nodes (`mcp ps`, `mcp list`, `mcp node list`) instead query the master, which aggregates from all agents. Service definition files remain on the operator's workstation. The CLI pushes them to the master on `mcp deploy` and `mcp sync`. --- ## Migration Plan ### Phase 1: Deploy mcp-agent on svc svc currently has no mcp-agent. Install and configure one: - Create `mcp` user on svc (Debian: `install-agent.sh`) - Install mcp-agent binary - Configure with Metacrypt and mc-proxy socket access - Verify with `mcp node list` (svc shows up) ### Phase 2: Add edge routing RPCs to agents Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the agent. Test by calling them directly from the CLI before the master exists. ### Phase 3: Build mcp-master Start with the core coordination loop: 1. Accept `Deploy` from CLI 2. Forward to the correct agent 3. Detect public hostnames 4. Call `SetupEdgeRoute` on the edge agent ### Phase 4: Provision straylight New node (straylight) takes over as master and hosts core infrastructure: 1. Deploy mcp-agent on straylight 2. Migrate mcns, metacrypt, mcr from rift to straylight 3. Deploy mcp-master on straylight 4. Update CLI config to point at master ### Phase 5: Cut over - Update DNS to point `*.svc.mcp.metacircular.net` at straylight - Update service definitions to use new node assignments - Verify all services via `mcp ps` and public endpoint tests --- ## Open Questions 1. **Master HA**: mcp-master is a single point of failure. For v2, this is acceptable (operator can SSH to agents directly if master is down). v3 could add master replication or make agents self-sufficient for local operations when the master is unreachable. 2. **Service placement**: v2 still requires explicit `node` assignment in service definitions. Automatic placement based on resource availability is a future concern. 3. **Cert renewal on edge**: Edge certs have a 90-day TTL. The edge agent needs a renewal loop (similar to the existing `renewWindow` check in `EnsureCert`) or the master needs to periodically re-check edge routes. 4. **mc-proxy database vs config**: mc-proxy persists routes in SQLite, which can diverge from the TOML config. The agent should be the sole manager of mc-proxy routes via the gRPC admin API, not the TOML file. This avoids the stale-database problem encountered during v1 operations on svc. 5. **straylight hardware**: What hardware is straylight? Does it run NixOS or Debian? Does it use rootless podman like rift? 6. **Mono-repo for core infrastructure**: The current layout has each service as a separate git repo under `~/src/metacircular/`. A mono-repo for core infrastructure (mcp, mcp-master, mcns, metacrypt, mcr, mc-proxy, mcdsl) would simplify coordinated changes (e.g., a proto change that touches agent + CLI + mc-proxy client), eliminate the `uses_mcdsl` build flag / vendoring, enable a single CI pipeline, and allow atomic platform versioning (one tag per release). Non-core application services (exo, mcq, mcdoc, sgard, kls, mcat) would remain as separate repos with independent release cadences. This is a large migration best tackled after straylight is running and the master exists, when the build/deploy pipeline is already being reorganized.