Files
mcp/ARCHITECTURE_V2.md
2026-03-29 20:40:32 -07:00

18 KiB

MCP v2 -- Multi-Node Control Plane

Overview

MCP v2 introduces multi-node orchestration with a master/agent topology. The CLI no longer dials agents directly. A dedicated mcp-master daemon coordinates deployments across nodes, handles cross-node concerns (edge routing, certificate provisioning, DNS), and serves as the single control point for the platform.

Motivation

v1 deployed successfully on a single node (rift) but exposed operational pain points as services needed public-facing routes through svc:

  • Manual edge routing: Exposing mcq.metacircular.net required hand-editing mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating the SQLite database when the config and database diverged, and debugging silent failures. Every redeployment risked breaking the public route.

  • Dynamic port instability: The route system assigns ephemeral host ports that change on every deploy. svc's mc-proxy pointed at a specific port (e.g., 100.95.252.120:48080), which went stale after redeployment. Container ports are also localhost-only under rootless podman, requiring explicit Tailscale IP bindings for external access.

  • $PORT env override conflict: The mcdsl config loader overrides listen_addr from $PORT when routes are present. This meant containers ignored their configured port and listened on the route-allocated one instead, breaking explicit port mappings that expected the config port.

  • Cert chain issues: mc-proxy requires full certificate chains (leaf + intermediates). Certs provisioned outside the standard metacrypt flow were leaf-only and caused silent TLS handshake failures (client_bytes=7 backend_bytes=0 with no error logged).

  • mc-proxy database divergence: mc-proxy persists routes in SQLite. Routes added via the admin API override the TOML config. Editing the TOML alone had no effect until the database was manually updated -- a failure mode that took hours to diagnose.

  • No cross-node coordination: The v1 CLI talks directly to individual agents. There is no mechanism for one agent to tell another "set up a route for this service." Every cross-node operation was manual.

v2 addresses all of these by making the master the single coordination point for deployments, with agents handling local concerns (containers, mc-proxy routes, cert provisioning) on instruction from the master.

What Changes from v1

Concern v1 v2
CLI target CLI dials agents directly CLI dials the master
Node awareness CLI routes by node field in service defs Master owns the node registry
Service definitions Live on operator workstation Pushed to master, which distributes to agents
Edge routing Manual mc-proxy config on svc Master coordinates agent-to-agent setup
Cert provisioning Agent provisions for local mc-proxy only Any agent can provision certs (edge included)
DNS registration Agent registers records on deploy Master coordinates DNS across zones

What Stays the Same

The agent's core responsibilities are unchanged: it manages containers via podman, stores its local registry in SQLite, monitors for drift, and alerts the operator. The agent gains new RPCs for edge routing but does not become aware of other nodes -- the master handles all cross-node coordination.


Topology

Operator workstation (vade)
  ┌──────────────────────────┐
  │  mcp (CLI)               │
  │                          │
  │  gRPC ───────────────────┼─── overlay ───┐
  └──────────────────────────┘               │
                                             ▼
Master node (straylight)
  ┌──────────────────────────────────────────────────────┐
  │  mcp-master                                          │
  │    ├── node registry (all nodes, roles, addresses)   │
  │    ├── service definitions (pushed from CLI)         │
  │    └── deployment coordinator                        │
  │                                                      │
  │  mcp-agent                                           │
  │    ├── mcns container                                │
  │    ├── metacrypt container                           │
  │    ├── mcr container                                 │
  │    └── mc-proxy (straylight)                         │
  └──────────┬──────────────────────────┬────────────────┘
             │                          │
         overlay                    overlay
             │                          │
             ▼                          ▼
Worker node (rift)              Edge node (svc)
  ┌─────────────────────┐      ┌─────────────────────────┐
  │  mcp-agent           │      │  mcp-agent               │
  │    ├── exo           │      │    ├── mc-proxy (svc)     │
  │    ├── mcq           │      │    └── (edge routes only) │
  │    ├── mcdoc         │      │                           │
  │    ├── sgard         │      │  Edge routes:             │
  │    ├── kls           │      │    mcq.metacircular.net   │
  │    └── mc-proxy      │      │    mcdoc.metacircular.net │
  │        (rift)        │      │    exo.metacircular.net   │
  └─────────────────────┘      │    sgard.metacircular.net │
                                └─────────────────────────┘

Node Roles

Role Purpose Nodes
master Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. straylight
worker Runs mcp-agent. Hosts application services. rift
edge Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. svc

Every node runs an mcp-agent. The master node also runs mcp-master. The master's local agent manages the infrastructure services (mcns, metacrypt, mcr) the same way rift's agent manages application services.

mc-proxy Mesh

Each node runs its own mc-proxy instance. They form a routing mesh:

mc-proxy (straylight)
  ├── :443  L7 routes for metacrypt-web, mcr-web
  ├── :8443 L4 passthrough for metacrypt-api, mcr-api
  └── :9443 L4 passthrough for gRPC services

mc-proxy (rift)
  ├── :443  L7 routes for internal .svc.mcp hostnames
  └── :8443 L4/L7 routes for internal APIs

mc-proxy (svc)
  └── :443  L7 termination for public hostnames
            → forwards to internal .svc.mcp endpoints

mcp-master

The master is a new binary that coordinates cross-node operations. It is not a replacement for the agent -- it sits above agents and orchestrates them.

Responsibilities

  1. Accept CLI commands via gRPC (deploy, undeploy, status, sync).
  2. Route deployments to the correct agent based on the service definition's node field.
  3. Detect public hostnames in service definitions and coordinate edge routing with the edge node's agent.
  4. Validate public hostnames against a configured allowlist of domains (e.g., metacircular.net, wntrmute.net).
  5. Resolve edge nodes by checking DNS CNAME records to determine which node handles public traffic for a given hostname.
  6. Coordinate undeploy across nodes: tear down the service on the worker, then clean up edge routes on the edge node.

What the Master Does NOT Do

  • Store container state (agents own their registries).
  • Manage container lifecycle directly (agents do this).
  • Run containers (the co-located agent does).
  • Replace the agent on any node.

Master Configuration

[server]
grpc_addr = "100.x.x.x:9555"     # master listens on overlay
tls_cert  = "/srv/mcp-master/certs/cert.pem"
tls_key   = "/srv/mcp-master/certs/key.pem"

[mcias]
server_url   = "https://mcias.metacircular.net:8443"
service_name = "mcp-master"

# Allowed public domains. Hostnames in service definitions must fall
# under one of these suffixes.
[edge]
allowed_domains = ["metacircular.net", "wntrmute.net"]

# Node registry. The master knows about all nodes.
[[nodes]]
name    = "straylight"
address = "100.x.x.x:9444"
role    = "master"

[[nodes]]
name    = "rift"
address = "100.95.252.120:9444"
role    = "worker"

[[nodes]]
name    = "svc"
address = "100.x.x.x:9444"
role    = "edge"

Edge Routing

The core v2 feature: when a service declares a public hostname, the master automatically provisions the edge route.

Service Definition

Public hostnames are declared in the route's hostname field. The master distinguishes public from internal hostnames by checking whether they fall under a .svc.mcp. subdomain:

  • mcq.svc.mcp.metacircular.net → internal (handled by local mc-proxy)
  • mcq.metacircular.net → public (requires edge routing)
name = "mcq"
node = "rift"
active = true

[[components]]
name = "mcq"
image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
volumes = ["/srv/mcq:/srv/mcq"]
cmd = ["server", "--config", "/srv/mcq/mcq.toml"]

# Internal route: handled by rift's mc-proxy.
[[components.routes]]
name = "internal"
port = 8443
mode = "l7"

# Public route: master detects this and sets up edge routing on svc.
[[components.routes]]
name = "public"
port = 8443
mode = "l7"
hostname = "mcq.metacircular.net"

Deploy Flow with Edge Routing

When the master receives Deploy(mcq):

  1. Route to worker: Master sends Deploy RPC to rift's agent with the full service spec. Rift's agent deploys the container and registers mc-proxy routes for all hostnames (both internal and public) on its local mc-proxy.

  2. Detect public hostnames: Master inspects the service spec for hostnames that are not .svc.mcp. subdomains.

  3. Validate domains: Master checks that mcq.metacircular.net falls under an allowed domain (metacircular.net ✓).

  4. Resolve edge node: Master performs a DNS lookup for mcq.metacircular.net. If it's a CNAME to svc.metacircular.net, the master resolves svc.metacircular.net to identify the edge node as svc. If DNS is not yet configured (no CNAME), the master uses the default edge node from config.

  5. Set up edge route: Master sends an SetupEdgeRoute RPC to svc's agent:

    SetupEdgeRoute(
      hostname:         "mcq.metacircular.net"
      backend_hostname: "mcq.svc.mcp.metacircular.net"
      backend_port:     8443
    )
    
  6. Svc agent provisions: On receiving SetupEdgeRoute, svc's agent: a. Provisions a TLS certificate from Metacrypt for mcq.metacircular.net. b. Registers an L7 route in its local mc-proxy: mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443 with the provisioned cert.

  7. Master records the edge route in its own registry for undeploy cleanup.

Undeploy Flow

When the master receives Undeploy(mcq):

  1. Look up edge routes: Master checks its registry for edge routes associated with mcq.
  2. Remove edge route: Master sends RemoveEdgeRoute(mcq.metacircular.net) to svc's agent. Svc's agent removes the mc-proxy route and cleans up the cert.
  3. Undeploy on worker: Master sends Undeploy RPC to rift's agent. Rift's agent tears down the container, routes, DNS, and certs as in v1.

Edge Node DNS Resolution

The master determines which edge node handles a public hostname by checking DNS:

  1. Look up mcq.metacircular.net → CNAME svc.metacircular.net
  2. Look up svc.metacircular.net → IP address
  3. Match the IP against known edge nodes

If no CNAME exists yet (operator hasn't set it up), the master warns but does not fail. The operator sets up DNS manually at Hurricane Electric. The master can provide a mcp dns check command that verifies all public hostnames resolve correctly.


Agent Changes for v2

New RPCs

// Edge routing -- called by master on edge nodes.
rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);

message SetupEdgeRouteRequest {
  string hostname = 1;          // public hostname (e.g. "mcq.metacircular.net")
  string backend_hostname = 2;  // internal hostname (e.g. "mcq.svc.mcp.metacircular.net")
  int32 backend_port = 3;       // port on the worker's mc-proxy (e.g. 8443)
}

message SetupEdgeRouteResponse {}

message RemoveEdgeRouteRequest {
  string hostname = 1;
}

message RemoveEdgeRouteResponse {}

message ListEdgeRoutesRequest {}

message ListEdgeRoutesResponse {
  repeated EdgeRoute routes = 1;
}

message EdgeRoute {
  string hostname = 1;
  string backend_hostname = 2;
  int32 backend_port = 3;
  string cert_serial = 4;
  string cert_expires = 5;
}

SetupEdgeRoute Implementation

When the agent receives SetupEdgeRoute:

  1. Resolve backend address: The agent resolves backend_hostname to an IP address (the worker node's overlay IP). It uses the port from the request to form the backend address (e.g., 100.95.252.120:8443).

  2. Provision TLS cert: The agent calls Metacrypt's CA API to issue a certificate for the public hostname. The cert and key are written to the mc-proxy cert directory.

  3. Register mc-proxy route: The agent adds an L7 route to its local mc-proxy:

    • Listener: :443
    • Hostname: mcq.metacircular.net
    • Backend: 100.95.252.120:8443
    • Mode: l7
    • TLS cert/key: the provisioned cert
    • Backend TLS: true (worker's mc-proxy serves TLS)
  4. Record the edge route in the agent's local registry for listing and cleanup.

Cert Provisioning on All Agents

All agents need Metacrypt configuration to provision certs:

[metacrypt]
server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
ca_cert    = "/srv/mcp/certs/metacircular-ca.pem"
mount      = "pki"
issuer     = "infra"
token_path = "/srv/mcp/metacrypt-token"

The svc agent provisions certs for public hostnames. The rift agent provisions certs for internal hostnames. Both use the same Metacrypt API.


CLI Changes for v2

The CLI's [[nodes]] config is replaced by a single master address:

[master]
address = "100.x.x.x:9555"

[mcias]
server_url   = "https://mcias.metacircular.net:8443"
service_name = "mcp"

[auth]
token_path = "/home/kyle/.config/mcp/token"

[services]
dir = "/home/kyle/.config/mcp/services"

Commands that currently iterate over nodes (mcp ps, mcp list, mcp node list) instead query the master, which aggregates from all agents.

Service definition files remain on the operator's workstation. The CLI pushes them to the master on mcp deploy and mcp sync.


Migration Plan

Phase 1: Deploy mcp-agent on svc

svc currently has no mcp-agent. Install and configure one:

  • Create mcp user on svc (Debian: install-agent.sh)
  • Install mcp-agent binary
  • Configure with Metacrypt and mc-proxy socket access
  • Verify with mcp node list (svc shows up)

Phase 2: Add edge routing RPCs to agents

Implement SetupEdgeRoute, RemoveEdgeRoute, ListEdgeRoutes on the agent. Test by calling them directly from the CLI before the master exists.

Phase 3: Build mcp-master

Start with the core coordination loop:

  1. Accept Deploy from CLI
  2. Forward to the correct agent
  3. Detect public hostnames
  4. Call SetupEdgeRoute on the edge agent

Phase 4: Provision straylight

New node (straylight) takes over as master and hosts core infrastructure:

  1. Deploy mcp-agent on straylight
  2. Migrate mcns, metacrypt, mcr from rift to straylight
  3. Deploy mcp-master on straylight
  4. Update CLI config to point at master

Phase 5: Cut over

  • Update DNS to point *.svc.mcp.metacircular.net at straylight
  • Update service definitions to use new node assignments
  • Verify all services via mcp ps and public endpoint tests

Open Questions

  1. Master HA: mcp-master is a single point of failure. For v2, this is acceptable (operator can SSH to agents directly if master is down). v3 could add master replication or make agents self-sufficient for local operations when the master is unreachable.

  2. Service placement: v2 still requires explicit node assignment in service definitions. Automatic placement based on resource availability is a future concern.

  3. Cert renewal on edge: Edge certs have a 90-day TTL. The edge agent needs a renewal loop (similar to the existing renewWindow check in EnsureCert) or the master needs to periodically re-check edge routes.

  4. mc-proxy database vs config: mc-proxy persists routes in SQLite, which can diverge from the TOML config. The agent should be the sole manager of mc-proxy routes via the gRPC admin API, not the TOML file. This avoids the stale-database problem encountered during v1 operations on svc.

  5. straylight hardware: What hardware is straylight? Does it run NixOS or Debian? Does it use rootless podman like rift?

  6. Mono-repo for core infrastructure: The current layout has each service as a separate git repo under ~/src/metacircular/. A mono-repo for core infrastructure (mcp, mcp-master, mcns, metacrypt, mcr, mc-proxy, mcdsl) would simplify coordinated changes (e.g., a proto change that touches agent + CLI + mc-proxy client), eliminate the uses_mcdsl build flag / vendoring, enable a single CI pipeline, and allow atomic platform versioning (one tag per release). Non-core application services (exo, mcq, mcdoc, sgard, kls, mcat) would remain as separate repos with independent release cadences. This is a large migration best tackled after straylight is running and the master exists, when the build/deploy pipeline is already being reorganized.