From fa8ba6fac16f0e32022387bf9033657d04f72d70 Mon Sep 17 00:00:00 2001 From: Kyle Isom Date: Thu, 2 Apr 2026 11:09:09 -0700 Subject: [PATCH] Move ARCHITECTURE_V2.md to metacircular docs The v2 architecture doc is platform-wide (covers master, agents, edge routing, snapshots, migration across all nodes). Moved to docs/architecture-v2.md in the metacircular workspace repo. Co-Authored-By: Claude Opus 4.6 (1M context) --- ARCHITECTURE_V2.md | 502 --------------------------------------------- 1 file changed, 502 deletions(-) delete mode 100644 ARCHITECTURE_V2.md diff --git a/ARCHITECTURE_V2.md b/ARCHITECTURE_V2.md deleted file mode 100644 index 1234432..0000000 --- a/ARCHITECTURE_V2.md +++ /dev/null @@ -1,502 +0,0 @@ -# MCP v2 -- Multi-Node Control Plane - -## Overview - -MCP v2 introduces multi-node orchestration with a master/agent topology. -The CLI no longer dials agents directly. A dedicated **mcp-master** daemon -coordinates deployments across nodes, handles cross-node concerns (edge -routing, certificate provisioning, DNS), and serves as the single control -point for the platform. - -### Motivation - -v1 deployed successfully on a single node (rift) but exposed operational -pain points as services needed public-facing routes through svc: - -- **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing - mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating - the SQLite database when the config and database diverged, and debugging - silent failures. Every redeployment risked breaking the public route. - -- **Dynamic port instability**: The route system assigns ephemeral host ports - that change on every deploy. svc's mc-proxy pointed at a specific port - (e.g., `100.95.252.120:48080`), which went stale after redeployment. - Container ports are also localhost-only under rootless podman, requiring - explicit Tailscale IP bindings for external access. - -- **$PORT env override conflict**: The mcdsl config loader overrides - `listen_addr` from `$PORT` when routes are present. This meant containers - ignored their configured port and listened on the route-allocated one - instead, breaking explicit port mappings that expected the config port. - -- **Cert chain issues**: mc-proxy requires full certificate chains (leaf + - intermediates). Certs provisioned outside the standard metacrypt flow - were leaf-only and caused silent TLS handshake failures (`client_bytes=7 - backend_bytes=0` with no error logged). - -- **mc-proxy database divergence**: mc-proxy persists routes in SQLite. - Routes added via the admin API override the TOML config. Editing the TOML - alone had no effect until the database was manually updated -- a failure - mode that took hours to diagnose. - -- **No cross-node coordination**: The v1 CLI talks directly to individual - agents. There is no mechanism for one agent to tell another "set up a - route for this service." Every cross-node operation was manual. - -v2 addresses all of these by making the master the single coordination -point for deployments, with agents handling local concerns (containers, -mc-proxy routes, cert provisioning) on instruction from the master. - -### What Changes from v1 - -| Concern | v1 | v2 | -|---------|----|----| -| CLI target | CLI dials agents directly | CLI dials the master | -| Node awareness | CLI routes by `node` field in service defs | Master owns the node registry | -| Service definitions | Live on operator workstation | Pushed to master, which distributes to agents | -| Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup | -| Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) | -| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones | - -### What Stays the Same - -The agent's core responsibilities are unchanged: it manages containers via -podman, stores its local registry in SQLite, monitors for drift, and alerts -the operator. The agent gains new RPCs for edge routing but does not become -aware of other nodes -- the master handles all cross-node coordination. - ---- - -## Topology - -``` -Operator workstation (vade) - ┌──────────────────────────┐ - │ mcp (CLI) │ - │ │ - │ gRPC ───────────────────┼─── overlay ───┐ - └──────────────────────────┘ │ - ▼ -Master node (straylight) - ┌──────────────────────────────────────────────────────┐ - │ mcp-master │ - │ ├── node registry (all nodes, roles, addresses) │ - │ ├── service definitions (pushed from CLI) │ - │ └── deployment coordinator │ - │ │ - │ mcp-agent │ - │ ├── mcns container │ - │ ├── metacrypt container │ - │ ├── mcr container │ - │ └── mc-proxy (straylight) │ - └──────────┬──────────────────────────┬────────────────┘ - │ │ - overlay overlay - │ │ - ▼ ▼ -Worker node (rift) Edge node (svc) - ┌─────────────────────┐ ┌─────────────────────────┐ - │ mcp-agent │ │ mcp-agent │ - │ ├── exo │ │ ├── mc-proxy (svc) │ - │ ├── mcq │ │ └── (edge routes only) │ - │ ├── mcdoc │ │ │ - │ ├── sgard │ │ Edge routes: │ - │ ├── kls │ │ mcq.metacircular.net │ - │ └── mc-proxy │ │ mcdoc.metacircular.net │ - │ (rift) │ │ exo.metacircular.net │ - └─────────────────────┘ │ sgard.metacircular.net │ - └─────────────────────────┘ -``` - -### Node Roles - -| Role | Purpose | Nodes | -|------|---------|-------| -| **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight | -| **worker** | Runs mcp-agent. Hosts application services. | rift | -| **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc | - -Every node runs an mcp-agent. The master node also runs mcp-master. -The master's local agent manages the infrastructure services (mcns, -metacrypt, mcr) the same way rift's agent manages application services. - -### mc-proxy Mesh - -Each node runs its own mc-proxy instance. They form a routing mesh: - -``` -mc-proxy (straylight) - ├── :443 L7 routes for metacrypt-web, mcr-web - ├── :8443 L4 passthrough for metacrypt-api, mcr-api - └── :9443 L4 passthrough for gRPC services - -mc-proxy (rift) - ├── :443 L7 routes for internal .svc.mcp hostnames - └── :8443 L4/L7 routes for internal APIs - -mc-proxy (svc) - └── :443 L7 termination for public hostnames - → forwards to internal .svc.mcp endpoints -``` - ---- - -## mcp-master - -The master is a new binary that coordinates cross-node operations. It is -**not** a replacement for the agent -- it sits above agents and orchestrates -them. - -### Responsibilities - -1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync). -2. **Route deployments** to the correct agent based on the service - definition's `node` field. -3. **Detect public hostnames** in service definitions and coordinate edge - routing with the edge node's agent. -4. **Validate public hostnames** against a configured allowlist of domains - (e.g., `metacircular.net`, `wntrmute.net`). -5. **Resolve edge nodes** by checking DNS CNAME records to determine which - node handles public traffic for a given hostname. -6. **Coordinate undeploy** across nodes: tear down the service on the - worker, then clean up edge routes on the edge node. - -### What the Master Does NOT Do - -- Store container state (agents own their registries). -- Manage container lifecycle directly (agents do this). -- Run containers (the co-located agent does). -- Replace the agent on any node. - -### Master Configuration - -```toml -[server] -grpc_addr = "100.x.x.x:9555" # master listens on overlay -tls_cert = "/srv/mcp-master/certs/cert.pem" -tls_key = "/srv/mcp-master/certs/key.pem" - -[mcias] -server_url = "https://mcias.metacircular.net:8443" -service_name = "mcp-master" - -# Allowed public domains. Hostnames in service definitions must fall -# under one of these suffixes. -[edge] -allowed_domains = ["metacircular.net", "wntrmute.net"] - -# Node registry. The master knows about all nodes. -[[nodes]] -name = "straylight" -address = "100.x.x.x:9444" -role = "master" - -[[nodes]] -name = "rift" -address = "100.95.252.120:9444" -role = "worker" - -[[nodes]] -name = "svc" -address = "100.x.x.x:9444" -role = "edge" -``` - ---- - -## Edge Routing - -The core v2 feature: when a service declares a public hostname, the -master automatically provisions the edge route. - -### Service Definition - -Public hostnames are declared in the route's `hostname` field. The -master distinguishes public from internal hostnames by checking whether -they fall under a `.svc.mcp.` subdomain: - -- `mcq.svc.mcp.metacircular.net` → internal (handled by local mc-proxy) -- `mcq.metacircular.net` → public (requires edge routing) - -```toml -name = "mcq" -node = "rift" -active = true - -[[components]] -name = "mcq" -image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0" -volumes = ["/srv/mcq:/srv/mcq"] -cmd = ["server", "--config", "/srv/mcq/mcq.toml"] - -# Internal route: handled by rift's mc-proxy. -[[components.routes]] -name = "internal" -port = 8443 -mode = "l7" - -# Public route: master detects this and sets up edge routing on svc. -[[components.routes]] -name = "public" -port = 8443 -mode = "l7" -hostname = "mcq.metacircular.net" -``` - -### Deploy Flow with Edge Routing - -When the master receives `Deploy(mcq)`: - -1. **Route to worker**: Master sends `Deploy` RPC to rift's agent with - the full service spec. Rift's agent deploys the container and - registers mc-proxy routes for all hostnames (both internal and public) - on its local mc-proxy. - -2. **Detect public hostnames**: Master inspects the service spec for - hostnames that are not `.svc.mcp.` subdomains. - -3. **Validate domains**: Master checks that `mcq.metacircular.net` falls - under an allowed domain (`metacircular.net` ✓). - -4. **Resolve edge node**: Master performs a DNS lookup for - `mcq.metacircular.net`. If it's a CNAME to `svc.metacircular.net`, - the master resolves `svc.metacircular.net` to identify the edge node - as `svc`. If DNS is not yet configured (no CNAME), the master uses - the default edge node from config. - -5. **Set up edge route**: Master sends an `SetupEdgeRoute` RPC to svc's - agent: - ``` - SetupEdgeRoute( - hostname: "mcq.metacircular.net" - backend_hostname: "mcq.svc.mcp.metacircular.net" - backend_port: 8443 - ) - ``` - -6. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent: - a. Provisions a TLS certificate from Metacrypt for - `mcq.metacircular.net`. - b. Registers an L7 route in its local mc-proxy: - `mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443` - with the provisioned cert. - -7. **Master records the edge route** in its own registry for undeploy - cleanup. - -### Undeploy Flow - -When the master receives `Undeploy(mcq)`: - -1. **Look up edge routes**: Master checks its registry for edge routes - associated with mcq. -2. **Remove edge route**: Master sends `RemoveEdgeRoute(mcq.metacircular.net)` - to svc's agent. Svc's agent removes the mc-proxy route and cleans up - the cert. -3. **Undeploy on worker**: Master sends `Undeploy` RPC to rift's agent. - Rift's agent tears down the container, routes, DNS, and certs as in v1. - -### Edge Node DNS Resolution - -The master determines which edge node handles a public hostname by -checking DNS: - -1. Look up `mcq.metacircular.net` → CNAME `svc.metacircular.net` -2. Look up `svc.metacircular.net` → IP address -3. Match the IP against known edge nodes - -If no CNAME exists yet (operator hasn't set it up), the master warns but -does not fail. The operator sets up DNS manually at Hurricane Electric. -The master can provide a `mcp dns check` command that verifies all public -hostnames resolve correctly. - ---- - -## Agent Changes for v2 - -### New RPCs - -```protobuf -// Edge routing -- called by master on edge nodes. -rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse); -rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse); -rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse); - -message SetupEdgeRouteRequest { - string hostname = 1; // public hostname (e.g. "mcq.metacircular.net") - string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net") - int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443) -} - -message SetupEdgeRouteResponse {} - -message RemoveEdgeRouteRequest { - string hostname = 1; -} - -message RemoveEdgeRouteResponse {} - -message ListEdgeRoutesRequest {} - -message ListEdgeRoutesResponse { - repeated EdgeRoute routes = 1; -} - -message EdgeRoute { - string hostname = 1; - string backend_hostname = 2; - int32 backend_port = 3; - string cert_serial = 4; - string cert_expires = 5; -} -``` - -### SetupEdgeRoute Implementation - -When the agent receives `SetupEdgeRoute`: - -1. **Resolve backend address**: The agent resolves `backend_hostname` to - an IP address (the worker node's overlay IP). It uses the port from - the request to form the backend address (e.g., `100.95.252.120:8443`). - -2. **Provision TLS cert**: The agent calls Metacrypt's CA API to issue a - certificate for the public hostname. The cert and key are written to - the mc-proxy cert directory. - -3. **Register mc-proxy route**: The agent adds an L7 route to its local - mc-proxy: - - Listener: `:443` - - Hostname: `mcq.metacircular.net` - - Backend: `100.95.252.120:8443` - - Mode: `l7` - - TLS cert/key: the provisioned cert - - Backend TLS: `true` (worker's mc-proxy serves TLS) - -4. **Record the edge route** in the agent's local registry for listing - and cleanup. - -### Cert Provisioning on All Agents - -All agents need Metacrypt configuration to provision certs: - -```toml -[metacrypt] -server_url = "https://metacrypt.svc.mcp.metacircular.net:8443" -ca_cert = "/srv/mcp/certs/metacircular-ca.pem" -mount = "pki" -issuer = "infra" -token_path = "/srv/mcp/metacrypt-token" -``` - -The svc agent provisions certs for public hostnames. The rift agent -provisions certs for internal hostnames. Both use the same Metacrypt API. - ---- - -## CLI Changes for v2 - -The CLI's `[[nodes]]` config is replaced by a single master address: - -```toml -[master] -address = "100.x.x.x:9555" - -[mcias] -server_url = "https://mcias.metacircular.net:8443" -service_name = "mcp" - -[auth] -token_path = "/home/kyle/.config/mcp/token" - -[services] -dir = "/home/kyle/.config/mcp/services" -``` - -Commands that currently iterate over nodes (`mcp ps`, `mcp list`, -`mcp node list`) instead query the master, which aggregates from all -agents. - -Service definition files remain on the operator's workstation. The CLI -pushes them to the master on `mcp deploy` and `mcp sync`. - ---- - -## Migration Plan - -### Phase 1: Deploy mcp-agent on svc - -svc currently has no mcp-agent. Install and configure one: - -- Create `mcp` user on svc (Debian: `install-agent.sh`) -- Install mcp-agent binary -- Configure with Metacrypt and mc-proxy socket access -- Verify with `mcp node list` (svc shows up) - -### Phase 2: Add edge routing RPCs to agents - -Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the -agent. Test by calling them directly from the CLI before the master exists. - -### Phase 3: Build mcp-master - -Start with the core coordination loop: - -1. Accept `Deploy` from CLI -2. Forward to the correct agent -3. Detect public hostnames -4. Call `SetupEdgeRoute` on the edge agent - -### Phase 4: Provision straylight - -New node (straylight) takes over as master and hosts core infrastructure: - -1. Deploy mcp-agent on straylight -2. Migrate mcns, metacrypt, mcr from rift to straylight -3. Deploy mcp-master on straylight -4. Update CLI config to point at master - -### Phase 5: Cut over - -- Update DNS to point `*.svc.mcp.metacircular.net` at straylight -- Update service definitions to use new node assignments -- Verify all services via `mcp ps` and public endpoint tests - ---- - -## Open Questions - -1. **Master HA**: mcp-master is a single point of failure. For v2, this - is acceptable (operator can SSH to agents directly if master is down). - v3 could add master replication or make agents self-sufficient for - local operations when the master is unreachable. - -2. **Service placement**: v2 still requires explicit `node` assignment - in service definitions. Automatic placement based on resource - availability is a future concern. - -3. **Cert renewal on edge**: Edge certs have a 90-day TTL. The edge - agent needs a renewal loop (similar to the existing `renewWindow` - check in `EnsureCert`) or the master needs to periodically re-check - edge routes. - -4. **mc-proxy database vs config**: mc-proxy persists routes in SQLite, - which can diverge from the TOML config. The agent should be the sole - manager of mc-proxy routes via the gRPC admin API, not the TOML file. - This avoids the stale-database problem encountered during v1 - operations on svc. - -5. **straylight hardware**: What hardware is straylight? Does it run - NixOS or Debian? Does it use rootless podman like rift? - -6. **Mono-repo for core infrastructure**: The current layout has each - service as a separate git repo under `~/src/metacircular/`. A - mono-repo for core infrastructure (mcp, mcp-master, mcns, metacrypt, - mcr, mc-proxy, mcdsl) would simplify coordinated changes (e.g., a - proto change that touches agent + CLI + mc-proxy client), eliminate - the `uses_mcdsl` build flag / vendoring, enable a single CI pipeline, - and allow atomic platform versioning (one tag per release). Non-core - application services (exo, mcq, mcdoc, sgard, kls, mcat) would - remain as separate repos with independent release cadences. This is - a large migration best tackled after straylight is running and the - master exists, when the build/deploy pipeline is already being - reorganized.