4 Commits

Author SHA1 Message Date
52914d50b0 Pass mode, backend-tls, and tls cert/key through route add
The --mode flag was defined but never wired through to the RPC.
Add tls_cert and tls_key fields to AddProxyRouteRequest so L7
routes can be created via mcp route add.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:44:44 -07:00
bb4bee51ba Add mono-repo consideration to ARCHITECTURE_V2.md open questions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:40:32 -07:00
4ac8a6d60b Add ARCHITECTURE_V2.md for multi-node master/agent topology
Documents the planned v2 architecture: mcp-master on straylight
coordinates deployments across worker (rift) and edge (svc) nodes.
Includes edge routing flow, agent RPCs, migration plan, and
operational issues from v1 that motivate the redesign.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:37:24 -07:00
d8f45ca520 Merge explicit ports with route-allocated ports during deploy
Previously, explicit port mappings from the service definition were
ignored when routes were present. Now both are included, allowing
services to have stable external port bindings alongside dynamic
route-allocated ports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 19:28:40 -07:00
6 changed files with 547 additions and 9 deletions

502
ARCHITECTURE_V2.md Normal file
View File

@@ -0,0 +1,502 @@
# MCP v2 -- Multi-Node Control Plane
## Overview
MCP v2 introduces multi-node orchestration with a master/agent topology.
The CLI no longer dials agents directly. A dedicated **mcp-master** daemon
coordinates deployments across nodes, handles cross-node concerns (edge
routing, certificate provisioning, DNS), and serves as the single control
point for the platform.
### Motivation
v1 deployed successfully on a single node (rift) but exposed operational
pain points as services needed public-facing routes through svc:
- **Manual edge routing**: Exposing mcq.metacircular.net required hand-editing
mc-proxy's TOML config on svc, provisioning a TLS cert manually, updating
the SQLite database when the config and database diverged, and debugging
silent failures. Every redeployment risked breaking the public route.
- **Dynamic port instability**: The route system assigns ephemeral host ports
that change on every deploy. svc's mc-proxy pointed at a specific port
(e.g., `100.95.252.120:48080`), which went stale after redeployment.
Container ports are also localhost-only under rootless podman, requiring
explicit Tailscale IP bindings for external access.
- **$PORT env override conflict**: The mcdsl config loader overrides
`listen_addr` from `$PORT` when routes are present. This meant containers
ignored their configured port and listened on the route-allocated one
instead, breaking explicit port mappings that expected the config port.
- **Cert chain issues**: mc-proxy requires full certificate chains (leaf +
intermediates). Certs provisioned outside the standard metacrypt flow
were leaf-only and caused silent TLS handshake failures (`client_bytes=7
backend_bytes=0` with no error logged).
- **mc-proxy database divergence**: mc-proxy persists routes in SQLite.
Routes added via the admin API override the TOML config. Editing the TOML
alone had no effect until the database was manually updated -- a failure
mode that took hours to diagnose.
- **No cross-node coordination**: The v1 CLI talks directly to individual
agents. There is no mechanism for one agent to tell another "set up a
route for this service." Every cross-node operation was manual.
v2 addresses all of these by making the master the single coordination
point for deployments, with agents handling local concerns (containers,
mc-proxy routes, cert provisioning) on instruction from the master.
### What Changes from v1
| Concern | v1 | v2 |
|---------|----|----|
| CLI target | CLI dials agents directly | CLI dials the master |
| Node awareness | CLI routes by `node` field in service defs | Master owns the node registry |
| Service definitions | Live on operator workstation | Pushed to master, which distributes to agents |
| Edge routing | Manual mc-proxy config on svc | Master coordinates agent-to-agent setup |
| Cert provisioning | Agent provisions for local mc-proxy only | Any agent can provision certs (edge included) |
| DNS registration | Agent registers records on deploy | Master coordinates DNS across zones |
### What Stays the Same
The agent's core responsibilities are unchanged: it manages containers via
podman, stores its local registry in SQLite, monitors for drift, and alerts
the operator. The agent gains new RPCs for edge routing but does not become
aware of other nodes -- the master handles all cross-node coordination.
---
## Topology
```
Operator workstation (vade)
┌──────────────────────────┐
│ mcp (CLI) │
│ │
│ gRPC ───────────────────┼─── overlay ───┐
└──────────────────────────┘ │
Master node (straylight)
┌──────────────────────────────────────────────────────┐
│ mcp-master │
│ ├── node registry (all nodes, roles, addresses) │
│ ├── service definitions (pushed from CLI) │
│ └── deployment coordinator │
│ │
│ mcp-agent │
│ ├── mcns container │
│ ├── metacrypt container │
│ ├── mcr container │
│ └── mc-proxy (straylight) │
└──────────┬──────────────────────────┬────────────────┘
│ │
overlay overlay
│ │
▼ ▼
Worker node (rift) Edge node (svc)
┌─────────────────────┐ ┌─────────────────────────┐
│ mcp-agent │ │ mcp-agent │
│ ├── exo │ │ ├── mc-proxy (svc) │
│ ├── mcq │ │ └── (edge routes only) │
│ ├── mcdoc │ │ │
│ ├── sgard │ │ Edge routes: │
│ ├── kls │ │ mcq.metacircular.net │
│ └── mc-proxy │ │ mcdoc.metacircular.net │
│ (rift) │ │ exo.metacircular.net │
└─────────────────────┘ │ sgard.metacircular.net │
└─────────────────────────┘
```
### Node Roles
| Role | Purpose | Nodes |
|------|---------|-------|
| **master** | Runs mcp-master + mcp-agent. Hosts core infrastructure (mcns, metacrypt, mcr). Single coordination point. | straylight |
| **worker** | Runs mcp-agent. Hosts application services. | rift |
| **edge** | Runs mcp-agent. Terminates public TLS, forwards to internal services. No application containers. | svc |
Every node runs an mcp-agent. The master node also runs mcp-master.
The master's local agent manages the infrastructure services (mcns,
metacrypt, mcr) the same way rift's agent manages application services.
### mc-proxy Mesh
Each node runs its own mc-proxy instance. They form a routing mesh:
```
mc-proxy (straylight)
├── :443 L7 routes for metacrypt-web, mcr-web
├── :8443 L4 passthrough for metacrypt-api, mcr-api
└── :9443 L4 passthrough for gRPC services
mc-proxy (rift)
├── :443 L7 routes for internal .svc.mcp hostnames
└── :8443 L4/L7 routes for internal APIs
mc-proxy (svc)
└── :443 L7 termination for public hostnames
→ forwards to internal .svc.mcp endpoints
```
---
## mcp-master
The master is a new binary that coordinates cross-node operations. It is
**not** a replacement for the agent -- it sits above agents and orchestrates
them.
### Responsibilities
1. **Accept CLI commands** via gRPC (deploy, undeploy, status, sync).
2. **Route deployments** to the correct agent based on the service
definition's `node` field.
3. **Detect public hostnames** in service definitions and coordinate edge
routing with the edge node's agent.
4. **Validate public hostnames** against a configured allowlist of domains
(e.g., `metacircular.net`, `wntrmute.net`).
5. **Resolve edge nodes** by checking DNS CNAME records to determine which
node handles public traffic for a given hostname.
6. **Coordinate undeploy** across nodes: tear down the service on the
worker, then clean up edge routes on the edge node.
### What the Master Does NOT Do
- Store container state (agents own their registries).
- Manage container lifecycle directly (agents do this).
- Run containers (the co-located agent does).
- Replace the agent on any node.
### Master Configuration
```toml
[server]
grpc_addr = "100.x.x.x:9555" # master listens on overlay
tls_cert = "/srv/mcp-master/certs/cert.pem"
tls_key = "/srv/mcp-master/certs/key.pem"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp-master"
# Allowed public domains. Hostnames in service definitions must fall
# under one of these suffixes.
[edge]
allowed_domains = ["metacircular.net", "wntrmute.net"]
# Node registry. The master knows about all nodes.
[[nodes]]
name = "straylight"
address = "100.x.x.x:9444"
role = "master"
[[nodes]]
name = "rift"
address = "100.95.252.120:9444"
role = "worker"
[[nodes]]
name = "svc"
address = "100.x.x.x:9444"
role = "edge"
```
---
## Edge Routing
The core v2 feature: when a service declares a public hostname, the
master automatically provisions the edge route.
### Service Definition
Public hostnames are declared in the route's `hostname` field. The
master distinguishes public from internal hostnames by checking whether
they fall under a `.svc.mcp.` subdomain:
- `mcq.svc.mcp.metacircular.net` → internal (handled by local mc-proxy)
- `mcq.metacircular.net` → public (requires edge routing)
```toml
name = "mcq"
node = "rift"
active = true
[[components]]
name = "mcq"
image = "mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.0"
volumes = ["/srv/mcq:/srv/mcq"]
cmd = ["server", "--config", "/srv/mcq/mcq.toml"]
# Internal route: handled by rift's mc-proxy.
[[components.routes]]
name = "internal"
port = 8443
mode = "l7"
# Public route: master detects this and sets up edge routing on svc.
[[components.routes]]
name = "public"
port = 8443
mode = "l7"
hostname = "mcq.metacircular.net"
```
### Deploy Flow with Edge Routing
When the master receives `Deploy(mcq)`:
1. **Route to worker**: Master sends `Deploy` RPC to rift's agent with
the full service spec. Rift's agent deploys the container and
registers mc-proxy routes for all hostnames (both internal and public)
on its local mc-proxy.
2. **Detect public hostnames**: Master inspects the service spec for
hostnames that are not `.svc.mcp.` subdomains.
3. **Validate domains**: Master checks that `mcq.metacircular.net` falls
under an allowed domain (`metacircular.net` ✓).
4. **Resolve edge node**: Master performs a DNS lookup for
`mcq.metacircular.net`. If it's a CNAME to `svc.metacircular.net`,
the master resolves `svc.metacircular.net` to identify the edge node
as `svc`. If DNS is not yet configured (no CNAME), the master uses
the default edge node from config.
5. **Set up edge route**: Master sends an `SetupEdgeRoute` RPC to svc's
agent:
```
SetupEdgeRoute(
hostname: "mcq.metacircular.net"
backend_hostname: "mcq.svc.mcp.metacircular.net"
backend_port: 8443
)
```
6. **Svc agent provisions**: On receiving `SetupEdgeRoute`, svc's agent:
a. Provisions a TLS certificate from Metacrypt for
`mcq.metacircular.net`.
b. Registers an L7 route in its local mc-proxy:
`mcq.metacircular.net:443 → mcq.svc.mcp.metacircular.net:8443`
with the provisioned cert.
7. **Master records the edge route** in its own registry for undeploy
cleanup.
### Undeploy Flow
When the master receives `Undeploy(mcq)`:
1. **Look up edge routes**: Master checks its registry for edge routes
associated with mcq.
2. **Remove edge route**: Master sends `RemoveEdgeRoute(mcq.metacircular.net)`
to svc's agent. Svc's agent removes the mc-proxy route and cleans up
the cert.
3. **Undeploy on worker**: Master sends `Undeploy` RPC to rift's agent.
Rift's agent tears down the container, routes, DNS, and certs as in v1.
### Edge Node DNS Resolution
The master determines which edge node handles a public hostname by
checking DNS:
1. Look up `mcq.metacircular.net` → CNAME `svc.metacircular.net`
2. Look up `svc.metacircular.net` → IP address
3. Match the IP against known edge nodes
If no CNAME exists yet (operator hasn't set it up), the master warns but
does not fail. The operator sets up DNS manually at Hurricane Electric.
The master can provide a `mcp dns check` command that verifies all public
hostnames resolve correctly.
---
## Agent Changes for v2
### New RPCs
```protobuf
// Edge routing -- called by master on edge nodes.
rpc SetupEdgeRoute(SetupEdgeRouteRequest) returns (SetupEdgeRouteResponse);
rpc RemoveEdgeRoute(RemoveEdgeRouteRequest) returns (RemoveEdgeRouteResponse);
rpc ListEdgeRoutes(ListEdgeRoutesRequest) returns (ListEdgeRoutesResponse);
message SetupEdgeRouteRequest {
string hostname = 1; // public hostname (e.g. "mcq.metacircular.net")
string backend_hostname = 2; // internal hostname (e.g. "mcq.svc.mcp.metacircular.net")
int32 backend_port = 3; // port on the worker's mc-proxy (e.g. 8443)
}
message SetupEdgeRouteResponse {}
message RemoveEdgeRouteRequest {
string hostname = 1;
}
message RemoveEdgeRouteResponse {}
message ListEdgeRoutesRequest {}
message ListEdgeRoutesResponse {
repeated EdgeRoute routes = 1;
}
message EdgeRoute {
string hostname = 1;
string backend_hostname = 2;
int32 backend_port = 3;
string cert_serial = 4;
string cert_expires = 5;
}
```
### SetupEdgeRoute Implementation
When the agent receives `SetupEdgeRoute`:
1. **Resolve backend address**: The agent resolves `backend_hostname` to
an IP address (the worker node's overlay IP). It uses the port from
the request to form the backend address (e.g., `100.95.252.120:8443`).
2. **Provision TLS cert**: The agent calls Metacrypt's CA API to issue a
certificate for the public hostname. The cert and key are written to
the mc-proxy cert directory.
3. **Register mc-proxy route**: The agent adds an L7 route to its local
mc-proxy:
- Listener: `:443`
- Hostname: `mcq.metacircular.net`
- Backend: `100.95.252.120:8443`
- Mode: `l7`
- TLS cert/key: the provisioned cert
- Backend TLS: `true` (worker's mc-proxy serves TLS)
4. **Record the edge route** in the agent's local registry for listing
and cleanup.
### Cert Provisioning on All Agents
All agents need Metacrypt configuration to provision certs:
```toml
[metacrypt]
server_url = "https://metacrypt.svc.mcp.metacircular.net:8443"
ca_cert = "/srv/mcp/certs/metacircular-ca.pem"
mount = "pki"
issuer = "infra"
token_path = "/srv/mcp/metacrypt-token"
```
The svc agent provisions certs for public hostnames. The rift agent
provisions certs for internal hostnames. Both use the same Metacrypt API.
---
## CLI Changes for v2
The CLI's `[[nodes]]` config is replaced by a single master address:
```toml
[master]
address = "100.x.x.x:9555"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
service_name = "mcp"
[auth]
token_path = "/home/kyle/.config/mcp/token"
[services]
dir = "/home/kyle/.config/mcp/services"
```
Commands that currently iterate over nodes (`mcp ps`, `mcp list`,
`mcp node list`) instead query the master, which aggregates from all
agents.
Service definition files remain on the operator's workstation. The CLI
pushes them to the master on `mcp deploy` and `mcp sync`.
---
## Migration Plan
### Phase 1: Deploy mcp-agent on svc
svc currently has no mcp-agent. Install and configure one:
- Create `mcp` user on svc (Debian: `install-agent.sh`)
- Install mcp-agent binary
- Configure with Metacrypt and mc-proxy socket access
- Verify with `mcp node list` (svc shows up)
### Phase 2: Add edge routing RPCs to agents
Implement `SetupEdgeRoute`, `RemoveEdgeRoute`, `ListEdgeRoutes` on the
agent. Test by calling them directly from the CLI before the master exists.
### Phase 3: Build mcp-master
Start with the core coordination loop:
1. Accept `Deploy` from CLI
2. Forward to the correct agent
3. Detect public hostnames
4. Call `SetupEdgeRoute` on the edge agent
### Phase 4: Provision straylight
New node (straylight) takes over as master and hosts core infrastructure:
1. Deploy mcp-agent on straylight
2. Migrate mcns, metacrypt, mcr from rift to straylight
3. Deploy mcp-master on straylight
4. Update CLI config to point at master
### Phase 5: Cut over
- Update DNS to point `*.svc.mcp.metacircular.net` at straylight
- Update service definitions to use new node assignments
- Verify all services via `mcp ps` and public endpoint tests
---
## Open Questions
1. **Master HA**: mcp-master is a single point of failure. For v2, this
is acceptable (operator can SSH to agents directly if master is down).
v3 could add master replication or make agents self-sufficient for
local operations when the master is unreachable.
2. **Service placement**: v2 still requires explicit `node` assignment
in service definitions. Automatic placement based on resource
availability is a future concern.
3. **Cert renewal on edge**: Edge certs have a 90-day TTL. The edge
agent needs a renewal loop (similar to the existing `renewWindow`
check in `EnsureCert`) or the master needs to periodically re-check
edge routes.
4. **mc-proxy database vs config**: mc-proxy persists routes in SQLite,
which can diverge from the TOML config. The agent should be the sole
manager of mc-proxy routes via the gRPC admin API, not the TOML file.
This avoids the stale-database problem encountered during v1
operations on svc.
5. **straylight hardware**: What hardware is straylight? Does it run
NixOS or Debian? Does it use rootless podman like rift?
6. **Mono-repo for core infrastructure**: The current layout has each
service as a separate git repo under `~/src/metacircular/`. A
mono-repo for core infrastructure (mcp, mcp-master, mcns, metacrypt,
mcr, mc-proxy, mcdsl) would simplify coordinated changes (e.g., a
proto change that touches agent + CLI + mc-proxy client), eliminate
the `uses_mcdsl` build flag / vendoring, enable a single CI pipeline,
and allow atomic platform versioning (one tag per release). Non-core
application services (exo, mcq, mcdoc, sgard, kls, mcat) would
remain as separate repos with independent release cadences. This is
a large migration best tackled after straylight is running and the
master exists, when the build/deploy pipeline is already being
reorganized.

View File

@@ -28,17 +28,26 @@ func routeCmd() *cobra.Command {
},
}
var (
routeMode string
backendTLS bool
tlsCert string
tlsKey string
)
add := &cobra.Command{
Use: "add <listener> <hostname> <backend>",
Short: "Add a route to mc-proxy",
Long: "Add a route. Example: mcp route add -n rift :443 mcq.metacircular.net 100.95.252.120:443",
Long: "Add a route. Example: mcp route add -n rift :443 mcq.svc.mcp.metacircular.net 127.0.0.1:48080 --mode l7 --tls-cert /srv/mc-proxy/certs/mcq.pem --tls-key /srv/mc-proxy/certs/mcq.key",
Args: cobra.ExactArgs(3),
RunE: func(_ *cobra.Command, args []string) error {
return runRouteAdd(nodeName, args)
return runRouteAdd(nodeName, args, routeMode, backendTLS, tlsCert, tlsKey)
},
}
add.Flags().String("mode", "l4", "route mode (l4 or l7)")
add.Flags().Bool("backend-tls", false, "re-encrypt traffic to backend")
add.Flags().StringVar(&routeMode, "mode", "l4", "route mode (l4 or l7)")
add.Flags().BoolVar(&backendTLS, "backend-tls", false, "re-encrypt traffic to backend")
add.Flags().StringVar(&tlsCert, "tls-cert", "", "path to TLS cert on the node (required for l7)")
add.Flags().StringVar(&tlsKey, "tls-key", "", "path to TLS key on the node (required for l7)")
remove := &cobra.Command{
Use: "remove <listener> <hostname>",
@@ -138,7 +147,7 @@ func printRoutes(nodeName string, resp *mcpv1.ListProxyRoutesResponse) {
}
}
func runRouteAdd(nodeName string, args []string) error {
func runRouteAdd(nodeName string, args []string, mode string, backendTLS bool, tlsCert, tlsKey string) error {
if nodeName == "" {
return fmt.Errorf("--node is required")
}
@@ -166,12 +175,16 @@ func runRouteAdd(nodeName string, args []string) error {
ListenerAddr: args[0],
Hostname: args[1],
Backend: args[2],
Mode: mode,
BackendTls: backendTLS,
TlsCert: tlsCert,
TlsKey: tlsKey,
})
if err != nil {
return fmt.Errorf("add route: %w", err)
}
fmt.Printf("Added route: %s → %s on %s (%s)\n", args[1], args[2], args[0], nodeName)
fmt.Printf("Added route: %s %s → %s on %s (%s)\n", mode, args[1], args[2], args[0], nodeName)
return nil
}

View File

@@ -2815,6 +2815,8 @@ type AddProxyRouteRequest struct {
Backend string `protobuf:"bytes,3,opt,name=backend,proto3" json:"backend,omitempty"`
Mode string `protobuf:"bytes,4,opt,name=mode,proto3" json:"mode,omitempty"` // "l4" or "l7"
BackendTls bool `protobuf:"varint,5,opt,name=backend_tls,json=backendTls,proto3" json:"backend_tls,omitempty"`
TlsCert string `protobuf:"bytes,6,opt,name=tls_cert,json=tlsCert,proto3" json:"tls_cert,omitempty"` // path to TLS cert (required for l7)
TlsKey string `protobuf:"bytes,7,opt,name=tls_key,json=tlsKey,proto3" json:"tls_key,omitempty"` // path to TLS key (required for l7)
unknownFields protoimpl.UnknownFields
sizeCache protoimpl.SizeCache
}
@@ -2884,6 +2886,20 @@ func (x *AddProxyRouteRequest) GetBackendTls() bool {
return false
}
func (x *AddProxyRouteRequest) GetTlsCert() string {
if x != nil {
return x.TlsCert
}
return ""
}
func (x *AddProxyRouteRequest) GetTlsKey() string {
if x != nil {
return x.TlsKey
}
return ""
}
type AddProxyRouteResponse struct {
state protoimpl.MessageState `protogen:"open.v1"`
unknownFields protoimpl.UnknownFields
@@ -3198,14 +3214,16 @@ const file_proto_mcp_v1_mcp_proto_rawDesc = "" +
"\x11total_connections\x18\x02 \x01(\x03R\x10totalConnections\x129\n" +
"\n" +
"started_at\x18\x03 \x01(\v2\x1a.google.protobuf.TimestampR\tstartedAt\x127\n" +
"\tlisteners\x18\x04 \x03(\v2\x19.mcp.v1.ProxyListenerInfoR\tlisteners\"\xa6\x01\n" +
"\tlisteners\x18\x04 \x03(\v2\x19.mcp.v1.ProxyListenerInfoR\tlisteners\"\xda\x01\n" +
"\x14AddProxyRouteRequest\x12#\n" +
"\rlistener_addr\x18\x01 \x01(\tR\flistenerAddr\x12\x1a\n" +
"\bhostname\x18\x02 \x01(\tR\bhostname\x12\x18\n" +
"\abackend\x18\x03 \x01(\tR\abackend\x12\x12\n" +
"\x04mode\x18\x04 \x01(\tR\x04mode\x12\x1f\n" +
"\vbackend_tls\x18\x05 \x01(\bR\n" +
"backendTls\"\x17\n" +
"backendTls\x12\x19\n" +
"\btls_cert\x18\x06 \x01(\tR\atlsCert\x12\x17\n" +
"\atls_key\x18\a \x01(\tR\x06tlsKey\"\x17\n" +
"\x15AddProxyRouteResponse\"Z\n" +
"\x17RemoveProxyRouteRequest\x12#\n" +
"\rlistener_addr\x18\x01 \x01(\tR\flistenerAddr\x12\x1a\n" +

View File

@@ -134,7 +134,8 @@ func (a *Agent) deployComponent(ctx context.Context, serviceName string, cs *mcp
Error: fmt.Sprintf("allocate route ports: %v", err),
}
}
runSpec.Ports = ports
// Merge explicit ports from the spec with route-allocated ports.
runSpec.Ports = append(cs.GetPorts(), ports...)
runSpec.Env = append(runSpec.Env, env...)
} else {
// Legacy: use ports directly from the spec.

View File

@@ -69,6 +69,8 @@ func (a *Agent) AddProxyRoute(ctx context.Context, req *mcpv1.AddProxyRouteReque
Backend: req.GetBackend(),
Mode: req.GetMode(),
BackendTLS: req.GetBackendTls(),
TLSCert: req.GetTlsCert(),
TLSKey: req.GetTlsKey(),
}
if err := a.Proxy.AddRoute(ctx, req.GetListenerAddr(), route); err != nil {

View File

@@ -362,6 +362,8 @@ message AddProxyRouteRequest {
string backend = 3;
string mode = 4; // "l4" or "l7"
bool backend_tls = 5;
string tls_cert = 6; // path to TLS cert (required for l7)
string tls_key = 7; // path to TLS key (required for l7)
}
message AddProxyRouteResponse {}