commit 6b99937a69aabce9eb37403d6907a1396bfcaf9e Author: Kyle Isom Date: Thu Mar 26 09:42:41 2026 -0700 Add MCP v1 architecture specification Design spec for the Metacircular Control Plane covering master/agent architecture, service registry with desired/observed state tracking, container lifecycle management, service definition files, single-file transfer scoped to /srv//, and continuous monitoring via mcp watch with event logging and alerting. Co-Authored-By: Claude Opus 4.6 (1M context) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..3c1aee2 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,1098 @@ +# MCP -- Metacircular Control Plane + +## Overview + +MCP is the orchestrator for the Metacircular platform. It manages container +lifecycle, tracks what services run where, and transfers files between the +operator's workstation and managed nodes. + +MCP uses a **master/agent architecture**: + +- **The agent** owns **"what is"**. It runs on each managed node, queries the + local container runtime, and reports facts. It manages containers and files + on behalf of the master. It has no opinion about what should be running. + +- **The master** owns **"what should be"**. It maintains the service registry + (desired state), compares it against agent-reported observed state, records + events, detects drift, and alerts the operator. The master runs in two + modes: as an interactive CLI for operator commands, and as a long-lived + watch process for continuous monitoring. + +Agents keep their containers running independently. The master is needed to +issue commands, track desired state, and monitor for drift. + +### v1 Scope + +v1 targets a single-node deployment (one agent on rift, master on vade). The +core operations are: + +- **Deploy** -- pull a container image and start (or restart) a service on a + specified node. +- **Container lifecycle** -- stop, start, restart individual containers. +- **Status** -- query live container state from agents and reconcile the + master's service registry. +- **Watch** -- continuous monitoring. Periodically syncs with agents, records + state-change events, detects drift, and alerts the operator. +- **File transfer** -- push or pull individual files between master and nodes + (config files, certificates). + +Explicitly **not in v1**: migration (snapshot/tar.zst transfer), automatic +scheduling/placement, certificate provisioning from Metacrypt, DNS updates to +MCNS, multi-node orchestration, auto-reconciliation (restarting drifted +containers). + +--- + +## Architecture + +``` +Operator workstation (vade) + ┌──────────────────────────────┐ + │ mcp (master CLI) │ + │ │ + │ ┌────────────────────────┐ │ + │ │ Service Registry │ │ + │ │ (SQLite) │ │ + │ └────────────────────────┘ │ + │ │ + │ gRPC client ────────────────┼──── overlay ────┐ + └──────────────────────────────┘ │ + │ +MC Node (rift) │ + ┌────────────────────────────────────────────────┼──┐ + │ │ │ + │ ┌──────────────────────┐ │ │ + │ │ mcp-agent │◄─────────────────────┘ │ + │ │ (gRPC server) │ │ + │ │ │ │ + │ │ MCIAS token │ │ + │ │ validation │ │ + │ └──────────┬───────────┘ │ + │ │ │ + │ │ exec │ + │ ▼ │ + │ ┌──────────────────────┐ │ + │ │ Container runtime │ │ + │ │ (podman / docker) │ │ + │ │ │ │ + │ │ ┌─────┐ ┌─────┐ │ ┌──────────────────┐ │ + │ │ │ svc │ │ svc │ ...│ │ /srv// │ │ + │ │ │ α │ │ β │ │ │ config, db, │ │ + │ │ └─────┘ └─────┘ │ │ certs, backups │ │ + │ └──────────────────────┘ └──────────────────┘ │ + └───────────────────────────────────────────────────┘ +``` + +### Components + +| Component | Binary | Role | +|-----------|--------|------| +| Master | `mcp` | Operator CLI. Maintains the service registry. Issues commands to agents. | +| Agent | `mcp-agent` | Node daemon. Manages local containers and files. Authenticates callers via MCIAS. | + +### Communication + +Master and agent communicate over gRPC with mutual TLS. The transport is the +encrypted overlay network (Tailscale/WireGuard). The master authenticates to +the agent by presenting an MCIAS bearer token. The agent validates the token +by calling MCIAS and checking for the `admin` role. + +--- + +## Authentication and Authorization + +MCP follows the platform authentication model: all auth is delegated to MCIAS. + +### Agent Authentication + +The agent is a gRPC server with a unary interceptor that enforces +authentication on every RPC: + +1. Master includes an MCIAS bearer token in the gRPC metadata + (`authorization: Bearer `). +2. Agent extracts the token and validates it against MCIAS (cached 30s by + SHA-256 of the token, per platform convention). +3. Agent checks that the caller has the `admin` role. All MCP operations + require admin -- there is no unprivileged MCP access. +4. If validation fails, the RPC returns `UNAUTHENTICATED` (invalid/expired + token) or `PERMISSION_DENIED` (valid token, not admin). + +### Master Authentication + +The master authenticates to MCIAS before issuing commands. The token can be +obtained by: + +1. `mcp login` -- interactive login, stores the token locally. +2. Environment variable (`MCP_TOKEN`) for scripted use. +3. System account credentials in the master config file. + +The stored token is used for all subsequent agent RPCs until it expires. + +### Agent Identity + +The agent itself authenticates to MCIAS as a system account (for future +agent-initiated flows like health reporting). For v1, the agent is purely +reactive -- it only responds to master-initiated RPCs and does not make +outbound calls beyond MCIAS token validation. + +--- + +## Master + +### CLI Commands + +``` +mcp login Authenticate to MCIAS, store token + +mcp deploy Deploy from service definition file +mcp deploy -f Deploy from explicit file +mcp deploy --image Deploy with image override +mcp stop Stop all containers for a service +mcp start Start all containers for a service +mcp restart Restart all containers for a service + +mcp ps List all services in the registry +mcp status [service] Query live state from agents, reconcile registry +mcp sync Full rescan: agents report all containers, + registry is reconciled against reality + +mcp adopt Claim an unmanaged container into a service + +mcp service show Print current spec from registry +mcp service edit Open service definition in $EDITOR +mcp service export Write registry spec to local service file +mcp service export -f Write to explicit path + +mcp push [path] Copy a local file into /srv//[path] +mcp pull [local-file] Copy a file from /srv// to local + +mcp watch Run continuous monitoring (long-lived) +mcp watch --interval 60s Custom poll interval + +mcp node list List registered nodes +mcp node add
Register a node +mcp node remove Deregister a node +``` + +### Service Registry + +The master maintains a SQLite database that tracks: + +- **Nodes** -- name, overlay address, agent port, state. +- **Services** -- name, assigned node, current state. +- **Containers** -- name, parent service, image reference, container runtime + config (network, ports, volumes, user, restart policy, command), observed + state. + +A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt" +API server + "metacrypt-web" UI). Deploy operates on services; lifecycle +commands can target individual containers. + +### Desired State vs. Observed State + +The registry tracks two separate pieces of information for each container: + +- **Desired state** -- what the operator wants: `running` or `stopped`. Set + explicitly by `mcp deploy`, `mcp stop`, `mcp start`, etc. +- **Observed state** -- what the agent actually sees on the node: `running`, + `stopped`, `exited`, `removed`, or `unknown`. + +These can diverge. A container with desired=`running` and observed=`exited` +has crashed. A container with desired=`stopped` and observed=`stopped` is +fine. `mcp status` shows both states and flags **drift** -- any container +where observed state does not match desired state. + +| Desired | Observed | Status | +|---------|----------|--------| +| running | running | OK | +| running | stopped | **DRIFT** -- stopped unexpectedly | +| running | exited | **DRIFT** -- crashed | +| running | removed | **DRIFT** -- container gone | +| stopped | stopped | OK | +| stopped | removed | OK | +| stopped | running | **DRIFT** -- running when it shouldn't be | + +For v1, MCP reports drift but does not auto-reconcile. The operator decides +whether to `mcp start`, `mcp deploy`, or investigate. Auto-reconciliation +(restart crashed containers, enforce desired state) is a v2 concern. + +### Registry Reconciliation + +`mcp sync` (and implicitly, `mcp status`) works as follows: + +1. For each registered node, master calls `ListContainers` on the agent. +2. Agent queries the container runtime and returns the full list of + containers with their configuration and state. +3. Master compares the agent's report against the registry: + - **Container in registry, seen by agent**: update observed state. + - **Container in registry, not seen by agent**: set observed state to + `removed`. + - **Container seen by agent, not in registry**: add to registry as + `unmanaged` (desired state unset, observed state from agent). +4. Master prints a summary: OK containers, drifted containers, and newly + discovered unmanaged containers. + +### Adopting Unmanaged Containers + +On first `mcp sync`, every container on rift will appear as `unmanaged` -- +MCP didn't deploy them and doesn't know their intended service grouping. + +`mcp adopt ` claims an unmanaged container: + +1. If the service doesn't exist, create it (assigned to the container's + node). +2. Set the container's service to the specified service. +3. Set `managed = true` and desired state to `running` (or `stopped` if + the container is currently stopped). + +This lets the operator incrementally bring existing containers under MCP +management without redeploying them. + +### Service Definitions + +A service definition is a TOML file that declares the containers for a +service. These files live in `~/.config/mcp/services/` by default, one +per service. + +Example: `~/.config/mcp/services/metacrypt.toml` + +```toml +name = "metacrypt" +node = "rift" + +[[containers]] +name = "metacrypt" +image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest" +network = "docker_default" +user = "0:0" +restart = "unless-stopped" +ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"] +volumes = ["/srv/metacrypt:/srv/metacrypt"] + +[[containers]] +name = "metacrypt-web" +image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest" +network = "docker_default" +user = "0:0" +restart = "unless-stopped" +ports = ["127.0.0.1:18080:8080"] +volumes = ["/srv/metacrypt:/srv/metacrypt"] +cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"] +``` + +#### Deploy Resolution + +`mcp deploy ` resolves the container spec through a precedence +chain: + +1. **Command-line flags** (highest) -- `--image`, `--node`, etc. override + individual fields. +2. **Service definition file** -- if `-f ` is specified, use that + file. Otherwise look for `~/.config/mcp/services/.toml`. +3. **Registry** (lowest) -- if no file exists, use the spec from the last + successful deploy stored in the registry. + +If none of these exist (first deploy, no file), the deploy fails with an +error telling the operator to create a service definition. + +On a successful deploy, the resolved spec (after all overrides) is written +to the registry. The file on disk is **not** modified -- it represents the +operator's declared intent, not the deployed state. To sync the file with +reality after overrides, use `mcp service export`. + +#### Spec Lifecycle + +``` + ┌─────────────┐ + write │ Service │ mcp deploy + ──────────► │ definition │ ──────────────┐ + │ (.toml) │ │ + └─────────────┘ ▼ + ▲ ┌─────────────────┐ + │ │ Registry │ + mcp service │ │ (deployed │ + export │ │ spec) │ + │ └─────────────────┘ + │ │ + └───────────────────────┘ +``` + +- **Operator writes** the service definition file (or copies one from + the service's repo). +- **`mcp deploy`** reads the file, applies overrides, sends to agent, + records the resolved spec in the registry. +- **`mcp service export`** reads the registry and writes it back to the + local file, incorporating any overrides or drift that occurred. + +This keeps the file and registry loosely coupled. The file is the starting +point; the registry is the truth; export reconciles them. + +`mcp service edit ` opens the service definition in `$EDITOR` +(falling back to `$VISUAL`, then `vi`). If no file exists yet, it exports +the current registry spec to a temp file first, so the operator starts +from the deployed state rather than a blank file. After the editor exits, +the file is saved to the standard path in the services directory. + +#### Where Definition Files Come From + +Service definition files can be: + +- **Written by hand** by the operator. +- **Copied from the service's repo** (a service could ship a + `deploy/mcp-service.toml` as a starting point). +- **Generated by `mcp adopt` + `mcp service export`** -- adopt existing + containers, then export to get a file matching the running config. +- **Generated by converting from mcdeploy.toml** during initial MCP + migration (one-time). + +### Monitoring (`mcp watch`) + +`mcp watch` is a long-lived master process that continuously monitors the +platform. It runs the same sync logic as `mcp status` on a loop, records +state-change events, and alerts the operator when drift is detected. + +#### Event Log + +Every state transition is recorded in the `events` table: + +```sql +CREATE TABLE events ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + container TEXT NOT NULL, + node TEXT NOT NULL, + prev_state TEXT NOT NULL, + new_state TEXT NOT NULL, + timestamp TEXT NOT NULL DEFAULT (datetime('now')) +); +``` + +Events accumulate over time and support rate queries: + +```sql +-- How many times has metacrypt exited in the last hour? +SELECT COUNT(*) FROM events +WHERE container = 'metacrypt' + AND new_state = 'exited' + AND timestamp > datetime('now', '-1 hour'); +``` + +Old events are pruned periodically (default: retain 30 days). + +#### Watch Loop + +Each iteration of the watch loop: + +1. Sync with all agents (same as `mcp sync`). +2. For each observed state change since the last iteration, insert an event. +3. Evaluate alert conditions against the current state and event history. +4. If an alert fires, execute the configured alert command. +5. Sleep for the configured interval. + +#### Alert Conditions + +The watch process evaluates two types of alert: + +- **Drift alert**: a managed container's observed state does not match its + desired state. Fires on the transition, not on every poll. +- **Rate alert**: a container has changed state more than N times within a + window (flapping). Default threshold: 3 transitions in 10 minutes. + +Each alert has a **cooldown** per container. Once an alert fires for a +container, it is suppressed for the cooldown period regardless of further +transitions. This prevents notification spam from a flapping service. + +``` +[watch] +interval = "60s" +alert_command = "" # command to run on alert; empty = log only +cooldown = "15m" # suppress repeat alerts per container +flap_threshold = 3 # state changes within flap_window = flapping +flap_window = "10m" +retention = "30d" # event log retention +``` + +#### Alert Command + +When an alert fires, the master executes the configured command with +context passed via environment variables: + +| Variable | Value | +|----------|-------| +| `MCP_CONTAINER` | Container name | +| `MCP_SERVICE` | Parent service name | +| `MCP_NODE` | Node name | +| `MCP_DESIRED` | Desired state | +| `MCP_OBSERVED` | Observed state | +| `MCP_PREV_STATE` | Previous observed state | +| `MCP_ALERT_TYPE` | `drift` or `flapping` | +| `MCP_TRANSITIONS` | Number of transitions in the flap window (for flapping alerts) | + +The alert command is the operator's choice. MCP does not ship with or depend +on any notification system. + +```bash +# Push notification +alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'" + +# Webhook +alert_command = "curl -s -X POST https://hooks.example.com/mcp ..." + +# Syslog +alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'" +``` + +#### Deployment + +`mcp watch` is intended to run as a systemd service on a node (typically +rift for v1), not on the operator's workstation. It uses the same binary, +config, and database as the CLI -- the difference is the execution mode. + +```ini +[Unit] +Description=MCP Watch (continuous monitoring) +After=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml +Restart=on-failure +RestartSec=10 +User=mcp +Group=mcp +``` + +This means the master's SQLite database lives on the node running `mcp +watch` (rift for v1), not on the operator's workstation. The operator can +still run CLI commands from vade -- they connect to the same database via +a configurable path, or the CLI operates statelessly by querying agents +directly and reading the registry from a shared location. + +**Alternatively**, the operator runs `mcp watch` on vade when at their desk +and relies on container restart policies for unattended operation. The +design supports both -- `mcp watch` is an optional monitoring layer, not +a required component. + +--- + +## Agent + +### gRPC Service Definition + +The agent exposes a single gRPC service. All RPCs require admin authentication. + +```protobuf +syntax = "proto3"; +package mcp.v1; + +import "google/protobuf/timestamp.proto"; + +service McpAgent { + // Container lifecycle + rpc Deploy(DeployRequest) returns (DeployResponse); + rpc StartContainer(ContainerRequest) returns (ContainerResponse); + rpc StopContainer(ContainerRequest) returns (ContainerResponse); + rpc RestartContainer(ContainerRequest) returns (ContainerResponse); + rpc RemoveContainer(ContainerRequest) returns (ContainerResponse); + + // Status + rpc ListContainers(ListContainersRequest) returns (ListContainersResponse); + + // File transfer + rpc PushFile(PushFileRequest) returns (PushFileResponse); + rpc PullFile(PullFileRequest) returns (PullFileResponse); + + // Node + rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse); +} + +// --- Container lifecycle --- + +message ContainerSpec { + string name = 1; + string image = 2; + string network = 3; + string user = 4; + string restart_policy = 5; + repeated string ports = 6; // "host:container" mappings + repeated string volumes = 7; // "host:container" mount specs + repeated string cmd = 8; // command and arguments +} + +message DeployRequest { + repeated ContainerSpec containers = 1; +} + +message DeployResponse { + repeated ContainerResult results = 1; +} + +message ContainerRequest { + string name = 1; +} + +message ContainerResponse { + string name = 1; + bool success = 2; + string error = 3; +} + +message ContainerResult { + string name = 1; + bool success = 2; + string error = 3; +} + +// --- Status --- + +message ListContainersRequest {} + +message ContainerInfo { + string name = 1; + string image = 2; + string state = 3; // "running", "stopped", "exited", etc. + string network = 4; + string user = 5; + string restart_policy = 6; + repeated string ports = 7; + repeated string volumes = 8; + repeated string cmd = 9; + google.protobuf.Timestamp created = 10; + google.protobuf.Timestamp started = 11; +} + +message ListContainersResponse { + repeated ContainerInfo containers = 1; +} + +// --- File transfer --- +// All file paths are relative to /srv// on the node. +// The agent resolves the full path and rejects traversal attempts. + +message PushFileRequest { + string service = 1; // service name (-> /srv//) + string path = 2; // relative path within service dir + bytes content = 3; + uint32 mode = 4; // file permissions (e.g. 0600) +} + +message PushFileResponse { + bool success = 1; + string error = 2; +} + +message PullFileRequest { + string service = 1; // service name (-> /srv//) + string path = 2; // relative path within service dir +} + +message PullFileResponse { + bytes content = 1; + uint32 mode = 2; + string error = 3; +} + +// --- Node --- + +message NodeStatusRequest {} + +message NodeStatusResponse { + string node_name = 1; + string runtime = 2; // "podman", "docker" + string runtime_version = 3; + uint32 container_count = 4; + google.protobuf.Timestamp uptime_since = 5; +} +``` + +### Container Runtime + +The agent manages containers by executing the local container runtime CLI +(`podman` or `docker`). The runtime is configured in the agent's config file. +The agent does not use the Docker socket API -- it shells out to the CLI for +simplicity and debuggability (the operator can always run the same commands +manually). + +#### Deploy Flow + +When the agent receives a `Deploy` RPC: + +1. For each container in the request: + a. Pull the image: ` pull ` + b. Stop and remove the existing container (if any): + ` stop ` and ` rm ` + c. Start the new container: ` run -d --name [flags] [cmd]` + d. Verify the container is running: ` inspect ` +2. Return success/failure per container. + +The flags passed to `run` are derived from the `ContainerSpec`: + +| Spec field | Runtime flag | +|------------|-------------| +| `network` | `--network ` | +| `user` | `--user ` | +| `restart_policy` | `--restart ` | +| `ports` | `-p ` (repeated) | +| `volumes` | `-v ` (repeated) | +| `cmd` | appended after the image name | + +### File Transfer + +The agent supports single-file push and pull, scoped to a specific service's +data directory. This is the mechanism for deploying config files and +certificates to nodes. + +Every file operation specifies a **service name** and a **relative path**. +The agent resolves the full path as `/srv//`. This scoping +ensures that a deploy for service A cannot accidentally write into service +B's directory. + +**Push**: master sends the service name, relative path, file content, and +permissions. The agent resolves the path, validates it (no `..` traversal, +no symlinks escaping the service directory), creates intermediate directories +if needed, and writes the file atomically (write to temp file, then rename). + +**Pull**: master sends the service name and relative path. The agent resolves +the path, validates it, reads the file, and returns the content and +permissions. + +``` +# Push mcr.toml into /srv/mcr/mcr.toml +mcp push mcr.toml mcr + +# Push a cert into /srv/mcr/certs/mcr.pem +mcp push cert.pem mcr certs/mcr.pem + +# Pull a config file back +mcp pull mcr mcr.toml ./mcr.toml +``` + +When the relative path is omitted from `mcp push`, the basename of the local +file is used. + +File size is bounded by gRPC message limits. For v1, the default 4MB gRPC +message size is sufficient -- config files and certificates are kilobytes. +If larger transfers are needed in the future, streaming RPCs or the v2 +tar.zst archive transfer will handle them. + +--- + +## Database Schema + +The master's SQLite database stores the service registry. + +```sql +CREATE TABLE nodes ( + name TEXT PRIMARY KEY, + address TEXT NOT NULL, + agent_port INTEGER NOT NULL DEFAULT 9444, + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE TABLE services ( + name TEXT PRIMARY KEY, + node TEXT NOT NULL REFERENCES nodes(name), + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE TABLE containers ( + name TEXT PRIMARY KEY, + service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE, + image TEXT NOT NULL, + network TEXT NOT NULL DEFAULT 'bridge', + user_spec TEXT NOT NULL DEFAULT '', + restart_policy TEXT NOT NULL DEFAULT 'unless-stopped', + desired_state TEXT NOT NULL DEFAULT 'running', + observed_state TEXT NOT NULL DEFAULT 'unknown', + managed INTEGER NOT NULL DEFAULT 1, + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE TABLE container_ports ( + container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, + mapping TEXT NOT NULL, + PRIMARY KEY (container, mapping) +); + +CREATE TABLE container_volumes ( + container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, + mapping TEXT NOT NULL, + PRIMARY KEY (container, mapping) +); + +CREATE TABLE container_cmd ( + container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, + position INTEGER NOT NULL, + arg TEXT NOT NULL, + PRIMARY KEY (container, position) +); + +-- Event log for monitoring and alerting +CREATE TABLE events ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + container TEXT NOT NULL, + node TEXT NOT NULL, + prev_state TEXT NOT NULL, + new_state TEXT NOT NULL, + timestamp TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE INDEX idx_events_container_time ON events(container, timestamp); +``` + +### State Values + +**Desired state** (set by operator actions): + +| State | Meaning | +|-------|---------| +| `running` | Operator wants this container running | +| `stopped` | Operator deliberately stopped this container | + +**Observed state** (set by agent reports): + +| State | Meaning | +|-------|---------| +| `running` | Container is running (confirmed by agent) | +| `stopped` | Container exists but is not running | +| `exited` | Container exited (crashed or completed) | +| `removed` | Container no longer exists on the node | +| `unknown` | State has not been queried since registration | + +The `managed` flag distinguishes MCP-deployed containers from pre-existing +ones discovered during sync. Unmanaged containers have no desired state set +(desired_state is empty). They appear in status output but MCP will not +modify them unless the operator explicitly adopts them with `mcp adopt`. + +--- + +## Configuration + +### Master Config + +When running as a CLI on the operator's workstation: + +```toml +[database] +path = "/home/kyle/.config/mcp/mcp.db" + +[services] +dir = "/home/kyle/.config/mcp/services" + +[mcias] +server_url = "https://mcias.metacircular.net:8443" +ca_cert = "" +service_name = "mcp" + +[auth] +token_path = "/home/kyle/.config/mcp/token" +``` + +Directory layout on the operator's workstation: + +``` +~/.config/mcp/ +├── mcp.toml Master config +├── mcp.db Service registry +├── token Cached MCIAS bearer token (0600) +└── services/ Service definition files + ├── metacrypt.toml + ├── mcr.toml + ├── mc-proxy.toml + └── ... +``` + +When running as a service on a node (for `mcp watch`): + +```toml +[database] +path = "/srv/mcp/mcp.db" + +[mcias] +server_url = "https://mcias.metacircular.net:8443" +ca_cert = "" +service_name = "mcp" +# System account credentials for unattended operation +username = "mcp" +password_file = "/srv/mcp/credentials" + +[watch] +interval = "60s" +alert_command = "" +cooldown = "15m" +flap_threshold = 3 +flap_window = "10m" +retention = "30d" + +[log] +level = "info" +``` + +The master binary is the same in both cases. When running from the +operator's workstation, it uses `~/.config/mcp/` for local state and +interactive login. When running as a systemd service on a node, it uses +`/srv/mcp/` following the platform convention and authenticates with a +system account. + +### Agent Config + +```toml +[server] +grpc_addr = ":9444" +tls_cert = "/srv/mcp/certs/cert.pem" +tls_key = "/srv/mcp/certs/key.pem" + +[mcias] +server_url = "https://mcias.metacircular.net:8443" +ca_cert = "" +service_name = "mcp-agent" + +[agent] +node_name = "rift" +container_runtime = "podman" + +[log] +level = "info" +``` + +The agent config lives in `/srv/mcp/` following the platform convention. The +agent's data directory contains only its config and TLS material -- it has no +database of its own. The container runtime is the source of truth for local +container state. + +--- + +## Deployment + +### Agent Deployment (on nodes) + +The agent is deployed like any other Metacircular service: + +1. Create `/srv/mcp/` directory and config file. +2. Provision TLS certificate from Metacrypt. +3. Create an MCIAS system account for the agent (`mcp-agent`). +4. Install the `mcp-agent` binary. +5. Start via systemd unit. + +The agent runs as a systemd service (container-first deployment is a v2 +concern -- MCP needs to be running before it can manage its own agent). + +```ini +[Unit] +Description=MCP Agent +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml +Restart=on-failure +RestartSec=5 + +User=mcp +Group=mcp + +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +PrivateDevices=true +ProtectKernelTunables=true +ProtectKernelModules=true +ProtectControlGroups=true +RestrictSUIDSGID=true +RestrictNamespaces=true +LockPersonality=true +MemoryDenyWriteExecute=true +RestrictRealtime=true +ReadWritePaths=/srv + +[Install] +WantedBy=multi-user.target +``` + +Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes files +to any service's `/srv//` directory on behalf of the master. + +### Master Installation (on operator workstation) + +For interactive CLI use: + +1. Install the `mcp` binary to `~/.local/bin/` or `/usr/local/bin/`. +2. Create `~/.config/mcp/mcp.toml` with MCIAS and registry config. +3. Run `mcp login` to authenticate. +4. Run `mcp node add rift
` to register the first node. +5. Run `mcp sync` to discover existing containers on the node. + +### Master Watch Deployment (on a node) + +For continuous monitoring, deploy `mcp watch` as a systemd service on a node +(rift for v1). This runs alongside the agent on the same machine. + +1. Install the `mcp` binary to `/usr/local/bin/`. +2. Create `/srv/mcp/mcp.toml` with database, MCIAS system account, and + `[watch]` settings. +3. Create an MCIAS system account for the master (`mcp`). + +```ini +[Unit] +Description=MCP Watch +After=network-online.target mcp-agent.service +Wants=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml +Restart=on-failure +RestartSec=10 + +User=mcp +Group=mcp + +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +PrivateDevices=true +ProtectKernelTunables=true +ProtectKernelModules=true +ProtectControlGroups=true +RestrictSUIDSGID=true +RestrictNamespaces=true +LockPersonality=true +MemoryDenyWriteExecute=true +RestrictRealtime=true +ReadWritePaths=/srv/mcp + +[Install] +WantedBy=multi-user.target +``` + +When `mcp watch` runs on a node and the operator uses the CLI from vade, +they share the same registry database. The operator can either point their +CLI config at the node's database (via the overlay) or maintain a separate +local database and sync it. For v1 with one node, the simplest option is +running all `mcp` commands on rift (via SSH) where the database lives. + +--- + +## Security Model + +### Threat Mitigations + +| Threat | Mitigation | +|--------|------------| +| Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC | +| Token theft | Tokens have short expiry; cached validation keyed by SHA-256 | +| Agent impersonation | Master verifies agent TLS certificate against known CA | +| Arbitrary file write via push | Agent restricts writes to `/srv//` for the named service | +| Arbitrary file read via pull | Agent restricts reads to `/srv//` for the named service | +| Cross-service file access | File ops require a service name; agent resolves to that service's directory only | +| Container runtime escape | Agent runs as non-root where possible; containers follow platform hardening | +| Network eavesdropping | All C2 traffic is gRPC over mTLS over encrypted overlay | + +### Security Invariants + +1. Every agent RPC requires a valid MCIAS admin token. No anonymous or + unprivileged access. +2. File operations are scoped to `/srv//` for the named service. + Path traversal attempts (`../`, symlinks outside the service directory) + are rejected. +3. The agent never executes arbitrary commands. It only runs container runtime + operations and file I/O through well-defined code paths. +4. TLS 1.3 minimum on the agent's gRPC listener. +5. The master's stored token is file-permission protected (0600). + +--- + +## Project Structure + +``` +mcp/ +├── cmd/ +│ ├── mcp/ Master CLI +│ │ ├── main.go +│ │ ├── login.go +│ │ ├── deploy.go +│ │ ├── lifecycle.go stop, start, restart +│ │ ├── status.go ps, status, sync +│ │ ├── adopt.go adopt unmanaged containers +│ │ ├── service.go service show/export +│ │ ├── transfer.go push, pull +│ │ ├── watch.go watch (long-lived monitoring) +│ │ └── node.go node add/list/remove +│ └── mcp-agent/ Agent daemon +│ ├── main.go +│ └── server.go +├── internal/ +│ ├── agent/ Agent logic +│ │ ├── agent.go Agent struct, setup +│ │ ├── containers.go Container runtime operations +│ │ ├── files.go File push/pull +│ │ └── runtime.go Runtime abstraction (podman/docker) +│ ├── registry/ Service registry +│ │ ├── db.go Schema, migrations +│ │ ├── nodes.go Node CRUD +│ │ ├── services.go Service CRUD +│ │ ├── containers.go Container CRUD +│ │ ├── events.go Event log +│ │ └── sync.go Registry reconciliation +│ ├── watch/ Monitoring subsystem +│ │ ├── watch.go Watch loop +│ │ └── alerting.go Alert evaluation and command execution +│ ├── servicedef/ Service definition file parsing +│ │ └── servicedef.go Load, parse, write TOML service defs +│ ├── auth/ MCIAS integration +│ │ └── auth.go Token validation, login +│ ├── config/ Configuration loading +│ │ ├── master.go +│ │ └── agent.go +│ └── grpcserver/ Agent gRPC server wiring +│ ├── server.go +│ ├── interceptors.go Auth interceptor +│ └── handlers.go RPC implementations +├── proto/mcp/ +│ └── v1/ +│ └── mcp.proto +├── gen/mcp/ +│ └── v1/ Generated Go code +├── deploy/ +│ ├── systemd/ +│ │ ├── mcp-agent.service +│ │ └── mcp-watch.service +│ ├── examples/ +│ │ ├── mcp.toml +│ │ └── mcp-agent.toml +│ └── scripts/ +│ └── install-agent.sh +├── Makefile +├── buf.yaml +├── .golangci.yaml +├── CLAUDE.md +└── ARCHITECTURE.md +``` + +--- + +## Future Work (v2+) + +These are explicitly out of scope for v1 but inform the design: + +- **Auto-reconciliation**: `mcp watch` detects drift but does not act on it + in v1. v2 adds configurable auto-restart for drifted containers (with + backoff to avoid restart storms). +- **Migration**: snapshot `/srv//` as tar.zst (with VACUUM INTO for + clean DB copies), stream to destination node, restore. Requires streaming + gRPC and archive assembly logic. +- **Scheduling**: automatic node selection based on resource availability and + operator constraints. Requires node resource reporting from agents. +- **Certificate provisioning**: MCP provisions TLS certs from Metacrypt + during deploy via the ACME client library. +- **DNS updates**: MCP pushes record updates to MCNS after deploy/migrate. + Requires MCNS to have an API (or, as a stopgap, zone file editing). +- **Multi-node orchestration**: deploy across multiple nodes, rolling + updates, health-aware placement. +- **Web UI**: master exposes a web interface for registry browsing and + operations. Would add REST+gRPC on the master side.