# MCP -- Metacircular Control Plane ## Overview MCP is the orchestrator for the Metacircular platform. It manages container lifecycle, tracks what services run where, and transfers files between the operator's workstation and managed nodes. MCP uses a **master/agent architecture**: - **The agent** owns **"what is"**. It runs on each managed node, queries the local container runtime, and reports facts. It manages containers and files on behalf of the master. It has no opinion about what should be running. - **The master** owns **"what should be"**. It maintains the service registry (desired state), compares it against agent-reported observed state, records events, detects drift, and alerts the operator. The master runs in two modes: as an interactive CLI for operator commands, and as a long-lived watch process for continuous monitoring. Agents keep their containers running independently. The master is needed to issue commands, track desired state, and monitor for drift. ### v1 Scope v1 targets a single-node deployment (one agent on rift, master on vade). The core operations are: - **Deploy** -- pull a container image and start (or restart) a service on a specified node. - **Container lifecycle** -- stop, start, restart individual containers. - **Status** -- query live container state from agents and reconcile the master's service registry. - **Watch** -- continuous monitoring. Periodically syncs with agents, records state-change events, detects drift, and alerts the operator. - **File transfer** -- push or pull individual files between master and nodes (config files, certificates). Explicitly **not in v1**: migration (snapshot/tar.zst transfer), automatic scheduling/placement, certificate provisioning from Metacrypt, DNS updates to MCNS, multi-node orchestration, auto-reconciliation (restarting drifted containers). --- ## Architecture ``` Operator workstation (vade) ┌──────────────────────────────┐ │ mcp (master CLI) │ │ │ │ ┌────────────────────────┐ │ │ │ Service Registry │ │ │ │ (SQLite) │ │ │ └────────────────────────┘ │ │ │ │ gRPC client ────────────────┼──── overlay ────┐ └──────────────────────────────┘ │ │ MC Node (rift) │ ┌────────────────────────────────────────────────┼──┐ │ │ │ │ ┌──────────────────────┐ │ │ │ │ mcp-agent │◄─────────────────────┘ │ │ │ (gRPC server) │ │ │ │ │ │ │ │ MCIAS token │ │ │ │ validation │ │ │ └──────────┬───────────┘ │ │ │ │ │ │ exec │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Container runtime │ │ │ │ (podman / docker) │ │ │ │ │ │ │ │ ┌─────┐ ┌─────┐ │ ┌──────────────────┐ │ │ │ │ svc │ │ svc │ ...│ │ /srv// │ │ │ │ │ α │ │ β │ │ │ config, db, │ │ │ │ └─────┘ └─────┘ │ │ certs, backups │ │ │ └──────────────────────┘ └──────────────────┘ │ └───────────────────────────────────────────────────┘ ``` ### Components | Component | Binary | Role | |-----------|--------|------| | Master | `mcp` | Operator CLI. Maintains the service registry. Issues commands to agents. | | Agent | `mcp-agent` | Node daemon. Manages local containers and files. Authenticates callers via MCIAS. | ### Communication Master and agent communicate over gRPC with mutual TLS. The transport is the encrypted overlay network (Tailscale/WireGuard). The master authenticates to the agent by presenting an MCIAS bearer token. The agent validates the token by calling MCIAS and checking for the `admin` role. --- ## Authentication and Authorization MCP follows the platform authentication model: all auth is delegated to MCIAS. ### Agent Authentication The agent is a gRPC server with a unary interceptor that enforces authentication on every RPC: 1. Master includes an MCIAS bearer token in the gRPC metadata (`authorization: Bearer `). 2. Agent extracts the token and validates it against MCIAS (cached 30s by SHA-256 of the token, per platform convention). 3. Agent checks that the caller has the `admin` role. All MCP operations require admin -- there is no unprivileged MCP access. 4. If validation fails, the RPC returns `UNAUTHENTICATED` (invalid/expired token) or `PERMISSION_DENIED` (valid token, not admin). ### Master Authentication The master authenticates to MCIAS before issuing commands. The token can be obtained by: 1. `mcp login` -- interactive login, stores the token locally. 2. Environment variable (`MCP_TOKEN`) for scripted use. 3. System account credentials in the master config file. The stored token is used for all subsequent agent RPCs until it expires. ### Agent Identity The agent itself authenticates to MCIAS as a system account (for future agent-initiated flows like health reporting). For v1, the agent is purely reactive -- it only responds to master-initiated RPCs and does not make outbound calls beyond MCIAS token validation. --- ## Master ### CLI Commands ``` mcp login Authenticate to MCIAS, store token mcp deploy Deploy from service definition file mcp deploy -f Deploy from explicit file mcp deploy --image Deploy with image override mcp stop Stop all containers for a service mcp start Start all containers for a service mcp restart Restart all containers for a service mcp ps List all services in the registry mcp status [service] Query live state from agents, reconcile registry mcp sync Full rescan: agents report all containers, registry is reconciled against reality mcp adopt Claim an unmanaged container into a service mcp service show Print current spec from registry mcp service edit Open service definition in $EDITOR mcp service export Write registry spec to local service file mcp service export -f Write to explicit path mcp push [path] Copy a local file into /srv//[path] mcp pull [local-file] Copy a file from /srv// to local mcp watch Run continuous monitoring (long-lived) mcp watch --interval 60s Custom poll interval mcp node list List registered nodes mcp node add
Register a node mcp node remove Deregister a node ``` ### Service Registry The master maintains a SQLite database that tracks: - **Nodes** -- name, overlay address, agent port, state. - **Services** -- name, assigned node, current state. - **Containers** -- name, parent service, image reference, container runtime config (network, ports, volumes, user, restart policy, command), observed state. A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt" API server + "metacrypt-web" UI). Deploy operates on services; lifecycle commands can target individual containers. ### Desired State vs. Observed State The registry tracks two separate pieces of information for each container: - **Desired state** -- what the operator wants: `running` or `stopped`. Set explicitly by `mcp deploy`, `mcp stop`, `mcp start`, etc. - **Observed state** -- what the agent actually sees on the node: `running`, `stopped`, `exited`, `removed`, or `unknown`. These can diverge. A container with desired=`running` and observed=`exited` has crashed. A container with desired=`stopped` and observed=`stopped` is fine. `mcp status` shows both states and flags **drift** -- any container where observed state does not match desired state. | Desired | Observed | Status | |---------|----------|--------| | running | running | OK | | running | stopped | **DRIFT** -- stopped unexpectedly | | running | exited | **DRIFT** -- crashed | | running | removed | **DRIFT** -- container gone | | stopped | stopped | OK | | stopped | removed | OK | | stopped | running | **DRIFT** -- running when it shouldn't be | For v1, MCP reports drift but does not auto-reconcile. The operator decides whether to `mcp start`, `mcp deploy`, or investigate. Auto-reconciliation (restart crashed containers, enforce desired state) is a v2 concern. ### Registry Reconciliation `mcp sync` (and implicitly, `mcp status`) works as follows: 1. For each registered node, master calls `ListContainers` on the agent. 2. Agent queries the container runtime and returns the full list of containers with their configuration and state. 3. Master compares the agent's report against the registry: - **Container in registry, seen by agent**: update observed state. - **Container in registry, not seen by agent**: set observed state to `removed`. - **Container seen by agent, not in registry**: add to registry as `unmanaged` (desired state unset, observed state from agent). 4. Master prints a summary: OK containers, drifted containers, and newly discovered unmanaged containers. ### Adopting Unmanaged Containers On first `mcp sync`, every container on rift will appear as `unmanaged` -- MCP didn't deploy them and doesn't know their intended service grouping. `mcp adopt ` claims an unmanaged container: 1. If the service doesn't exist, create it (assigned to the container's node). 2. Set the container's service to the specified service. 3. Set `managed = true` and desired state to `running` (or `stopped` if the container is currently stopped). This lets the operator incrementally bring existing containers under MCP management without redeploying them. ### Service Definitions A service definition is a TOML file that declares the containers for a service. These files live in `~/.config/mcp/services/` by default, one per service. Example: `~/.config/mcp/services/metacrypt.toml` ```toml name = "metacrypt" node = "rift" [[containers]] name = "metacrypt" image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest" network = "docker_default" user = "0:0" restart = "unless-stopped" ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"] volumes = ["/srv/metacrypt:/srv/metacrypt"] [[containers]] name = "metacrypt-web" image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest" network = "docker_default" user = "0:0" restart = "unless-stopped" ports = ["127.0.0.1:18080:8080"] volumes = ["/srv/metacrypt:/srv/metacrypt"] cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"] ``` #### Deploy Resolution `mcp deploy ` resolves the container spec through a precedence chain: 1. **Command-line flags** (highest) -- `--image`, `--node`, etc. override individual fields. 2. **Service definition file** -- if `-f ` is specified, use that file. Otherwise look for `~/.config/mcp/services/.toml`. 3. **Registry** (lowest) -- if no file exists, use the spec from the last successful deploy stored in the registry. If none of these exist (first deploy, no file), the deploy fails with an error telling the operator to create a service definition. On a successful deploy, the resolved spec (after all overrides) is written to the registry. The file on disk is **not** modified -- it represents the operator's declared intent, not the deployed state. To sync the file with reality after overrides, use `mcp service export`. #### Spec Lifecycle ``` ┌─────────────┐ write │ Service │ mcp deploy ──────────► │ definition │ ──────────────┐ │ (.toml) │ │ └─────────────┘ ▼ ▲ ┌─────────────────┐ │ │ Registry │ mcp service │ │ (deployed │ export │ │ spec) │ │ └─────────────────┘ │ │ └───────────────────────┘ ``` - **Operator writes** the service definition file (or copies one from the service's repo). - **`mcp deploy`** reads the file, applies overrides, sends to agent, records the resolved spec in the registry. - **`mcp service export`** reads the registry and writes it back to the local file, incorporating any overrides or drift that occurred. This keeps the file and registry loosely coupled. The file is the starting point; the registry is the truth; export reconciles them. `mcp service edit ` opens the service definition in `$EDITOR` (falling back to `$VISUAL`, then `vi`). If no file exists yet, it exports the current registry spec to a temp file first, so the operator starts from the deployed state rather than a blank file. After the editor exits, the file is saved to the standard path in the services directory. #### Where Definition Files Come From Service definition files can be: - **Written by hand** by the operator. - **Copied from the service's repo** (a service could ship a `deploy/mcp-service.toml` as a starting point). - **Generated by `mcp adopt` + `mcp service export`** -- adopt existing containers, then export to get a file matching the running config. - **Generated by converting from mcdeploy.toml** during initial MCP migration (one-time). ### Monitoring (`mcp watch`) `mcp watch` is a long-lived master process that continuously monitors the platform. It runs the same sync logic as `mcp status` on a loop, records state-change events, and alerts the operator when drift is detected. #### Event Log Every state transition is recorded in the `events` table: ```sql CREATE TABLE events ( id INTEGER PRIMARY KEY AUTOINCREMENT, container TEXT NOT NULL, node TEXT NOT NULL, prev_state TEXT NOT NULL, new_state TEXT NOT NULL, timestamp TEXT NOT NULL DEFAULT (datetime('now')) ); ``` Events accumulate over time and support rate queries: ```sql -- How many times has metacrypt exited in the last hour? SELECT COUNT(*) FROM events WHERE container = 'metacrypt' AND new_state = 'exited' AND timestamp > datetime('now', '-1 hour'); ``` Old events are pruned periodically (default: retain 30 days). #### Watch Loop Each iteration of the watch loop: 1. Sync with all agents (same as `mcp sync`). 2. For each observed state change since the last iteration, insert an event. 3. Evaluate alert conditions against the current state and event history. 4. If an alert fires, execute the configured alert command. 5. Sleep for the configured interval. #### Alert Conditions The watch process evaluates two types of alert: - **Drift alert**: a managed container's observed state does not match its desired state. Fires on the transition, not on every poll. - **Rate alert**: a container has changed state more than N times within a window (flapping). Default threshold: 3 transitions in 10 minutes. Each alert has a **cooldown** per container. Once an alert fires for a container, it is suppressed for the cooldown period regardless of further transitions. This prevents notification spam from a flapping service. ``` [watch] interval = "60s" alert_command = "" # command to run on alert; empty = log only cooldown = "15m" # suppress repeat alerts per container flap_threshold = 3 # state changes within flap_window = flapping flap_window = "10m" retention = "30d" # event log retention ``` #### Alert Command When an alert fires, the master executes the configured command with context passed via environment variables: | Variable | Value | |----------|-------| | `MCP_CONTAINER` | Container name | | `MCP_SERVICE` | Parent service name | | `MCP_NODE` | Node name | | `MCP_DESIRED` | Desired state | | `MCP_OBSERVED` | Observed state | | `MCP_PREV_STATE` | Previous observed state | | `MCP_ALERT_TYPE` | `drift` or `flapping` | | `MCP_TRANSITIONS` | Number of transitions in the flap window (for flapping alerts) | The alert command is the operator's choice. MCP does not ship with or depend on any notification system. ```bash # Push notification alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'" # Webhook alert_command = "curl -s -X POST https://hooks.example.com/mcp ..." # Syslog alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'" ``` #### Deployment `mcp watch` is intended to run as a systemd service on a node (typically rift for v1), not on the operator's workstation. It uses the same binary, config, and database as the CLI -- the difference is the execution mode. ```ini [Unit] Description=MCP Watch (continuous monitoring) After=network-online.target [Service] Type=simple ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml Restart=on-failure RestartSec=10 User=mcp Group=mcp ``` This means the master's SQLite database lives on the node running `mcp watch` (rift for v1), not on the operator's workstation. The operator can still run CLI commands from vade -- they connect to the same database via a configurable path, or the CLI operates statelessly by querying agents directly and reading the registry from a shared location. **Alternatively**, the operator runs `mcp watch` on vade when at their desk and relies on container restart policies for unattended operation. The design supports both -- `mcp watch` is an optional monitoring layer, not a required component. --- ## Agent ### gRPC Service Definition The agent exposes a single gRPC service. All RPCs require admin authentication. ```protobuf syntax = "proto3"; package mcp.v1; import "google/protobuf/timestamp.proto"; service McpAgent { // Container lifecycle rpc Deploy(DeployRequest) returns (DeployResponse); rpc StartContainer(ContainerRequest) returns (ContainerResponse); rpc StopContainer(ContainerRequest) returns (ContainerResponse); rpc RestartContainer(ContainerRequest) returns (ContainerResponse); rpc RemoveContainer(ContainerRequest) returns (ContainerResponse); // Status rpc ListContainers(ListContainersRequest) returns (ListContainersResponse); // File transfer rpc PushFile(PushFileRequest) returns (PushFileResponse); rpc PullFile(PullFileRequest) returns (PullFileResponse); // Node rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse); } // --- Container lifecycle --- message ContainerSpec { string name = 1; string image = 2; string network = 3; string user = 4; string restart_policy = 5; repeated string ports = 6; // "host:container" mappings repeated string volumes = 7; // "host:container" mount specs repeated string cmd = 8; // command and arguments } message DeployRequest { repeated ContainerSpec containers = 1; } message DeployResponse { repeated ContainerResult results = 1; } message ContainerRequest { string name = 1; } message ContainerResponse { string name = 1; bool success = 2; string error = 3; } message ContainerResult { string name = 1; bool success = 2; string error = 3; } // --- Status --- message ListContainersRequest {} message ContainerInfo { string name = 1; string image = 2; string state = 3; // "running", "stopped", "exited", etc. string network = 4; string user = 5; string restart_policy = 6; repeated string ports = 7; repeated string volumes = 8; repeated string cmd = 9; google.protobuf.Timestamp created = 10; google.protobuf.Timestamp started = 11; } message ListContainersResponse { repeated ContainerInfo containers = 1; } // --- File transfer --- // All file paths are relative to /srv// on the node. // The agent resolves the full path and rejects traversal attempts. message PushFileRequest { string service = 1; // service name (-> /srv//) string path = 2; // relative path within service dir bytes content = 3; uint32 mode = 4; // file permissions (e.g. 0600) } message PushFileResponse { bool success = 1; string error = 2; } message PullFileRequest { string service = 1; // service name (-> /srv//) string path = 2; // relative path within service dir } message PullFileResponse { bytes content = 1; uint32 mode = 2; string error = 3; } // --- Node --- message NodeStatusRequest {} message NodeStatusResponse { string node_name = 1; string runtime = 2; // "podman", "docker" string runtime_version = 3; uint32 container_count = 4; google.protobuf.Timestamp uptime_since = 5; } ``` ### Container Runtime The agent manages containers by executing the local container runtime CLI (`podman` or `docker`). The runtime is configured in the agent's config file. The agent does not use the Docker socket API -- it shells out to the CLI for simplicity and debuggability (the operator can always run the same commands manually). #### Deploy Flow When the agent receives a `Deploy` RPC: 1. For each container in the request: a. Pull the image: ` pull ` b. Stop and remove the existing container (if any): ` stop ` and ` rm ` c. Start the new container: ` run -d --name [flags] [cmd]` d. Verify the container is running: ` inspect ` 2. Return success/failure per container. The flags passed to `run` are derived from the `ContainerSpec`: | Spec field | Runtime flag | |------------|-------------| | `network` | `--network ` | | `user` | `--user ` | | `restart_policy` | `--restart ` | | `ports` | `-p ` (repeated) | | `volumes` | `-v ` (repeated) | | `cmd` | appended after the image name | ### File Transfer The agent supports single-file push and pull, scoped to a specific service's data directory. This is the mechanism for deploying config files and certificates to nodes. Every file operation specifies a **service name** and a **relative path**. The agent resolves the full path as `/srv//`. This scoping ensures that a deploy for service A cannot accidentally write into service B's directory. **Push**: master sends the service name, relative path, file content, and permissions. The agent resolves the path, validates it (no `..` traversal, no symlinks escaping the service directory), creates intermediate directories if needed, and writes the file atomically (write to temp file, then rename). **Pull**: master sends the service name and relative path. The agent resolves the path, validates it, reads the file, and returns the content and permissions. ``` # Push mcr.toml into /srv/mcr/mcr.toml mcp push mcr.toml mcr # Push a cert into /srv/mcr/certs/mcr.pem mcp push cert.pem mcr certs/mcr.pem # Pull a config file back mcp pull mcr mcr.toml ./mcr.toml ``` When the relative path is omitted from `mcp push`, the basename of the local file is used. File size is bounded by gRPC message limits. For v1, the default 4MB gRPC message size is sufficient -- config files and certificates are kilobytes. If larger transfers are needed in the future, streaming RPCs or the v2 tar.zst archive transfer will handle them. --- ## Database Schema The master's SQLite database stores the service registry. ```sql CREATE TABLE nodes ( name TEXT PRIMARY KEY, address TEXT NOT NULL, agent_port INTEGER NOT NULL DEFAULT 9444, created_at TEXT NOT NULL DEFAULT (datetime('now')), updated_at TEXT NOT NULL DEFAULT (datetime('now')) ); CREATE TABLE services ( name TEXT PRIMARY KEY, node TEXT NOT NULL REFERENCES nodes(name), created_at TEXT NOT NULL DEFAULT (datetime('now')), updated_at TEXT NOT NULL DEFAULT (datetime('now')) ); CREATE TABLE containers ( name TEXT PRIMARY KEY, service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE, image TEXT NOT NULL, network TEXT NOT NULL DEFAULT 'bridge', user_spec TEXT NOT NULL DEFAULT '', restart_policy TEXT NOT NULL DEFAULT 'unless-stopped', desired_state TEXT NOT NULL DEFAULT 'running', observed_state TEXT NOT NULL DEFAULT 'unknown', managed INTEGER NOT NULL DEFAULT 1, created_at TEXT NOT NULL DEFAULT (datetime('now')), updated_at TEXT NOT NULL DEFAULT (datetime('now')) ); CREATE TABLE container_ports ( container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, mapping TEXT NOT NULL, PRIMARY KEY (container, mapping) ); CREATE TABLE container_volumes ( container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, mapping TEXT NOT NULL, PRIMARY KEY (container, mapping) ); CREATE TABLE container_cmd ( container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, position INTEGER NOT NULL, arg TEXT NOT NULL, PRIMARY KEY (container, position) ); -- Event log for monitoring and alerting CREATE TABLE events ( id INTEGER PRIMARY KEY AUTOINCREMENT, container TEXT NOT NULL, node TEXT NOT NULL, prev_state TEXT NOT NULL, new_state TEXT NOT NULL, timestamp TEXT NOT NULL DEFAULT (datetime('now')) ); CREATE INDEX idx_events_container_time ON events(container, timestamp); ``` ### State Values **Desired state** (set by operator actions): | State | Meaning | |-------|---------| | `running` | Operator wants this container running | | `stopped` | Operator deliberately stopped this container | **Observed state** (set by agent reports): | State | Meaning | |-------|---------| | `running` | Container is running (confirmed by agent) | | `stopped` | Container exists but is not running | | `exited` | Container exited (crashed or completed) | | `removed` | Container no longer exists on the node | | `unknown` | State has not been queried since registration | The `managed` flag distinguishes MCP-deployed containers from pre-existing ones discovered during sync. Unmanaged containers have no desired state set (desired_state is empty). They appear in status output but MCP will not modify them unless the operator explicitly adopts them with `mcp adopt`. --- ## Configuration ### Master Config When running as a CLI on the operator's workstation: ```toml [database] path = "/home/kyle/.config/mcp/mcp.db" [services] dir = "/home/kyle/.config/mcp/services" [mcias] server_url = "https://mcias.metacircular.net:8443" ca_cert = "" service_name = "mcp" [auth] token_path = "/home/kyle/.config/mcp/token" ``` Directory layout on the operator's workstation: ``` ~/.config/mcp/ ├── mcp.toml Master config ├── mcp.db Service registry ├── token Cached MCIAS bearer token (0600) └── services/ Service definition files ├── metacrypt.toml ├── mcr.toml ├── mc-proxy.toml └── ... ``` When running as a service on a node (for `mcp watch`): ```toml [database] path = "/srv/mcp/mcp.db" [mcias] server_url = "https://mcias.metacircular.net:8443" ca_cert = "" service_name = "mcp" # System account credentials for unattended operation username = "mcp" password_file = "/srv/mcp/credentials" [watch] interval = "60s" alert_command = "" cooldown = "15m" flap_threshold = 3 flap_window = "10m" retention = "30d" [log] level = "info" ``` The master binary is the same in both cases. When running from the operator's workstation, it uses `~/.config/mcp/` for local state and interactive login. When running as a systemd service on a node, it uses `/srv/mcp/` following the platform convention and authenticates with a system account. ### Agent Config ```toml [server] grpc_addr = ":9444" tls_cert = "/srv/mcp/certs/cert.pem" tls_key = "/srv/mcp/certs/key.pem" [mcias] server_url = "https://mcias.metacircular.net:8443" ca_cert = "" service_name = "mcp-agent" [agent] node_name = "rift" container_runtime = "podman" [log] level = "info" ``` The agent config lives in `/srv/mcp/` following the platform convention. The agent's data directory contains only its config and TLS material -- it has no database of its own. The container runtime is the source of truth for local container state. --- ## Deployment ### Agent Deployment (on nodes) The agent is deployed like any other Metacircular service: 1. Create `/srv/mcp/` directory and config file. 2. Provision TLS certificate from Metacrypt. 3. Create an MCIAS system account for the agent (`mcp-agent`). 4. Install the `mcp-agent` binary. 5. Start via systemd unit. The agent runs as a systemd service (container-first deployment is a v2 concern -- MCP needs to be running before it can manage its own agent). ```ini [Unit] Description=MCP Agent After=network-online.target Wants=network-online.target [Service] Type=simple ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml Restart=on-failure RestartSec=5 User=mcp Group=mcp NoNewPrivileges=true ProtectSystem=strict ProtectHome=true PrivateTmp=true PrivateDevices=true ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true RestrictSUIDSGID=true RestrictNamespaces=true LockPersonality=true MemoryDenyWriteExecute=true RestrictRealtime=true ReadWritePaths=/srv [Install] WantedBy=multi-user.target ``` Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes files to any service's `/srv//` directory on behalf of the master. ### Master Installation (on operator workstation) For interactive CLI use: 1. Install the `mcp` binary to `~/.local/bin/` or `/usr/local/bin/`. 2. Create `~/.config/mcp/mcp.toml` with MCIAS and registry config. 3. Run `mcp login` to authenticate. 4. Run `mcp node add rift
` to register the first node. 5. Run `mcp sync` to discover existing containers on the node. ### Master Watch Deployment (on a node) For continuous monitoring, deploy `mcp watch` as a systemd service on a node (rift for v1). This runs alongside the agent on the same machine. 1. Install the `mcp` binary to `/usr/local/bin/`. 2. Create `/srv/mcp/mcp.toml` with database, MCIAS system account, and `[watch]` settings. 3. Create an MCIAS system account for the master (`mcp`). ```ini [Unit] Description=MCP Watch After=network-online.target mcp-agent.service Wants=network-online.target [Service] Type=simple ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml Restart=on-failure RestartSec=10 User=mcp Group=mcp NoNewPrivileges=true ProtectSystem=strict ProtectHome=true PrivateTmp=true PrivateDevices=true ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true RestrictSUIDSGID=true RestrictNamespaces=true LockPersonality=true MemoryDenyWriteExecute=true RestrictRealtime=true ReadWritePaths=/srv/mcp [Install] WantedBy=multi-user.target ``` When `mcp watch` runs on a node and the operator uses the CLI from vade, they share the same registry database. The operator can either point their CLI config at the node's database (via the overlay) or maintain a separate local database and sync it. For v1 with one node, the simplest option is running all `mcp` commands on rift (via SSH) where the database lives. --- ## Security Model ### Threat Mitigations | Threat | Mitigation | |--------|------------| | Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC | | Token theft | Tokens have short expiry; cached validation keyed by SHA-256 | | Agent impersonation | Master verifies agent TLS certificate against known CA | | Arbitrary file write via push | Agent restricts writes to `/srv//` for the named service | | Arbitrary file read via pull | Agent restricts reads to `/srv//` for the named service | | Cross-service file access | File ops require a service name; agent resolves to that service's directory only | | Container runtime escape | Agent runs as non-root where possible; containers follow platform hardening | | Network eavesdropping | All C2 traffic is gRPC over mTLS over encrypted overlay | ### Security Invariants 1. Every agent RPC requires a valid MCIAS admin token. No anonymous or unprivileged access. 2. File operations are scoped to `/srv//` for the named service. Path traversal attempts (`../`, symlinks outside the service directory) are rejected. 3. The agent never executes arbitrary commands. It only runs container runtime operations and file I/O through well-defined code paths. 4. TLS 1.3 minimum on the agent's gRPC listener. 5. The master's stored token is file-permission protected (0600). --- ## Project Structure ``` mcp/ ├── cmd/ │ ├── mcp/ Master CLI │ │ ├── main.go │ │ ├── login.go │ │ ├── deploy.go │ │ ├── lifecycle.go stop, start, restart │ │ ├── status.go ps, status, sync │ │ ├── adopt.go adopt unmanaged containers │ │ ├── service.go service show/export │ │ ├── transfer.go push, pull │ │ ├── watch.go watch (long-lived monitoring) │ │ └── node.go node add/list/remove │ └── mcp-agent/ Agent daemon │ ├── main.go │ └── server.go ├── internal/ │ ├── agent/ Agent logic │ │ ├── agent.go Agent struct, setup │ │ ├── containers.go Container runtime operations │ │ ├── files.go File push/pull │ │ └── runtime.go Runtime abstraction (podman/docker) │ ├── registry/ Service registry │ │ ├── db.go Schema, migrations │ │ ├── nodes.go Node CRUD │ │ ├── services.go Service CRUD │ │ ├── containers.go Container CRUD │ │ ├── events.go Event log │ │ └── sync.go Registry reconciliation │ ├── watch/ Monitoring subsystem │ │ ├── watch.go Watch loop │ │ └── alerting.go Alert evaluation and command execution │ ├── servicedef/ Service definition file parsing │ │ └── servicedef.go Load, parse, write TOML service defs │ ├── auth/ MCIAS integration │ │ └── auth.go Token validation, login │ ├── config/ Configuration loading │ │ ├── master.go │ │ └── agent.go │ └── grpcserver/ Agent gRPC server wiring │ ├── server.go │ ├── interceptors.go Auth interceptor │ └── handlers.go RPC implementations ├── proto/mcp/ │ └── v1/ │ └── mcp.proto ├── gen/mcp/ │ └── v1/ Generated Go code ├── deploy/ │ ├── systemd/ │ │ ├── mcp-agent.service │ │ └── mcp-watch.service │ ├── examples/ │ │ ├── mcp.toml │ │ └── mcp-agent.toml │ └── scripts/ │ └── install-agent.sh ├── Makefile ├── buf.yaml ├── .golangci.yaml ├── CLAUDE.md └── ARCHITECTURE.md ``` --- ## Future Work (v2+) These are explicitly out of scope for v1 but inform the design: - **Auto-reconciliation**: `mcp watch` detects drift but does not act on it in v1. v2 adds configurable auto-restart for drifted containers (with backoff to avoid restart storms). - **Migration**: snapshot `/srv//` as tar.zst (with VACUUM INTO for clean DB copies), stream to destination node, restore. Requires streaming gRPC and archive assembly logic. - **Scheduling**: automatic node selection based on resource availability and operator constraints. Requires node resource reporting from agents. - **Certificate provisioning**: MCP provisions TLS certs from Metacrypt during deploy via the ACME client library. - **DNS updates**: MCP pushes record updates to MCNS after deploy/migrate. Requires MCNS to have an API (or, as a stopgap, zone file editing). - **Multi-node orchestration**: deploy across multiple nodes, rolling updates, health-aware placement. - **Web UI**: master exposes a web interface for registry browsing and operations. Would add REST+gRPC on the master side.