Files
mcp/ARCHITECTURE.md
Kyle Isom 6b99937a69 Add MCP v1 architecture specification
Design spec for the Metacircular Control Plane covering master/agent
architecture, service registry with desired/observed state tracking,
container lifecycle management, service definition files, single-file
transfer scoped to /srv/<service>/, and continuous monitoring via
mcp watch with event logging and alerting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:42:41 -07:00

38 KiB
Raw Blame History

MCP -- Metacircular Control Plane

Overview

MCP is the orchestrator for the Metacircular platform. It manages container lifecycle, tracks what services run where, and transfers files between the operator's workstation and managed nodes.

MCP uses a master/agent architecture:

  • The agent owns "what is". It runs on each managed node, queries the local container runtime, and reports facts. It manages containers and files on behalf of the master. It has no opinion about what should be running.

  • The master owns "what should be". It maintains the service registry (desired state), compares it against agent-reported observed state, records events, detects drift, and alerts the operator. The master runs in two modes: as an interactive CLI for operator commands, and as a long-lived watch process for continuous monitoring.

Agents keep their containers running independently. The master is needed to issue commands, track desired state, and monitor for drift.

v1 Scope

v1 targets a single-node deployment (one agent on rift, master on vade). The core operations are:

  • Deploy -- pull a container image and start (or restart) a service on a specified node.
  • Container lifecycle -- stop, start, restart individual containers.
  • Status -- query live container state from agents and reconcile the master's service registry.
  • Watch -- continuous monitoring. Periodically syncs with agents, records state-change events, detects drift, and alerts the operator.
  • File transfer -- push or pull individual files between master and nodes (config files, certificates).

Explicitly not in v1: migration (snapshot/tar.zst transfer), automatic scheduling/placement, certificate provisioning from Metacrypt, DNS updates to MCNS, multi-node orchestration, auto-reconciliation (restarting drifted containers).


Architecture

Operator workstation (vade)
  ┌──────────────────────────────┐
  │  mcp (master CLI)            │
  │                              │
  │  ┌────────────────────────┐  │
  │  │ Service Registry       │  │
  │  │ (SQLite)               │  │
  │  └────────────────────────┘  │
  │                              │
  │  gRPC client ────────────────┼──── overlay ────┐
  └──────────────────────────────┘                 │
                                                   │
MC Node (rift)                                     │
  ┌────────────────────────────────────────────────┼──┐
  │                                                │  │
  │  ┌──────────────────────┐                      │  │
  │  │ mcp-agent             │◄─────────────────────┘  │
  │  │ (gRPC server)        │                         │
  │  │                      │                         │
  │  │ MCIAS token          │                         │
  │  │ validation           │                         │
  │  └──────────┬───────────┘                         │
  │             │                                     │
  │             │ exec                                │
  │             ▼                                     │
  │  ┌──────────────────────┐                         │
  │  │ Container runtime    │                         │
  │  │ (podman / docker)    │                         │
  │  │                      │                         │
  │  │  ┌─────┐ ┌─────┐    │    ┌──────────────────┐ │
  │  │  │ svc │ │ svc │ ...│    │ /srv/<service>/   │ │
  │  │  │  α  │ │  β  │    │    │  config, db,      │ │
  │  │  └─────┘ └─────┘    │    │  certs, backups   │ │
  │  └──────────────────────┘    └──────────────────┘ │
  └───────────────────────────────────────────────────┘

Components

Component Binary Role
Master mcp Operator CLI. Maintains the service registry. Issues commands to agents.
Agent mcp-agent Node daemon. Manages local containers and files. Authenticates callers via MCIAS.

Communication

Master and agent communicate over gRPC with mutual TLS. The transport is the encrypted overlay network (Tailscale/WireGuard). The master authenticates to the agent by presenting an MCIAS bearer token. The agent validates the token by calling MCIAS and checking for the admin role.


Authentication and Authorization

MCP follows the platform authentication model: all auth is delegated to MCIAS.

Agent Authentication

The agent is a gRPC server with a unary interceptor that enforces authentication on every RPC:

  1. Master includes an MCIAS bearer token in the gRPC metadata (authorization: Bearer <token>).
  2. Agent extracts the token and validates it against MCIAS (cached 30s by SHA-256 of the token, per platform convention).
  3. Agent checks that the caller has the admin role. All MCP operations require admin -- there is no unprivileged MCP access.
  4. If validation fails, the RPC returns UNAUTHENTICATED (invalid/expired token) or PERMISSION_DENIED (valid token, not admin).

Master Authentication

The master authenticates to MCIAS before issuing commands. The token can be obtained by:

  1. mcp login -- interactive login, stores the token locally.
  2. Environment variable (MCP_TOKEN) for scripted use.
  3. System account credentials in the master config file.

The stored token is used for all subsequent agent RPCs until it expires.

Agent Identity

The agent itself authenticates to MCIAS as a system account (for future agent-initiated flows like health reporting). For v1, the agent is purely reactive -- it only responds to master-initiated RPCs and does not make outbound calls beyond MCIAS token validation.


Master

CLI Commands

mcp login                              Authenticate to MCIAS, store token

mcp deploy <service>                   Deploy from service definition file
mcp deploy <service> -f <file>         Deploy from explicit file
mcp deploy <service> --image <tag>     Deploy with image override
mcp stop <service>                     Stop all containers for a service
mcp start <service>                    Start all containers for a service
mcp restart <service>                  Restart all containers for a service

mcp ps                                 List all services in the registry
mcp status [service]                   Query live state from agents, reconcile registry
mcp sync                               Full rescan: agents report all containers,
                                         registry is reconciled against reality

mcp adopt <container> <service>        Claim an unmanaged container into a service

mcp service show <service>             Print current spec from registry
mcp service edit <service>             Open service definition in $EDITOR
mcp service export <service>           Write registry spec to local service file
mcp service export <service> -f <file> Write to explicit path

mcp push <local-file> <service> [path] Copy a local file into /srv/<service>/[path]
mcp pull <service> <path> [local-file]  Copy a file from /srv/<service>/<path> to local

mcp watch                              Run continuous monitoring (long-lived)
mcp watch --interval 60s              Custom poll interval

mcp node list                          List registered nodes
mcp node add <name> <address>          Register a node
mcp node remove <name>                 Deregister a node

Service Registry

The master maintains a SQLite database that tracks:

  • Nodes -- name, overlay address, agent port, state.
  • Services -- name, assigned node, current state.
  • Containers -- name, parent service, image reference, container runtime config (network, ports, volumes, user, restart policy, command), observed state.

A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt" API server + "metacrypt-web" UI). Deploy operates on services; lifecycle commands can target individual containers.

Desired State vs. Observed State

The registry tracks two separate pieces of information for each container:

  • Desired state -- what the operator wants: running or stopped. Set explicitly by mcp deploy, mcp stop, mcp start, etc.
  • Observed state -- what the agent actually sees on the node: running, stopped, exited, removed, or unknown.

These can diverge. A container with desired=running and observed=exited has crashed. A container with desired=stopped and observed=stopped is fine. mcp status shows both states and flags drift -- any container where observed state does not match desired state.

Desired Observed Status
running running OK
running stopped DRIFT -- stopped unexpectedly
running exited DRIFT -- crashed
running removed DRIFT -- container gone
stopped stopped OK
stopped removed OK
stopped running DRIFT -- running when it shouldn't be

For v1, MCP reports drift but does not auto-reconcile. The operator decides whether to mcp start, mcp deploy, or investigate. Auto-reconciliation (restart crashed containers, enforce desired state) is a v2 concern.

Registry Reconciliation

mcp sync (and implicitly, mcp status) works as follows:

  1. For each registered node, master calls ListContainers on the agent.
  2. Agent queries the container runtime and returns the full list of containers with their configuration and state.
  3. Master compares the agent's report against the registry:
    • Container in registry, seen by agent: update observed state.
    • Container in registry, not seen by agent: set observed state to removed.
    • Container seen by agent, not in registry: add to registry as unmanaged (desired state unset, observed state from agent).
  4. Master prints a summary: OK containers, drifted containers, and newly discovered unmanaged containers.

Adopting Unmanaged Containers

On first mcp sync, every container on rift will appear as unmanaged -- MCP didn't deploy them and doesn't know their intended service grouping.

mcp adopt <container> <service> claims an unmanaged container:

  1. If the service doesn't exist, create it (assigned to the container's node).
  2. Set the container's service to the specified service.
  3. Set managed = true and desired state to running (or stopped if the container is currently stopped).

This lets the operator incrementally bring existing containers under MCP management without redeploying them.

Service Definitions

A service definition is a TOML file that declares the containers for a service. These files live in ~/.config/mcp/services/ by default, one per service.

Example: ~/.config/mcp/services/metacrypt.toml

name = "metacrypt"
node = "rift"

[[containers]]
name = "metacrypt"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]

[[containers]]
name = "metacrypt-web"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18080:8080"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]
cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"]

Deploy Resolution

mcp deploy <service> resolves the container spec through a precedence chain:

  1. Command-line flags (highest) -- --image, --node, etc. override individual fields.
  2. Service definition file -- if -f <file> is specified, use that file. Otherwise look for ~/.config/mcp/services/<service>.toml.
  3. Registry (lowest) -- if no file exists, use the spec from the last successful deploy stored in the registry.

If none of these exist (first deploy, no file), the deploy fails with an error telling the operator to create a service definition.

On a successful deploy, the resolved spec (after all overrides) is written to the registry. The file on disk is not modified -- it represents the operator's declared intent, not the deployed state. To sync the file with reality after overrides, use mcp service export.

Spec Lifecycle

                    ┌─────────────┐
         write      │  Service    │     mcp deploy
        ──────────► │  definition │ ──────────────┐
                    │  (.toml)    │                │
                    └─────────────┘                ▼
                          ▲              ┌─────────────────┐
                          │              │    Registry      │
           mcp service    │              │    (deployed     │
           export         │              │     spec)        │
                          │              └─────────────────┘
                          │                       │
                          └───────────────────────┘
  • Operator writes the service definition file (or copies one from the service's repo).
  • mcp deploy reads the file, applies overrides, sends to agent, records the resolved spec in the registry.
  • mcp service export reads the registry and writes it back to the local file, incorporating any overrides or drift that occurred.

This keeps the file and registry loosely coupled. The file is the starting point; the registry is the truth; export reconciles them.

mcp service edit <service> opens the service definition in $EDITOR (falling back to $VISUAL, then vi). If no file exists yet, it exports the current registry spec to a temp file first, so the operator starts from the deployed state rather than a blank file. After the editor exits, the file is saved to the standard path in the services directory.

Where Definition Files Come From

Service definition files can be:

  • Written by hand by the operator.
  • Copied from the service's repo (a service could ship a deploy/mcp-service.toml as a starting point).
  • Generated by mcp adopt + mcp service export -- adopt existing containers, then export to get a file matching the running config.
  • Generated by converting from mcdeploy.toml during initial MCP migration (one-time).

Monitoring (mcp watch)

mcp watch is a long-lived master process that continuously monitors the platform. It runs the same sync logic as mcp status on a loop, records state-change events, and alerts the operator when drift is detected.

Event Log

Every state transition is recorded in the events table:

CREATE TABLE events (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    container  TEXT NOT NULL,
    node       TEXT NOT NULL,
    prev_state TEXT NOT NULL,
    new_state  TEXT NOT NULL,
    timestamp  TEXT NOT NULL DEFAULT (datetime('now'))
);

Events accumulate over time and support rate queries:

-- How many times has metacrypt exited in the last hour?
SELECT COUNT(*) FROM events
WHERE container = 'metacrypt'
  AND new_state = 'exited'
  AND timestamp > datetime('now', '-1 hour');

Old events are pruned periodically (default: retain 30 days).

Watch Loop

Each iteration of the watch loop:

  1. Sync with all agents (same as mcp sync).
  2. For each observed state change since the last iteration, insert an event.
  3. Evaluate alert conditions against the current state and event history.
  4. If an alert fires, execute the configured alert command.
  5. Sleep for the configured interval.

Alert Conditions

The watch process evaluates two types of alert:

  • Drift alert: a managed container's observed state does not match its desired state. Fires on the transition, not on every poll.
  • Rate alert: a container has changed state more than N times within a window (flapping). Default threshold: 3 transitions in 10 minutes.

Each alert has a cooldown per container. Once an alert fires for a container, it is suppressed for the cooldown period regardless of further transitions. This prevents notification spam from a flapping service.

[watch]
interval = "60s"
alert_command = ""               # command to run on alert; empty = log only
cooldown = "15m"                 # suppress repeat alerts per container
flap_threshold = 3               # state changes within flap_window = flapping
flap_window = "10m"
retention = "30d"                # event log retention

Alert Command

When an alert fires, the master executes the configured command with context passed via environment variables:

Variable Value
MCP_CONTAINER Container name
MCP_SERVICE Parent service name
MCP_NODE Node name
MCP_DESIRED Desired state
MCP_OBSERVED Observed state
MCP_PREV_STATE Previous observed state
MCP_ALERT_TYPE drift or flapping
MCP_TRANSITIONS Number of transitions in the flap window (for flapping alerts)

The alert command is the operator's choice. MCP does not ship with or depend on any notification system.

# Push notification
alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'"

# Webhook
alert_command = "curl -s -X POST https://hooks.example.com/mcp ..."

# Syslog
alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'"

Deployment

mcp watch is intended to run as a systemd service on a node (typically rift for v1), not on the operator's workstation. It uses the same binary, config, and database as the CLI -- the difference is the execution mode.

[Unit]
Description=MCP Watch (continuous monitoring)
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
Restart=on-failure
RestartSec=10
User=mcp
Group=mcp

This means the master's SQLite database lives on the node running mcp watch (rift for v1), not on the operator's workstation. The operator can still run CLI commands from vade -- they connect to the same database via a configurable path, or the CLI operates statelessly by querying agents directly and reading the registry from a shared location.

Alternatively, the operator runs mcp watch on vade when at their desk and relies on container restart policies for unattended operation. The design supports both -- mcp watch is an optional monitoring layer, not a required component.


Agent

gRPC Service Definition

The agent exposes a single gRPC service. All RPCs require admin authentication.

syntax = "proto3";
package mcp.v1;

import "google/protobuf/timestamp.proto";

service McpAgent {
  // Container lifecycle
  rpc Deploy(DeployRequest) returns (DeployResponse);
  rpc StartContainer(ContainerRequest) returns (ContainerResponse);
  rpc StopContainer(ContainerRequest) returns (ContainerResponse);
  rpc RestartContainer(ContainerRequest) returns (ContainerResponse);
  rpc RemoveContainer(ContainerRequest) returns (ContainerResponse);

  // Status
  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);

  // File transfer
  rpc PushFile(PushFileRequest) returns (PushFileResponse);
  rpc PullFile(PullFileRequest) returns (PullFileResponse);

  // Node
  rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse);
}

// --- Container lifecycle ---

message ContainerSpec {
  string name = 1;
  string image = 2;
  string network = 3;
  string user = 4;
  string restart_policy = 5;
  repeated string ports = 6;       // "host:container" mappings
  repeated string volumes = 7;     // "host:container" mount specs
  repeated string cmd = 8;         // command and arguments
}

message DeployRequest {
  repeated ContainerSpec containers = 1;
}

message DeployResponse {
  repeated ContainerResult results = 1;
}

message ContainerRequest {
  string name = 1;
}

message ContainerResponse {
  string name = 1;
  bool success = 2;
  string error = 3;
}

message ContainerResult {
  string name = 1;
  bool success = 2;
  string error = 3;
}

// --- Status ---

message ListContainersRequest {}

message ContainerInfo {
  string name = 1;
  string image = 2;
  string state = 3;              // "running", "stopped", "exited", etc.
  string network = 4;
  string user = 5;
  string restart_policy = 6;
  repeated string ports = 7;
  repeated string volumes = 8;
  repeated string cmd = 9;
  google.protobuf.Timestamp created = 10;
  google.protobuf.Timestamp started = 11;
}

message ListContainersResponse {
  repeated ContainerInfo containers = 1;
}

// --- File transfer ---
// All file paths are relative to /srv/<service>/ on the node.
// The agent resolves the full path and rejects traversal attempts.

message PushFileRequest {
  string service = 1;            // service name (-> /srv/<service>/)
  string path = 2;               // relative path within service dir
  bytes content = 3;
  uint32 mode = 4;               // file permissions (e.g. 0600)
}

message PushFileResponse {
  bool success = 1;
  string error = 2;
}

message PullFileRequest {
  string service = 1;            // service name (-> /srv/<service>/)
  string path = 2;               // relative path within service dir
}

message PullFileResponse {
  bytes content = 1;
  uint32 mode = 2;
  string error = 3;
}

// --- Node ---

message NodeStatusRequest {}

message NodeStatusResponse {
  string node_name = 1;
  string runtime = 2;            // "podman", "docker"
  string runtime_version = 3;
  uint32 container_count = 4;
  google.protobuf.Timestamp uptime_since = 5;
}

Container Runtime

The agent manages containers by executing the local container runtime CLI (podman or docker). The runtime is configured in the agent's config file. The agent does not use the Docker socket API -- it shells out to the CLI for simplicity and debuggability (the operator can always run the same commands manually).

Deploy Flow

When the agent receives a Deploy RPC:

  1. For each container in the request: a. Pull the image: <runtime> pull <image> b. Stop and remove the existing container (if any): <runtime> stop <name> and <runtime> rm <name> c. Start the new container: <runtime> run -d --name <name> [flags] <image> [cmd] d. Verify the container is running: <runtime> inspect <name>
  2. Return success/failure per container.

The flags passed to run are derived from the ContainerSpec:

Spec field Runtime flag
network --network <network>
user --user <user>
restart_policy --restart <policy>
ports -p <mapping> (repeated)
volumes -v <mapping> (repeated)
cmd appended after the image name

File Transfer

The agent supports single-file push and pull, scoped to a specific service's data directory. This is the mechanism for deploying config files and certificates to nodes.

Every file operation specifies a service name and a relative path. The agent resolves the full path as /srv/<service>/<path>. This scoping ensures that a deploy for service A cannot accidentally write into service B's directory.

Push: master sends the service name, relative path, file content, and permissions. The agent resolves the path, validates it (no .. traversal, no symlinks escaping the service directory), creates intermediate directories if needed, and writes the file atomically (write to temp file, then rename).

Pull: master sends the service name and relative path. The agent resolves the path, validates it, reads the file, and returns the content and permissions.

# Push mcr.toml into /srv/mcr/mcr.toml
mcp push mcr.toml mcr

# Push a cert into /srv/mcr/certs/mcr.pem
mcp push cert.pem mcr certs/mcr.pem

# Pull a config file back
mcp pull mcr mcr.toml ./mcr.toml

When the relative path is omitted from mcp push, the basename of the local file is used.

File size is bounded by gRPC message limits. For v1, the default 4MB gRPC message size is sufficient -- config files and certificates are kilobytes. If larger transfers are needed in the future, streaming RPCs or the v2 tar.zst archive transfer will handle them.


Database Schema

The master's SQLite database stores the service registry.

CREATE TABLE nodes (
    name       TEXT PRIMARY KEY,
    address    TEXT NOT NULL,
    agent_port INTEGER NOT NULL DEFAULT 9444,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE TABLE services (
    name       TEXT PRIMARY KEY,
    node       TEXT NOT NULL REFERENCES nodes(name),
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE TABLE containers (
    name           TEXT PRIMARY KEY,
    service        TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE,
    image          TEXT NOT NULL,
    network        TEXT NOT NULL DEFAULT 'bridge',
    user_spec      TEXT NOT NULL DEFAULT '',
    restart_policy TEXT NOT NULL DEFAULT 'unless-stopped',
    desired_state  TEXT NOT NULL DEFAULT 'running',
    observed_state TEXT NOT NULL DEFAULT 'unknown',
    managed        INTEGER NOT NULL DEFAULT 1,
    created_at     TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at     TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE TABLE container_ports (
    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
    mapping   TEXT NOT NULL,
    PRIMARY KEY (container, mapping)
);

CREATE TABLE container_volumes (
    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
    mapping   TEXT NOT NULL,
    PRIMARY KEY (container, mapping)
);

CREATE TABLE container_cmd (
    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
    position  INTEGER NOT NULL,
    arg       TEXT NOT NULL,
    PRIMARY KEY (container, position)
);

-- Event log for monitoring and alerting
CREATE TABLE events (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    container  TEXT NOT NULL,
    node       TEXT NOT NULL,
    prev_state TEXT NOT NULL,
    new_state  TEXT NOT NULL,
    timestamp  TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE INDEX idx_events_container_time ON events(container, timestamp);

State Values

Desired state (set by operator actions):

State Meaning
running Operator wants this container running
stopped Operator deliberately stopped this container

Observed state (set by agent reports):

State Meaning
running Container is running (confirmed by agent)
stopped Container exists but is not running
exited Container exited (crashed or completed)
removed Container no longer exists on the node
unknown State has not been queried since registration

The managed flag distinguishes MCP-deployed containers from pre-existing ones discovered during sync. Unmanaged containers have no desired state set (desired_state is empty). They appear in status output but MCP will not modify them unless the operator explicitly adopts them with mcp adopt.


Configuration

Master Config

When running as a CLI on the operator's workstation:

[database]
path = "/home/kyle/.config/mcp/mcp.db"

[services]
dir = "/home/kyle/.config/mcp/services"

[mcias]
server_url   = "https://mcias.metacircular.net:8443"
ca_cert      = ""
service_name = "mcp"

[auth]
token_path = "/home/kyle/.config/mcp/token"

Directory layout on the operator's workstation:

~/.config/mcp/
├── mcp.toml                    Master config
├── mcp.db                      Service registry
├── token                       Cached MCIAS bearer token (0600)
└── services/                   Service definition files
    ├── metacrypt.toml
    ├── mcr.toml
    ├── mc-proxy.toml
    └── ...

When running as a service on a node (for mcp watch):

[database]
path = "/srv/mcp/mcp.db"

[mcias]
server_url   = "https://mcias.metacircular.net:8443"
ca_cert      = ""
service_name = "mcp"
# System account credentials for unattended operation
username     = "mcp"
password_file = "/srv/mcp/credentials"

[watch]
interval       = "60s"
alert_command  = ""
cooldown       = "15m"
flap_threshold = 3
flap_window    = "10m"
retention      = "30d"

[log]
level = "info"

The master binary is the same in both cases. When running from the operator's workstation, it uses ~/.config/mcp/ for local state and interactive login. When running as a systemd service on a node, it uses /srv/mcp/ following the platform convention and authenticates with a system account.

Agent Config

[server]
grpc_addr = ":9444"
tls_cert  = "/srv/mcp/certs/cert.pem"
tls_key   = "/srv/mcp/certs/key.pem"

[mcias]
server_url   = "https://mcias.metacircular.net:8443"
ca_cert      = ""
service_name = "mcp-agent"

[agent]
node_name         = "rift"
container_runtime = "podman"

[log]
level = "info"

The agent config lives in /srv/mcp/ following the platform convention. The agent's data directory contains only its config and TLS material -- it has no database of its own. The container runtime is the source of truth for local container state.


Deployment

Agent Deployment (on nodes)

The agent is deployed like any other Metacircular service:

  1. Create /srv/mcp/ directory and config file.
  2. Provision TLS certificate from Metacrypt.
  3. Create an MCIAS system account for the agent (mcp-agent).
  4. Install the mcp-agent binary.
  5. Start via systemd unit.

The agent runs as a systemd service (container-first deployment is a v2 concern -- MCP needs to be running before it can manage its own agent).

[Unit]
Description=MCP Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml
Restart=on-failure
RestartSec=5

User=mcp
Group=mcp

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv

[Install]
WantedBy=multi-user.target

Note: ReadWritePaths=/srv (not /srv/mcp) because the agent writes files to any service's /srv/<service>/ directory on behalf of the master.

Master Installation (on operator workstation)

For interactive CLI use:

  1. Install the mcp binary to ~/.local/bin/ or /usr/local/bin/.
  2. Create ~/.config/mcp/mcp.toml with MCIAS and registry config.
  3. Run mcp login to authenticate.
  4. Run mcp node add rift <address> to register the first node.
  5. Run mcp sync to discover existing containers on the node.

Master Watch Deployment (on a node)

For continuous monitoring, deploy mcp watch as a systemd service on a node (rift for v1). This runs alongside the agent on the same machine.

  1. Install the mcp binary to /usr/local/bin/.
  2. Create /srv/mcp/mcp.toml with database, MCIAS system account, and [watch] settings.
  3. Create an MCIAS system account for the master (mcp).
[Unit]
Description=MCP Watch
After=network-online.target mcp-agent.service
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
Restart=on-failure
RestartSec=10

User=mcp
Group=mcp

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv/mcp

[Install]
WantedBy=multi-user.target

When mcp watch runs on a node and the operator uses the CLI from vade, they share the same registry database. The operator can either point their CLI config at the node's database (via the overlay) or maintain a separate local database and sync it. For v1 with one node, the simplest option is running all mcp commands on rift (via SSH) where the database lives.


Security Model

Threat Mitigations

Threat Mitigation
Unauthorized C2 commands Agent requires admin MCIAS token on every RPC
Token theft Tokens have short expiry; cached validation keyed by SHA-256
Agent impersonation Master verifies agent TLS certificate against known CA
Arbitrary file write via push Agent restricts writes to /srv/<service>/ for the named service
Arbitrary file read via pull Agent restricts reads to /srv/<service>/ for the named service
Cross-service file access File ops require a service name; agent resolves to that service's directory only
Container runtime escape Agent runs as non-root where possible; containers follow platform hardening
Network eavesdropping All C2 traffic is gRPC over mTLS over encrypted overlay

Security Invariants

  1. Every agent RPC requires a valid MCIAS admin token. No anonymous or unprivileged access.
  2. File operations are scoped to /srv/<service>/ for the named service. Path traversal attempts (../, symlinks outside the service directory) are rejected.
  3. The agent never executes arbitrary commands. It only runs container runtime operations and file I/O through well-defined code paths.
  4. TLS 1.3 minimum on the agent's gRPC listener.
  5. The master's stored token is file-permission protected (0600).

Project Structure

mcp/
├── cmd/
│   ├── mcp/                  Master CLI
│   │   ├── main.go
│   │   ├── login.go
│   │   ├── deploy.go
│   │   ├── lifecycle.go      stop, start, restart
│   │   ├── status.go         ps, status, sync
│   │   ├── adopt.go          adopt unmanaged containers
│   │   ├── service.go        service show/export
│   │   ├── transfer.go       push, pull
│   │   ├── watch.go          watch (long-lived monitoring)
│   │   └── node.go           node add/list/remove
│   └── mcp-agent/            Agent daemon
│       ├── main.go
│       └── server.go
├── internal/
│   ├── agent/                Agent logic
│   │   ├── agent.go          Agent struct, setup
│   │   ├── containers.go     Container runtime operations
│   │   ├── files.go          File push/pull
│   │   └── runtime.go        Runtime abstraction (podman/docker)
│   ├── registry/             Service registry
│   │   ├── db.go             Schema, migrations
│   │   ├── nodes.go          Node CRUD
│   │   ├── services.go       Service CRUD
│   │   ├── containers.go     Container CRUD
│   │   ├── events.go         Event log
│   │   └── sync.go           Registry reconciliation
│   ├── watch/                Monitoring subsystem
│   │   ├── watch.go          Watch loop
│   │   └── alerting.go       Alert evaluation and command execution
│   ├── servicedef/           Service definition file parsing
│   │   └── servicedef.go     Load, parse, write TOML service defs
│   ├── auth/                 MCIAS integration
│   │   └── auth.go           Token validation, login
│   ├── config/               Configuration loading
│   │   ├── master.go
│   │   └── agent.go
│   └── grpcserver/           Agent gRPC server wiring
│       ├── server.go
│       ├── interceptors.go   Auth interceptor
│       └── handlers.go       RPC implementations
├── proto/mcp/
│   └── v1/
│       └── mcp.proto
├── gen/mcp/
│   └── v1/                   Generated Go code
├── deploy/
│   ├── systemd/
│   │   ├── mcp-agent.service
│   │   └── mcp-watch.service
│   ├── examples/
│   │   ├── mcp.toml
│   │   └── mcp-agent.toml
│   └── scripts/
│       └── install-agent.sh
├── Makefile
├── buf.yaml
├── .golangci.yaml
├── CLAUDE.md
└── ARCHITECTURE.md

Future Work (v2+)

These are explicitly out of scope for v1 but inform the design:

  • Auto-reconciliation: mcp watch detects drift but does not act on it in v1. v2 adds configurable auto-restart for drifted containers (with backoff to avoid restart storms).
  • Migration: snapshot /srv/<service>/ as tar.zst (with VACUUM INTO for clean DB copies), stream to destination node, restore. Requires streaming gRPC and archive assembly logic.
  • Scheduling: automatic node selection based on resource availability and operator constraints. Requires node resource reporting from agents.
  • Certificate provisioning: MCP provisions TLS certs from Metacrypt during deploy via the ACME client library.
  • DNS updates: MCP pushes record updates to MCNS after deploy/migrate. Requires MCNS to have an API (or, as a stopgap, zone file editing).
  • Multi-node orchestration: deploy across multiple nodes, rolling updates, health-aware placement.
  • Web UI: master exposes a web interface for registry browsing and operations. Would add REST+gRPC on the master side.