Design spec for the Metacircular Control Plane covering master/agent architecture, service registry with desired/observed state tracking, container lifecycle management, service definition files, single-file transfer scoped to /srv/<service>/, and continuous monitoring via mcp watch with event logging and alerting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
38 KiB
MCP -- Metacircular Control Plane
Overview
MCP is the orchestrator for the Metacircular platform. It manages container lifecycle, tracks what services run where, and transfers files between the operator's workstation and managed nodes.
MCP uses a master/agent architecture:
-
The agent owns "what is". It runs on each managed node, queries the local container runtime, and reports facts. It manages containers and files on behalf of the master. It has no opinion about what should be running.
-
The master owns "what should be". It maintains the service registry (desired state), compares it against agent-reported observed state, records events, detects drift, and alerts the operator. The master runs in two modes: as an interactive CLI for operator commands, and as a long-lived watch process for continuous monitoring.
Agents keep their containers running independently. The master is needed to issue commands, track desired state, and monitor for drift.
v1 Scope
v1 targets a single-node deployment (one agent on rift, master on vade). The core operations are:
- Deploy -- pull a container image and start (or restart) a service on a specified node.
- Container lifecycle -- stop, start, restart individual containers.
- Status -- query live container state from agents and reconcile the master's service registry.
- Watch -- continuous monitoring. Periodically syncs with agents, records state-change events, detects drift, and alerts the operator.
- File transfer -- push or pull individual files between master and nodes (config files, certificates).
Explicitly not in v1: migration (snapshot/tar.zst transfer), automatic scheduling/placement, certificate provisioning from Metacrypt, DNS updates to MCNS, multi-node orchestration, auto-reconciliation (restarting drifted containers).
Architecture
Operator workstation (vade)
┌──────────────────────────────┐
│ mcp (master CLI) │
│ │
│ ┌────────────────────────┐ │
│ │ Service Registry │ │
│ │ (SQLite) │ │
│ └────────────────────────┘ │
│ │
│ gRPC client ────────────────┼──── overlay ────┐
└──────────────────────────────┘ │
│
MC Node (rift) │
┌────────────────────────────────────────────────┼──┐
│ │ │
│ ┌──────────────────────┐ │ │
│ │ mcp-agent │◄─────────────────────┘ │
│ │ (gRPC server) │ │
│ │ │ │
│ │ MCIAS token │ │
│ │ validation │ │
│ └──────────┬───────────┘ │
│ │ │
│ │ exec │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Container runtime │ │
│ │ (podman / docker) │ │
│ │ │ │
│ │ ┌─────┐ ┌─────┐ │ ┌──────────────────┐ │
│ │ │ svc │ │ svc │ ...│ │ /srv/<service>/ │ │
│ │ │ α │ │ β │ │ │ config, db, │ │
│ │ └─────┘ └─────┘ │ │ certs, backups │ │
│ └──────────────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────┘
Components
| Component | Binary | Role |
|---|---|---|
| Master | mcp |
Operator CLI. Maintains the service registry. Issues commands to agents. |
| Agent | mcp-agent |
Node daemon. Manages local containers and files. Authenticates callers via MCIAS. |
Communication
Master and agent communicate over gRPC with mutual TLS. The transport is the
encrypted overlay network (Tailscale/WireGuard). The master authenticates to
the agent by presenting an MCIAS bearer token. The agent validates the token
by calling MCIAS and checking for the admin role.
Authentication and Authorization
MCP follows the platform authentication model: all auth is delegated to MCIAS.
Agent Authentication
The agent is a gRPC server with a unary interceptor that enforces authentication on every RPC:
- Master includes an MCIAS bearer token in the gRPC metadata
(
authorization: Bearer <token>). - Agent extracts the token and validates it against MCIAS (cached 30s by SHA-256 of the token, per platform convention).
- Agent checks that the caller has the
adminrole. All MCP operations require admin -- there is no unprivileged MCP access. - If validation fails, the RPC returns
UNAUTHENTICATED(invalid/expired token) orPERMISSION_DENIED(valid token, not admin).
Master Authentication
The master authenticates to MCIAS before issuing commands. The token can be obtained by:
mcp login-- interactive login, stores the token locally.- Environment variable (
MCP_TOKEN) for scripted use. - System account credentials in the master config file.
The stored token is used for all subsequent agent RPCs until it expires.
Agent Identity
The agent itself authenticates to MCIAS as a system account (for future agent-initiated flows like health reporting). For v1, the agent is purely reactive -- it only responds to master-initiated RPCs and does not make outbound calls beyond MCIAS token validation.
Master
CLI Commands
mcp login Authenticate to MCIAS, store token
mcp deploy <service> Deploy from service definition file
mcp deploy <service> -f <file> Deploy from explicit file
mcp deploy <service> --image <tag> Deploy with image override
mcp stop <service> Stop all containers for a service
mcp start <service> Start all containers for a service
mcp restart <service> Restart all containers for a service
mcp ps List all services in the registry
mcp status [service] Query live state from agents, reconcile registry
mcp sync Full rescan: agents report all containers,
registry is reconciled against reality
mcp adopt <container> <service> Claim an unmanaged container into a service
mcp service show <service> Print current spec from registry
mcp service edit <service> Open service definition in $EDITOR
mcp service export <service> Write registry spec to local service file
mcp service export <service> -f <file> Write to explicit path
mcp push <local-file> <service> [path] Copy a local file into /srv/<service>/[path]
mcp pull <service> <path> [local-file] Copy a file from /srv/<service>/<path> to local
mcp watch Run continuous monitoring (long-lived)
mcp watch --interval 60s Custom poll interval
mcp node list List registered nodes
mcp node add <name> <address> Register a node
mcp node remove <name> Deregister a node
Service Registry
The master maintains a SQLite database that tracks:
- Nodes -- name, overlay address, agent port, state.
- Services -- name, assigned node, current state.
- Containers -- name, parent service, image reference, container runtime config (network, ports, volumes, user, restart policy, command), observed state.
A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt" API server + "metacrypt-web" UI). Deploy operates on services; lifecycle commands can target individual containers.
Desired State vs. Observed State
The registry tracks two separate pieces of information for each container:
- Desired state -- what the operator wants:
runningorstopped. Set explicitly bymcp deploy,mcp stop,mcp start, etc. - Observed state -- what the agent actually sees on the node:
running,stopped,exited,removed, orunknown.
These can diverge. A container with desired=running and observed=exited
has crashed. A container with desired=stopped and observed=stopped is
fine. mcp status shows both states and flags drift -- any container
where observed state does not match desired state.
| Desired | Observed | Status |
|---|---|---|
| running | running | OK |
| running | stopped | DRIFT -- stopped unexpectedly |
| running | exited | DRIFT -- crashed |
| running | removed | DRIFT -- container gone |
| stopped | stopped | OK |
| stopped | removed | OK |
| stopped | running | DRIFT -- running when it shouldn't be |
For v1, MCP reports drift but does not auto-reconcile. The operator decides
whether to mcp start, mcp deploy, or investigate. Auto-reconciliation
(restart crashed containers, enforce desired state) is a v2 concern.
Registry Reconciliation
mcp sync (and implicitly, mcp status) works as follows:
- For each registered node, master calls
ListContainerson the agent. - Agent queries the container runtime and returns the full list of containers with their configuration and state.
- Master compares the agent's report against the registry:
- Container in registry, seen by agent: update observed state.
- Container in registry, not seen by agent: set observed state to
removed. - Container seen by agent, not in registry: add to registry as
unmanaged(desired state unset, observed state from agent).
- Master prints a summary: OK containers, drifted containers, and newly discovered unmanaged containers.
Adopting Unmanaged Containers
On first mcp sync, every container on rift will appear as unmanaged --
MCP didn't deploy them and doesn't know their intended service grouping.
mcp adopt <container> <service> claims an unmanaged container:
- If the service doesn't exist, create it (assigned to the container's node).
- Set the container's service to the specified service.
- Set
managed = trueand desired state torunning(orstoppedif the container is currently stopped).
This lets the operator incrementally bring existing containers under MCP management without redeploying them.
Service Definitions
A service definition is a TOML file that declares the containers for a
service. These files live in ~/.config/mcp/services/ by default, one
per service.
Example: ~/.config/mcp/services/metacrypt.toml
name = "metacrypt"
node = "rift"
[[containers]]
name = "metacrypt"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]
[[containers]]
name = "metacrypt-web"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18080:8080"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]
cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"]
Deploy Resolution
mcp deploy <service> resolves the container spec through a precedence
chain:
- Command-line flags (highest) --
--image,--node, etc. override individual fields. - Service definition file -- if
-f <file>is specified, use that file. Otherwise look for~/.config/mcp/services/<service>.toml. - Registry (lowest) -- if no file exists, use the spec from the last successful deploy stored in the registry.
If none of these exist (first deploy, no file), the deploy fails with an error telling the operator to create a service definition.
On a successful deploy, the resolved spec (after all overrides) is written
to the registry. The file on disk is not modified -- it represents the
operator's declared intent, not the deployed state. To sync the file with
reality after overrides, use mcp service export.
Spec Lifecycle
┌─────────────┐
write │ Service │ mcp deploy
──────────► │ definition │ ──────────────┐
│ (.toml) │ │
└─────────────┘ ▼
▲ ┌─────────────────┐
│ │ Registry │
mcp service │ │ (deployed │
export │ │ spec) │
│ └─────────────────┘
│ │
└───────────────────────┘
- Operator writes the service definition file (or copies one from the service's repo).
mcp deployreads the file, applies overrides, sends to agent, records the resolved spec in the registry.mcp service exportreads the registry and writes it back to the local file, incorporating any overrides or drift that occurred.
This keeps the file and registry loosely coupled. The file is the starting point; the registry is the truth; export reconciles them.
mcp service edit <service> opens the service definition in $EDITOR
(falling back to $VISUAL, then vi). If no file exists yet, it exports
the current registry spec to a temp file first, so the operator starts
from the deployed state rather than a blank file. After the editor exits,
the file is saved to the standard path in the services directory.
Where Definition Files Come From
Service definition files can be:
- Written by hand by the operator.
- Copied from the service's repo (a service could ship a
deploy/mcp-service.tomlas a starting point). - Generated by
mcp adopt+mcp service export-- adopt existing containers, then export to get a file matching the running config. - Generated by converting from mcdeploy.toml during initial MCP migration (one-time).
Monitoring (mcp watch)
mcp watch is a long-lived master process that continuously monitors the
platform. It runs the same sync logic as mcp status on a loop, records
state-change events, and alerts the operator when drift is detected.
Event Log
Every state transition is recorded in the events table:
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
container TEXT NOT NULL,
node TEXT NOT NULL,
prev_state TEXT NOT NULL,
new_state TEXT NOT NULL,
timestamp TEXT NOT NULL DEFAULT (datetime('now'))
);
Events accumulate over time and support rate queries:
-- How many times has metacrypt exited in the last hour?
SELECT COUNT(*) FROM events
WHERE container = 'metacrypt'
AND new_state = 'exited'
AND timestamp > datetime('now', '-1 hour');
Old events are pruned periodically (default: retain 30 days).
Watch Loop
Each iteration of the watch loop:
- Sync with all agents (same as
mcp sync). - For each observed state change since the last iteration, insert an event.
- Evaluate alert conditions against the current state and event history.
- If an alert fires, execute the configured alert command.
- Sleep for the configured interval.
Alert Conditions
The watch process evaluates two types of alert:
- Drift alert: a managed container's observed state does not match its desired state. Fires on the transition, not on every poll.
- Rate alert: a container has changed state more than N times within a window (flapping). Default threshold: 3 transitions in 10 minutes.
Each alert has a cooldown per container. Once an alert fires for a container, it is suppressed for the cooldown period regardless of further transitions. This prevents notification spam from a flapping service.
[watch]
interval = "60s"
alert_command = "" # command to run on alert; empty = log only
cooldown = "15m" # suppress repeat alerts per container
flap_threshold = 3 # state changes within flap_window = flapping
flap_window = "10m"
retention = "30d" # event log retention
Alert Command
When an alert fires, the master executes the configured command with context passed via environment variables:
| Variable | Value |
|---|---|
MCP_CONTAINER |
Container name |
MCP_SERVICE |
Parent service name |
MCP_NODE |
Node name |
MCP_DESIRED |
Desired state |
MCP_OBSERVED |
Observed state |
MCP_PREV_STATE |
Previous observed state |
MCP_ALERT_TYPE |
drift or flapping |
MCP_TRANSITIONS |
Number of transitions in the flap window (for flapping alerts) |
The alert command is the operator's choice. MCP does not ship with or depend on any notification system.
# Push notification
alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'"
# Webhook
alert_command = "curl -s -X POST https://hooks.example.com/mcp ..."
# Syslog
alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'"
Deployment
mcp watch is intended to run as a systemd service on a node (typically
rift for v1), not on the operator's workstation. It uses the same binary,
config, and database as the CLI -- the difference is the execution mode.
[Unit]
Description=MCP Watch (continuous monitoring)
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
Restart=on-failure
RestartSec=10
User=mcp
Group=mcp
This means the master's SQLite database lives on the node running mcp watch (rift for v1), not on the operator's workstation. The operator can
still run CLI commands from vade -- they connect to the same database via
a configurable path, or the CLI operates statelessly by querying agents
directly and reading the registry from a shared location.
Alternatively, the operator runs mcp watch on vade when at their desk
and relies on container restart policies for unattended operation. The
design supports both -- mcp watch is an optional monitoring layer, not
a required component.
Agent
gRPC Service Definition
The agent exposes a single gRPC service. All RPCs require admin authentication.
syntax = "proto3";
package mcp.v1;
import "google/protobuf/timestamp.proto";
service McpAgent {
// Container lifecycle
rpc Deploy(DeployRequest) returns (DeployResponse);
rpc StartContainer(ContainerRequest) returns (ContainerResponse);
rpc StopContainer(ContainerRequest) returns (ContainerResponse);
rpc RestartContainer(ContainerRequest) returns (ContainerResponse);
rpc RemoveContainer(ContainerRequest) returns (ContainerResponse);
// Status
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
// File transfer
rpc PushFile(PushFileRequest) returns (PushFileResponse);
rpc PullFile(PullFileRequest) returns (PullFileResponse);
// Node
rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse);
}
// --- Container lifecycle ---
message ContainerSpec {
string name = 1;
string image = 2;
string network = 3;
string user = 4;
string restart_policy = 5;
repeated string ports = 6; // "host:container" mappings
repeated string volumes = 7; // "host:container" mount specs
repeated string cmd = 8; // command and arguments
}
message DeployRequest {
repeated ContainerSpec containers = 1;
}
message DeployResponse {
repeated ContainerResult results = 1;
}
message ContainerRequest {
string name = 1;
}
message ContainerResponse {
string name = 1;
bool success = 2;
string error = 3;
}
message ContainerResult {
string name = 1;
bool success = 2;
string error = 3;
}
// --- Status ---
message ListContainersRequest {}
message ContainerInfo {
string name = 1;
string image = 2;
string state = 3; // "running", "stopped", "exited", etc.
string network = 4;
string user = 5;
string restart_policy = 6;
repeated string ports = 7;
repeated string volumes = 8;
repeated string cmd = 9;
google.protobuf.Timestamp created = 10;
google.protobuf.Timestamp started = 11;
}
message ListContainersResponse {
repeated ContainerInfo containers = 1;
}
// --- File transfer ---
// All file paths are relative to /srv/<service>/ on the node.
// The agent resolves the full path and rejects traversal attempts.
message PushFileRequest {
string service = 1; // service name (-> /srv/<service>/)
string path = 2; // relative path within service dir
bytes content = 3;
uint32 mode = 4; // file permissions (e.g. 0600)
}
message PushFileResponse {
bool success = 1;
string error = 2;
}
message PullFileRequest {
string service = 1; // service name (-> /srv/<service>/)
string path = 2; // relative path within service dir
}
message PullFileResponse {
bytes content = 1;
uint32 mode = 2;
string error = 3;
}
// --- Node ---
message NodeStatusRequest {}
message NodeStatusResponse {
string node_name = 1;
string runtime = 2; // "podman", "docker"
string runtime_version = 3;
uint32 container_count = 4;
google.protobuf.Timestamp uptime_since = 5;
}
Container Runtime
The agent manages containers by executing the local container runtime CLI
(podman or docker). The runtime is configured in the agent's config file.
The agent does not use the Docker socket API -- it shells out to the CLI for
simplicity and debuggability (the operator can always run the same commands
manually).
Deploy Flow
When the agent receives a Deploy RPC:
- For each container in the request:
a. Pull the image:
<runtime> pull <image>b. Stop and remove the existing container (if any):<runtime> stop <name>and<runtime> rm <name>c. Start the new container:<runtime> run -d --name <name> [flags] <image> [cmd]d. Verify the container is running:<runtime> inspect <name> - Return success/failure per container.
The flags passed to run are derived from the ContainerSpec:
| Spec field | Runtime flag |
|---|---|
network |
--network <network> |
user |
--user <user> |
restart_policy |
--restart <policy> |
ports |
-p <mapping> (repeated) |
volumes |
-v <mapping> (repeated) |
cmd |
appended after the image name |
File Transfer
The agent supports single-file push and pull, scoped to a specific service's data directory. This is the mechanism for deploying config files and certificates to nodes.
Every file operation specifies a service name and a relative path.
The agent resolves the full path as /srv/<service>/<path>. This scoping
ensures that a deploy for service A cannot accidentally write into service
B's directory.
Push: master sends the service name, relative path, file content, and
permissions. The agent resolves the path, validates it (no .. traversal,
no symlinks escaping the service directory), creates intermediate directories
if needed, and writes the file atomically (write to temp file, then rename).
Pull: master sends the service name and relative path. The agent resolves the path, validates it, reads the file, and returns the content and permissions.
# Push mcr.toml into /srv/mcr/mcr.toml
mcp push mcr.toml mcr
# Push a cert into /srv/mcr/certs/mcr.pem
mcp push cert.pem mcr certs/mcr.pem
# Pull a config file back
mcp pull mcr mcr.toml ./mcr.toml
When the relative path is omitted from mcp push, the basename of the local
file is used.
File size is bounded by gRPC message limits. For v1, the default 4MB gRPC message size is sufficient -- config files and certificates are kilobytes. If larger transfers are needed in the future, streaming RPCs or the v2 tar.zst archive transfer will handle them.
Database Schema
The master's SQLite database stores the service registry.
CREATE TABLE nodes (
name TEXT PRIMARY KEY,
address TEXT NOT NULL,
agent_port INTEGER NOT NULL DEFAULT 9444,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE services (
name TEXT PRIMARY KEY,
node TEXT NOT NULL REFERENCES nodes(name),
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE containers (
name TEXT PRIMARY KEY,
service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE,
image TEXT NOT NULL,
network TEXT NOT NULL DEFAULT 'bridge',
user_spec TEXT NOT NULL DEFAULT '',
restart_policy TEXT NOT NULL DEFAULT 'unless-stopped',
desired_state TEXT NOT NULL DEFAULT 'running',
observed_state TEXT NOT NULL DEFAULT 'unknown',
managed INTEGER NOT NULL DEFAULT 1,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE container_ports (
container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
mapping TEXT NOT NULL,
PRIMARY KEY (container, mapping)
);
CREATE TABLE container_volumes (
container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
mapping TEXT NOT NULL,
PRIMARY KEY (container, mapping)
);
CREATE TABLE container_cmd (
container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
position INTEGER NOT NULL,
arg TEXT NOT NULL,
PRIMARY KEY (container, position)
);
-- Event log for monitoring and alerting
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
container TEXT NOT NULL,
node TEXT NOT NULL,
prev_state TEXT NOT NULL,
new_state TEXT NOT NULL,
timestamp TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX idx_events_container_time ON events(container, timestamp);
State Values
Desired state (set by operator actions):
| State | Meaning |
|---|---|
running |
Operator wants this container running |
stopped |
Operator deliberately stopped this container |
Observed state (set by agent reports):
| State | Meaning |
|---|---|
running |
Container is running (confirmed by agent) |
stopped |
Container exists but is not running |
exited |
Container exited (crashed or completed) |
removed |
Container no longer exists on the node |
unknown |
State has not been queried since registration |
The managed flag distinguishes MCP-deployed containers from pre-existing
ones discovered during sync. Unmanaged containers have no desired state set
(desired_state is empty). They appear in status output but MCP will not
modify them unless the operator explicitly adopts them with mcp adopt.
Configuration
Master Config
When running as a CLI on the operator's workstation:
[database]
path = "/home/kyle/.config/mcp/mcp.db"
[services]
dir = "/home/kyle/.config/mcp/services"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = ""
service_name = "mcp"
[auth]
token_path = "/home/kyle/.config/mcp/token"
Directory layout on the operator's workstation:
~/.config/mcp/
├── mcp.toml Master config
├── mcp.db Service registry
├── token Cached MCIAS bearer token (0600)
└── services/ Service definition files
├── metacrypt.toml
├── mcr.toml
├── mc-proxy.toml
└── ...
When running as a service on a node (for mcp watch):
[database]
path = "/srv/mcp/mcp.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = ""
service_name = "mcp"
# System account credentials for unattended operation
username = "mcp"
password_file = "/srv/mcp/credentials"
[watch]
interval = "60s"
alert_command = ""
cooldown = "15m"
flap_threshold = 3
flap_window = "10m"
retention = "30d"
[log]
level = "info"
The master binary is the same in both cases. When running from the
operator's workstation, it uses ~/.config/mcp/ for local state and
interactive login. When running as a systemd service on a node, it uses
/srv/mcp/ following the platform convention and authenticates with a
system account.
Agent Config
[server]
grpc_addr = ":9444"
tls_cert = "/srv/mcp/certs/cert.pem"
tls_key = "/srv/mcp/certs/key.pem"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = ""
service_name = "mcp-agent"
[agent]
node_name = "rift"
container_runtime = "podman"
[log]
level = "info"
The agent config lives in /srv/mcp/ following the platform convention. The
agent's data directory contains only its config and TLS material -- it has no
database of its own. The container runtime is the source of truth for local
container state.
Deployment
Agent Deployment (on nodes)
The agent is deployed like any other Metacircular service:
- Create
/srv/mcp/directory and config file. - Provision TLS certificate from Metacrypt.
- Create an MCIAS system account for the agent (
mcp-agent). - Install the
mcp-agentbinary. - Start via systemd unit.
The agent runs as a systemd service (container-first deployment is a v2 concern -- MCP needs to be running before it can manage its own agent).
[Unit]
Description=MCP Agent
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml
Restart=on-failure
RestartSec=5
User=mcp
Group=mcp
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv
[Install]
WantedBy=multi-user.target
Note: ReadWritePaths=/srv (not /srv/mcp) because the agent writes files
to any service's /srv/<service>/ directory on behalf of the master.
Master Installation (on operator workstation)
For interactive CLI use:
- Install the
mcpbinary to~/.local/bin/or/usr/local/bin/. - Create
~/.config/mcp/mcp.tomlwith MCIAS and registry config. - Run
mcp loginto authenticate. - Run
mcp node add rift <address>to register the first node. - Run
mcp syncto discover existing containers on the node.
Master Watch Deployment (on a node)
For continuous monitoring, deploy mcp watch as a systemd service on a node
(rift for v1). This runs alongside the agent on the same machine.
- Install the
mcpbinary to/usr/local/bin/. - Create
/srv/mcp/mcp.tomlwith database, MCIAS system account, and[watch]settings. - Create an MCIAS system account for the master (
mcp).
[Unit]
Description=MCP Watch
After=network-online.target mcp-agent.service
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
Restart=on-failure
RestartSec=10
User=mcp
Group=mcp
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv/mcp
[Install]
WantedBy=multi-user.target
When mcp watch runs on a node and the operator uses the CLI from vade,
they share the same registry database. The operator can either point their
CLI config at the node's database (via the overlay) or maintain a separate
local database and sync it. For v1 with one node, the simplest option is
running all mcp commands on rift (via SSH) where the database lives.
Security Model
Threat Mitigations
| Threat | Mitigation |
|---|---|
| Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC |
| Token theft | Tokens have short expiry; cached validation keyed by SHA-256 |
| Agent impersonation | Master verifies agent TLS certificate against known CA |
| Arbitrary file write via push | Agent restricts writes to /srv/<service>/ for the named service |
| Arbitrary file read via pull | Agent restricts reads to /srv/<service>/ for the named service |
| Cross-service file access | File ops require a service name; agent resolves to that service's directory only |
| Container runtime escape | Agent runs as non-root where possible; containers follow platform hardening |
| Network eavesdropping | All C2 traffic is gRPC over mTLS over encrypted overlay |
Security Invariants
- Every agent RPC requires a valid MCIAS admin token. No anonymous or unprivileged access.
- File operations are scoped to
/srv/<service>/for the named service. Path traversal attempts (../, symlinks outside the service directory) are rejected. - The agent never executes arbitrary commands. It only runs container runtime operations and file I/O through well-defined code paths.
- TLS 1.3 minimum on the agent's gRPC listener.
- The master's stored token is file-permission protected (0600).
Project Structure
mcp/
├── cmd/
│ ├── mcp/ Master CLI
│ │ ├── main.go
│ │ ├── login.go
│ │ ├── deploy.go
│ │ ├── lifecycle.go stop, start, restart
│ │ ├── status.go ps, status, sync
│ │ ├── adopt.go adopt unmanaged containers
│ │ ├── service.go service show/export
│ │ ├── transfer.go push, pull
│ │ ├── watch.go watch (long-lived monitoring)
│ │ └── node.go node add/list/remove
│ └── mcp-agent/ Agent daemon
│ ├── main.go
│ └── server.go
├── internal/
│ ├── agent/ Agent logic
│ │ ├── agent.go Agent struct, setup
│ │ ├── containers.go Container runtime operations
│ │ ├── files.go File push/pull
│ │ └── runtime.go Runtime abstraction (podman/docker)
│ ├── registry/ Service registry
│ │ ├── db.go Schema, migrations
│ │ ├── nodes.go Node CRUD
│ │ ├── services.go Service CRUD
│ │ ├── containers.go Container CRUD
│ │ ├── events.go Event log
│ │ └── sync.go Registry reconciliation
│ ├── watch/ Monitoring subsystem
│ │ ├── watch.go Watch loop
│ │ └── alerting.go Alert evaluation and command execution
│ ├── servicedef/ Service definition file parsing
│ │ └── servicedef.go Load, parse, write TOML service defs
│ ├── auth/ MCIAS integration
│ │ └── auth.go Token validation, login
│ ├── config/ Configuration loading
│ │ ├── master.go
│ │ └── agent.go
│ └── grpcserver/ Agent gRPC server wiring
│ ├── server.go
│ ├── interceptors.go Auth interceptor
│ └── handlers.go RPC implementations
├── proto/mcp/
│ └── v1/
│ └── mcp.proto
├── gen/mcp/
│ └── v1/ Generated Go code
├── deploy/
│ ├── systemd/
│ │ ├── mcp-agent.service
│ │ └── mcp-watch.service
│ ├── examples/
│ │ ├── mcp.toml
│ │ └── mcp-agent.toml
│ └── scripts/
│ └── install-agent.sh
├── Makefile
├── buf.yaml
├── .golangci.yaml
├── CLAUDE.md
└── ARCHITECTURE.md
Future Work (v2+)
These are explicitly out of scope for v1 but inform the design:
- Auto-reconciliation:
mcp watchdetects drift but does not act on it in v1. v2 adds configurable auto-restart for drifted containers (with backoff to avoid restart storms). - Migration: snapshot
/srv/<service>/as tar.zst (with VACUUM INTO for clean DB copies), stream to destination node, restore. Requires streaming gRPC and archive assembly logic. - Scheduling: automatic node selection based on resource availability and operator constraints. Requires node resource reporting from agents.
- Certificate provisioning: MCP provisions TLS certs from Metacrypt during deploy via the ACME client library.
- DNS updates: MCP pushes record updates to MCNS after deploy/migrate. Requires MCNS to have an API (or, as a stopgap, zone file editing).
- Multi-node orchestration: deploy across multiple nodes, rolling updates, health-aware placement.
- Web UI: master exposes a web interface for registry browsing and operations. Would add REST+gRPC on the master side.