Major design changes from the review: - Merge agent and watcher into a single smart per-node daemon - CLI is a thin client with no database; service definition files are the operator's source of truth for desired state - Registry database lives on the agent, not the CLI - Rename containers to components; components are independently deployable within a service (mcp deploy metacrypt/web) - active: true/false in service definitions; desired_state values are running/stopped/ignore - Server-side TLS + bearer token (not mTLS) - Dedicated mcp user with rootless podman - CLI commands: list (registry), ps (live), status (drift+events), sync (push desired state) - Agent reports node resources (disk, memory, CPU) for future scheduling - Agent is gRPC-only (deliberate exception to REST+gRPC parity rule) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
40 KiB
MCP -- Metacircular Control Plane
Overview
MCP is the orchestrator for the Metacircular platform. It manages container lifecycle, tracks what services run where, and transfers files between the operator's workstation and managed nodes.
MCP has two components:
-
The CLI (
mcp) is a thin client that runs on the operator's workstation. It reads local service definition files — the operator's declaration of what should be running — and pushes that intent to agents. It has no database and no daemon process. -
The agent (
mcp-agent) is a smart per-node daemon. It receives desired state from the CLI, manages containers via the local runtime, stores the node's registry (desired state, observed state, deployed specs, events), monitors for drift, and alerts the operator. The agent owns the full loop: it knows what should be running, observes what is running, and can act on the difference.
The agent's container runtime interaction (podman/docker CLI) is an internal subcomponent — the "dumb" part. The agent itself is the smart coordinator that wraps it with state tracking, monitoring, and a gRPC API.
v1 Scope
v1 targets a single-node deployment (one agent on rift, CLI on vade). The core operations are:
- Deploy -- push service definitions to the agent; agent pulls images and starts (or restarts) containers.
- Component-level deploy -- deploy individual components within a service without disrupting others (e.g., update the web UI without restarting the API server).
- Container lifecycle -- stop, start, restart services.
- Monitoring -- agent continuously watches container state, records events, detects drift and flapping, alerts the operator.
- Status -- query live container state, view drift, review events.
- File transfer -- push or pull individual files between CLI and nodes (config files, certificates), scoped to service directories.
- Sync -- push service definitions to the agent to update desired state without deploying.
Explicitly not in v1: migration (snapshot/tar.zst transfer), automatic scheduling/placement, certificate provisioning from Metacrypt, DNS updates to MCNS, multi-node orchestration, auto-reconciliation (agent restarting drifted containers without operator action).
Architecture
Operator workstation (vade)
┌──────────────────────────────┐
│ mcp (CLI) │
│ │
│ ~/.config/mcp/services/ │
│ metacrypt.toml │
│ mcr.toml │
│ mc-proxy.toml │
│ │
│ gRPC client ────────────────┼──── overlay ────┐
└──────────────────────────────┘ │
│
MC Node (rift) │
┌────────────────────────────────────────────────┼──┐
│ │ │
│ ┌──────────────────────────────────────────┐ │ │
│ │ mcp-agent │◄─┘ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ Registry │ │ Monitor │ │ │
│ │ │ (SQLite) │ │ (watch loop, │ │ │
│ │ │ │ │ events, │ │ │
│ │ │ desired │ │ alerting) │ │ │
│ │ │ observed │ │ │ │ │
│ │ │ specs │ │ │ │ │
│ │ │ events │ │ │ │ │
│ │ └─────────────┘ └──────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ Container runtime (podman) │ │ │
│ │ │ │ │ │
│ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │
│ │ │ │ svc α │ │ svc β │ │ svc γ │ │ │ │
│ │ │ └───────┘ └───────┘ └───────┘ │ │ │
│ │ └──────────────────────────────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │
│ /srv/<service>/ (config, db, certs, backups) │
└───────────────────────────────────────────────────┘
Components
| Component | Binary | Where | Role |
|---|---|---|---|
| CLI | mcp |
Operator workstation (vade) | Thin client. Reads service definitions, pushes intent to agents, queries status. |
| Agent | mcp-agent |
Each managed node (rift) | Smart daemon. Manages containers, stores registry, monitors, alerts. |
Communication
The CLI communicates with agents over gRPC with server-side TLS. The
transport is the encrypted overlay network (Tailscale/WireGuard). The CLI
authenticates by presenting an MCIAS bearer token in gRPC metadata. The
agent validates the token by calling MCIAS and checking for the admin
role.
Client certificates (mTLS) are not used. The overlay network restricts network access to platform participants, MCIAS tokens are short-lived with role enforcement, and the agent's TLS certificate is verified against the Metacrypt CA. The scenarios where mTLS adds value (stolen token, MCIAS compromise) already imply broader platform compromise. mTLS remains an option for future security hardening.
Authentication and Authorization
MCP follows the platform authentication model: all auth is delegated to MCIAS.
Agent Authentication
The agent is a gRPC server with a unary interceptor that enforces authentication on every RPC:
- CLI includes an MCIAS bearer token in the gRPC metadata
(
authorization: Bearer <token>). - Agent extracts the token and validates it against MCIAS (cached 30s by SHA-256 of the token, per platform convention).
- Agent checks that the caller has the
adminrole. All MCP operations require admin -- there is no unprivileged MCP access. - If validation fails, the RPC returns
UNAUTHENTICATED(invalid/expired token) orPERMISSION_DENIED(valid token, not admin).
CLI Authentication
The CLI authenticates to MCIAS before issuing commands. The token can be obtained by:
mcp login-- interactive login, stores the token locally.- Environment variable (
MCP_TOKEN) for scripted use. - System account credentials in the CLI config file.
The stored token is used for all subsequent agent RPCs until it expires.
Services and Components
A service is a logical unit of the platform (e.g., "metacrypt"). A service has one or more components -- the containers that make it up (e.g., "api" and "web"). Components within a service:
- Share the same node.
- Share the same
/srv/<service>/data directory. - Are deployed together by default, but can be deployed independently.
This models the real constraint that components like an API server and its web UI are co-located and share state, but have different operational characteristics. For example, restarting Metacrypt's API server requires unsealing the vault, but the web UI can be redeployed independently without disrupting the API.
Services with a single component (e.g., mc-proxy) simply have one
[[components]] block.
The unique identity of a component is node/service/component.
CLI
Commands
mcp login Authenticate to MCIAS, store token
mcp deploy <service> Deploy all components from service definition
mcp deploy <service>/<component> Deploy a single component
mcp deploy <service> -f <file> Deploy from explicit file
mcp stop <service> Stop all components
mcp start <service> Start all components
mcp restart <service> Restart all components
mcp list List services from agent registry (no runtime query)
mcp ps Live check: query runtime, show running containers
with uptime and version
mcp status [service] Full picture: live query + drift + recent events
mcp sync Push service definitions to agent (update desired
state without deploying)
mcp adopt <container> <service> Claim an unmanaged container into a service
mcp service show <service> Print current spec from agent registry
mcp service edit <service> Open service definition in $EDITOR
mcp service export <service> Write agent registry spec to local service file
mcp service export <service> -f <file> Write to explicit path
mcp push <local-file> <service> [path] Copy a local file into /srv/<service>/[path]
mcp pull <service> <path> [local-file] Copy a file from /srv/<service>/<path> to local
mcp node list List registered nodes
mcp node add <name> <address> Register a node
mcp node remove <name> Deregister a node
Service Definition Files
A service definition is a TOML file that declares the components for a
service. These files live in ~/.config/mcp/services/ by default, one
per service. They are the operator's declaration of intent -- what should
exist, with what spec, in what state.
Example: ~/.config/mcp/services/metacrypt.toml
name = "metacrypt"
node = "rift"
active = true
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]
[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest"
network = "docker_default"
user = "0:0"
restart = "unless-stopped"
ports = ["127.0.0.1:18080:8080"]
volumes = ["/srv/metacrypt:/srv/metacrypt"]
cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"]
Deploy Resolution
mcp deploy <service> resolves the component spec through a precedence
chain:
- Service definition file -- if
-f <file>is specified, use that file. Otherwise look for~/.config/mcp/services/<service>.toml. - Agent registry (fallback) -- if no file exists, use the spec from the last successful deploy stored in the agent's registry.
If neither exists (first deploy, no file), the deploy fails with an error telling the operator to create a service definition.
The CLI pushes the resolved spec to the agent. The agent records it in its
registry and executes the deploy. The service definition file on disk is
not modified -- it represents the operator's declared intent, not the
deployed state. To sync the file with reality, use mcp service export.
Spec Lifecycle
┌─────────────┐
write │ Service │ mcp deploy
──────────► │ definition │ ──────────────┐
│ (.toml) │ │
└─────────────┘ ▼
▲ ┌─────────────────┐
│ │ Agent registry │
mcp service │ │ (deployed │
export │ │ spec) │
│ └─────────────────┘
│ │
└───────────────────────┘
- Operator writes the service definition file (or copies one from the service's repo).
mcp deployreads the file, pushes to the agent, agent records the spec in its registry and deploys.mcp service exportreads the agent's registry and writes it back to the local file, incorporating any changes since the file was last edited.
mcp service edit <service> opens the service definition in $EDITOR
(falling back to $VISUAL, then vi). If no file exists yet, it exports
the current spec from the agent's registry first, so the operator starts
from the deployed state rather than a blank file. After the editor exits,
the file is saved to the standard path in the services directory.
Where Definition Files Come From
Service definition files can be:
- Written by hand by the operator.
- Copied from the service's repo (a service could ship a
deploy/mcp-service.tomlas a starting point). - Generated by
mcp adopt+mcp service export-- adopt existing containers, then export to get a file matching the running config. - Generated by converting from mcdeploy.toml during initial MCP migration (one-time).
Agent
The agent is the smart per-node daemon. It owns the full lifecycle: receives desired state, manages containers, stores the registry, monitors for drift, and alerts the operator.
gRPC Service Definition
The agent exposes a single gRPC service. All RPCs require admin authentication. The agent is gRPC-only -- it is internal C2 infrastructure, not a user-facing service, so the platform's REST+gRPC parity rule does not apply.
syntax = "proto3";
package mcp.v1;
import "google/protobuf/timestamp.proto";
service McpAgent {
// Service lifecycle
rpc Deploy(DeployRequest) returns (DeployResponse);
rpc StopService(ServiceRequest) returns (ServiceResponse);
rpc StartService(ServiceRequest) returns (ServiceResponse);
rpc RestartService(ServiceRequest) returns (ServiceResponse);
// Desired state
rpc SyncDesiredState(SyncRequest) returns (SyncResponse);
// Status and registry
rpc ListServices(ListServicesRequest) returns (ListServicesResponse);
rpc GetServiceStatus(ServiceStatusRequest) returns (ServiceStatusResponse);
rpc LiveCheck(LiveCheckRequest) returns (LiveCheckResponse);
// Adopt
rpc AdoptContainer(AdoptRequest) returns (AdoptResponse);
// File transfer
rpc PushFile(PushFileRequest) returns (PushFileResponse);
rpc PullFile(PullFileRequest) returns (PullFileResponse);
// Node
rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse);
}
// --- Service lifecycle ---
message ComponentSpec {
string name = 1;
string image = 2;
string network = 3;
string user = 4;
string restart = 5;
repeated string ports = 6; // "host:container" mappings
repeated string volumes = 7; // "host:container" mount specs
repeated string cmd = 8; // command and arguments
}
message ServiceSpec {
string name = 1;
bool active = 2;
repeated ComponentSpec components = 3;
}
message DeployRequest {
ServiceSpec service = 1;
string component = 2; // deploy single component (empty = all)
}
message DeployResponse {
repeated ComponentResult results = 1;
}
message ComponentResult {
string name = 1;
bool success = 2;
string error = 3;
}
message ServiceRequest {
string name = 1;
}
message ServiceResponse {
repeated ComponentResult results = 1;
}
// --- Desired state ---
message SyncRequest {
repeated ServiceSpec services = 1; // all services for this node
}
message SyncResponse {
repeated ServiceSyncResult results = 1;
}
message ServiceSyncResult {
string name = 1;
bool changed = 2; // desired state was updated
string summary = 3;
}
// --- Status and registry ---
message ListServicesRequest {}
message ServiceInfo {
string name = 1;
bool active = 2;
repeated ComponentInfo components = 3;
}
message ComponentInfo {
string name = 1;
string image = 2;
string desired_state = 3; // "running", "stopped", "ignore"
string observed_state = 4; // "running", "stopped", "exited", "removed", "unknown"
string version = 5;
google.protobuf.Timestamp started = 6;
}
message ListServicesResponse {
repeated ServiceInfo services = 1;
}
message ServiceStatusRequest {
string name = 1; // empty = all services
}
message DriftInfo {
string service = 1;
string component = 2;
string desired_state = 3;
string observed_state = 4;
}
message EventInfo {
string component = 1;
string prev_state = 2;
string new_state = 3;
google.protobuf.Timestamp timestamp = 4;
}
message ServiceStatusResponse {
repeated ServiceInfo services = 1;
repeated DriftInfo drift = 2;
repeated EventInfo recent_events = 3;
}
message LiveCheckRequest {}
message LiveCheckResponse {
repeated ServiceInfo services = 1; // with freshly observed state
}
// --- Adopt ---
message AdoptRequest {
string container = 1; // runtime container name
string service = 2; // service to adopt into
string component = 3; // component name within the service
}
message AdoptResponse {
bool success = 1;
string error = 2;
}
// --- File transfer ---
// All file paths are relative to /srv/<service>/ on the node.
// The agent resolves the full path and rejects traversal attempts.
message PushFileRequest {
string service = 1; // service name (-> /srv/<service>/)
string path = 2; // relative path within service dir
bytes content = 3;
uint32 mode = 4; // file permissions (e.g. 0600)
}
message PushFileResponse {
bool success = 1;
string error = 2;
}
message PullFileRequest {
string service = 1; // service name (-> /srv/<service>/)
string path = 2; // relative path within service dir
}
message PullFileResponse {
bytes content = 1;
uint32 mode = 2;
string error = 3;
}
// --- Node ---
message NodeStatusRequest {}
message NodeStatusResponse {
string node_name = 1;
string runtime = 2; // "podman", "docker"
string runtime_version = 3;
uint32 service_count = 4;
uint32 component_count = 5;
uint64 disk_total_bytes = 6;
uint64 disk_free_bytes = 7;
uint64 memory_total_bytes = 8;
uint64 memory_free_bytes = 9;
double cpu_usage_percent = 10;
google.protobuf.Timestamp uptime_since = 11;
}
Container Runtime
The agent manages containers by executing the local container runtime CLI
(podman). The runtime is configured in the agent's config file. The agent
shells out to the CLI for simplicity and debuggability -- the operator can
always run the same commands manually.
The agent runs as a dedicated mcp system user. Podman runs rootless under
this user. All containers are owned by mcp. The NixOS configuration
provisions the mcp user with podman access.
Deploy Flow
When the agent receives a Deploy RPC:
- Record the service spec in the registry (desired state, component specs).
- For each component being deployed (all, or the one named in the request):
a. Pull the image:
podman pull <image>b. Stop and remove the existing container (if any):podman stop <name>andpodman rm <name>c. Start the new container:podman run -d --name <name> [flags] <image> [cmd]d. Verify the container is running:podman inspect <name>e. Update observed state in the registry. - Set desired state to
runningfor deployed components. - Return success/failure per component.
The flags passed to podman run are derived from the ComponentSpec:
| Spec field | Runtime flag |
|---|---|
network |
--network <network> |
user |
--user <user> |
restart |
--restart <restart> |
ports |
-p <mapping> (repeated) |
volumes |
-v <mapping> (repeated) |
cmd |
appended after the image name |
File Transfer
The agent supports single-file push and pull, scoped to a specific service's data directory. This is the mechanism for deploying config files and certificates to nodes.
Every file operation specifies a service name and a relative path.
The agent resolves the full path as /srv/<service>/<path>. This scoping
ensures that a file operation for service A cannot write into service B's
directory.
Push: CLI sends the service name, relative path, file content, and
permissions. The agent resolves the path, validates it (no .. traversal,
no symlinks escaping the service directory), creates intermediate
directories if needed, and writes the file atomically (write to temp file,
then rename).
Pull: CLI sends the service name and relative path. The agent resolves the path, validates it, reads the file, and returns the content and permissions.
# Push mcr.toml into /srv/mcr/mcr.toml
mcp push mcr.toml mcr
# Push a cert into /srv/mcr/certs/mcr.pem
mcp push cert.pem mcr certs/mcr.pem
# Pull a config file back
mcp pull mcr mcr.toml ./mcr.toml
When the relative path is omitted from mcp push, the basename of the
local file is used.
File size is bounded by gRPC message limits. For v1, the default 4MB gRPC message size is sufficient -- config files and certificates are kilobytes. If larger transfers are needed in the future, streaming RPCs or the v2 tar.zst archive transfer will handle them.
Desired State vs. Observed State
The agent's registry tracks two separate pieces of information for each component:
- Desired state -- what the operator wants:
running,stopped, orignore. Set by the CLI via deploy, stop, start, sync, or adopt. - Observed state -- what the container runtime reports:
running,stopped,exited,removed, orunknown.
These can diverge. A component with desired=running and observed=exited
has crashed. The agent flags this as drift. Components with
desired=ignore are tracked but never flagged as drifting.
| Desired | Observed | Status |
|---|---|---|
| running | running | OK |
| running | stopped | DRIFT -- stopped unexpectedly |
| running | exited | DRIFT -- crashed |
| running | removed | DRIFT -- container gone |
| stopped | stopped | OK |
| stopped | removed | OK |
| stopped | running | DRIFT -- running when it shouldn't be |
| ignore | (any) | OK -- not managed |
For v1, the agent reports drift but does not auto-reconcile. The operator
decides whether to mcp start, mcp deploy, or investigate.
Auto-reconciliation (agent restarting drifted containers without operator
action) is a v2 concern.
Registry Reconciliation
The agent reconciles its registry against the container runtime on three
occasions: during the monitor loop (continuous), on mcp ps / mcp status
(on demand), and on mcp sync (when new desired state is pushed).
Reconciliation:
- Agent queries the container runtime for all containers.
- Compares the runtime's report against the registry:
- Component in registry, seen in runtime: update observed state.
- Component in registry, not in runtime: set observed state to
removed. - Container in runtime, not in registry: add to registry with
desired state
ignore. These are containers the agent sees but MCP didn't deploy.
- Record state-change events for any transitions.
Adopting Unmanaged Containers
On first sync, every container on rift will appear with desired state
ignore -- MCP didn't deploy them and doesn't know their intended service
grouping.
mcp adopt <container> <service> claims an unmanaged container:
- If the service doesn't exist in the registry, create it.
- Assign the container as a component of the specified service.
- Set desired state to
running(orstoppedif the container is currently stopped).
This lets the operator incrementally bring existing containers under MCP
management without redeploying them. The typical bootstrap flow:
mcp sync to discover containers, mcp adopt to group them into services,
mcp service export to generate service definition files from the adopted
state.
Monitoring
The agent runs a continuous monitor loop that watches container state and alerts the operator when problems are detected. Monitoring is a core function of the agent, not a separate process.
Event Log
Every state transition is recorded in the events table:
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
component TEXT NOT NULL,
service TEXT NOT NULL,
prev_state TEXT NOT NULL,
new_state TEXT NOT NULL,
timestamp TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX idx_events_component_time ON events(component, timestamp);
Events accumulate over time and support rate queries:
-- How many times has metacrypt-api exited in the last hour?
SELECT COUNT(*) FROM events
WHERE component = 'api' AND service = 'metacrypt'
AND new_state = 'exited'
AND timestamp > datetime('now', '-1 hour');
Old events are pruned at the start of each monitor iteration (default: retain 30 days).
Monitor Loop
Each iteration of the monitor loop:
- Query the container runtime for all container states.
- Reconcile against the registry (update observed states).
- For each state transition since the last iteration, insert an event.
- Evaluate alert conditions against the current state and event history.
- If an alert fires, execute the configured alert command.
- Sleep for the configured interval.
Alert Conditions
The monitor evaluates two types of alert:
- Drift alert: a managed component's observed state does not match its desired state. Fires on the transition, not on every iteration.
- Flap alert: a component has changed state more than N times within a window. Default threshold: 3 transitions in 10 minutes.
Each alert has a cooldown per component. Once an alert fires for a component, it is suppressed for the cooldown period regardless of further transitions. This prevents notification spam from a flapping service.
[monitor]
interval = "60s"
alert_command = "" # command to run on alert; empty = log only
cooldown = "15m" # suppress repeat alerts per component
flap_threshold = 3 # state changes within flap_window = flapping
flap_window = "10m"
retention = "30d" # event log retention
Alert Command
When an alert fires, the agent executes the configured command with context passed via environment variables:
| Variable | Value |
|---|---|
MCP_COMPONENT |
Component name |
MCP_SERVICE |
Parent service name |
MCP_NODE |
Node name |
MCP_DESIRED |
Desired state |
MCP_OBSERVED |
Observed state |
MCP_PREV_STATE |
Previous observed state |
MCP_ALERT_TYPE |
drift or flapping |
MCP_TRANSITIONS |
Number of transitions in the flap window (for flap alerts) |
The alert command is the operator's choice. MCP does not ship with or depend on any notification system.
# Push notification
alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT ($MCP_DESIRED -> $MCP_OBSERVED)'"
# Webhook
alert_command = "curl -s -X POST https://hooks.example.com/mcp ..."
# Syslog
alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT on $MCP_NODE'"
Database Schema
The agent's SQLite database stores the node-local registry. Each agent has its own database. Component identity is scoped to the node -- there are no cross-node name collisions because each node has a separate database.
CREATE TABLE services (
name TEXT PRIMARY KEY,
active INTEGER NOT NULL DEFAULT 1,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE components (
name TEXT NOT NULL,
service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE,
image TEXT NOT NULL,
network TEXT NOT NULL DEFAULT 'bridge',
user_spec TEXT NOT NULL DEFAULT '',
restart TEXT NOT NULL DEFAULT 'unless-stopped',
desired_state TEXT NOT NULL DEFAULT 'running',
observed_state TEXT NOT NULL DEFAULT 'unknown',
version TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now')),
PRIMARY KEY (service, name)
);
CREATE TABLE component_ports (
service TEXT NOT NULL,
component TEXT NOT NULL,
mapping TEXT NOT NULL,
PRIMARY KEY (service, component, mapping),
FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
);
CREATE TABLE component_volumes (
service TEXT NOT NULL,
component TEXT NOT NULL,
mapping TEXT NOT NULL,
PRIMARY KEY (service, component, mapping),
FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
);
CREATE TABLE component_cmd (
service TEXT NOT NULL,
component TEXT NOT NULL,
position INTEGER NOT NULL,
arg TEXT NOT NULL,
PRIMARY KEY (service, component, position),
FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
);
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
service TEXT NOT NULL,
component TEXT NOT NULL,
prev_state TEXT NOT NULL,
new_state TEXT NOT NULL,
timestamp TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX idx_events_component_time ON events(service, component, timestamp);
State Values
Desired state (set by operator actions via CLI):
| State | Meaning |
|---|---|
running |
Operator wants this component running |
stopped |
Operator deliberately stopped this component |
ignore |
Unmanaged -- MCP sees it but is not responsible for it |
Observed state (set by container runtime queries):
| State | Meaning |
|---|---|
running |
Container is running |
stopped |
Container exists but is not running |
exited |
Container exited (crashed or completed) |
removed |
Container no longer exists |
unknown |
State has not been queried yet |
Configuration
CLI Config
[services]
dir = "/home/kyle/.config/mcp/services"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = ""
service_name = "mcp"
[auth]
token_path = "/home/kyle/.config/mcp/token"
Directory layout on the operator's workstation:
~/.config/mcp/
├── mcp.toml CLI config
├── token Cached MCIAS bearer token (0600)
└── services/ Service definition files
├── metacrypt.toml
├── mcr.toml
├── mc-proxy.toml
└── ...
The CLI has no database. Service definition files are the operator's source of truth for desired state. The agent's registry is the operational truth.
Agent Config
[server]
grpc_addr = ":9444"
tls_cert = "/srv/mcp/certs/cert.pem"
tls_key = "/srv/mcp/certs/key.pem"
[database]
path = "/srv/mcp/mcp.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = ""
service_name = "mcp-agent"
[agent]
node_name = "rift"
container_runtime = "podman"
[monitor]
interval = "60s"
alert_command = ""
cooldown = "15m"
flap_threshold = 3
flap_window = "10m"
retention = "30d"
[log]
level = "info"
The agent's data directory follows the platform convention:
/srv/mcp/
├── mcp-agent.toml Agent config
├── mcp.db Registry database
├── certs/
│ ├── cert.pem Agent TLS certificate
│ └── key.pem Agent TLS key
└── backups/ Database snapshots
Deployment
Agent Deployment (on nodes)
The agent is deployed like any other Metacircular service:
- Provision the
mcpsystem user via NixOS config (with podman access). - Create
/srv/mcp/directory and config file. - Provision TLS certificate from Metacrypt.
- Create an MCIAS system account for the agent (
mcp-agent). - Install the
mcp-agentbinary. - Start via systemd unit.
The agent runs as a systemd service. Container-first deployment is a v2 concern -- MCP needs to be running before it can manage its own agent.
[Unit]
Description=MCP Agent
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml
Restart=on-failure
RestartSec=5
User=mcp
Group=mcp
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv
[Install]
WantedBy=multi-user.target
Note: ReadWritePaths=/srv (not /srv/mcp) because the agent writes
files to any service's /srv/<service>/ directory on behalf of the CLI.
CLI Installation (on operator workstation)
The CLI is a standalone binary with no daemon.
- Install the
mcpbinary to~/.local/bin/or/usr/local/bin/. - Create
~/.config/mcp/mcp.toml. - Create
~/.config/mcp/services/directory. - Run
mcp loginto authenticate. - Run
mcp syncto push service definitions and discover existing containers.
MCP Bootstrap (first time)
When bringing MCP up on a node that already has running containers:
- Deploy the agent (steps above).
mcp syncwith no service definition files -- the agent discovers all running containers and adds them to its registry with desired stateignore.mcp adopt <container> <service>for each container -- group them into services and set desired state torunning.mcp service export <service>for each service -- generate service definition files from the adopted state.- Review and edit the generated files as needed.
From this point, the service definition files are the source of truth and
mcp deploy manages the containers.
Existing containers on rift currently run under kyle's podman instance.
As part of MCP bootstrap, they will need to be re-created under the mcp
user's rootless podman. This is a one-time migration.
Security Model
Threat Mitigations
| Threat | Mitigation |
|---|---|
| Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC |
| Token theft | Tokens have short expiry; cached validation keyed by SHA-256 |
| Agent impersonation | CLI verifies agent TLS certificate against Metacrypt CA |
| Arbitrary file write via push | Agent restricts writes to /srv/<service>/ for the named service |
| Arbitrary file read via pull | Agent restricts reads to /srv/<service>/ for the named service |
| Cross-service file access | File ops require a service name; agent resolves to that service's directory only |
| Container runtime escape | Rootless podman under mcp user; containers follow platform hardening |
| Network eavesdropping | All C2 traffic is gRPC over TLS over encrypted overlay |
Security Invariants
- Every agent RPC requires a valid MCIAS admin token. No anonymous or unprivileged access.
- File operations are scoped to
/srv/<service>/for the named service. Path traversal attempts (../, symlinks outside the service directory) are rejected. - The agent never executes arbitrary commands. It only runs container runtime operations and file I/O through well-defined code paths.
- TLS 1.3 minimum on the agent's gRPC listener.
- The CLI's stored token is file-permission protected (0600).
- The agent runs as a dedicated
mcpuser with rootless podman. No root access required.
Project Structure
mcp/
├── cmd/
│ ├── mcp/ CLI
│ │ ├── main.go
│ │ ├── login.go
│ │ ├── deploy.go
│ │ ├── lifecycle.go stop, start, restart
│ │ ├── status.go list, ps, status
│ │ ├── sync.go sync desired state
│ │ ├── adopt.go adopt unmanaged containers
│ │ ├── service.go service show/edit/export
│ │ ├── transfer.go push, pull
│ │ └── node.go node add/list/remove
│ └── mcp-agent/ Agent daemon
│ └── main.go
├── internal/
│ ├── agent/ Agent core
│ │ ├── agent.go Agent struct, setup, gRPC server
│ │ ├── deploy.go Deploy flow
│ │ ├── lifecycle.go Stop, start, restart
│ │ ├── files.go File push/pull with path validation
│ │ ├── sync.go Desired state sync, reconciliation
│ │ ├── adopt.go Container adoption
│ │ └── status.go Status queries
│ ├── runtime/ Container runtime abstraction
│ │ ├── runtime.go Interface
│ │ └── podman.go Podman implementation
│ ├── registry/ Node-local registry
│ │ ├── db.go Schema, migrations
│ │ ├── services.go Service CRUD
│ │ ├── components.go Component CRUD
│ │ └── events.go Event log
│ ├── monitor/ Monitoring subsystem
│ │ ├── monitor.go Watch loop
│ │ └── alerting.go Alert evaluation and command execution
│ ├── servicedef/ Service definition file parsing
│ │ └── servicedef.go Load, parse, write TOML service defs
│ ├── auth/ MCIAS integration
│ │ └── auth.go Token validation, interceptor
│ └── config/ Configuration loading
│ ├── cli.go
│ └── agent.go
├── proto/mcp/
│ └── v1/
│ └── mcp.proto
├── gen/mcp/
│ └── v1/ Generated Go code
├── deploy/
│ ├── systemd/
│ │ └── mcp-agent.service
│ ├── examples/
│ │ ├── mcp.toml CLI config example
│ │ └── mcp-agent.toml Agent config example
│ └── scripts/
│ └── install-agent.sh
├── Makefile
├── buf.yaml
├── .golangci.yaml
├── CLAUDE.md
└── ARCHITECTURE.md
Future Work (v2+)
These are explicitly out of scope for v1 but inform the design:
- Auto-reconciliation: the agent detects drift but does not act on it in v1. v2 adds configurable auto-restart for drifted components (with backoff to avoid restart storms). This is the path to fully declarative operation -- the agent continuously reconciles toward desired state.
- Migration: snapshot
/srv/<service>/as tar.zst (with VACUUM INTO for clean DB copies), stream to destination node, restore. Requires streaming gRPC and archive assembly logic. - Scheduling: automatic node selection based on resource availability
and operator constraints. The agent already reports disk, memory, and CPU
in
NodeStatusto support this. - Certificate provisioning: MCP provisions TLS certs from Metacrypt during deploy via the ACME client library.
- DNS updates: MCP pushes record updates to MCNS after deploy/migrate. Requires MCNS to have an API (or, as a stopgap, zone file editing).
- Multi-node orchestration: deploy across multiple nodes, rolling updates, health-aware placement.
- Web UI: a web interface for registry browsing and operations. Would be a separate binary communicating with agents via gRPC, following the platform's web UI pattern.