From ea7a9dcf4d381e3f3c4d111eeba988a04685336c Mon Sep 17 00:00:00 2001 From: Kyle Isom Date: Thu, 26 Mar 2026 10:31:48 -0700 Subject: [PATCH] Rewrite ARCHITECTURE.md incorporating review findings Major design changes from the review: - Merge agent and watcher into a single smart per-node daemon - CLI is a thin client with no database; service definition files are the operator's source of truth for desired state - Registry database lives on the agent, not the CLI - Rename containers to components; components are independently deployable within a service (mcp deploy metacrypt/web) - active: true/false in service definitions; desired_state values are running/stopped/ignore - Server-side TLS + bearer token (not mTLS) - Dedicated mcp user with rootless podman - CLI commands: list (registry), ps (live), status (drift+events), sync (push desired state) - Agent reports node resources (disk, memory, CPU) for future scheduling - Agent is gRPC-only (deliberate exception to REST+gRPC parity rule) Co-Authored-By: Claude Opus 4.6 (1M context) --- ARCHITECTURE.md | 1245 ++++++++++++++++++++++++----------------------- 1 file changed, 646 insertions(+), 599 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 3c1aee2..f1bc6cf 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -6,40 +6,47 @@ MCP is the orchestrator for the Metacircular platform. It manages container lifecycle, tracks what services run where, and transfers files between the operator's workstation and managed nodes. -MCP uses a **master/agent architecture**: +MCP has two components: -- **The agent** owns **"what is"**. It runs on each managed node, queries the - local container runtime, and reports facts. It manages containers and files - on behalf of the master. It has no opinion about what should be running. +- **The CLI** (`mcp`) is a thin client that runs on the operator's + workstation. It reads local service definition files — the operator's + declaration of what should be running — and pushes that intent to agents. + It has no database and no daemon process. -- **The master** owns **"what should be"**. It maintains the service registry - (desired state), compares it against agent-reported observed state, records - events, detects drift, and alerts the operator. The master runs in two - modes: as an interactive CLI for operator commands, and as a long-lived - watch process for continuous monitoring. +- **The agent** (`mcp-agent`) is a smart per-node daemon. It receives + desired state from the CLI, manages containers via the local runtime, + stores the node's registry (desired state, observed state, deployed specs, + events), monitors for drift, and alerts the operator. The agent owns the + full loop: it knows what should be running, observes what is running, and + can act on the difference. -Agents keep their containers running independently. The master is needed to -issue commands, track desired state, and monitor for drift. +The agent's container runtime interaction (podman/docker CLI) is an internal +subcomponent — the "dumb" part. The agent itself is the smart coordinator +that wraps it with state tracking, monitoring, and a gRPC API. ### v1 Scope -v1 targets a single-node deployment (one agent on rift, master on vade). The +v1 targets a single-node deployment (one agent on rift, CLI on vade). The core operations are: -- **Deploy** -- pull a container image and start (or restart) a service on a - specified node. -- **Container lifecycle** -- stop, start, restart individual containers. -- **Status** -- query live container state from agents and reconcile the - master's service registry. -- **Watch** -- continuous monitoring. Periodically syncs with agents, records - state-change events, detects drift, and alerts the operator. -- **File transfer** -- push or pull individual files between master and nodes - (config files, certificates). +- **Deploy** -- push service definitions to the agent; agent pulls images + and starts (or restarts) containers. +- **Component-level deploy** -- deploy individual components within a + service without disrupting others (e.g., update the web UI without + restarting the API server). +- **Container lifecycle** -- stop, start, restart services. +- **Monitoring** -- agent continuously watches container state, records + events, detects drift and flapping, alerts the operator. +- **Status** -- query live container state, view drift, review events. +- **File transfer** -- push or pull individual files between CLI and nodes + (config files, certificates), scoped to service directories. +- **Sync** -- push service definitions to the agent to update desired state + without deploying. Explicitly **not in v1**: migration (snapshot/tar.zst transfer), automatic -scheduling/placement, certificate provisioning from Metacrypt, DNS updates to -MCNS, multi-node orchestration, auto-reconciliation (restarting drifted -containers). +scheduling/placement, certificate provisioning from Metacrypt, DNS updates +to MCNS, multi-node orchestration, auto-reconciliation (agent restarting +drifted containers without operator action). --- @@ -48,12 +55,12 @@ containers). ``` Operator workstation (vade) ┌──────────────────────────────┐ - │ mcp (master CLI) │ + │ mcp (CLI) │ │ │ - │ ┌────────────────────────┐ │ - │ │ Service Registry │ │ - │ │ (SQLite) │ │ - │ └────────────────────────┘ │ + │ ~/.config/mcp/services/ │ + │ metacrypt.toml │ + │ mcr.toml │ + │ mc-proxy.toml │ │ │ │ gRPC client ────────────────┼──── overlay ────┐ └──────────────────────────────┘ │ @@ -61,54 +68,67 @@ Operator workstation (vade) MC Node (rift) │ ┌────────────────────────────────────────────────┼──┐ │ │ │ - │ ┌──────────────────────┐ │ │ - │ │ mcp-agent │◄─────────────────────┘ │ - │ │ (gRPC server) │ │ - │ │ │ │ - │ │ MCIAS token │ │ - │ │ validation │ │ - │ └──────────┬───────────┘ │ - │ │ │ - │ │ exec │ - │ ▼ │ - │ ┌──────────────────────┐ │ - │ │ Container runtime │ │ - │ │ (podman / docker) │ │ - │ │ │ │ - │ │ ┌─────┐ ┌─────┐ │ ┌──────────────────┐ │ - │ │ │ svc │ │ svc │ ...│ │ /srv// │ │ - │ │ │ α │ │ β │ │ │ config, db, │ │ - │ │ └─────┘ └─────┘ │ │ certs, backups │ │ - │ └──────────────────────┘ └──────────────────┘ │ + │ ┌──────────────────────────────────────────┐ │ │ + │ │ mcp-agent │◄─┘ │ + │ │ │ │ + │ │ ┌─────────────┐ ┌──────────────────┐ │ │ + │ │ │ Registry │ │ Monitor │ │ │ + │ │ │ (SQLite) │ │ (watch loop, │ │ │ + │ │ │ │ │ events, │ │ │ + │ │ │ desired │ │ alerting) │ │ │ + │ │ │ observed │ │ │ │ │ + │ │ │ specs │ │ │ │ │ + │ │ │ events │ │ │ │ │ + │ │ └─────────────┘ └──────────────────┘ │ │ + │ │ │ │ + │ │ ┌──────────────────────────────────┐ │ │ + │ │ │ Container runtime (podman) │ │ │ + │ │ │ │ │ │ + │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ + │ │ │ │ svc α │ │ svc β │ │ svc γ │ │ │ │ + │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ + │ │ └──────────────────────────────────┘ │ │ + │ └──────────────────────────────────────────┘ │ + │ │ + │ /srv// (config, db, certs, backups) │ └───────────────────────────────────────────────────┘ ``` ### Components -| Component | Binary | Role | -|-----------|--------|------| -| Master | `mcp` | Operator CLI. Maintains the service registry. Issues commands to agents. | -| Agent | `mcp-agent` | Node daemon. Manages local containers and files. Authenticates callers via MCIAS. | +| Component | Binary | Where | Role | +|-----------|--------|-------|------| +| CLI | `mcp` | Operator workstation (vade) | Thin client. Reads service definitions, pushes intent to agents, queries status. | +| Agent | `mcp-agent` | Each managed node (rift) | Smart daemon. Manages containers, stores registry, monitors, alerts. | ### Communication -Master and agent communicate over gRPC with mutual TLS. The transport is the -encrypted overlay network (Tailscale/WireGuard). The master authenticates to -the agent by presenting an MCIAS bearer token. The agent validates the token -by calling MCIAS and checking for the `admin` role. +The CLI communicates with agents over gRPC with server-side TLS. The +transport is the encrypted overlay network (Tailscale/WireGuard). The CLI +authenticates by presenting an MCIAS bearer token in gRPC metadata. The +agent validates the token by calling MCIAS and checking for the `admin` +role. + +Client certificates (mTLS) are not used. The overlay network restricts +network access to platform participants, MCIAS tokens are short-lived with +role enforcement, and the agent's TLS certificate is verified against the +Metacrypt CA. The scenarios where mTLS adds value (stolen token, MCIAS +compromise) already imply broader platform compromise. mTLS remains an +option for future security hardening. --- ## Authentication and Authorization -MCP follows the platform authentication model: all auth is delegated to MCIAS. +MCP follows the platform authentication model: all auth is delegated to +MCIAS. ### Agent Authentication The agent is a gRPC server with a unary interceptor that enforces authentication on every RPC: -1. Master includes an MCIAS bearer token in the gRPC metadata +1. CLI includes an MCIAS bearer token in the gRPC metadata (`authorization: Bearer `). 2. Agent extracts the token and validates it against MCIAS (cached 30s by SHA-256 of the token, per platform convention). @@ -117,151 +137,94 @@ authentication on every RPC: 4. If validation fails, the RPC returns `UNAUTHENTICATED` (invalid/expired token) or `PERMISSION_DENIED` (valid token, not admin). -### Master Authentication +### CLI Authentication -The master authenticates to MCIAS before issuing commands. The token can be +The CLI authenticates to MCIAS before issuing commands. The token can be obtained by: 1. `mcp login` -- interactive login, stores the token locally. 2. Environment variable (`MCP_TOKEN`) for scripted use. -3. System account credentials in the master config file. +3. System account credentials in the CLI config file. The stored token is used for all subsequent agent RPCs until it expires. -### Agent Identity +--- -The agent itself authenticates to MCIAS as a system account (for future -agent-initiated flows like health reporting). For v1, the agent is purely -reactive -- it only responds to master-initiated RPCs and does not make -outbound calls beyond MCIAS token validation. +## Services and Components + +A **service** is a logical unit of the platform (e.g., "metacrypt"). A +service has one or more **components** -- the containers that make it up +(e.g., "api" and "web"). Components within a service: + +- Share the same node. +- Share the same `/srv//` data directory. +- Are deployed together by default, but can be deployed independently. + +This models the real constraint that components like an API server and its +web UI are co-located and share state, but have different operational +characteristics. For example, restarting Metacrypt's API server requires +unsealing the vault, but the web UI can be redeployed independently without +disrupting the API. + +Services with a single component (e.g., mc-proxy) simply have one +`[[components]]` block. + +The unique identity of a component is `node/service/component`. --- -## Master +## CLI -### CLI Commands +### Commands ``` mcp login Authenticate to MCIAS, store token -mcp deploy Deploy from service definition file +mcp deploy Deploy all components from service definition +mcp deploy / Deploy a single component mcp deploy -f Deploy from explicit file -mcp deploy --image Deploy with image override -mcp stop Stop all containers for a service -mcp start Start all containers for a service -mcp restart Restart all containers for a service +mcp stop Stop all components +mcp start Start all components +mcp restart Restart all components -mcp ps List all services in the registry -mcp status [service] Query live state from agents, reconcile registry -mcp sync Full rescan: agents report all containers, - registry is reconciled against reality +mcp list List services from agent registry (no runtime query) +mcp ps Live check: query runtime, show running containers + with uptime and version +mcp status [service] Full picture: live query + drift + recent events +mcp sync Push service definitions to agent (update desired + state without deploying) mcp adopt Claim an unmanaged container into a service -mcp service show Print current spec from registry +mcp service show Print current spec from agent registry mcp service edit Open service definition in $EDITOR -mcp service export Write registry spec to local service file +mcp service export Write agent registry spec to local service file mcp service export -f Write to explicit path mcp push [path] Copy a local file into /srv//[path] -mcp pull [local-file] Copy a file from /srv// to local - -mcp watch Run continuous monitoring (long-lived) -mcp watch --interval 60s Custom poll interval +mcp pull [local-file] Copy a file from /srv// to local mcp node list List registered nodes mcp node add
Register a node mcp node remove Deregister a node ``` -### Service Registry +### Service Definition Files -The master maintains a SQLite database that tracks: - -- **Nodes** -- name, overlay address, agent port, state. -- **Services** -- name, assigned node, current state. -- **Containers** -- name, parent service, image reference, container runtime - config (network, ports, volumes, user, restart policy, command), observed - state. - -A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt" -API server + "metacrypt-web" UI). Deploy operates on services; lifecycle -commands can target individual containers. - -### Desired State vs. Observed State - -The registry tracks two separate pieces of information for each container: - -- **Desired state** -- what the operator wants: `running` or `stopped`. Set - explicitly by `mcp deploy`, `mcp stop`, `mcp start`, etc. -- **Observed state** -- what the agent actually sees on the node: `running`, - `stopped`, `exited`, `removed`, or `unknown`. - -These can diverge. A container with desired=`running` and observed=`exited` -has crashed. A container with desired=`stopped` and observed=`stopped` is -fine. `mcp status` shows both states and flags **drift** -- any container -where observed state does not match desired state. - -| Desired | Observed | Status | -|---------|----------|--------| -| running | running | OK | -| running | stopped | **DRIFT** -- stopped unexpectedly | -| running | exited | **DRIFT** -- crashed | -| running | removed | **DRIFT** -- container gone | -| stopped | stopped | OK | -| stopped | removed | OK | -| stopped | running | **DRIFT** -- running when it shouldn't be | - -For v1, MCP reports drift but does not auto-reconcile. The operator decides -whether to `mcp start`, `mcp deploy`, or investigate. Auto-reconciliation -(restart crashed containers, enforce desired state) is a v2 concern. - -### Registry Reconciliation - -`mcp sync` (and implicitly, `mcp status`) works as follows: - -1. For each registered node, master calls `ListContainers` on the agent. -2. Agent queries the container runtime and returns the full list of - containers with their configuration and state. -3. Master compares the agent's report against the registry: - - **Container in registry, seen by agent**: update observed state. - - **Container in registry, not seen by agent**: set observed state to - `removed`. - - **Container seen by agent, not in registry**: add to registry as - `unmanaged` (desired state unset, observed state from agent). -4. Master prints a summary: OK containers, drifted containers, and newly - discovered unmanaged containers. - -### Adopting Unmanaged Containers - -On first `mcp sync`, every container on rift will appear as `unmanaged` -- -MCP didn't deploy them and doesn't know their intended service grouping. - -`mcp adopt ` claims an unmanaged container: - -1. If the service doesn't exist, create it (assigned to the container's - node). -2. Set the container's service to the specified service. -3. Set `managed = true` and desired state to `running` (or `stopped` if - the container is currently stopped). - -This lets the operator incrementally bring existing containers under MCP -management without redeploying them. - -### Service Definitions - -A service definition is a TOML file that declares the containers for a +A service definition is a TOML file that declares the components for a service. These files live in `~/.config/mcp/services/` by default, one -per service. +per service. They are the operator's declaration of intent -- what should +exist, with what spec, in what state. Example: `~/.config/mcp/services/metacrypt.toml` ```toml name = "metacrypt" node = "rift" +active = true -[[containers]] -name = "metacrypt" +[[components]] +name = "api" image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest" network = "docker_default" user = "0:0" @@ -269,8 +232,8 @@ restart = "unless-stopped" ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"] volumes = ["/srv/metacrypt:/srv/metacrypt"] -[[containers]] -name = "metacrypt-web" +[[components]] +name = "web" image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest" network = "docker_default" user = "0:0" @@ -280,27 +243,25 @@ volumes = ["/srv/metacrypt:/srv/metacrypt"] cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"] ``` -#### Deploy Resolution +### Deploy Resolution -`mcp deploy ` resolves the container spec through a precedence +`mcp deploy ` resolves the component spec through a precedence chain: -1. **Command-line flags** (highest) -- `--image`, `--node`, etc. override - individual fields. -2. **Service definition file** -- if `-f ` is specified, use that +1. **Service definition file** -- if `-f ` is specified, use that file. Otherwise look for `~/.config/mcp/services/.toml`. -3. **Registry** (lowest) -- if no file exists, use the spec from the last - successful deploy stored in the registry. +2. **Agent registry** (fallback) -- if no file exists, use the spec from + the last successful deploy stored in the agent's registry. -If none of these exist (first deploy, no file), the deploy fails with an -error telling the operator to create a service definition. +If neither exists (first deploy, no file), the deploy fails with an error +telling the operator to create a service definition. -On a successful deploy, the resolved spec (after all overrides) is written -to the registry. The file on disk is **not** modified -- it represents the -operator's declared intent, not the deployed state. To sync the file with -reality after overrides, use `mcp service export`. +The CLI pushes the resolved spec to the agent. The agent records it in its +registry and executes the deploy. The service definition file on disk is +**not** modified -- it represents the operator's declared intent, not the +deployed state. To sync the file with reality, use `mcp service export`. -#### Spec Lifecycle +### Spec Lifecycle ``` ┌─────────────┐ @@ -309,9 +270,9 @@ reality after overrides, use `mcp service export`. │ (.toml) │ │ └─────────────┘ ▼ ▲ ┌─────────────────┐ - │ │ Registry │ - mcp service │ │ (deployed │ - export │ │ spec) │ + │ │ Agent registry │ + mcp service │ │ (deployed │ + export │ │ spec) │ │ └─────────────────┘ │ │ └───────────────────────┘ @@ -319,21 +280,18 @@ reality after overrides, use `mcp service export`. - **Operator writes** the service definition file (or copies one from the service's repo). -- **`mcp deploy`** reads the file, applies overrides, sends to agent, - records the resolved spec in the registry. -- **`mcp service export`** reads the registry and writes it back to the - local file, incorporating any overrides or drift that occurred. - -This keeps the file and registry loosely coupled. The file is the starting -point; the registry is the truth; export reconciles them. +- **`mcp deploy`** reads the file, pushes to the agent, agent records the + spec in its registry and deploys. +- **`mcp service export`** reads the agent's registry and writes it back to + the local file, incorporating any changes since the file was last edited. `mcp service edit ` opens the service definition in `$EDITOR` (falling back to `$VISUAL`, then `vi`). If no file exists yet, it exports -the current registry spec to a temp file first, so the operator starts +the current spec from the agent's registry first, so the operator starts from the deployed state rather than a blank file. After the editor exits, the file is saved to the standard path in the services directory. -#### Where Definition Files Come From +### Where Definition Files Come From Service definition files can be: @@ -345,140 +303,20 @@ Service definition files can be: - **Generated by converting from mcdeploy.toml** during initial MCP migration (one-time). -### Monitoring (`mcp watch`) - -`mcp watch` is a long-lived master process that continuously monitors the -platform. It runs the same sync logic as `mcp status` on a loop, records -state-change events, and alerts the operator when drift is detected. - -#### Event Log - -Every state transition is recorded in the `events` table: - -```sql -CREATE TABLE events ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - container TEXT NOT NULL, - node TEXT NOT NULL, - prev_state TEXT NOT NULL, - new_state TEXT NOT NULL, - timestamp TEXT NOT NULL DEFAULT (datetime('now')) -); -``` - -Events accumulate over time and support rate queries: - -```sql --- How many times has metacrypt exited in the last hour? -SELECT COUNT(*) FROM events -WHERE container = 'metacrypt' - AND new_state = 'exited' - AND timestamp > datetime('now', '-1 hour'); -``` - -Old events are pruned periodically (default: retain 30 days). - -#### Watch Loop - -Each iteration of the watch loop: - -1. Sync with all agents (same as `mcp sync`). -2. For each observed state change since the last iteration, insert an event. -3. Evaluate alert conditions against the current state and event history. -4. If an alert fires, execute the configured alert command. -5. Sleep for the configured interval. - -#### Alert Conditions - -The watch process evaluates two types of alert: - -- **Drift alert**: a managed container's observed state does not match its - desired state. Fires on the transition, not on every poll. -- **Rate alert**: a container has changed state more than N times within a - window (flapping). Default threshold: 3 transitions in 10 minutes. - -Each alert has a **cooldown** per container. Once an alert fires for a -container, it is suppressed for the cooldown period regardless of further -transitions. This prevents notification spam from a flapping service. - -``` -[watch] -interval = "60s" -alert_command = "" # command to run on alert; empty = log only -cooldown = "15m" # suppress repeat alerts per container -flap_threshold = 3 # state changes within flap_window = flapping -flap_window = "10m" -retention = "30d" # event log retention -``` - -#### Alert Command - -When an alert fires, the master executes the configured command with -context passed via environment variables: - -| Variable | Value | -|----------|-------| -| `MCP_CONTAINER` | Container name | -| `MCP_SERVICE` | Parent service name | -| `MCP_NODE` | Node name | -| `MCP_DESIRED` | Desired state | -| `MCP_OBSERVED` | Observed state | -| `MCP_PREV_STATE` | Previous observed state | -| `MCP_ALERT_TYPE` | `drift` or `flapping` | -| `MCP_TRANSITIONS` | Number of transitions in the flap window (for flapping alerts) | - -The alert command is the operator's choice. MCP does not ship with or depend -on any notification system. - -```bash -# Push notification -alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'" - -# Webhook -alert_command = "curl -s -X POST https://hooks.example.com/mcp ..." - -# Syslog -alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'" -``` - -#### Deployment - -`mcp watch` is intended to run as a systemd service on a node (typically -rift for v1), not on the operator's workstation. It uses the same binary, -config, and database as the CLI -- the difference is the execution mode. - -```ini -[Unit] -Description=MCP Watch (continuous monitoring) -After=network-online.target - -[Service] -Type=simple -ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml -Restart=on-failure -RestartSec=10 -User=mcp -Group=mcp -``` - -This means the master's SQLite database lives on the node running `mcp -watch` (rift for v1), not on the operator's workstation. The operator can -still run CLI commands from vade -- they connect to the same database via -a configurable path, or the CLI operates statelessly by querying agents -directly and reading the registry from a shared location. - -**Alternatively**, the operator runs `mcp watch` on vade when at their desk -and relies on container restart policies for unattended operation. The -design supports both -- `mcp watch` is an optional monitoring layer, not -a required component. - --- ## Agent +The agent is the smart per-node daemon. It owns the full lifecycle: +receives desired state, manages containers, stores the registry, monitors +for drift, and alerts the operator. + ### gRPC Service Definition -The agent exposes a single gRPC service. All RPCs require admin authentication. +The agent exposes a single gRPC service. All RPCs require admin +authentication. The agent is gRPC-only -- it is internal C2 infrastructure, +not a user-facing service, so the platform's REST+gRPC parity rule does not +apply. ```protobuf syntax = "proto3"; @@ -487,15 +325,22 @@ package mcp.v1; import "google/protobuf/timestamp.proto"; service McpAgent { - // Container lifecycle + // Service lifecycle rpc Deploy(DeployRequest) returns (DeployResponse); - rpc StartContainer(ContainerRequest) returns (ContainerResponse); - rpc StopContainer(ContainerRequest) returns (ContainerResponse); - rpc RestartContainer(ContainerRequest) returns (ContainerResponse); - rpc RemoveContainer(ContainerRequest) returns (ContainerResponse); + rpc StopService(ServiceRequest) returns (ServiceResponse); + rpc StartService(ServiceRequest) returns (ServiceResponse); + rpc RestartService(ServiceRequest) returns (ServiceResponse); - // Status - rpc ListContainers(ListContainersRequest) returns (ListContainersResponse); + // Desired state + rpc SyncDesiredState(SyncRequest) returns (SyncResponse); + + // Status and registry + rpc ListServices(ListServicesRequest) returns (ListServicesResponse); + rpc GetServiceStatus(ServiceStatusRequest) returns (ServiceStatusResponse); + rpc LiveCheck(LiveCheckRequest) returns (LiveCheckResponse); + + // Adopt + rpc AdoptContainer(AdoptRequest) returns (AdoptResponse); // File transfer rpc PushFile(PushFileRequest) returns (PushFileResponse); @@ -505,63 +350,128 @@ service McpAgent { rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse); } -// --- Container lifecycle --- +// --- Service lifecycle --- -message ContainerSpec { +message ComponentSpec { string name = 1; string image = 2; string network = 3; string user = 4; - string restart_policy = 5; + string restart = 5; repeated string ports = 6; // "host:container" mappings repeated string volumes = 7; // "host:container" mount specs repeated string cmd = 8; // command and arguments } +message ServiceSpec { + string name = 1; + bool active = 2; + repeated ComponentSpec components = 3; +} + message DeployRequest { - repeated ContainerSpec containers = 1; + ServiceSpec service = 1; + string component = 2; // deploy single component (empty = all) } message DeployResponse { - repeated ContainerResult results = 1; + repeated ComponentResult results = 1; } -message ContainerRequest { - string name = 1; -} - -message ContainerResponse { +message ComponentResult { string name = 1; bool success = 2; string error = 3; } -message ContainerResult { +message ServiceRequest { string name = 1; - bool success = 2; - string error = 3; } -// --- Status --- +message ServiceResponse { + repeated ComponentResult results = 1; +} -message ListContainersRequest {} +// --- Desired state --- -message ContainerInfo { +message SyncRequest { + repeated ServiceSpec services = 1; // all services for this node +} + +message SyncResponse { + repeated ServiceSyncResult results = 1; +} + +message ServiceSyncResult { + string name = 1; + bool changed = 2; // desired state was updated + string summary = 3; +} + +// --- Status and registry --- + +message ListServicesRequest {} + +message ServiceInfo { + string name = 1; + bool active = 2; + repeated ComponentInfo components = 3; +} + +message ComponentInfo { string name = 1; string image = 2; - string state = 3; // "running", "stopped", "exited", etc. - string network = 4; - string user = 5; - string restart_policy = 6; - repeated string ports = 7; - repeated string volumes = 8; - repeated string cmd = 9; - google.protobuf.Timestamp created = 10; - google.protobuf.Timestamp started = 11; + string desired_state = 3; // "running", "stopped", "ignore" + string observed_state = 4; // "running", "stopped", "exited", "removed", "unknown" + string version = 5; + google.protobuf.Timestamp started = 6; } -message ListContainersResponse { - repeated ContainerInfo containers = 1; +message ListServicesResponse { + repeated ServiceInfo services = 1; +} + +message ServiceStatusRequest { + string name = 1; // empty = all services +} + +message DriftInfo { + string service = 1; + string component = 2; + string desired_state = 3; + string observed_state = 4; +} + +message EventInfo { + string component = 1; + string prev_state = 2; + string new_state = 3; + google.protobuf.Timestamp timestamp = 4; +} + +message ServiceStatusResponse { + repeated ServiceInfo services = 1; + repeated DriftInfo drift = 2; + repeated EventInfo recent_events = 3; +} + +message LiveCheckRequest {} + +message LiveCheckResponse { + repeated ServiceInfo services = 1; // with freshly observed state +} + +// --- Adopt --- + +message AdoptRequest { + string container = 1; // runtime container name + string service = 2; // service to adopt into + string component = 3; // component name within the service +} + +message AdoptResponse { + bool success = 1; + string error = 2; } // --- File transfer --- @@ -569,10 +479,10 @@ message ListContainersResponse { // The agent resolves the full path and rejects traversal attempts. message PushFileRequest { - string service = 1; // service name (-> /srv//) - string path = 2; // relative path within service dir + string service = 1; // service name (-> /srv//) + string path = 2; // relative path within service dir bytes content = 3; - uint32 mode = 4; // file permissions (e.g. 0600) + uint32 mode = 4; // file permissions (e.g. 0600) } message PushFileResponse { @@ -581,8 +491,8 @@ message PushFileResponse { } message PullFileRequest { - string service = 1; // service name (-> /srv//) - string path = 2; // relative path within service dir + string service = 1; // service name (-> /srv//) + string path = 2; // relative path within service dir } message PullFileResponse { @@ -597,61 +507,75 @@ message NodeStatusRequest {} message NodeStatusResponse { string node_name = 1; - string runtime = 2; // "podman", "docker" + string runtime = 2; // "podman", "docker" string runtime_version = 3; - uint32 container_count = 4; - google.protobuf.Timestamp uptime_since = 5; + uint32 service_count = 4; + uint32 component_count = 5; + uint64 disk_total_bytes = 6; + uint64 disk_free_bytes = 7; + uint64 memory_total_bytes = 8; + uint64 memory_free_bytes = 9; + double cpu_usage_percent = 10; + google.protobuf.Timestamp uptime_since = 11; } ``` ### Container Runtime The agent manages containers by executing the local container runtime CLI -(`podman` or `docker`). The runtime is configured in the agent's config file. -The agent does not use the Docker socket API -- it shells out to the CLI for -simplicity and debuggability (the operator can always run the same commands -manually). +(`podman`). The runtime is configured in the agent's config file. The agent +shells out to the CLI for simplicity and debuggability -- the operator can +always run the same commands manually. + +The agent runs as a dedicated `mcp` system user. Podman runs rootless under +this user. All containers are owned by `mcp`. The NixOS configuration +provisions the `mcp` user with podman access. #### Deploy Flow When the agent receives a `Deploy` RPC: -1. For each container in the request: - a. Pull the image: ` pull ` +1. Record the service spec in the registry (desired state, component specs). +2. For each component being deployed (all, or the one named in the request): + a. Pull the image: `podman pull ` b. Stop and remove the existing container (if any): - ` stop ` and ` rm ` - c. Start the new container: ` run -d --name [flags] [cmd]` - d. Verify the container is running: ` inspect ` -2. Return success/failure per container. + `podman stop ` and `podman rm ` + c. Start the new container: + `podman run -d --name [flags] [cmd]` + d. Verify the container is running: `podman inspect ` + e. Update observed state in the registry. +3. Set desired state to `running` for deployed components. +4. Return success/failure per component. -The flags passed to `run` are derived from the `ContainerSpec`: +The flags passed to `podman run` are derived from the `ComponentSpec`: | Spec field | Runtime flag | |------------|-------------| | `network` | `--network ` | | `user` | `--user ` | -| `restart_policy` | `--restart ` | +| `restart` | `--restart ` | | `ports` | `-p ` (repeated) | | `volumes` | `-v ` (repeated) | | `cmd` | appended after the image name | ### File Transfer -The agent supports single-file push and pull, scoped to a specific service's -data directory. This is the mechanism for deploying config files and -certificates to nodes. +The agent supports single-file push and pull, scoped to a specific +service's data directory. This is the mechanism for deploying config files +and certificates to nodes. Every file operation specifies a **service name** and a **relative path**. The agent resolves the full path as `/srv//`. This scoping -ensures that a deploy for service A cannot accidentally write into service -B's directory. +ensures that a file operation for service A cannot write into service B's +directory. -**Push**: master sends the service name, relative path, file content, and +**Push**: CLI sends the service name, relative path, file content, and permissions. The agent resolves the path, validates it (no `..` traversal, -no symlinks escaping the service directory), creates intermediate directories -if needed, and writes the file atomically (write to temp file, then rename). +no symlinks escaping the service directory), creates intermediate +directories if needed, and writes the file atomically (write to temp file, +then rename). -**Pull**: master sends the service name and relative path. The agent resolves +**Pull**: CLI sends the service name and relative path. The agent resolves the path, validates it, reads the file, and returns the content and permissions. @@ -666,118 +590,277 @@ mcp push cert.pem mcr certs/mcr.pem mcp pull mcr mcr.toml ./mcr.toml ``` -When the relative path is omitted from `mcp push`, the basename of the local -file is used. +When the relative path is omitted from `mcp push`, the basename of the +local file is used. File size is bounded by gRPC message limits. For v1, the default 4MB gRPC message size is sufficient -- config files and certificates are kilobytes. If larger transfers are needed in the future, streaming RPCs or the v2 tar.zst archive transfer will handle them. ---- +### Desired State vs. Observed State -## Database Schema +The agent's registry tracks two separate pieces of information for each +component: -The master's SQLite database stores the service registry. +- **Desired state** -- what the operator wants: `running`, `stopped`, or + `ignore`. Set by the CLI via deploy, stop, start, sync, or adopt. +- **Observed state** -- what the container runtime reports: `running`, + `stopped`, `exited`, `removed`, or `unknown`. + +These can diverge. A component with desired=`running` and observed=`exited` +has crashed. The agent flags this as **drift**. Components with +desired=`ignore` are tracked but never flagged as drifting. + +| Desired | Observed | Status | +|---------|----------|--------| +| running | running | OK | +| running | stopped | **DRIFT** -- stopped unexpectedly | +| running | exited | **DRIFT** -- crashed | +| running | removed | **DRIFT** -- container gone | +| stopped | stopped | OK | +| stopped | removed | OK | +| stopped | running | **DRIFT** -- running when it shouldn't be | +| ignore | (any) | OK -- not managed | + +For v1, the agent reports drift but does not auto-reconcile. The operator +decides whether to `mcp start`, `mcp deploy`, or investigate. +Auto-reconciliation (agent restarting drifted containers without operator +action) is a v2 concern. + +### Registry Reconciliation + +The agent reconciles its registry against the container runtime on three +occasions: during the monitor loop (continuous), on `mcp ps` / `mcp status` +(on demand), and on `mcp sync` (when new desired state is pushed). + +Reconciliation: + +1. Agent queries the container runtime for all containers. +2. Compares the runtime's report against the registry: + - **Component in registry, seen in runtime**: update observed state. + - **Component in registry, not in runtime**: set observed state to + `removed`. + - **Container in runtime, not in registry**: add to registry with + desired state `ignore`. These are containers the agent sees but + MCP didn't deploy. +3. Record state-change events for any transitions. + +### Adopting Unmanaged Containers + +On first sync, every container on rift will appear with desired state +`ignore` -- MCP didn't deploy them and doesn't know their intended service +grouping. + +`mcp adopt ` claims an unmanaged container: + +1. If the service doesn't exist in the registry, create it. +2. Assign the container as a component of the specified service. +3. Set desired state to `running` (or `stopped` if the container is + currently stopped). + +This lets the operator incrementally bring existing containers under MCP +management without redeploying them. The typical bootstrap flow: +`mcp sync` to discover containers, `mcp adopt` to group them into services, +`mcp service export` to generate service definition files from the adopted +state. + +### Monitoring + +The agent runs a continuous monitor loop that watches container state and +alerts the operator when problems are detected. Monitoring is a core +function of the agent, not a separate process. + +#### Event Log + +Every state transition is recorded in the `events` table: ```sql -CREATE TABLE nodes ( - name TEXT PRIMARY KEY, - address TEXT NOT NULL, - agent_port INTEGER NOT NULL DEFAULT 9444, - created_at TEXT NOT NULL DEFAULT (datetime('now')), - updated_at TEXT NOT NULL DEFAULT (datetime('now')) -); - -CREATE TABLE services ( - name TEXT PRIMARY KEY, - node TEXT NOT NULL REFERENCES nodes(name), - created_at TEXT NOT NULL DEFAULT (datetime('now')), - updated_at TEXT NOT NULL DEFAULT (datetime('now')) -); - -CREATE TABLE containers ( - name TEXT PRIMARY KEY, - service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE, - image TEXT NOT NULL, - network TEXT NOT NULL DEFAULT 'bridge', - user_spec TEXT NOT NULL DEFAULT '', - restart_policy TEXT NOT NULL DEFAULT 'unless-stopped', - desired_state TEXT NOT NULL DEFAULT 'running', - observed_state TEXT NOT NULL DEFAULT 'unknown', - managed INTEGER NOT NULL DEFAULT 1, - created_at TEXT NOT NULL DEFAULT (datetime('now')), - updated_at TEXT NOT NULL DEFAULT (datetime('now')) -); - -CREATE TABLE container_ports ( - container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, - mapping TEXT NOT NULL, - PRIMARY KEY (container, mapping) -); - -CREATE TABLE container_volumes ( - container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, - mapping TEXT NOT NULL, - PRIMARY KEY (container, mapping) -); - -CREATE TABLE container_cmd ( - container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE, - position INTEGER NOT NULL, - arg TEXT NOT NULL, - PRIMARY KEY (container, position) -); - --- Event log for monitoring and alerting CREATE TABLE events ( id INTEGER PRIMARY KEY AUTOINCREMENT, - container TEXT NOT NULL, - node TEXT NOT NULL, + component TEXT NOT NULL, + service TEXT NOT NULL, prev_state TEXT NOT NULL, new_state TEXT NOT NULL, timestamp TEXT NOT NULL DEFAULT (datetime('now')) ); -CREATE INDEX idx_events_container_time ON events(container, timestamp); +CREATE INDEX idx_events_component_time ON events(component, timestamp); +``` + +Events accumulate over time and support rate queries: + +```sql +-- How many times has metacrypt-api exited in the last hour? +SELECT COUNT(*) FROM events +WHERE component = 'api' AND service = 'metacrypt' + AND new_state = 'exited' + AND timestamp > datetime('now', '-1 hour'); +``` + +Old events are pruned at the start of each monitor iteration (default: +retain 30 days). + +#### Monitor Loop + +Each iteration of the monitor loop: + +1. Query the container runtime for all container states. +2. Reconcile against the registry (update observed states). +3. For each state transition since the last iteration, insert an event. +4. Evaluate alert conditions against the current state and event history. +5. If an alert fires, execute the configured alert command. +6. Sleep for the configured interval. + +#### Alert Conditions + +The monitor evaluates two types of alert: + +- **Drift alert**: a managed component's observed state does not match its + desired state. Fires on the transition, not on every iteration. +- **Flap alert**: a component has changed state more than N times within a + window. Default threshold: 3 transitions in 10 minutes. + +Each alert has a **cooldown** per component. Once an alert fires for a +component, it is suppressed for the cooldown period regardless of further +transitions. This prevents notification spam from a flapping service. + +```toml +[monitor] +interval = "60s" +alert_command = "" # command to run on alert; empty = log only +cooldown = "15m" # suppress repeat alerts per component +flap_threshold = 3 # state changes within flap_window = flapping +flap_window = "10m" +retention = "30d" # event log retention +``` + +#### Alert Command + +When an alert fires, the agent executes the configured command with +context passed via environment variables: + +| Variable | Value | +|----------|-------| +| `MCP_COMPONENT` | Component name | +| `MCP_SERVICE` | Parent service name | +| `MCP_NODE` | Node name | +| `MCP_DESIRED` | Desired state | +| `MCP_OBSERVED` | Observed state | +| `MCP_PREV_STATE` | Previous observed state | +| `MCP_ALERT_TYPE` | `drift` or `flapping` | +| `MCP_TRANSITIONS` | Number of transitions in the flap window (for flap alerts) | + +The alert command is the operator's choice. MCP does not ship with or +depend on any notification system. + +```bash +# Push notification +alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT ($MCP_DESIRED -> $MCP_OBSERVED)'" + +# Webhook +alert_command = "curl -s -X POST https://hooks.example.com/mcp ..." + +# Syslog +alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT on $MCP_NODE'" +``` + +--- + +## Database Schema + +The agent's SQLite database stores the node-local registry. Each agent +has its own database. Component identity is scoped to the node -- there +are no cross-node name collisions because each node has a separate +database. + +```sql +CREATE TABLE services ( + name TEXT PRIMARY KEY, + active INTEGER NOT NULL DEFAULT 1, + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE TABLE components ( + name TEXT NOT NULL, + service TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE, + image TEXT NOT NULL, + network TEXT NOT NULL DEFAULT 'bridge', + user_spec TEXT NOT NULL DEFAULT '', + restart TEXT NOT NULL DEFAULT 'unless-stopped', + desired_state TEXT NOT NULL DEFAULT 'running', + observed_state TEXT NOT NULL DEFAULT 'unknown', + version TEXT NOT NULL DEFAULT '', + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT NOT NULL DEFAULT (datetime('now')), + PRIMARY KEY (service, name) +); + +CREATE TABLE component_ports ( + service TEXT NOT NULL, + component TEXT NOT NULL, + mapping TEXT NOT NULL, + PRIMARY KEY (service, component, mapping), + FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE +); + +CREATE TABLE component_volumes ( + service TEXT NOT NULL, + component TEXT NOT NULL, + mapping TEXT NOT NULL, + PRIMARY KEY (service, component, mapping), + FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE +); + +CREATE TABLE component_cmd ( + service TEXT NOT NULL, + component TEXT NOT NULL, + position INTEGER NOT NULL, + arg TEXT NOT NULL, + PRIMARY KEY (service, component, position), + FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE +); + +CREATE TABLE events ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + service TEXT NOT NULL, + component TEXT NOT NULL, + prev_state TEXT NOT NULL, + new_state TEXT NOT NULL, + timestamp TEXT NOT NULL DEFAULT (datetime('now')) +); + +CREATE INDEX idx_events_component_time ON events(service, component, timestamp); ``` ### State Values -**Desired state** (set by operator actions): +**Desired state** (set by operator actions via CLI): | State | Meaning | |-------|---------| -| `running` | Operator wants this container running | -| `stopped` | Operator deliberately stopped this container | +| `running` | Operator wants this component running | +| `stopped` | Operator deliberately stopped this component | +| `ignore` | Unmanaged -- MCP sees it but is not responsible for it | -**Observed state** (set by agent reports): +**Observed state** (set by container runtime queries): | State | Meaning | |-------|---------| -| `running` | Container is running (confirmed by agent) | +| `running` | Container is running | | `stopped` | Container exists but is not running | | `exited` | Container exited (crashed or completed) | -| `removed` | Container no longer exists on the node | -| `unknown` | State has not been queried since registration | - -The `managed` flag distinguishes MCP-deployed containers from pre-existing -ones discovered during sync. Unmanaged containers have no desired state set -(desired_state is empty). They appear in status output but MCP will not -modify them unless the operator explicitly adopts them with `mcp adopt`. +| `removed` | Container no longer exists | +| `unknown` | State has not been queried yet | --- ## Configuration -### Master Config - -When running as a CLI on the operator's workstation: +### CLI Config ```toml -[database] -path = "/home/kyle/.config/mcp/mcp.db" - [services] dir = "/home/kyle/.config/mcp/services" @@ -794,8 +877,7 @@ Directory layout on the operator's workstation: ``` ~/.config/mcp/ -├── mcp.toml Master config -├── mcp.db Service registry +├── mcp.toml CLI config ├── token Cached MCIAS bearer token (0600) └── services/ Service definition files ├── metacrypt.toml @@ -804,21 +886,30 @@ Directory layout on the operator's workstation: └── ... ``` -When running as a service on a node (for `mcp watch`): +The CLI has no database. Service definition files are the operator's source +of truth for desired state. The agent's registry is the operational truth. + +### Agent Config ```toml +[server] +grpc_addr = ":9444" +tls_cert = "/srv/mcp/certs/cert.pem" +tls_key = "/srv/mcp/certs/key.pem" + [database] path = "/srv/mcp/mcp.db" [mcias] server_url = "https://mcias.metacircular.net:8443" ca_cert = "" -service_name = "mcp" -# System account credentials for unattended operation -username = "mcp" -password_file = "/srv/mcp/credentials" +service_name = "mcp-agent" -[watch] +[agent] +node_name = "rift" +container_runtime = "podman" + +[monitor] interval = "60s" alert_command = "" cooldown = "15m" @@ -830,37 +921,17 @@ retention = "30d" level = "info" ``` -The master binary is the same in both cases. When running from the -operator's workstation, it uses `~/.config/mcp/` for local state and -interactive login. When running as a systemd service on a node, it uses -`/srv/mcp/` following the platform convention and authenticates with a -system account. +The agent's data directory follows the platform convention: -### Agent Config - -```toml -[server] -grpc_addr = ":9444" -tls_cert = "/srv/mcp/certs/cert.pem" -tls_key = "/srv/mcp/certs/key.pem" - -[mcias] -server_url = "https://mcias.metacircular.net:8443" -ca_cert = "" -service_name = "mcp-agent" - -[agent] -node_name = "rift" -container_runtime = "podman" - -[log] -level = "info" ``` - -The agent config lives in `/srv/mcp/` following the platform convention. The -agent's data directory contains only its config and TLS material -- it has no -database of its own. The container runtime is the source of truth for local -container state. +/srv/mcp/ +├── mcp-agent.toml Agent config +├── mcp.db Registry database +├── certs/ +│ ├── cert.pem Agent TLS certificate +│ └── key.pem Agent TLS key +└── backups/ Database snapshots +``` --- @@ -870,14 +941,15 @@ container state. The agent is deployed like any other Metacircular service: -1. Create `/srv/mcp/` directory and config file. -2. Provision TLS certificate from Metacrypt. -3. Create an MCIAS system account for the agent (`mcp-agent`). -4. Install the `mcp-agent` binary. -5. Start via systemd unit. +1. Provision the `mcp` system user via NixOS config (with podman access). +2. Create `/srv/mcp/` directory and config file. +3. Provision TLS certificate from Metacrypt. +4. Create an MCIAS system account for the agent (`mcp-agent`). +5. Install the `mcp-agent` binary. +6. Start via systemd unit. -The agent runs as a systemd service (container-first deployment is a v2 -concern -- MCP needs to be running before it can manage its own agent). +The agent runs as a systemd service. Container-first deployment is a v2 +concern -- MCP needs to be running before it can manage its own agent. ```ini [Unit] @@ -913,68 +985,40 @@ ReadWritePaths=/srv WantedBy=multi-user.target ``` -Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes files -to any service's `/srv//` directory on behalf of the master. +Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes +files to any service's `/srv//` directory on behalf of the CLI. -### Master Installation (on operator workstation) +### CLI Installation (on operator workstation) -For interactive CLI use: +The CLI is a standalone binary with no daemon. 1. Install the `mcp` binary to `~/.local/bin/` or `/usr/local/bin/`. -2. Create `~/.config/mcp/mcp.toml` with MCIAS and registry config. -3. Run `mcp login` to authenticate. -4. Run `mcp node add rift
` to register the first node. -5. Run `mcp sync` to discover existing containers on the node. +2. Create `~/.config/mcp/mcp.toml`. +3. Create `~/.config/mcp/services/` directory. +4. Run `mcp login` to authenticate. +5. Run `mcp sync` to push service definitions and discover existing + containers. -### Master Watch Deployment (on a node) +### MCP Bootstrap (first time) -For continuous monitoring, deploy `mcp watch` as a systemd service on a node -(rift for v1). This runs alongside the agent on the same machine. +When bringing MCP up on a node that already has running containers: -1. Install the `mcp` binary to `/usr/local/bin/`. -2. Create `/srv/mcp/mcp.toml` with database, MCIAS system account, and - `[watch]` settings. -3. Create an MCIAS system account for the master (`mcp`). +1. Deploy the agent (steps above). +2. `mcp sync` with no service definition files -- the agent discovers all + running containers and adds them to its registry with desired state + `ignore`. +3. `mcp adopt ` for each container -- group them into + services and set desired state to `running`. +4. `mcp service export ` for each service -- generate service + definition files from the adopted state. +5. Review and edit the generated files as needed. -```ini -[Unit] -Description=MCP Watch -After=network-online.target mcp-agent.service -Wants=network-online.target +From this point, the service definition files are the source of truth and +`mcp deploy` manages the containers. -[Service] -Type=simple -ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml -Restart=on-failure -RestartSec=10 - -User=mcp -Group=mcp - -NoNewPrivileges=true -ProtectSystem=strict -ProtectHome=true -PrivateTmp=true -PrivateDevices=true -ProtectKernelTunables=true -ProtectKernelModules=true -ProtectControlGroups=true -RestrictSUIDSGID=true -RestrictNamespaces=true -LockPersonality=true -MemoryDenyWriteExecute=true -RestrictRealtime=true -ReadWritePaths=/srv/mcp - -[Install] -WantedBy=multi-user.target -``` - -When `mcp watch` runs on a node and the operator uses the CLI from vade, -they share the same registry database. The operator can either point their -CLI config at the node's database (via the overlay) or maintain a separate -local database and sync it. For v1 with one node, the simplest option is -running all `mcp` commands on rift (via SSH) where the database lives. +Existing containers on rift currently run under kyle's podman instance. +As part of MCP bootstrap, they will need to be re-created under the `mcp` +user's rootless podman. This is a one-time migration. --- @@ -986,12 +1030,12 @@ running all `mcp` commands on rift (via SSH) where the database lives. |--------|------------| | Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC | | Token theft | Tokens have short expiry; cached validation keyed by SHA-256 | -| Agent impersonation | Master verifies agent TLS certificate against known CA | +| Agent impersonation | CLI verifies agent TLS certificate against Metacrypt CA | | Arbitrary file write via push | Agent restricts writes to `/srv//` for the named service | | Arbitrary file read via pull | Agent restricts reads to `/srv//` for the named service | | Cross-service file access | File ops require a service name; agent resolves to that service's directory only | -| Container runtime escape | Agent runs as non-root where possible; containers follow platform hardening | -| Network eavesdropping | All C2 traffic is gRPC over mTLS over encrypted overlay | +| Container runtime escape | Rootless podman under `mcp` user; containers follow platform hardening | +| Network eavesdropping | All C2 traffic is gRPC over TLS over encrypted overlay | ### Security Invariants @@ -1000,10 +1044,12 @@ running all `mcp` commands on rift (via SSH) where the database lives. 2. File operations are scoped to `/srv//` for the named service. Path traversal attempts (`../`, symlinks outside the service directory) are rejected. -3. The agent never executes arbitrary commands. It only runs container runtime - operations and file I/O through well-defined code paths. +3. The agent never executes arbitrary commands. It only runs container + runtime operations and file I/O through well-defined code paths. 4. TLS 1.3 minimum on the agent's gRPC listener. -5. The master's stored token is file-permission protected (0600). +5. The CLI's stored token is file-permission protected (0600). +6. The agent runs as a dedicated `mcp` user with rootless podman. No root + access required. --- @@ -1012,47 +1058,46 @@ running all `mcp` commands on rift (via SSH) where the database lives. ``` mcp/ ├── cmd/ -│ ├── mcp/ Master CLI +│ ├── mcp/ CLI │ │ ├── main.go │ │ ├── login.go │ │ ├── deploy.go │ │ ├── lifecycle.go stop, start, restart -│ │ ├── status.go ps, status, sync +│ │ ├── status.go list, ps, status +│ │ ├── sync.go sync desired state │ │ ├── adopt.go adopt unmanaged containers -│ │ ├── service.go service show/export +│ │ ├── service.go service show/edit/export │ │ ├── transfer.go push, pull -│ │ ├── watch.go watch (long-lived monitoring) │ │ └── node.go node add/list/remove │ └── mcp-agent/ Agent daemon -│ ├── main.go -│ └── server.go +│ └── main.go ├── internal/ -│ ├── agent/ Agent logic -│ │ ├── agent.go Agent struct, setup -│ │ ├── containers.go Container runtime operations -│ │ ├── files.go File push/pull -│ │ └── runtime.go Runtime abstraction (podman/docker) -│ ├── registry/ Service registry +│ ├── agent/ Agent core +│ │ ├── agent.go Agent struct, setup, gRPC server +│ │ ├── deploy.go Deploy flow +│ │ ├── lifecycle.go Stop, start, restart +│ │ ├── files.go File push/pull with path validation +│ │ ├── sync.go Desired state sync, reconciliation +│ │ ├── adopt.go Container adoption +│ │ └── status.go Status queries +│ ├── runtime/ Container runtime abstraction +│ │ ├── runtime.go Interface +│ │ └── podman.go Podman implementation +│ ├── registry/ Node-local registry │ │ ├── db.go Schema, migrations -│ │ ├── nodes.go Node CRUD │ │ ├── services.go Service CRUD -│ │ ├── containers.go Container CRUD -│ │ ├── events.go Event log -│ │ └── sync.go Registry reconciliation -│ ├── watch/ Monitoring subsystem -│ │ ├── watch.go Watch loop +│ │ ├── components.go Component CRUD +│ │ └── events.go Event log +│ ├── monitor/ Monitoring subsystem +│ │ ├── monitor.go Watch loop │ │ └── alerting.go Alert evaluation and command execution │ ├── servicedef/ Service definition file parsing │ │ └── servicedef.go Load, parse, write TOML service defs │ ├── auth/ MCIAS integration -│ │ └── auth.go Token validation, login -│ ├── config/ Configuration loading -│ │ ├── master.go -│ │ └── agent.go -│ └── grpcserver/ Agent gRPC server wiring -│ ├── server.go -│ ├── interceptors.go Auth interceptor -│ └── handlers.go RPC implementations +│ │ └── auth.go Token validation, interceptor +│ └── config/ Configuration loading +│ ├── cli.go +│ └── agent.go ├── proto/mcp/ │ └── v1/ │ └── mcp.proto @@ -1060,11 +1105,10 @@ mcp/ │ └── v1/ Generated Go code ├── deploy/ │ ├── systemd/ -│ │ ├── mcp-agent.service -│ │ └── mcp-watch.service +│ │ └── mcp-agent.service │ ├── examples/ -│ │ ├── mcp.toml -│ │ └── mcp-agent.toml +│ │ ├── mcp.toml CLI config example +│ │ └── mcp-agent.toml Agent config example │ └── scripts/ │ └── install-agent.sh ├── Makefile @@ -1080,19 +1124,22 @@ mcp/ These are explicitly out of scope for v1 but inform the design: -- **Auto-reconciliation**: `mcp watch` detects drift but does not act on it - in v1. v2 adds configurable auto-restart for drifted containers (with - backoff to avoid restart storms). -- **Migration**: snapshot `/srv//` as tar.zst (with VACUUM INTO for - clean DB copies), stream to destination node, restore. Requires streaming - gRPC and archive assembly logic. -- **Scheduling**: automatic node selection based on resource availability and - operator constraints. Requires node resource reporting from agents. +- **Auto-reconciliation**: the agent detects drift but does not act on it + in v1. v2 adds configurable auto-restart for drifted components (with + backoff to avoid restart storms). This is the path to fully declarative + operation -- the agent continuously reconciles toward desired state. +- **Migration**: snapshot `/srv//` as tar.zst (with VACUUM INTO + for clean DB copies), stream to destination node, restore. Requires + streaming gRPC and archive assembly logic. +- **Scheduling**: automatic node selection based on resource availability + and operator constraints. The agent already reports disk, memory, and CPU + in `NodeStatus` to support this. - **Certificate provisioning**: MCP provisions TLS certs from Metacrypt during deploy via the ACME client library. - **DNS updates**: MCP pushes record updates to MCNS after deploy/migrate. Requires MCNS to have an API (or, as a stopgap, zone file editing). - **Multi-node orchestration**: deploy across multiple nodes, rolling updates, health-aware placement. -- **Web UI**: master exposes a web interface for registry browsing and - operations. Would add REST+gRPC on the master side. +- **Web UI**: a web interface for registry browsing and operations. Would + be a separate binary communicating with agents via gRPC, following the + platform's web UI pattern.