From ea7a9dcf4d381e3f3c4d111eeba988a04685336c Mon Sep 17 00:00:00 2001
From: Kyle Isom <kyle@imap.cc>
Date: Thu, 26 Mar 2026 10:31:48 -0700
Subject: [PATCH] Rewrite ARCHITECTURE.md incorporating review findings

Major design changes from the review:
- Merge agent and watcher into a single smart per-node daemon
- CLI is a thin client with no database; service definition files
  are the operator's source of truth for desired state
- Registry database lives on the agent, not the CLI
- Rename containers to components; components are independently
  deployable within a service (mcp deploy metacrypt/web)
- active: true/false in service definitions; desired_state values
  are running/stopped/ignore
- Server-side TLS + bearer token (not mTLS)
- Dedicated mcp user with rootless podman
- CLI commands: list (registry), ps (live), status (drift+events),
  sync (push desired state)
- Agent reports node resources (disk, memory, CPU) for future scheduling
- Agent is gRPC-only (deliberate exception to REST+gRPC parity rule)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 ARCHITECTURE.md | 1245 ++++++++++++++++++++++++-----------------------
 1 file changed, 646 insertions(+), 599 deletions(-)

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index 3c1aee2..f1bc6cf 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -6,40 +6,47 @@ MCP is the orchestrator for the Metacircular platform. It manages container
 lifecycle, tracks what services run where, and transfers files between the
 operator's workstation and managed nodes.
 
-MCP uses a **master/agent architecture**:
+MCP has two components:
 
-- **The agent** owns **"what is"**. It runs on each managed node, queries the
-  local container runtime, and reports facts. It manages containers and files
-  on behalf of the master. It has no opinion about what should be running.
+- **The CLI** (`mcp`) is a thin client that runs on the operator's
+  workstation. It reads local service definition files — the operator's
+  declaration of what should be running — and pushes that intent to agents.
+  It has no database and no daemon process.
 
-- **The master** owns **"what should be"**. It maintains the service registry
-  (desired state), compares it against agent-reported observed state, records
-  events, detects drift, and alerts the operator. The master runs in two
-  modes: as an interactive CLI for operator commands, and as a long-lived
-  watch process for continuous monitoring.
+- **The agent** (`mcp-agent`) is a smart per-node daemon. It receives
+  desired state from the CLI, manages containers via the local runtime,
+  stores the node's registry (desired state, observed state, deployed specs,
+  events), monitors for drift, and alerts the operator. The agent owns the
+  full loop: it knows what should be running, observes what is running, and
+  can act on the difference.
 
-Agents keep their containers running independently. The master is needed to
-issue commands, track desired state, and monitor for drift.
+The agent's container runtime interaction (podman/docker CLI) is an internal
+subcomponent — the "dumb" part. The agent itself is the smart coordinator
+that wraps it with state tracking, monitoring, and a gRPC API.
 
 ### v1 Scope
 
-v1 targets a single-node deployment (one agent on rift, master on vade). The
+v1 targets a single-node deployment (one agent on rift, CLI on vade). The
 core operations are:
 
-- **Deploy** -- pull a container image and start (or restart) a service on a
-  specified node.
-- **Container lifecycle** -- stop, start, restart individual containers.
-- **Status** -- query live container state from agents and reconcile the
-  master's service registry.
-- **Watch** -- continuous monitoring. Periodically syncs with agents, records
-  state-change events, detects drift, and alerts the operator.
-- **File transfer** -- push or pull individual files between master and nodes
-  (config files, certificates).
+- **Deploy** -- push service definitions to the agent; agent pulls images
+  and starts (or restarts) containers.
+- **Component-level deploy** -- deploy individual components within a
+  service without disrupting others (e.g., update the web UI without
+  restarting the API server).
+- **Container lifecycle** -- stop, start, restart services.
+- **Monitoring** -- agent continuously watches container state, records
+  events, detects drift and flapping, alerts the operator.
+- **Status** -- query live container state, view drift, review events.
+- **File transfer** -- push or pull individual files between CLI and nodes
+  (config files, certificates), scoped to service directories.
+- **Sync** -- push service definitions to the agent to update desired state
+  without deploying.
 
 Explicitly **not in v1**: migration (snapshot/tar.zst transfer), automatic
-scheduling/placement, certificate provisioning from Metacrypt, DNS updates to
-MCNS, multi-node orchestration, auto-reconciliation (restarting drifted
-containers).
+scheduling/placement, certificate provisioning from Metacrypt, DNS updates
+to MCNS, multi-node orchestration, auto-reconciliation (agent restarting
+drifted containers without operator action).
 
 ---
 
@@ -48,12 +55,12 @@ containers).
 ```
 Operator workstation (vade)
   ┌──────────────────────────────┐
-  │  mcp (master CLI)            │
+  │  mcp (CLI)                   │
   │                              │
-  │  ┌────────────────────────┐  │
-  │  │ Service Registry       │  │
-  │  │ (SQLite)               │  │
-  │  └────────────────────────┘  │
+  │  ~/.config/mcp/services/     │
+  │    metacrypt.toml            │
+  │    mcr.toml                  │
+  │    mc-proxy.toml             │
   │                              │
   │  gRPC client ────────────────┼──── overlay ────┐
   └──────────────────────────────┘                 │
@@ -61,54 +68,67 @@ Operator workstation (vade)
 MC Node (rift)                                     │
   ┌────────────────────────────────────────────────┼──┐
   │                                                │  │
-  │  ┌──────────────────────┐                      │  │
-  │  │ mcp-agent             │◄─────────────────────┘  │
-  │  │ (gRPC server)        │                         │
-  │  │                      │                         │
-  │  │ MCIAS token          │                         │
-  │  │ validation           │                         │
-  │  └──────────┬───────────┘                         │
-  │             │                                     │
-  │             │ exec                                │
-  │             ▼                                     │
-  │  ┌──────────────────────┐                         │
-  │  │ Container runtime    │                         │
-  │  │ (podman / docker)    │                         │
-  │  │                      │                         │
-  │  │  ┌─────┐ ┌─────┐    │    ┌──────────────────┐ │
-  │  │  │ svc │ │ svc │ ...│    │ /srv/<service>/   │ │
-  │  │  │  α  │ │  β  │    │    │  config, db,      │ │
-  │  │  └─────┘ └─────┘    │    │  certs, backups   │ │
-  │  └──────────────────────┘    └──────────────────┘ │
+  │  ┌──────────────────────────────────────────┐  │  │
+  │  │ mcp-agent                                │◄─┘  │
+  │  │                                          │     │
+  │  │  ┌─────────────┐  ┌──────────────────┐   │     │
+  │  │  │ Registry    │  │ Monitor          │   │     │
+  │  │  │ (SQLite)    │  │ (watch loop,     │   │     │
+  │  │  │             │  │  events,         │   │     │
+  │  │  │ desired     │  │  alerting)       │   │     │
+  │  │  │ observed    │  │                  │   │     │
+  │  │  │ specs       │  │                  │   │     │
+  │  │  │ events      │  │                  │   │     │
+  │  │  └─────────────┘  └──────────────────┘   │     │
+  │  │                                          │     │
+  │  │  ┌──────────────────────────────────┐    │     │
+  │  │  │ Container runtime (podman)       │    │     │
+  │  │  │                                  │    │     │
+  │  │  │  ┌───────┐ ┌───────┐ ┌───────┐  │    │     │
+  │  │  │  │ svc α │ │ svc β │ │ svc γ │  │    │     │
+  │  │  │  └───────┘ └───────┘ └───────┘  │    │     │
+  │  │  └──────────────────────────────────┘    │     │
+  │  └──────────────────────────────────────────┘     │
+  │                                                   │
+  │  /srv/<service>/  (config, db, certs, backups)    │
   └───────────────────────────────────────────────────┘
 ```
 
 ### Components
 
-| Component | Binary | Role |
-|-----------|--------|------|
-| Master | `mcp` | Operator CLI. Maintains the service registry. Issues commands to agents. |
-| Agent | `mcp-agent` | Node daemon. Manages local containers and files. Authenticates callers via MCIAS. |
+| Component | Binary | Where | Role |
+|-----------|--------|-------|------|
+| CLI | `mcp` | Operator workstation (vade) | Thin client. Reads service definitions, pushes intent to agents, queries status. |
+| Agent | `mcp-agent` | Each managed node (rift) | Smart daemon. Manages containers, stores registry, monitors, alerts. |
 
 ### Communication
 
-Master and agent communicate over gRPC with mutual TLS. The transport is the
-encrypted overlay network (Tailscale/WireGuard). The master authenticates to
-the agent by presenting an MCIAS bearer token. The agent validates the token
-by calling MCIAS and checking for the `admin` role.
+The CLI communicates with agents over gRPC with server-side TLS. The
+transport is the encrypted overlay network (Tailscale/WireGuard). The CLI
+authenticates by presenting an MCIAS bearer token in gRPC metadata. The
+agent validates the token by calling MCIAS and checking for the `admin`
+role.
+
+Client certificates (mTLS) are not used. The overlay network restricts
+network access to platform participants, MCIAS tokens are short-lived with
+role enforcement, and the agent's TLS certificate is verified against the
+Metacrypt CA. The scenarios where mTLS adds value (stolen token, MCIAS
+compromise) already imply broader platform compromise. mTLS remains an
+option for future security hardening.
 
 ---
 
 ## Authentication and Authorization
 
-MCP follows the platform authentication model: all auth is delegated to MCIAS.
+MCP follows the platform authentication model: all auth is delegated to
+MCIAS.
 
 ### Agent Authentication
 
 The agent is a gRPC server with a unary interceptor that enforces
 authentication on every RPC:
 
-1. Master includes an MCIAS bearer token in the gRPC metadata
+1. CLI includes an MCIAS bearer token in the gRPC metadata
    (`authorization: Bearer <token>`).
 2. Agent extracts the token and validates it against MCIAS (cached 30s by
    SHA-256 of the token, per platform convention).
@@ -117,151 +137,94 @@ authentication on every RPC:
 4. If validation fails, the RPC returns `UNAUTHENTICATED` (invalid/expired
    token) or `PERMISSION_DENIED` (valid token, not admin).
 
-### Master Authentication
+### CLI Authentication
 
-The master authenticates to MCIAS before issuing commands. The token can be
+The CLI authenticates to MCIAS before issuing commands. The token can be
 obtained by:
 
 1. `mcp login` -- interactive login, stores the token locally.
 2. Environment variable (`MCP_TOKEN`) for scripted use.
-3. System account credentials in the master config file.
+3. System account credentials in the CLI config file.
 
 The stored token is used for all subsequent agent RPCs until it expires.
 
-### Agent Identity
+---
 
-The agent itself authenticates to MCIAS as a system account (for future
-agent-initiated flows like health reporting). For v1, the agent is purely
-reactive -- it only responds to master-initiated RPCs and does not make
-outbound calls beyond MCIAS token validation.
+## Services and Components
+
+A **service** is a logical unit of the platform (e.g., "metacrypt"). A
+service has one or more **components** -- the containers that make it up
+(e.g., "api" and "web"). Components within a service:
+
+- Share the same node.
+- Share the same `/srv/<service>/` data directory.
+- Are deployed together by default, but can be deployed independently.
+
+This models the real constraint that components like an API server and its
+web UI are co-located and share state, but have different operational
+characteristics. For example, restarting Metacrypt's API server requires
+unsealing the vault, but the web UI can be redeployed independently without
+disrupting the API.
+
+Services with a single component (e.g., mc-proxy) simply have one
+`[[components]]` block.
+
+The unique identity of a component is `node/service/component`.
 
 ---
 
-## Master
+## CLI
 
-### CLI Commands
+### Commands
 
 ```
 mcp login                              Authenticate to MCIAS, store token
 
-mcp deploy <service>                   Deploy from service definition file
+mcp deploy <service>                   Deploy all components from service definition
+mcp deploy <service>/<component>       Deploy a single component
 mcp deploy <service> -f <file>         Deploy from explicit file
-mcp deploy <service> --image <tag>     Deploy with image override
-mcp stop <service>                     Stop all containers for a service
-mcp start <service>                    Start all containers for a service
-mcp restart <service>                  Restart all containers for a service
+mcp stop <service>                     Stop all components
+mcp start <service>                    Start all components
+mcp restart <service>                  Restart all components
 
-mcp ps                                 List all services in the registry
-mcp status [service]                   Query live state from agents, reconcile registry
-mcp sync                               Full rescan: agents report all containers,
-                                         registry is reconciled against reality
+mcp list                               List services from agent registry (no runtime query)
+mcp ps                                 Live check: query runtime, show running containers
+                                         with uptime and version
+mcp status [service]                   Full picture: live query + drift + recent events
+mcp sync                               Push service definitions to agent (update desired
+                                         state without deploying)
 
 mcp adopt <container> <service>        Claim an unmanaged container into a service
 
-mcp service show <service>             Print current spec from registry
+mcp service show <service>             Print current spec from agent registry
 mcp service edit <service>             Open service definition in $EDITOR
-mcp service export <service>           Write registry spec to local service file
+mcp service export <service>           Write agent registry spec to local service file
 mcp service export <service> -f <file> Write to explicit path
 
 mcp push <local-file> <service> [path] Copy a local file into /srv/<service>/[path]
-mcp pull <service> <path> [local-file]  Copy a file from /srv/<service>/<path> to local
-
-mcp watch                              Run continuous monitoring (long-lived)
-mcp watch --interval 60s              Custom poll interval
+mcp pull <service> <path> [local-file] Copy a file from /srv/<service>/<path> to local
 
 mcp node list                          List registered nodes
 mcp node add <name> <address>          Register a node
 mcp node remove <name>                 Deregister a node
 ```
 
-### Service Registry
+### Service Definition Files
 
-The master maintains a SQLite database that tracks:
-
-- **Nodes** -- name, overlay address, agent port, state.
-- **Services** -- name, assigned node, current state.
-- **Containers** -- name, parent service, image reference, container runtime
-  config (network, ports, volumes, user, restart policy, command), observed
-  state.
-
-A service (e.g., "metacrypt") has one or more containers (e.g., "metacrypt"
-API server + "metacrypt-web" UI). Deploy operates on services; lifecycle
-commands can target individual containers.
-
-### Desired State vs. Observed State
-
-The registry tracks two separate pieces of information for each container:
-
-- **Desired state** -- what the operator wants: `running` or `stopped`. Set
-  explicitly by `mcp deploy`, `mcp stop`, `mcp start`, etc.
-- **Observed state** -- what the agent actually sees on the node: `running`,
-  `stopped`, `exited`, `removed`, or `unknown`.
-
-These can diverge. A container with desired=`running` and observed=`exited`
-has crashed. A container with desired=`stopped` and observed=`stopped` is
-fine. `mcp status` shows both states and flags **drift** -- any container
-where observed state does not match desired state.
-
-| Desired | Observed | Status |
-|---------|----------|--------|
-| running | running | OK |
-| running | stopped | **DRIFT** -- stopped unexpectedly |
-| running | exited | **DRIFT** -- crashed |
-| running | removed | **DRIFT** -- container gone |
-| stopped | stopped | OK |
-| stopped | removed | OK |
-| stopped | running | **DRIFT** -- running when it shouldn't be |
-
-For v1, MCP reports drift but does not auto-reconcile. The operator decides
-whether to `mcp start`, `mcp deploy`, or investigate. Auto-reconciliation
-(restart crashed containers, enforce desired state) is a v2 concern.
-
-### Registry Reconciliation
-
-`mcp sync` (and implicitly, `mcp status`) works as follows:
-
-1. For each registered node, master calls `ListContainers` on the agent.
-2. Agent queries the container runtime and returns the full list of
-   containers with their configuration and state.
-3. Master compares the agent's report against the registry:
-   - **Container in registry, seen by agent**: update observed state.
-   - **Container in registry, not seen by agent**: set observed state to
-     `removed`.
-   - **Container seen by agent, not in registry**: add to registry as
-     `unmanaged` (desired state unset, observed state from agent).
-4. Master prints a summary: OK containers, drifted containers, and newly
-   discovered unmanaged containers.
-
-### Adopting Unmanaged Containers
-
-On first `mcp sync`, every container on rift will appear as `unmanaged` --
-MCP didn't deploy them and doesn't know their intended service grouping.
-
-`mcp adopt <container> <service>` claims an unmanaged container:
-
-1. If the service doesn't exist, create it (assigned to the container's
-   node).
-2. Set the container's service to the specified service.
-3. Set `managed = true` and desired state to `running` (or `stopped` if
-   the container is currently stopped).
-
-This lets the operator incrementally bring existing containers under MCP
-management without redeploying them.
-
-### Service Definitions
-
-A service definition is a TOML file that declares the containers for a
+A service definition is a TOML file that declares the components for a
 service. These files live in `~/.config/mcp/services/` by default, one
-per service.
+per service. They are the operator's declaration of intent -- what should
+exist, with what spec, in what state.
 
 Example: `~/.config/mcp/services/metacrypt.toml`
 
 ```toml
 name = "metacrypt"
 node = "rift"
+active = true
 
-[[containers]]
-name = "metacrypt"
+[[components]]
+name = "api"
 image = "mcr.svc.mcp.metacircular.net:8443/metacrypt:latest"
 network = "docker_default"
 user = "0:0"
@@ -269,8 +232,8 @@ restart = "unless-stopped"
 ports = ["127.0.0.1:18443:8443", "127.0.0.1:19443:9443"]
 volumes = ["/srv/metacrypt:/srv/metacrypt"]
 
-[[containers]]
-name = "metacrypt-web"
+[[components]]
+name = "web"
 image = "mcr.svc.mcp.metacircular.net:8443/metacrypt-web:latest"
 network = "docker_default"
 user = "0:0"
@@ -280,27 +243,25 @@ volumes = ["/srv/metacrypt:/srv/metacrypt"]
 cmd = ["server", "--config", "/srv/metacrypt/metacrypt.toml"]
 ```
 
-#### Deploy Resolution
+### Deploy Resolution
 
-`mcp deploy <service>` resolves the container spec through a precedence
+`mcp deploy <service>` resolves the component spec through a precedence
 chain:
 
-1. **Command-line flags** (highest) -- `--image`, `--node`, etc. override
-   individual fields.
-2. **Service definition file** -- if `-f <file>` is specified, use that
+1. **Service definition file** -- if `-f <file>` is specified, use that
    file. Otherwise look for `~/.config/mcp/services/<service>.toml`.
-3. **Registry** (lowest) -- if no file exists, use the spec from the last
-   successful deploy stored in the registry.
+2. **Agent registry** (fallback) -- if no file exists, use the spec from
+   the last successful deploy stored in the agent's registry.
 
-If none of these exist (first deploy, no file), the deploy fails with an
-error telling the operator to create a service definition.
+If neither exists (first deploy, no file), the deploy fails with an error
+telling the operator to create a service definition.
 
-On a successful deploy, the resolved spec (after all overrides) is written
-to the registry. The file on disk is **not** modified -- it represents the
-operator's declared intent, not the deployed state. To sync the file with
-reality after overrides, use `mcp service export`.
+The CLI pushes the resolved spec to the agent. The agent records it in its
+registry and executes the deploy. The service definition file on disk is
+**not** modified -- it represents the operator's declared intent, not the
+deployed state. To sync the file with reality, use `mcp service export`.
 
-#### Spec Lifecycle
+### Spec Lifecycle
 
 ```
                     ┌─────────────┐
@@ -309,9 +270,9 @@ reality after overrides, use `mcp service export`.
                     │  (.toml)    │                │
                     └─────────────┘                ▼
                           ▲              ┌─────────────────┐
-                          │              │    Registry      │
-           mcp service    │              │    (deployed     │
-           export         │              │     spec)        │
+                          │              │  Agent registry  │
+           mcp service    │              │  (deployed       │
+           export         │              │   spec)          │
                           │              └─────────────────┘
                           │                       │
                           └───────────────────────┘
@@ -319,21 +280,18 @@ reality after overrides, use `mcp service export`.
 
 - **Operator writes** the service definition file (or copies one from
   the service's repo).
-- **`mcp deploy`** reads the file, applies overrides, sends to agent,
-  records the resolved spec in the registry.
-- **`mcp service export`** reads the registry and writes it back to the
-  local file, incorporating any overrides or drift that occurred.
-
-This keeps the file and registry loosely coupled. The file is the starting
-point; the registry is the truth; export reconciles them.
+- **`mcp deploy`** reads the file, pushes to the agent, agent records the
+  spec in its registry and deploys.
+- **`mcp service export`** reads the agent's registry and writes it back to
+  the local file, incorporating any changes since the file was last edited.
 
 `mcp service edit <service>` opens the service definition in `$EDITOR`
 (falling back to `$VISUAL`, then `vi`). If no file exists yet, it exports
-the current registry spec to a temp file first, so the operator starts
+the current spec from the agent's registry first, so the operator starts
 from the deployed state rather than a blank file. After the editor exits,
 the file is saved to the standard path in the services directory.
 
-#### Where Definition Files Come From
+### Where Definition Files Come From
 
 Service definition files can be:
 
@@ -345,140 +303,20 @@ Service definition files can be:
 - **Generated by converting from mcdeploy.toml** during initial MCP
   migration (one-time).
 
-### Monitoring (`mcp watch`)
-
-`mcp watch` is a long-lived master process that continuously monitors the
-platform. It runs the same sync logic as `mcp status` on a loop, records
-state-change events, and alerts the operator when drift is detected.
-
-#### Event Log
-
-Every state transition is recorded in the `events` table:
-
-```sql
-CREATE TABLE events (
-    id         INTEGER PRIMARY KEY AUTOINCREMENT,
-    container  TEXT NOT NULL,
-    node       TEXT NOT NULL,
-    prev_state TEXT NOT NULL,
-    new_state  TEXT NOT NULL,
-    timestamp  TEXT NOT NULL DEFAULT (datetime('now'))
-);
-```
-
-Events accumulate over time and support rate queries:
-
-```sql
--- How many times has metacrypt exited in the last hour?
-SELECT COUNT(*) FROM events
-WHERE container = 'metacrypt'
-  AND new_state = 'exited'
-  AND timestamp > datetime('now', '-1 hour');
-```
-
-Old events are pruned periodically (default: retain 30 days).
-
-#### Watch Loop
-
-Each iteration of the watch loop:
-
-1. Sync with all agents (same as `mcp sync`).
-2. For each observed state change since the last iteration, insert an event.
-3. Evaluate alert conditions against the current state and event history.
-4. If an alert fires, execute the configured alert command.
-5. Sleep for the configured interval.
-
-#### Alert Conditions
-
-The watch process evaluates two types of alert:
-
-- **Drift alert**: a managed container's observed state does not match its
-  desired state. Fires on the transition, not on every poll.
-- **Rate alert**: a container has changed state more than N times within a
-  window (flapping). Default threshold: 3 transitions in 10 minutes.
-
-Each alert has a **cooldown** per container. Once an alert fires for a
-container, it is suppressed for the cooldown period regardless of further
-transitions. This prevents notification spam from a flapping service.
-
-```
-[watch]
-interval = "60s"
-alert_command = ""               # command to run on alert; empty = log only
-cooldown = "15m"                 # suppress repeat alerts per container
-flap_threshold = 3               # state changes within flap_window = flapping
-flap_window = "10m"
-retention = "30d"                # event log retention
-```
-
-#### Alert Command
-
-When an alert fires, the master executes the configured command with
-context passed via environment variables:
-
-| Variable | Value |
-|----------|-------|
-| `MCP_CONTAINER` | Container name |
-| `MCP_SERVICE` | Parent service name |
-| `MCP_NODE` | Node name |
-| `MCP_DESIRED` | Desired state |
-| `MCP_OBSERVED` | Observed state |
-| `MCP_PREV_STATE` | Previous observed state |
-| `MCP_ALERT_TYPE` | `drift` or `flapping` |
-| `MCP_TRANSITIONS` | Number of transitions in the flap window (for flapping alerts) |
-
-The alert command is the operator's choice. MCP does not ship with or depend
-on any notification system.
-
-```bash
-# Push notification
-alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER ($MCP_DESIRED -> $MCP_OBSERVED)'"
-
-# Webhook
-alert_command = "curl -s -X POST https://hooks.example.com/mcp ..."
-
-# Syslog
-alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_CONTAINER on $MCP_NODE'"
-```
-
-#### Deployment
-
-`mcp watch` is intended to run as a systemd service on a node (typically
-rift for v1), not on the operator's workstation. It uses the same binary,
-config, and database as the CLI -- the difference is the execution mode.
-
-```ini
-[Unit]
-Description=MCP Watch (continuous monitoring)
-After=network-online.target
-
-[Service]
-Type=simple
-ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
-Restart=on-failure
-RestartSec=10
-User=mcp
-Group=mcp
-```
-
-This means the master's SQLite database lives on the node running `mcp
-watch` (rift for v1), not on the operator's workstation. The operator can
-still run CLI commands from vade -- they connect to the same database via
-a configurable path, or the CLI operates statelessly by querying agents
-directly and reading the registry from a shared location.
-
-**Alternatively**, the operator runs `mcp watch` on vade when at their desk
-and relies on container restart policies for unattended operation. The
-design supports both -- `mcp watch` is an optional monitoring layer, not
-a required component.
-
 ---
 
 ## Agent
 
+The agent is the smart per-node daemon. It owns the full lifecycle:
+receives desired state, manages containers, stores the registry, monitors
+for drift, and alerts the operator.
+
 ### gRPC Service Definition
 
-The agent exposes a single gRPC service. All RPCs require admin authentication.
+The agent exposes a single gRPC service. All RPCs require admin
+authentication. The agent is gRPC-only -- it is internal C2 infrastructure,
+not a user-facing service, so the platform's REST+gRPC parity rule does not
+apply.
 
 ```protobuf
 syntax = "proto3";
@@ -487,15 +325,22 @@ package mcp.v1;
 import "google/protobuf/timestamp.proto";
 
 service McpAgent {
-  // Container lifecycle
+  // Service lifecycle
   rpc Deploy(DeployRequest) returns (DeployResponse);
-  rpc StartContainer(ContainerRequest) returns (ContainerResponse);
-  rpc StopContainer(ContainerRequest) returns (ContainerResponse);
-  rpc RestartContainer(ContainerRequest) returns (ContainerResponse);
-  rpc RemoveContainer(ContainerRequest) returns (ContainerResponse);
+  rpc StopService(ServiceRequest) returns (ServiceResponse);
+  rpc StartService(ServiceRequest) returns (ServiceResponse);
+  rpc RestartService(ServiceRequest) returns (ServiceResponse);
 
-  // Status
-  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
+  // Desired state
+  rpc SyncDesiredState(SyncRequest) returns (SyncResponse);
+
+  // Status and registry
+  rpc ListServices(ListServicesRequest) returns (ListServicesResponse);
+  rpc GetServiceStatus(ServiceStatusRequest) returns (ServiceStatusResponse);
+  rpc LiveCheck(LiveCheckRequest) returns (LiveCheckResponse);
+
+  // Adopt
+  rpc AdoptContainer(AdoptRequest) returns (AdoptResponse);
 
   // File transfer
   rpc PushFile(PushFileRequest) returns (PushFileResponse);
@@ -505,63 +350,128 @@ service McpAgent {
   rpc NodeStatus(NodeStatusRequest) returns (NodeStatusResponse);
 }
 
-// --- Container lifecycle ---
+// --- Service lifecycle ---
 
-message ContainerSpec {
+message ComponentSpec {
   string name = 1;
   string image = 2;
   string network = 3;
   string user = 4;
-  string restart_policy = 5;
+  string restart = 5;
   repeated string ports = 6;       // "host:container" mappings
   repeated string volumes = 7;     // "host:container" mount specs
   repeated string cmd = 8;         // command and arguments
 }
 
+message ServiceSpec {
+  string name = 1;
+  bool active = 2;
+  repeated ComponentSpec components = 3;
+}
+
 message DeployRequest {
-  repeated ContainerSpec containers = 1;
+  ServiceSpec service = 1;
+  string component = 2;            // deploy single component (empty = all)
 }
 
 message DeployResponse {
-  repeated ContainerResult results = 1;
+  repeated ComponentResult results = 1;
 }
 
-message ContainerRequest {
-  string name = 1;
-}
-
-message ContainerResponse {
+message ComponentResult {
   string name = 1;
   bool success = 2;
   string error = 3;
 }
 
-message ContainerResult {
+message ServiceRequest {
   string name = 1;
-  bool success = 2;
-  string error = 3;
 }
 
-// --- Status ---
+message ServiceResponse {
+  repeated ComponentResult results = 1;
+}
 
-message ListContainersRequest {}
+// --- Desired state ---
 
-message ContainerInfo {
+message SyncRequest {
+  repeated ServiceSpec services = 1;  // all services for this node
+}
+
+message SyncResponse {
+  repeated ServiceSyncResult results = 1;
+}
+
+message ServiceSyncResult {
+  string name = 1;
+  bool changed = 2;                // desired state was updated
+  string summary = 3;
+}
+
+// --- Status and registry ---
+
+message ListServicesRequest {}
+
+message ServiceInfo {
+  string name = 1;
+  bool active = 2;
+  repeated ComponentInfo components = 3;
+}
+
+message ComponentInfo {
   string name = 1;
   string image = 2;
-  string state = 3;              // "running", "stopped", "exited", etc.
-  string network = 4;
-  string user = 5;
-  string restart_policy = 6;
-  repeated string ports = 7;
-  repeated string volumes = 8;
-  repeated string cmd = 9;
-  google.protobuf.Timestamp created = 10;
-  google.protobuf.Timestamp started = 11;
+  string desired_state = 3;        // "running", "stopped", "ignore"
+  string observed_state = 4;       // "running", "stopped", "exited", "removed", "unknown"
+  string version = 5;
+  google.protobuf.Timestamp started = 6;
 }
 
-message ListContainersResponse {
-  repeated ContainerInfo containers = 1;
+message ListServicesResponse {
+  repeated ServiceInfo services = 1;
+}
+
+message ServiceStatusRequest {
+  string name = 1;                 // empty = all services
+}
+
+message DriftInfo {
+  string service = 1;
+  string component = 2;
+  string desired_state = 3;
+  string observed_state = 4;
+}
+
+message EventInfo {
+  string component = 1;
+  string prev_state = 2;
+  string new_state = 3;
+  google.protobuf.Timestamp timestamp = 4;
+}
+
+message ServiceStatusResponse {
+  repeated ServiceInfo services = 1;
+  repeated DriftInfo drift = 2;
+  repeated EventInfo recent_events = 3;
+}
+
+message LiveCheckRequest {}
+
+message LiveCheckResponse {
+  repeated ServiceInfo services = 1;  // with freshly observed state
+}
+
+// --- Adopt ---
+
+message AdoptRequest {
+  string container = 1;            // runtime container name
+  string service = 2;              // service to adopt into
+  string component = 3;            // component name within the service
+}
+
+message AdoptResponse {
+  bool success = 1;
+  string error = 2;
 }
 
 // --- File transfer ---
@@ -569,10 +479,10 @@ message ListContainersResponse {
 // The agent resolves the full path and rejects traversal attempts.
 
 message PushFileRequest {
-  string service = 1;            // service name (-> /srv/<service>/)
-  string path = 2;               // relative path within service dir
+  string service = 1;              // service name (-> /srv/<service>/)
+  string path = 2;                 // relative path within service dir
   bytes content = 3;
-  uint32 mode = 4;               // file permissions (e.g. 0600)
+  uint32 mode = 4;                 // file permissions (e.g. 0600)
 }
 
 message PushFileResponse {
@@ -581,8 +491,8 @@ message PushFileResponse {
 }
 
 message PullFileRequest {
-  string service = 1;            // service name (-> /srv/<service>/)
-  string path = 2;               // relative path within service dir
+  string service = 1;              // service name (-> /srv/<service>/)
+  string path = 2;                 // relative path within service dir
 }
 
 message PullFileResponse {
@@ -597,61 +507,75 @@ message NodeStatusRequest {}
 
 message NodeStatusResponse {
   string node_name = 1;
-  string runtime = 2;            // "podman", "docker"
+  string runtime = 2;             // "podman", "docker"
   string runtime_version = 3;
-  uint32 container_count = 4;
-  google.protobuf.Timestamp uptime_since = 5;
+  uint32 service_count = 4;
+  uint32 component_count = 5;
+  uint64 disk_total_bytes = 6;
+  uint64 disk_free_bytes = 7;
+  uint64 memory_total_bytes = 8;
+  uint64 memory_free_bytes = 9;
+  double cpu_usage_percent = 10;
+  google.protobuf.Timestamp uptime_since = 11;
 }
 ```
 
 ### Container Runtime
 
 The agent manages containers by executing the local container runtime CLI
-(`podman` or `docker`). The runtime is configured in the agent's config file.
-The agent does not use the Docker socket API -- it shells out to the CLI for
-simplicity and debuggability (the operator can always run the same commands
-manually).
+(`podman`). The runtime is configured in the agent's config file. The agent
+shells out to the CLI for simplicity and debuggability -- the operator can
+always run the same commands manually.
+
+The agent runs as a dedicated `mcp` system user. Podman runs rootless under
+this user. All containers are owned by `mcp`. The NixOS configuration
+provisions the `mcp` user with podman access.
 
 #### Deploy Flow
 
 When the agent receives a `Deploy` RPC:
 
-1. For each container in the request:
-   a. Pull the image: `<runtime> pull <image>`
+1. Record the service spec in the registry (desired state, component specs).
+2. For each component being deployed (all, or the one named in the request):
+   a. Pull the image: `podman pull <image>`
    b. Stop and remove the existing container (if any):
-      `<runtime> stop <name>` and `<runtime> rm <name>`
-   c. Start the new container: `<runtime> run -d --name <name> [flags] <image> [cmd]`
-   d. Verify the container is running: `<runtime> inspect <name>`
-2. Return success/failure per container.
+      `podman stop <name>` and `podman rm <name>`
+   c. Start the new container:
+      `podman run -d --name <name> [flags] <image> [cmd]`
+   d. Verify the container is running: `podman inspect <name>`
+   e. Update observed state in the registry.
+3. Set desired state to `running` for deployed components.
+4. Return success/failure per component.
 
-The flags passed to `run` are derived from the `ContainerSpec`:
+The flags passed to `podman run` are derived from the `ComponentSpec`:
 
 | Spec field | Runtime flag |
 |------------|-------------|
 | `network` | `--network <network>` |
 | `user` | `--user <user>` |
-| `restart_policy` | `--restart <policy>` |
+| `restart` | `--restart <restart>` |
 | `ports` | `-p <mapping>` (repeated) |
 | `volumes` | `-v <mapping>` (repeated) |
 | `cmd` | appended after the image name |
 
 ### File Transfer
 
-The agent supports single-file push and pull, scoped to a specific service's
-data directory. This is the mechanism for deploying config files and
-certificates to nodes.
+The agent supports single-file push and pull, scoped to a specific
+service's data directory. This is the mechanism for deploying config files
+and certificates to nodes.
 
 Every file operation specifies a **service name** and a **relative path**.
 The agent resolves the full path as `/srv/<service>/<path>`. This scoping
-ensures that a deploy for service A cannot accidentally write into service
-B's directory.
+ensures that a file operation for service A cannot write into service B's
+directory.
 
-**Push**: master sends the service name, relative path, file content, and
+**Push**: CLI sends the service name, relative path, file content, and
 permissions. The agent resolves the path, validates it (no `..` traversal,
-no symlinks escaping the service directory), creates intermediate directories
-if needed, and writes the file atomically (write to temp file, then rename).
+no symlinks escaping the service directory), creates intermediate
+directories if needed, and writes the file atomically (write to temp file,
+then rename).
 
-**Pull**: master sends the service name and relative path. The agent resolves
+**Pull**: CLI sends the service name and relative path. The agent resolves
 the path, validates it, reads the file, and returns the content and
 permissions.
 
@@ -666,118 +590,277 @@ mcp push cert.pem mcr certs/mcr.pem
 mcp pull mcr mcr.toml ./mcr.toml
 ```
 
-When the relative path is omitted from `mcp push`, the basename of the local
-file is used.
+When the relative path is omitted from `mcp push`, the basename of the
+local file is used.
 
 File size is bounded by gRPC message limits. For v1, the default 4MB gRPC
 message size is sufficient -- config files and certificates are kilobytes.
 If larger transfers are needed in the future, streaming RPCs or the v2
 tar.zst archive transfer will handle them.
 
----
+### Desired State vs. Observed State
 
-## Database Schema
+The agent's registry tracks two separate pieces of information for each
+component:
 
-The master's SQLite database stores the service registry.
+- **Desired state** -- what the operator wants: `running`, `stopped`, or
+  `ignore`. Set by the CLI via deploy, stop, start, sync, or adopt.
+- **Observed state** -- what the container runtime reports: `running`,
+  `stopped`, `exited`, `removed`, or `unknown`.
+
+These can diverge. A component with desired=`running` and observed=`exited`
+has crashed. The agent flags this as **drift**. Components with
+desired=`ignore` are tracked but never flagged as drifting.
+
+| Desired | Observed | Status |
+|---------|----------|--------|
+| running | running | OK |
+| running | stopped | **DRIFT** -- stopped unexpectedly |
+| running | exited | **DRIFT** -- crashed |
+| running | removed | **DRIFT** -- container gone |
+| stopped | stopped | OK |
+| stopped | removed | OK |
+| stopped | running | **DRIFT** -- running when it shouldn't be |
+| ignore | (any) | OK -- not managed |
+
+For v1, the agent reports drift but does not auto-reconcile. The operator
+decides whether to `mcp start`, `mcp deploy`, or investigate.
+Auto-reconciliation (agent restarting drifted containers without operator
+action) is a v2 concern.
+
+### Registry Reconciliation
+
+The agent reconciles its registry against the container runtime on three
+occasions: during the monitor loop (continuous), on `mcp ps` / `mcp status`
+(on demand), and on `mcp sync` (when new desired state is pushed).
+
+Reconciliation:
+
+1. Agent queries the container runtime for all containers.
+2. Compares the runtime's report against the registry:
+   - **Component in registry, seen in runtime**: update observed state.
+   - **Component in registry, not in runtime**: set observed state to
+     `removed`.
+   - **Container in runtime, not in registry**: add to registry with
+     desired state `ignore`. These are containers the agent sees but
+     MCP didn't deploy.
+3. Record state-change events for any transitions.
+
+### Adopting Unmanaged Containers
+
+On first sync, every container on rift will appear with desired state
+`ignore` -- MCP didn't deploy them and doesn't know their intended service
+grouping.
+
+`mcp adopt <container> <service>` claims an unmanaged container:
+
+1. If the service doesn't exist in the registry, create it.
+2. Assign the container as a component of the specified service.
+3. Set desired state to `running` (or `stopped` if the container is
+   currently stopped).
+
+This lets the operator incrementally bring existing containers under MCP
+management without redeploying them. The typical bootstrap flow:
+`mcp sync` to discover containers, `mcp adopt` to group them into services,
+`mcp service export` to generate service definition files from the adopted
+state.
+
+### Monitoring
+
+The agent runs a continuous monitor loop that watches container state and
+alerts the operator when problems are detected. Monitoring is a core
+function of the agent, not a separate process.
+
+#### Event Log
+
+Every state transition is recorded in the `events` table:
 
 ```sql
-CREATE TABLE nodes (
-    name       TEXT PRIMARY KEY,
-    address    TEXT NOT NULL,
-    agent_port INTEGER NOT NULL DEFAULT 9444,
-    created_at TEXT NOT NULL DEFAULT (datetime('now')),
-    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
-);
-
-CREATE TABLE services (
-    name       TEXT PRIMARY KEY,
-    node       TEXT NOT NULL REFERENCES nodes(name),
-    created_at TEXT NOT NULL DEFAULT (datetime('now')),
-    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
-);
-
-CREATE TABLE containers (
-    name           TEXT PRIMARY KEY,
-    service        TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE,
-    image          TEXT NOT NULL,
-    network        TEXT NOT NULL DEFAULT 'bridge',
-    user_spec      TEXT NOT NULL DEFAULT '',
-    restart_policy TEXT NOT NULL DEFAULT 'unless-stopped',
-    desired_state  TEXT NOT NULL DEFAULT 'running',
-    observed_state TEXT NOT NULL DEFAULT 'unknown',
-    managed        INTEGER NOT NULL DEFAULT 1,
-    created_at     TEXT NOT NULL DEFAULT (datetime('now')),
-    updated_at     TEXT NOT NULL DEFAULT (datetime('now'))
-);
-
-CREATE TABLE container_ports (
-    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
-    mapping   TEXT NOT NULL,
-    PRIMARY KEY (container, mapping)
-);
-
-CREATE TABLE container_volumes (
-    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
-    mapping   TEXT NOT NULL,
-    PRIMARY KEY (container, mapping)
-);
-
-CREATE TABLE container_cmd (
-    container TEXT NOT NULL REFERENCES containers(name) ON DELETE CASCADE,
-    position  INTEGER NOT NULL,
-    arg       TEXT NOT NULL,
-    PRIMARY KEY (container, position)
-);
-
--- Event log for monitoring and alerting
 CREATE TABLE events (
     id         INTEGER PRIMARY KEY AUTOINCREMENT,
-    container  TEXT NOT NULL,
-    node       TEXT NOT NULL,
+    component  TEXT NOT NULL,
+    service    TEXT NOT NULL,
     prev_state TEXT NOT NULL,
     new_state  TEXT NOT NULL,
     timestamp  TEXT NOT NULL DEFAULT (datetime('now'))
 );
 
-CREATE INDEX idx_events_container_time ON events(container, timestamp);
+CREATE INDEX idx_events_component_time ON events(component, timestamp);
+```
+
+Events accumulate over time and support rate queries:
+
+```sql
+-- How many times has metacrypt-api exited in the last hour?
+SELECT COUNT(*) FROM events
+WHERE component = 'api' AND service = 'metacrypt'
+  AND new_state = 'exited'
+  AND timestamp > datetime('now', '-1 hour');
+```
+
+Old events are pruned at the start of each monitor iteration (default:
+retain 30 days).
+
+#### Monitor Loop
+
+Each iteration of the monitor loop:
+
+1. Query the container runtime for all container states.
+2. Reconcile against the registry (update observed states).
+3. For each state transition since the last iteration, insert an event.
+4. Evaluate alert conditions against the current state and event history.
+5. If an alert fires, execute the configured alert command.
+6. Sleep for the configured interval.
+
+#### Alert Conditions
+
+The monitor evaluates two types of alert:
+
+- **Drift alert**: a managed component's observed state does not match its
+  desired state. Fires on the transition, not on every iteration.
+- **Flap alert**: a component has changed state more than N times within a
+  window. Default threshold: 3 transitions in 10 minutes.
+
+Each alert has a **cooldown** per component. Once an alert fires for a
+component, it is suppressed for the cooldown period regardless of further
+transitions. This prevents notification spam from a flapping service.
+
+```toml
+[monitor]
+interval       = "60s"
+alert_command  = ""               # command to run on alert; empty = log only
+cooldown       = "15m"            # suppress repeat alerts per component
+flap_threshold = 3                # state changes within flap_window = flapping
+flap_window    = "10m"
+retention      = "30d"            # event log retention
+```
+
+#### Alert Command
+
+When an alert fires, the agent executes the configured command with
+context passed via environment variables:
+
+| Variable | Value |
+|----------|-------|
+| `MCP_COMPONENT` | Component name |
+| `MCP_SERVICE` | Parent service name |
+| `MCP_NODE` | Node name |
+| `MCP_DESIRED` | Desired state |
+| `MCP_OBSERVED` | Observed state |
+| `MCP_PREV_STATE` | Previous observed state |
+| `MCP_ALERT_TYPE` | `drift` or `flapping` |
+| `MCP_TRANSITIONS` | Number of transitions in the flap window (for flap alerts) |
+
+The alert command is the operator's choice. MCP does not ship with or
+depend on any notification system.
+
+```bash
+# Push notification
+alert_command = "ntfy publish mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT ($MCP_DESIRED -> $MCP_OBSERVED)'"
+
+# Webhook
+alert_command = "curl -s -X POST https://hooks.example.com/mcp ..."
+
+# Syslog
+alert_command = "logger -t mcp '$MCP_ALERT_TYPE: $MCP_SERVICE/$MCP_COMPONENT on $MCP_NODE'"
+```
+
+---
+
+## Database Schema
+
+The agent's SQLite database stores the node-local registry. Each agent
+has its own database. Component identity is scoped to the node -- there
+are no cross-node name collisions because each node has a separate
+database.
+
+```sql
+CREATE TABLE services (
+    name       TEXT PRIMARY KEY,
+    active     INTEGER NOT NULL DEFAULT 1,
+    created_at TEXT NOT NULL DEFAULT (datetime('now')),
+    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE TABLE components (
+    name          TEXT NOT NULL,
+    service       TEXT NOT NULL REFERENCES services(name) ON DELETE CASCADE,
+    image         TEXT NOT NULL,
+    network       TEXT NOT NULL DEFAULT 'bridge',
+    user_spec     TEXT NOT NULL DEFAULT '',
+    restart       TEXT NOT NULL DEFAULT 'unless-stopped',
+    desired_state TEXT NOT NULL DEFAULT 'running',
+    observed_state TEXT NOT NULL DEFAULT 'unknown',
+    version       TEXT NOT NULL DEFAULT '',
+    created_at    TEXT NOT NULL DEFAULT (datetime('now')),
+    updated_at    TEXT NOT NULL DEFAULT (datetime('now')),
+    PRIMARY KEY (service, name)
+);
+
+CREATE TABLE component_ports (
+    service   TEXT NOT NULL,
+    component TEXT NOT NULL,
+    mapping   TEXT NOT NULL,
+    PRIMARY KEY (service, component, mapping),
+    FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
+);
+
+CREATE TABLE component_volumes (
+    service   TEXT NOT NULL,
+    component TEXT NOT NULL,
+    mapping   TEXT NOT NULL,
+    PRIMARY KEY (service, component, mapping),
+    FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
+);
+
+CREATE TABLE component_cmd (
+    service   TEXT NOT NULL,
+    component TEXT NOT NULL,
+    position  INTEGER NOT NULL,
+    arg       TEXT NOT NULL,
+    PRIMARY KEY (service, component, position),
+    FOREIGN KEY (service, component) REFERENCES components(service, name) ON DELETE CASCADE
+);
+
+CREATE TABLE events (
+    id         INTEGER PRIMARY KEY AUTOINCREMENT,
+    service    TEXT NOT NULL,
+    component  TEXT NOT NULL,
+    prev_state TEXT NOT NULL,
+    new_state  TEXT NOT NULL,
+    timestamp  TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE INDEX idx_events_component_time ON events(service, component, timestamp);
 ```
 
 ### State Values
 
-**Desired state** (set by operator actions):
+**Desired state** (set by operator actions via CLI):
 
 | State | Meaning |
 |-------|---------|
-| `running` | Operator wants this container running |
-| `stopped` | Operator deliberately stopped this container |
+| `running` | Operator wants this component running |
+| `stopped` | Operator deliberately stopped this component |
+| `ignore` | Unmanaged -- MCP sees it but is not responsible for it |
 
-**Observed state** (set by agent reports):
+**Observed state** (set by container runtime queries):
 
 | State | Meaning |
 |-------|---------|
-| `running` | Container is running (confirmed by agent) |
+| `running` | Container is running |
 | `stopped` | Container exists but is not running |
 | `exited` | Container exited (crashed or completed) |
-| `removed` | Container no longer exists on the node |
-| `unknown` | State has not been queried since registration |
-
-The `managed` flag distinguishes MCP-deployed containers from pre-existing
-ones discovered during sync. Unmanaged containers have no desired state set
-(desired_state is empty). They appear in status output but MCP will not
-modify them unless the operator explicitly adopts them with `mcp adopt`.
+| `removed` | Container no longer exists |
+| `unknown` | State has not been queried yet |
 
 ---
 
 ## Configuration
 
-### Master Config
-
-When running as a CLI on the operator's workstation:
+### CLI Config
 
 ```toml
-[database]
-path = "/home/kyle/.config/mcp/mcp.db"
-
 [services]
 dir = "/home/kyle/.config/mcp/services"
 
@@ -794,8 +877,7 @@ Directory layout on the operator's workstation:
 
 ```
 ~/.config/mcp/
-├── mcp.toml                    Master config
-├── mcp.db                      Service registry
+├── mcp.toml                    CLI config
 ├── token                       Cached MCIAS bearer token (0600)
 └── services/                   Service definition files
     ├── metacrypt.toml
@@ -804,21 +886,30 @@ Directory layout on the operator's workstation:
     └── ...
 ```
 
-When running as a service on a node (for `mcp watch`):
+The CLI has no database. Service definition files are the operator's source
+of truth for desired state. The agent's registry is the operational truth.
+
+### Agent Config
 
 ```toml
+[server]
+grpc_addr = ":9444"
+tls_cert  = "/srv/mcp/certs/cert.pem"
+tls_key   = "/srv/mcp/certs/key.pem"
+
 [database]
 path = "/srv/mcp/mcp.db"
 
 [mcias]
 server_url   = "https://mcias.metacircular.net:8443"
 ca_cert      = ""
-service_name = "mcp"
-# System account credentials for unattended operation
-username     = "mcp"
-password_file = "/srv/mcp/credentials"
+service_name = "mcp-agent"
 
-[watch]
+[agent]
+node_name         = "rift"
+container_runtime = "podman"
+
+[monitor]
 interval       = "60s"
 alert_command  = ""
 cooldown       = "15m"
@@ -830,37 +921,17 @@ retention      = "30d"
 level = "info"
 ```
 
-The master binary is the same in both cases. When running from the
-operator's workstation, it uses `~/.config/mcp/` for local state and
-interactive login. When running as a systemd service on a node, it uses
-`/srv/mcp/` following the platform convention and authenticates with a
-system account.
+The agent's data directory follows the platform convention:
 
-### Agent Config
-
-```toml
-[server]
-grpc_addr = ":9444"
-tls_cert  = "/srv/mcp/certs/cert.pem"
-tls_key   = "/srv/mcp/certs/key.pem"
-
-[mcias]
-server_url   = "https://mcias.metacircular.net:8443"
-ca_cert      = ""
-service_name = "mcp-agent"
-
-[agent]
-node_name         = "rift"
-container_runtime = "podman"
-
-[log]
-level = "info"
 ```
-
-The agent config lives in `/srv/mcp/` following the platform convention. The
-agent's data directory contains only its config and TLS material -- it has no
-database of its own. The container runtime is the source of truth for local
-container state.
+/srv/mcp/
+├── mcp-agent.toml              Agent config
+├── mcp.db                      Registry database
+├── certs/
+│   ├── cert.pem                Agent TLS certificate
+│   └── key.pem                 Agent TLS key
+└── backups/                    Database snapshots
+```
 
 ---
 
@@ -870,14 +941,15 @@ container state.
 
 The agent is deployed like any other Metacircular service:
 
-1. Create `/srv/mcp/` directory and config file.
-2. Provision TLS certificate from Metacrypt.
-3. Create an MCIAS system account for the agent (`mcp-agent`).
-4. Install the `mcp-agent` binary.
-5. Start via systemd unit.
+1. Provision the `mcp` system user via NixOS config (with podman access).
+2. Create `/srv/mcp/` directory and config file.
+3. Provision TLS certificate from Metacrypt.
+4. Create an MCIAS system account for the agent (`mcp-agent`).
+5. Install the `mcp-agent` binary.
+6. Start via systemd unit.
 
-The agent runs as a systemd service (container-first deployment is a v2
-concern -- MCP needs to be running before it can manage its own agent).
+The agent runs as a systemd service. Container-first deployment is a v2
+concern -- MCP needs to be running before it can manage its own agent.
 
 ```ini
 [Unit]
@@ -913,68 +985,40 @@ ReadWritePaths=/srv
 WantedBy=multi-user.target
 ```
 
-Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes files
-to any service's `/srv/<service>/` directory on behalf of the master.
+Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes
+files to any service's `/srv/<service>/` directory on behalf of the CLI.
 
-### Master Installation (on operator workstation)
+### CLI Installation (on operator workstation)
 
-For interactive CLI use:
+The CLI is a standalone binary with no daemon.
 
 1. Install the `mcp` binary to `~/.local/bin/` or `/usr/local/bin/`.
-2. Create `~/.config/mcp/mcp.toml` with MCIAS and registry config.
-3. Run `mcp login` to authenticate.
-4. Run `mcp node add rift <address>` to register the first node.
-5. Run `mcp sync` to discover existing containers on the node.
+2. Create `~/.config/mcp/mcp.toml`.
+3. Create `~/.config/mcp/services/` directory.
+4. Run `mcp login` to authenticate.
+5. Run `mcp sync` to push service definitions and discover existing
+   containers.
 
-### Master Watch Deployment (on a node)
+### MCP Bootstrap (first time)
 
-For continuous monitoring, deploy `mcp watch` as a systemd service on a node
-(rift for v1). This runs alongside the agent on the same machine.
+When bringing MCP up on a node that already has running containers:
 
-1. Install the `mcp` binary to `/usr/local/bin/`.
-2. Create `/srv/mcp/mcp.toml` with database, MCIAS system account, and
-   `[watch]` settings.
-3. Create an MCIAS system account for the master (`mcp`).
+1. Deploy the agent (steps above).
+2. `mcp sync` with no service definition files -- the agent discovers all
+   running containers and adds them to its registry with desired state
+   `ignore`.
+3. `mcp adopt <container> <service>` for each container -- group them into
+   services and set desired state to `running`.
+4. `mcp service export <service>` for each service -- generate service
+   definition files from the adopted state.
+5. Review and edit the generated files as needed.
 
-```ini
-[Unit]
-Description=MCP Watch
-After=network-online.target mcp-agent.service
-Wants=network-online.target
+From this point, the service definition files are the source of truth and
+`mcp deploy` manages the containers.
 
-[Service]
-Type=simple
-ExecStart=/usr/local/bin/mcp watch --config /srv/mcp/mcp.toml
-Restart=on-failure
-RestartSec=10
-
-User=mcp
-Group=mcp
-
-NoNewPrivileges=true
-ProtectSystem=strict
-ProtectHome=true
-PrivateTmp=true
-PrivateDevices=true
-ProtectKernelTunables=true
-ProtectKernelModules=true
-ProtectControlGroups=true
-RestrictSUIDSGID=true
-RestrictNamespaces=true
-LockPersonality=true
-MemoryDenyWriteExecute=true
-RestrictRealtime=true
-ReadWritePaths=/srv/mcp
-
-[Install]
-WantedBy=multi-user.target
-```
-
-When `mcp watch` runs on a node and the operator uses the CLI from vade,
-they share the same registry database. The operator can either point their
-CLI config at the node's database (via the overlay) or maintain a separate
-local database and sync it. For v1 with one node, the simplest option is
-running all `mcp` commands on rift (via SSH) where the database lives.
+Existing containers on rift currently run under kyle's podman instance.
+As part of MCP bootstrap, they will need to be re-created under the `mcp`
+user's rootless podman. This is a one-time migration.
 
 ---
 
@@ -986,12 +1030,12 @@ running all `mcp` commands on rift (via SSH) where the database lives.
 |--------|------------|
 | Unauthorized C2 commands | Agent requires admin MCIAS token on every RPC |
 | Token theft | Tokens have short expiry; cached validation keyed by SHA-256 |
-| Agent impersonation | Master verifies agent TLS certificate against known CA |
+| Agent impersonation | CLI verifies agent TLS certificate against Metacrypt CA |
 | Arbitrary file write via push | Agent restricts writes to `/srv/<service>/` for the named service |
 | Arbitrary file read via pull | Agent restricts reads to `/srv/<service>/` for the named service |
 | Cross-service file access | File ops require a service name; agent resolves to that service's directory only |
-| Container runtime escape | Agent runs as non-root where possible; containers follow platform hardening |
-| Network eavesdropping | All C2 traffic is gRPC over mTLS over encrypted overlay |
+| Container runtime escape | Rootless podman under `mcp` user; containers follow platform hardening |
+| Network eavesdropping | All C2 traffic is gRPC over TLS over encrypted overlay |
 
 ### Security Invariants
 
@@ -1000,10 +1044,12 @@ running all `mcp` commands on rift (via SSH) where the database lives.
 2. File operations are scoped to `/srv/<service>/` for the named service.
    Path traversal attempts (`../`, symlinks outside the service directory)
    are rejected.
-3. The agent never executes arbitrary commands. It only runs container runtime
-   operations and file I/O through well-defined code paths.
+3. The agent never executes arbitrary commands. It only runs container
+   runtime operations and file I/O through well-defined code paths.
 4. TLS 1.3 minimum on the agent's gRPC listener.
-5. The master's stored token is file-permission protected (0600).
+5. The CLI's stored token is file-permission protected (0600).
+6. The agent runs as a dedicated `mcp` user with rootless podman. No root
+   access required.
 
 ---
 
@@ -1012,47 +1058,46 @@ running all `mcp` commands on rift (via SSH) where the database lives.
 ```
 mcp/
 ├── cmd/
-│   ├── mcp/                  Master CLI
+│   ├── mcp/                  CLI
 │   │   ├── main.go
 │   │   ├── login.go
 │   │   ├── deploy.go
 │   │   ├── lifecycle.go      stop, start, restart
-│   │   ├── status.go         ps, status, sync
+│   │   ├── status.go         list, ps, status
+│   │   ├── sync.go           sync desired state
 │   │   ├── adopt.go          adopt unmanaged containers
-│   │   ├── service.go        service show/export
+│   │   ├── service.go        service show/edit/export
 │   │   ├── transfer.go       push, pull
-│   │   ├── watch.go          watch (long-lived monitoring)
 │   │   └── node.go           node add/list/remove
 │   └── mcp-agent/            Agent daemon
-│       ├── main.go
-│       └── server.go
+│       └── main.go
 ├── internal/
-│   ├── agent/                Agent logic
-│   │   ├── agent.go          Agent struct, setup
-│   │   ├── containers.go     Container runtime operations
-│   │   ├── files.go          File push/pull
-│   │   └── runtime.go        Runtime abstraction (podman/docker)
-│   ├── registry/             Service registry
+│   ├── agent/                Agent core
+│   │   ├── agent.go          Agent struct, setup, gRPC server
+│   │   ├── deploy.go         Deploy flow
+│   │   ├── lifecycle.go      Stop, start, restart
+│   │   ├── files.go          File push/pull with path validation
+│   │   ├── sync.go           Desired state sync, reconciliation
+│   │   ├── adopt.go          Container adoption
+│   │   └── status.go         Status queries
+│   ├── runtime/              Container runtime abstraction
+│   │   ├── runtime.go        Interface
+│   │   └── podman.go         Podman implementation
+│   ├── registry/             Node-local registry
 │   │   ├── db.go             Schema, migrations
-│   │   ├── nodes.go          Node CRUD
 │   │   ├── services.go       Service CRUD
-│   │   ├── containers.go     Container CRUD
-│   │   ├── events.go         Event log
-│   │   └── sync.go           Registry reconciliation
-│   ├── watch/                Monitoring subsystem
-│   │   ├── watch.go          Watch loop
+│   │   ├── components.go     Component CRUD
+│   │   └── events.go         Event log
+│   ├── monitor/              Monitoring subsystem
+│   │   ├── monitor.go        Watch loop
 │   │   └── alerting.go       Alert evaluation and command execution
 │   ├── servicedef/           Service definition file parsing
 │   │   └── servicedef.go     Load, parse, write TOML service defs
 │   ├── auth/                 MCIAS integration
-│   │   └── auth.go           Token validation, login
-│   ├── config/               Configuration loading
-│   │   ├── master.go
-│   │   └── agent.go
-│   └── grpcserver/           Agent gRPC server wiring
-│       ├── server.go
-│       ├── interceptors.go   Auth interceptor
-│       └── handlers.go       RPC implementations
+│   │   └── auth.go           Token validation, interceptor
+│   └── config/               Configuration loading
+│       ├── cli.go
+│       └── agent.go
 ├── proto/mcp/
 │   └── v1/
 │       └── mcp.proto
@@ -1060,11 +1105,10 @@ mcp/
 │   └── v1/                   Generated Go code
 ├── deploy/
 │   ├── systemd/
-│   │   ├── mcp-agent.service
-│   │   └── mcp-watch.service
+│   │   └── mcp-agent.service
 │   ├── examples/
-│   │   ├── mcp.toml
-│   │   └── mcp-agent.toml
+│   │   ├── mcp.toml          CLI config example
+│   │   └── mcp-agent.toml    Agent config example
 │   └── scripts/
 │       └── install-agent.sh
 ├── Makefile
@@ -1080,19 +1124,22 @@ mcp/
 
 These are explicitly out of scope for v1 but inform the design:
 
-- **Auto-reconciliation**: `mcp watch` detects drift but does not act on it
-  in v1. v2 adds configurable auto-restart for drifted containers (with
-  backoff to avoid restart storms).
-- **Migration**: snapshot `/srv/<service>/` as tar.zst (with VACUUM INTO for
-  clean DB copies), stream to destination node, restore. Requires streaming
-  gRPC and archive assembly logic.
-- **Scheduling**: automatic node selection based on resource availability and
-  operator constraints. Requires node resource reporting from agents.
+- **Auto-reconciliation**: the agent detects drift but does not act on it
+  in v1. v2 adds configurable auto-restart for drifted components (with
+  backoff to avoid restart storms). This is the path to fully declarative
+  operation -- the agent continuously reconciles toward desired state.
+- **Migration**: snapshot `/srv/<service>/` as tar.zst (with VACUUM INTO
+  for clean DB copies), stream to destination node, restore. Requires
+  streaming gRPC and archive assembly logic.
+- **Scheduling**: automatic node selection based on resource availability
+  and operator constraints. The agent already reports disk, memory, and CPU
+  in `NodeStatus` to support this.
 - **Certificate provisioning**: MCP provisions TLS certs from Metacrypt
   during deploy via the ACME client library.
 - **DNS updates**: MCP pushes record updates to MCNS after deploy/migrate.
   Requires MCNS to have an API (or, as a stopgap, zone file editing).
 - **Multi-node orchestration**: deploy across multiple nodes, rolling
   updates, health-aware placement.
-- **Web UI**: master exposes a web interface for registry browsing and
-  operations. Would add REST+gRPC on the master side.
+- **Web UI**: a web interface for registry browsing and operations. Would
+  be a separate binary communicating with agents via gRPC, following the
+  platform's web UI pattern.