Add purge design to architecture doc

Purge removes stale registry entries — components that are no longer in service definitions and have no running container. Designed as an explicit, safe operation separate from sync: sync is additive (push desired state), purge is subtractive (remove forgotten entries). Includes safety rules (refuses to purge running containers), dry-run mode, agent RPC definition, and rationale for why sync should not be made destructive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:22:27 -07:00
parent ea8a42a696
commit 1afbf5e1f6
1 changed files with 142 additions and 0 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -207,6 +207,7 @@ mcp sync                               Push service definitions to agent (update
                                         state without deploying)

 mcp adopt <service>                    Adopt all <service>-* containers into a service
+mcp purge [service[/component]]       Remove stale registry entries (--dry-run to preview)

 mcp service show <service>             Print current spec from agent registry
 mcp service edit <service>             Open service definition in $EDITOR
@@ -1195,6 +1196,147 @@ mcp/

 ---

+## Registry Cleanup: Purge
+
+### Problem
+
+The agent's registry accumulates stale entries over time. A component
+that was replaced (e.g., `mcns/coredns` → `mcns/mcns`) or a service
+that was decommissioned remains in the registry indefinitely with
+`observed=removed` or `observed=unknown`. There is no mechanism to tell
+the agent "this component no longer exists and should not be tracked."
+
+This causes:
+- Perpetual drift alerts for components that will never return.
+- Noise in `mcp status` and `mcp list` output.
+- Confusion about what the agent is actually responsible for.
+
+The existing `mcp sync` compares local service definitions against the
+agent's registry and updates desired state for components that are
+defined. But it does not remove components or services that are *absent*
+from the local definitions — sync is additive, not declarative.
+
+### Design: `mcp purge`
+
+Purge removes registry entries that are both **unwanted** (not in any
+current service definition) and **gone** (no corresponding container in
+the runtime). It is the garbage collector for the registry.
+
+```
+mcp purge [--dry-run]               Purge all stale entries
+mcp purge <service> [--dry-run]     Purge stale entries for one service
+mcp purge <service>/<component> [--dry-run]  Purge a specific component
+```
+
+#### Semantics
+
+Purge operates on the agent's registry, not on containers. It never
+stops or removes running containers. The rules:
+
+1. **Component purge**: a component is eligible for purge when:
+   - Its observed state is `removed`, `unknown`, or `exited`, AND
+   - It is not present in any current service definition file
+     (i.e., `mcp sync` would not recreate it).
+
+   Purging a component deletes its registry entry (from `components`,
+   `component_ports`, `component_volumes`, `component_cmd`) and its
+   event history.
+
+2. **Service purge**: a service is eligible for purge when all of its
+   components have been purged (or it has no components). Purging a
+   service deletes its `services` row.
+
+3. **Safety**: purge refuses to remove a component whose observed state
+   is `running` or `stopped` (i.e., a container still exists in the
+   runtime). This prevents accidentally losing track of live containers.
+   The operator must `mcp stop` and wait for the container to be removed
+   before purging, or manually remove it via podman.
+
+4. **Dry run**: `--dry-run` lists what would be purged without modifying
+   the registry. This is the default-safe way to preview the operation.
+
+#### Interaction with Sync
+
+`mcp sync` pushes desired state from service definitions. `mcp purge`
+removes entries that sync would never touch. They are complementary:
+
+- `sync` answers: "what should exist?" (additive)
+- `purge` answers: "what should be forgotten?" (subtractive)
+
+A full cleanup is: `mcp sync && mcp purge`.
+
+An alternative design would make `mcp sync` itself remove entries not
+present in service definitions (fully declarative sync). This was
+rejected because:
+
+- Sync currently only operates on services that have local definition
+  files. A service without a local file is left untouched — this is
+  desirable when multiple operators or workstations manage different
+  services.
+- Making sync destructive increases the blast radius of a missing file
+  (accidentally deleting the local `mcr.toml` would cause sync to
+  purge MCR from the registry).
+- Purge as a separate, explicit command with `--dry-run` gives the
+  operator clear control over what gets cleaned up.
+
+#### Agent RPC
+
+```protobuf
+rpc PurgeComponent(PurgeRequest) returns (PurgeResponse);
+
+message PurgeRequest {
+  string service = 1;       // service name (empty = all services)
+  string component = 2;     // component name (empty = all eligible in service)
+  bool dry_run = 3;         // preview only, do not modify registry
+}
+
+message PurgeResponse {
+  repeated PurgeResult results = 1;
+}
+
+message PurgeResult {
+  string service = 1;
+  string component = 2;
+  bool purged = 3;           // true if removed (or would be, in dry-run)
+  string reason = 4;         // why eligible, or why refused
+}
+```
+
+The CLI sends the set of currently-defined service/component names
+alongside the purge request so the agent can determine what is "not in
+any current service definition" without needing access to the CLI's
+filesystem.
+
+#### Example
+
+After replacing `mcns/coredns` with `mcns/mcns`:
+
+```
+$ mcp purge --dry-run
+would purge mcns/coredns (observed=removed, not in service definitions)
+
+$ mcp purge
+purged mcns/coredns
+
+$ mcp status
+SERVICE    COMPONENT  DESIRED  OBSERVED  VERSION
+mc-proxy   mc-proxy   running  running   latest
+mcns       mcns       running  running   v1.0.0
+mcr        api        running  running   latest
+mcr        web        running  running   latest
+metacrypt  api        running  running   latest
+metacrypt  web        running  running   latest
+```
+
+#### Registry Auth
+
+Purge also cleans up after the `mcp adopt` workflow. When containers are
+adopted and later removed (replaced by a proper deploy), the adopted
+entries linger. Purge removes them once the containers are gone and the
+service definition no longer references them.
+
+---
+
 ## Future Work (v2+)

 These are explicitly out of scope for v1 but inform the design: