Add undeploy command: full inverse of deploy

Implements `mcp undeploy <service>` which tears down all infrastructure for a service: removes mc-proxy routes, DNS records, TLS certificates, stops and removes containers, releases allocated ports, and marks the service inactive. This fills the gap between `stop` (temporary pause) and `purge` (registry cleanup). Undeploy is the complete teardown that returns the node to the state before the service was deployed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 21:45:42 -07:00
parent b2eaa69619
commit f932dd64cc
8 changed files with 610 additions and 150 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -198,6 +198,7 @@ mcp build <service>/<image>            Build and push a single image
 mcp deploy <service>                   Deploy all components from service definition
 mcp deploy <service>/<component>       Deploy a single component
 mcp deploy <service> -f <file>         Deploy from explicit file
+mcp undeploy <service>                 Full teardown: remove routes, DNS, certs, containers
 mcp stop <service>                     Stop all components, set active=false
 mcp start <service>                    Start all components, set active=true
 mcp restart <service>                  Restart all components
@@ -453,6 +454,7 @@ import "google/protobuf/timestamp.proto";
 service McpAgent {
  // Service lifecycle
  rpc Deploy(DeployRequest) returns (DeployResponse);
+  rpc UndeployService(UndeployRequest) returns (UndeployResponse);
  rpc StopService(ServiceRequest) returns (ServiceResponse);
  rpc StartService(ServiceRequest) returns (ServiceResponse);
  rpc RestartService(ServiceRequest) returns (ServiceResponse);
@@ -714,6 +716,40 @@ The flags passed to `podman run` are derived from the `ComponentSpec`:
 | `volumes` | `-v <mapping>` (repeated) |
 | `cmd` | appended after the image name |

+#### Undeploy Flow
+
+`mcp undeploy <service>` is the full inverse of deploy. It tears down all
+infrastructure associated with a service. When the agent receives an
+`UndeployService` RPC:
+
+1. For each component:
+   a. Remove mc-proxy routes (traffic stops flowing).
+   b. Remove DNS A records from MCNS.
+   c. Remove TLS certificate and key files from the mc-proxy cert
+      directory (for L7 routes).
+   d. Stop and remove the container.
+   e. Release allocated host ports back to the port allocator.
+   f. Update component state to `removed` in the registry.
+2. Mark the service as inactive.
+3. Return success/failure per component.
+
+The CLI also sets `active = false` in the local service definition file
+to keep it in sync with the operator's intent.
+
+Undeploy differs from `stop` in three ways:
+
+| Aspect | `stop` | `undeploy` |
+|--------|--------|-----------|
+| Container | Stopped (still exists) | Stopped and removed |
+| TLS certs | Kept | Removed |
+| Ports | Kept allocated | Released |
+| Service active | Unchanged | Set to inactive |
+
+After undeploy, the service can be redeployed with `mcp deploy`. The
+registry entries are preserved (desired state `removed`) so `mcp status`
+and `mcp list` still show the service existed. Use `mcp purge` to clean
+up the registry entries if desired.
+
 ### File Transfer

 The agent supports single-file push and pull, scoped to a specific
@@ -1203,6 +1239,102 @@ container, the effective host UID depends on the mapping. Files in
 configuration should provision appropriate subuid/subgid ranges when
 creating the `mcp` user.

+**Dockerfile convention**: Do not use `USER`, `VOLUME`, or `adduser`
+directives in production Dockerfiles. The `user` field in the service
+definition (typically `"0:0"`) controls the runtime user, and host
+volumes provide the data directories. A non-root `USER` in the
+Dockerfile maps to a subordinate UID under rootless podman that cannot
+access files owned by the `mcp` user on the host.
+
+#### Infrastructure Boot Order and Circular Dependencies
+
+MCR (container registry) and MCNS (DNS) are both deployed as containers
+via MCP, but MCP itself depends on them:
+
+- **MCR** is reachable through mc-proxy (L4 passthrough on `:8443`).
+  The agent pulls images from MCR during `mcp deploy`.
+- **MCNS** serves DNS for internal zones. Tailscale and the overlay
+  network depend on DNS resolution.
+
+This creates circular dependencies during cold-start or recovery:
+
+```
+mcp deploy → agent pulls image → needs MCR → needs mc-proxy
+mcp deploy → agent dials MCR → DNS resolves hostname → needs MCNS
+```
+
+**Cold-start procedure** (no containers running):
+
+1. **Build images on the operator workstation** for mc-proxy, MCR, and
+   MCNS. Transfer to rift via `podman save` / `scp` / `podman load`
+   since the registry is not yet available:
+   ```
+   docker save <image> -o /tmp/image.tar
+   scp /tmp/image.tar <rift-lan-ip>:/tmp/
+   # on rift, as mcp user:
+   podman load -i /tmp/image.tar
+   ```
+   Use the LAN IP for scp, not a DNS name (DNS is not running yet).
+
+2. **Start MCNS first** (DNS must come up before anything that resolves
+   hostnames). Run directly with podman since the MCP agent cannot reach
+   the registry yet:
+   ```
+   podman run -d --name mcns --restart unless-stopped \
+     --sysctl net.ipv4.ip_unprivileged_port_start=53 \
+     -p <lan-ip>:53:53/tcp -p <lan-ip>:53:53/udp \
+     -p <overlay-ip>:53:53/tcp -p <overlay-ip>:53:53/udp \
+     -v /srv/mcns:/srv/mcns \
+     <mcns-image> server --config /srv/mcns/mcns.toml
+   ```
+
+3. **Start mc-proxy** (registry traffic routes through it):
+   ```
+   podman run -d --name mc-proxy --network host \
+     --restart unless-stopped \
+     -v /srv/mc-proxy:/srv/mc-proxy \
+     <mc-proxy-image> server --config /srv/mc-proxy/mc-proxy.toml
+   ```
+
+4. **Start MCR** (API server, then web UI):
+   ```
+   podman run -d --name mcr-api --network mcpnet \
+     --restart unless-stopped \
+     -p 127.0.0.1:28443:8443 -p 127.0.0.1:29443:9443 \
+     -v /srv/mcr:/srv/mcr \
+     <mcr-image> server --config /srv/mcr/mcr.toml
+   ```
+
+5. **Push images to MCR** from the operator workstation now that the
+   registry is reachable:
+   ```
+   docker push <registry>/<image>:<tag>
+   ```
+
+6. **Start the MCP agent** (systemd service). It can now reach MCR for
+   image pulls.
+
+7. **`mcp adopt`** the manually-started containers to bring them under
+   MCP management. Then `mcp service export` to generate service
+   definition files.
+
+From this point, `mcp deploy` works normally. The manually-started
+containers are replaced by MCP-managed ones on the next deploy.
+
+**Recovery procedure** (mc-proxy or MCNS crashed):
+
+If mc-proxy or MCNS goes down, the agent cannot pull images (registry
+unreachable or DNS broken). Recovery:
+
+1. Check if the required image is cached locally:
+   `podman images | grep <service>`
+2. If cached, start the container directly with `podman run` (same
+   flags as the cold-start procedure above).
+3. If not cached, transfer the image from the operator workstation via
+   `podman save` / `scp` / `podman load` using the LAN IP.
+4. Once the infrastructure service is running, `mcp deploy` resumes
+   normal operation for other services.
+
 ---

 ## Security Model