Document system account auth model in ARCHITECTURE.md

Replaces the "admin required for all operations" model with the new three-tier identity model: human operators for CLI, mcp-agent system account for infrastructure automation, admin reserved for MCIAS-level administration. Documents agent-to-service token paths and per-service authorization policies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 16:11:08 -07:00
parent 86d516acf6
commit 18365cc0a8
1 changed files with 133 additions and 18 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -121,9 +121,26 @@ option for future security hardening.
 ## Authentication and Authorization

 MCP follows the platform authentication model: all auth is delegated to
-MCIAS.
+MCIAS. The auth model separates three concerns: operator intent (CLI to
+agent), infrastructure automation (agent to platform services), and
+access control (who can do what).

-### Agent Authentication
+### Identity Model
+
+| Identity | Type | Purpose |
+|----------|------|---------|
+| Human operator (e.g., `kyle`) | human | CLI operations: deploy, stop, start, build |
+| `mcp-agent` | system | Agent-to-service automation: certs, DNS, routes, image pull |
+| Per-service accounts (e.g., `mcq`) | system | Scoped self-management (own DNS records only) |
+| `admin` role | role | MCIAS account management, policy changes, zone creation |
+| `guest` role | role | Explicitly rejected by the agent |
+
+The `admin` role is reserved for MCIAS-level administrative operations
+(account creation, policy management, zone mutations). Routine MCP
+operations (deploy, stop, start, build) do not require admin — any
+authenticated non-guest user or system account is accepted.
+
+### Agent Authentication (CLI → Agent)

 The agent is a gRPC server with a unary interceptor that enforces
 authentication on every RPC:
@@ -132,10 +149,34 @@ authentication on every RPC:
   (`authorization: Bearer <token>`).
 2. Agent extracts the token and validates it against MCIAS (cached 30s by
   SHA-256 of the token, per platform convention).
-3. Agent checks that the caller has the `admin` role. All MCP operations
-   require admin -- there is no unprivileged MCP access.
+3. Agent rejects guests (`guest` role → `PERMISSION_DENIED`). All other
+   authenticated users and system accounts are accepted.
 4. If validation fails, the RPC returns `UNAUTHENTICATED` (invalid/expired
-   token) or `PERMISSION_DENIED` (valid token, not admin).
+   token) or `PERMISSION_DENIED` (guest).
+
+### Agent Service Authentication (Agent → Platform Services)
+
+The agent authenticates to platform services using a long-lived system
+account token (`mcp-agent`). Each service has its own token file:
+
+| Service | Token Path | Operations |
+|---------|------------|------------|
+| Metacrypt | `/srv/mcp/metacrypt-token` | TLS cert provisioning (PKI issue) |
+| MCNS | `/srv/mcp/mcns-token` | DNS record create/delete (any name) |
+| mc-proxy | Unix socket (no auth) | Route registration/removal |
+| MCR | podman auth store | Image pull (JWT-as-password) |
+
+These tokens are issued by MCIAS for the `mcp-agent` system account.
+They carry no roles — authorization is handled by each service's policy
+engine:
+
+- **Metacrypt:** Policy rule grants `mcp-agent` write access to
+  `engine/pki/issue`.
+- **MCNS:** Code-level authorization: system account `mcp-agent` can
+  manage any record; other system accounts can only manage records
+  matching their username.
+- **MCR:** Default policy allows all authenticated users to push/pull.
+  MCR accepts MCIAS JWTs as passwords at the `/v2/token` endpoint.

 ### CLI Authentication

@@ -148,6 +189,15 @@ obtained by:

 The stored token is used for all subsequent agent RPCs until it expires.

+### MCR Registry Authentication
+
+`mcp build` auto-authenticates to MCR before pushing images. It reads
+the CLI's stored MCIAS token and uses it as the password for `podman
+login`. MCR's token endpoint accepts MCIAS JWTs as passwords (the
+personal-access-token pattern), so both human and system account tokens
+work. This eliminates the need for a separate interactive `podman login`
+step.
+
 ---

 ## Services and Components
@@ -224,6 +274,9 @@ mcp pull <service> <path> [local-file] Copy a file from /srv/<service>/<path> to
 mcp node list                          List registered nodes
 mcp node add <name> <address>          Register a node
 mcp node remove <name>                 Deregister a node
+
+mcp agent upgrade [node]               Build, push, and restart agent on all (or one) node(s)
+mcp agent status                       Show agent version on each node
 ```

 ### Service Definition Files
@@ -1144,20 +1197,84 @@ The agent's data directory follows the platform convention:

 ### Agent Deployment (on nodes)

-The agent is deployed like any other Metacircular service:
+#### Provisioning (one-time per node)

-1. Provision the `mcp` system user via NixOS config (with podman access
-   and subuid/subgid ranges for rootless containers).
+Each node needs a one-time setup before the agent can run. The steps are
+the same regardless of OS, but the mechanism differs:
+
+1. Create `mcp` system user with podman access and subuid/subgid ranges.
 2. Set `/srv/` ownership to the `mcp` user (the agent creates and manages
   `/srv/<service>/` directories for all services).
 3. Create `/srv/mcp/` directory and config file.
 4. Provision TLS certificate from Metacrypt.
 5. Create an MCIAS system account for the agent (`mcp-agent`).
-6. Install the `mcp-agent` binary.
-7. Start via systemd unit.
+6. Install the initial `mcp-agent` binary to `/srv/mcp/mcp-agent`.
+7. Install and start the systemd unit.

-The agent runs as a systemd service. Container-first deployment is a v2
-concern -- MCP needs to be running before it can manage its own agent.
+On **NixOS** (rift), provisioning is declarative via the NixOS config.
+The NixOS config owns the infrastructure (user, systemd unit, podman,
+directories, permissions) but **not** the binary. `ExecStart` points to
+`/srv/mcp/mcp-agent`, a mutable path that MCP manages. NixOS may
+bootstrap the initial binary there, but subsequent updates come from MCP.
+
+On **Debian** (hyperborea, svc), provisioning is done via a setup script
+or ansible playbook that creates the same layout.
+
+#### Binary Location
+
+The agent binary lives at `/srv/mcp/mcp-agent` on **all** nodes,
+regardless of OS. This unifies the update mechanism across the fleet.
+
+#### Agent Upgrades
+
+After initial provisioning, the agent binary is updated via
+`mcp agent upgrade`. The CLI:
+
+1. Cross-compiles the agent for each target architecture
+   (`GOARCH=amd64` for rift/svc, `GOARCH=arm64` for hyperborea).
+2. SSHs to each node, pushes the binary to `/srv/mcp/mcp-agent.new`.
+3. Atomically swaps the binary (`mv mcp-agent.new mcp-agent`).
+4. Restarts the systemd service (`systemctl restart mcp-agent`).
+
+SSH is used instead of gRPC because:
+- It works even when the agent is broken or has an incompatible version.
+- The binary is ~17MB, which exceeds gRPC default message limits.
+- No self-restart coordination needed.
+
+The CLI uses `golang.org/x/crypto/ssh` for native SSH, keeping the
+entire workflow in a single binary with no external tool dependencies.
+
+#### Node Configuration
+
+Node config includes SSH and architecture info for agent management:
+
+```toml
+[[nodes]]
+name = "rift"
+address = "100.95.252.120:9444"
+ssh = "rift"           # SSH host (from ~/.ssh/config or hostname)
+arch = "amd64"         # GOARCH for cross-compilation
+
+[[nodes]]
+name = "hyperborea"
+address = "100.x.x.x:9444"
+ssh = "hyperborea"
+arch = "arm64"
+```
+
+#### Coordinated Upgrades
+
+New MCP releases often add new RPCs. A CLI at v0.6.0 calling an agent
+at v0.5.0 fails with `Unimplemented`. Therefore agent upgrades must be
+coordinated: `mcp agent upgrade` (with no node argument) upgrades all
+nodes before the CLI is used for other operations.
+
+If a node fails to upgrade, it is reported but the others still proceed.
+The operator can retry or investigate via SSH.
+
+#### Systemd Unit
+
+The systemd unit is the same on all nodes:

 ```ini
 [Unit]
@@ -1167,7 +1284,7 @@ Wants=network-online.target

 [Service]
 Type=simple
-ExecStart=/usr/local/bin/mcp-agent server --config /srv/mcp/mcp-agent.toml
+ExecStart=/srv/mcp/mcp-agent server --config /srv/mcp/mcp-agent.toml
 Restart=on-failure
 RestartSec=5

@@ -1175,17 +1292,14 @@ User=mcp
 Group=mcp

 NoNewPrivileges=true
-ProtectSystem=strict
-ProtectHome=true
+ProtectSystem=full
+ProtectHome=false
 PrivateTmp=true
 PrivateDevices=true
 ProtectKernelTunables=true
 ProtectKernelModules=true
-ProtectControlGroups=true
 RestrictSUIDSGID=true
-RestrictNamespaces=true
 LockPersonality=true
-MemoryDenyWriteExecute=true
 RestrictRealtime=true
 ReadWritePaths=/srv

@@ -1195,6 +1309,7 @@ WantedBy=multi-user.target

 Note: `ReadWritePaths=/srv` (not `/srv/mcp`) because the agent writes
 files to any service's `/srv/<service>/` directory on behalf of the CLI.
+`ProtectHome=false` because the `mcp` user's home is `/srv/mcp`.

 ### CLI Installation (on operator workstation)