Document SSO login flow in packaging and deployment guide

Add SSO redirect flow alongside direct credentials, MCIAS client registration steps, [sso] config section, and updated service versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 23:49:36 -07:00
parent e441df34c8
commit 8fb6374257
3 changed files with 799 additions and 8 deletions
--- a/docs/packaging-and-deployment.md
+++ b/docs/packaging-and-deployment.md
@@ -385,7 +385,14 @@ tags         = []
 level = "info"
 ```

-For services with a web UI, add:
+For services with SSO-enabled web UIs, add:
+
+```toml
+[sso]
+redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
+```
+
+For services with a separate web UI binary, add:

 ```toml
 [web]
@@ -433,18 +440,72 @@ these.
 ## 6. Authentication (MCIAS Integration)

 Every service delegates authentication to MCIAS. No service maintains
-its own user database.
+its own user database. Services support two login modes: **SSO
+redirect** (recommended for web UIs) and **direct credentials**
+(fallback / API clients).

-### Auth Flow
+### SSO Login (Web UIs)
+
+SSO is the preferred login method for web UIs. The flow is an OAuth
+2.0-style authorization code exchange:
+
+1. User visits the service and is redirected to `/login`.
+2. Login page shows a "Sign in with MCIAS" button.
+3. Click redirects to MCIAS (`/sso/authorize`), which authenticates the
+   user.
+4. MCIAS redirects back to the service's `/sso/callback` with an
+   authorization code.
+5. The service exchanges the code for a JWT via a server-to-server call
+   to MCIAS `POST /v1/sso/token`.
+6. The JWT is stored in a session cookie.
+
+SSO is enabled by adding an `[sso]` section to the service config and
+registering the service as an SSO client in MCIAS.
+
+**Service config:**
+
+```toml
+[sso]
+redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
+```
+
+**MCIAS config** (add to the `[[sso_clients]]` list):
+
+```toml
+[[sso_clients]]
+client_id    = "<service>"
+redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
+service_name = "<service>"
+```
+
+The `redirect_uri` must match exactly between the service config and
+the MCIAS client registration.
+
+When `[sso].redirect_uri` is empty or absent, the service falls back to
+the direct credentials form.
+
+**Implementation:** Services use `mcdsl/sso` (v1.7.0+) which handles
+state management, CSRF-safe cookies, and the code exchange. The web
+server registers three routes:
+
+| Route | Purpose |
+|-------|---------|
+| `GET /login` | Renders landing page with "Sign in with MCIAS" button |
+| `GET /sso/redirect` | Sets state cookies, redirects to MCIAS |
+| `GET /sso/callback` | Validates state, exchanges code for JWT, sets session |
+
+### Direct Credentials (API / Fallback)

 1. Client sends credentials to the service's `POST /v1/auth/login`.
-2. Service forwards them to MCIAS via the client library
-   (`git.wntrmute.dev/mc/mcias/clients/go`).
+2. Service forwards them to MCIAS via `mcdsl/auth.Authenticator.Login()`.
 3. MCIAS validates and returns a bearer token.
 4. Subsequent requests include `Authorization: Bearer <token>`.
-5. Service validates tokens via MCIAS `ValidateToken()`, cached for 30s
+5. Service validates tokens via `ValidateToken()`, cached for 30s
   (keyed by SHA-256 of the token).

+Web UIs use this mode when SSO is not configured, presenting a
+username/password/TOTP form instead of the SSO button.
+
 ### Roles

 | Role | Access |
@@ -685,10 +746,10 @@ For reference, these services are operational on the platform:
 | Service | Version | Node | Purpose |
 |---------|---------|------|---------|
 | MCIAS | v1.9.0 | (separate) | Identity and access |
-| Metacrypt | v1.3.1 | rift | Cryptographic service, PKI/CA |
+| Metacrypt | v1.4.1 | rift | Cryptographic service, PKI/CA |
 | MC-Proxy | v1.2.1 | rift | TLS proxy and router |
 | MCR | v1.2.1 | rift | Container registry |
 | MCNS | v1.1.1 | rift | Authoritative DNS |
 | MCDoc | v0.1.0 | rift | Documentation server |
-| MCQ | v0.2.0 | rift | Document review queue |
+| MCQ | v0.4.0 | rift | Document review queue |
 | MCP | v0.7.6 | rift | Control plane agent |
--- a/docs/phase-e-plan.md
+++ b/docs/phase-e-plan.md
@@ -0,0 +1,103 @@
+# Phase E: Multi-Node Orchestration
+
+Phase D (automated DNS registration) is complete. Phase E extends MCP from
+a single-node agent on rift to a multi-node fleet with a central master
+process.
+
+## Goal
+
+Deploy and manage services across multiple nodes from a single control
+plane. The operator runs `mcp deploy` and the system places the workload on
+the right node, provisions certs, registers DNS, and configures routing --
+same as today on rift, but across the fleet.
+
+## Fleet Topology
+
+| Node | OS | Arch | Role |
+|------|----|------|------|
+| desktop (TBD) | NixOS | amd64 | Control plane -- runs master + MCIAS + MCNS |
+| rift | NixOS | amd64 | Compute -- application services |
+| orion | NixOS | amd64 | Compute |
+| hyperborea | Debian | arm64 | Compute (Raspberry Pi) |
+| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
+
+Tailnet is the interconnect between all nodes. Public traffic enters via
+mc-proxy on svc, which forwards over Tailnet to compute nodes.
+
+## Components
+
+### Master (`mcp-master`)
+
+Long-lived orchestrator on the control plane node. Responsibilities:
+
+- Accept CLI commands and dispatch to the correct agent
+- Aggregate status from all agents (fleet-wide view)
+- Node selection when `node` is omitted from a service definition
+- Health-aware scheduling using agent heartbeat data
+
+The master is stateless in the durable sense -- it rebuilds its world view
+from agents on startup. If the master goes down, running services continue
+unaffected; only new deploys and rescheduling stop.
+
+### Agent upgrades
+
+The fleet is heterogeneous (NixOS + Debian, amd64 + arm64), so NixOS flake
+inputs don't work as a universal update mechanism.
+
+**Design:** MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
+
+- `mcp agent upgrade [node]` -- CLI cross-compiles for the target's
+  GOARCH, SCPs the binary, restarts via SSH
+- Node config gains `ssh` (user@host) and `arch` (amd64/arm64) fields
+- rift's NixOS `ExecStart` changes from nix store path to
+  `/srv/mcp/mcp-agent`
+- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
+  `mcp-agent.service`
+
+Upgrades must be coordinated -- new RPCs cause `Unimplemented` errors on
+old agents.
+
+### Edge agents
+
+svc runs an agent but does NOT run containers. Its agent manages mc-proxy
+routing only: when the master provisions a service on a compute node, svc's
+agent updates mc-proxy routes to point at the compute node's Tailnet
+address.
+
+### MCIAS migration
+
+MCIAS moves from the svc VPS to the control plane node, running as an
+MCP-managed container with an independent lifecycle. Bootstrap order:
+
+1. MCIAS image pre-staged or pulled unauthenticated
+2. MCIAS starts (L4 passthrough through mc-proxy -- manages its own TLS)
+3. All other services bootstrap after MCIAS is up
+
+## Scheduling
+
+Three placement modes, in order of specificity:
+
+1. `node = "rift"` -- explicit placement on a named node
+2. `node = "pi-pool"` -- master picks within a named cluster
+3. `node` omitted -- master picks any compute node with capacity
+
+Resource-aware placement via agent heartbeats (CPU, memory, disk). RPis
+with 4-8 GB RAM need resource tracking more than beefy servers.
+
+## Open Questions
+
+- **Control plane machine**: which desktop becomes the always-on node?
+- **Heartbeat model**: agent push vs. master poll?
+- **Cluster definition**: explicit pool config in master vs. node labels/tags?
+- **MCIAS migration timeline**: when to cut over from svc to control plane?
+- **Agent on svc**: what subset of agent RPCs does an edge-only agent need?
+
+## What Phase E Does NOT Include
+
+These remain future work:
+
+- Auto-reconciliation (agent auto-restarting drifted containers)
+- Migration (snapshot streaming between nodes)
+- Web UI for fleet management
+- Observability / log aggregation
+- Object store