Compare commits

..

7 Commits

Author SHA1 Message Date
95bec6a095 Sync docs/metacircular.md versions and add undeploy capability
Update version references to match current git tags: MCIAS v1.9.0,
Metacrypt v1.3.1, MCP v0.7.6. Add Phase D (DNS registration) to MCP
status, update RPC/CLI counts, and document undeploy as a first-class
capability. Also sync STATUS.md and packaging-and-deployment.md with
the same version updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:06:24 -07:00
faf58ceb72 Update packaging docs: fix stale versions, add missing CLI commands
Sync the deployed services version table with current git tags (MCIAS
v1.9.0, Metacrypt v1.3.1, MCR v1.2.1, MCNS v1.1.1, MCP v0.7.6) and
add mcp logs and mcp edit to the command reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:06:13 -07:00
bce32654e1 Sync platform docs: Phase D complete, Phase E planned, version updates
- PLATFORM_EVOLUTION: Mark Phase D (DNS) complete, add Phase E
  (multi-node agent management) planning with items #10-12
- PLATFORM_EVOLUTION: Fix stale mcdsl reference (v1.2.0 adds → added,
  consuming services now on v1.4.0)
- STATUS: Update all service versions to current, note Phase A-D
  completion and Phase E planning
- docs/packaging-and-deployment: Add agent management section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:05:37 -07:00
0123e6e29a Update metacircular.md with system account auth model
Document the three-tier identity model across MCIAS, MCR, Metacrypt,
MCNS, and MCP. Update version numbers for MCR (v1.2.1), MCNS (v1.1.1),
MCP (v0.7.2). Clarify that admin is reserved for MCIAS administration,
not routine deploy operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 16:11:21 -07:00
86bbfa640f Add CLI security standard: never echo passwords
New standard requiring mcdsl/terminal.ReadPassword for interactive
password prompts. Codifies the fix applied to MCP's login command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:11:53 -07:00
cadbb3f234 Sync packaging docs with MCP implementation reality
Fix service definition examples to require explicit image fields,
remove phantom version top-level field, document container listen
address requirement (0.0.0.0:$PORT not localhost), add undeploy
command, and update deployed service versions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 21:55:23 -07:00
a777c3ff8b Mark Phase C complete in PLATFORM_EVOLUTION.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 14:59:35 -07:00
12 changed files with 93 additions and 3371 deletions

View File

@@ -17,17 +17,12 @@ Metacircular is a multi-service personal infrastructure platform. This root repo
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go | | `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
| `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go | | `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go |
| `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go | | `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go |
| `mcq/` | Document review queue — push docs for review, MCP server for Claude integration | Go | | `mcp/` | Control plane — operator-driven deployment, service registry, container lifecycle (master/agent) | Go |
| `mcp/` | Control plane — service deployment, container lifecycle, multi-node fleet management (CLI/agent, master in development) | Go |
| `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go | | `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go |
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — | | `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
| `docs/` | Platform-wide documentation (architecture overview, deployment guide) | Markdown |
| `log/` | Operational work logs and session notes | Markdown |
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first. Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
Some subprojects (mcat, mcdoc, mcq) may live at sibling paths (e.g., `../mcq/`) rather than as subdirectories, depending on workspace layout.
## Service Dependencies ## Service Dependencies
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph: MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
@@ -38,7 +33,6 @@ mcias (standalone — no MCIAS dependency)
├── mc-proxy (uses MCIAS for admin auth) ├── mc-proxy (uses MCIAS for admin auth)
├── mcr (uses MCIAS for auth + policy) ├── mcr (uses MCIAS for auth + policy)
├── mcdoc (public, no MCIAS — fetches docs from Gitea) ├── mcdoc (public, no MCIAS — fetches docs from Gitea)
├── mcq (uses MCIAS for auth; document review queue)
├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle) ├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle)
├── mcns (uses MCIAS for auth; authoritative DNS for internal zones) ├── mcns (uses MCIAS for auth; authoritative DNS for internal zones)
└── mcat (tests MCIAS login policies) └── mcat (tests MCIAS login policies)

View File

@@ -5,7 +5,7 @@ from its current manually-wired state to fully declarative deployment.
It is a living design document — not a spec, not a commitment, but a It is a living design document — not a spec, not a commitment, but a
record of where we are, where we want to be, and what's between. record of where we are, where we want to be, and what's between.
Last updated: 2026-04-02 (Phases AD complete, Phase E in progress) Last updated: 2026-03-28 (Phases A + B + C + D complete)
--- ---
@@ -324,11 +324,9 @@ mcproxyctl, TOML editing, cert generation, or DNS zone editing.
### Immediate Next Steps ### Immediate Next Steps
1. **Phase E: Multi-node agent management** — see below. See 1. **Phase E: Multi-node agent management** — see below.
`docs/phase-e-plan.md` for the detailed Phase E design and 2. **mcdoc implementation** — fully designed, no platform evolution
`docs/architecture-v2.md` for the v2 master/agent architecture. dependency. Deployable now with the new route system.
2. **mcdoc** — deployed (v0.1.0 on rift). Serves rendered markdown
from Gitea via mc-proxy.
#### 10. Agent Binary Location Convention #### 10. Agent Binary Location Convention
@@ -382,9 +380,8 @@ Different for NixOS vs Debian.
| Node | OS | Arch | Status | | Node | OS | Arch | Status |
|------|----|------|--------| |------|----|------|--------|
| rift | NixOS | amd64 | Operational, single MCP agent | | rift | NixOS | amd64 | Operational, single MCP agent |
| hyperborea | Debian (RPi) | arm64 | Online, agent provisioning planned | | hyperborea | Debian (RPi) | arm64 | Online, needs agent provisioning |
| svc | Debian | amd64 | Runs MCIAS, needs agent for public edge services | | svc | Debian | amd64 | Runs MCIAS, needs agent for public edge services |
| orion | NixOS | amd64 | Provisioned, offline for maintenance |
--- ---

View File

@@ -25,9 +25,7 @@ lives in [docs/metacircular.md](docs/metacircular.md).
| **MC-Proxy** | Node ingress — TLS proxy and router. L4 passthrough or L7 terminating (per-route), PROXY protocol, firewall with rate limiting and GeoIP. | Implemented | | **MC-Proxy** | Node ingress — TLS proxy and router. L4 passthrough or L7 terminating (per-route), PROXY protocol, firewall with rate limiting and GeoIP. | Implemented |
| **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented | | **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented |
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented | | **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented |
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | Implemented | | **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | In progress |
| **MCQ** | Document review queue — push docs for review, MCP server for Claude Code integration. | Implemented |
| **MCDeploy** | Deployment CLI — tactical bridge tool, now deprecated and archived. Superseded by MCP. | Deprecated |
Shared library: **MCDSL** — standard library for all services (auth, db, Shared library: **MCDSL** — standard library for all services (auth, db,
config, TLS server, CSRF, snapshots). config, TLS server, CSRF, snapshots).
@@ -104,9 +102,7 @@ metacircular/
├── mcns/ DNS server ├── mcns/ DNS server
├── mcat/ Login policy tester ├── mcat/ Login policy tester
├── mcdsl/ Standard library (shared packages) ├── mcdsl/ Standard library (shared packages)
├── mcdeploy/ Deployment CLI (deprecated, archived)
├── mcdoc/ Documentation server ├── mcdoc/ Documentation server
├── mcq/ Document review queue
├── ca/ PKI infrastructure (dev/test, not source code) ├── ca/ PKI infrastructure (dev/test, not source code)
└── docs/ Platform-wide documentation └── docs/ Platform-wide documentation
``` ```

149
STATUS.md
View File

@@ -1,45 +1,41 @@
# Metacircular Platform Status # Metacircular Platform Status
Last updated: 2026-04-02 Last updated: 2026-03-28
## Platform Overview ## Platform Overview
Two nodes operational (**rift** + **svc**), with **orion** provisioned but One node operational (**rift**), running core infrastructure services as
offline for maintenance. Core infrastructure services run as containers on containers fronted by MC-Proxy. MCIAS runs separately (not on rift).
rift, fronted by MC-Proxy. Svc operates as an MCP edge node managing Bootstrap phases 04 complete (MCIAS, Metacrypt, MC-Proxy, MCR all
mc-proxy routing only (no containers); MCIAS runs on svc separately as a operational). MCP is deployed and managing all platform containers. MCNS is
systemd service. Bootstrap phases 04 complete (MCIAS, Metacrypt, MC-Proxy, deployed on rift, serving authoritative DNS. Platform evolution Phases AD
MCR all operational). MCP is deployed and managing all platform containers complete (automated port assignment, route registration, TLS cert
on rift, with multi-node capability (svc as edge node). MCNS is deployed on provisioning, and DNS registration). Multi-node deployment is being planned
rift, serving authoritative DNS. Platform evolution Phases AD complete (Phase E).
(automated port assignment, route registration, TLS cert provisioning, and
DNS registration). Phase E (multi-node expansion) is in planning, with v2
architecture in development.
## Service Status ## Service Status
| Service | Version | SDLC Phase | Deployed | Node | | Service | Version | SDLC Phase | Deployed | Node |
|---------|---------|------------|----------|------| |---------|---------|------------|----------|------|
| MCIAS | v1.10.5 | Maintenance | Yes | svc (systemd) | | MCIAS | v1.9.0 | Maintenance | Yes | (separate) |
| Metacrypt | v1.4.1 | Production | Yes | rift | | Metacrypt | v1.3.1 | Production | Yes | rift |
| MC-Proxy | v1.2.2 | Maintenance | Yes | rift | | MC-Proxy | v1.2.1 | Maintenance | Yes | rift |
| MCR | v1.3.2 | Production | Yes | rift | | MCR | v1.2.1 | Production | Yes | rift |
| MCAT | v1.2.0 | Production | Yes | rift | | MCAT | v1.1.1 | Complete | Unknown | — |
| MCDSL | v1.7.0 | Stable | N/A (library) | — | | MCDSL | v1.4.0 | Stable | N/A (library) | — |
| MCNS | v1.2.0 | Production | Yes | rift | | MCNS | v1.1.1 | Production | Yes | rift |
| MCDoc | v0.1.0 | Production | Yes | rift | | MCP | v0.7.6 | Production | Yes | rift |
| MCQ | v0.4.2 | Production | Yes | rift | | MCDoc | v0.1.0 | Active dev | No | — |
| MCP | v0.9.0 | Production | Yes | rift |
## Service Details ## Service Details
### MCIAS — Identity and Access Service ### MCIAS — Identity and Access Service
- **Version:** v1.10.5 (client library: clients/go/v0.2.0) - **Version:** v1.9.0 (client library: clients/go/v0.2.0)
- **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active - **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active
refinement. refinement.
- **Deployment:** Running in production on svc as a systemd service. All - **Deployment:** Running in production. All other services authenticate
other services authenticate against it. against it.
- **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login - **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login
policies, Nix flake for CLI tools. policies, Nix flake for CLI tools.
- **Artifacts:** systemd units (service + backup timer), install script, - **Artifacts:** systemd units (service + backup timer), install script,
@@ -47,7 +43,7 @@ architecture in development.
### Metacrypt — Cryptographic Service Engine ### Metacrypt — Cryptographic Service Engine
- **Version:** v1.4.1 (API v1.3.1, Web v1.4.1). - **Version:** v1.3.1.
- **Phase:** Production. All four engine types implemented (CA, SSH CA, transit, - **Phase:** Production. All four engine types implemented (CA, SSH CA, transit,
user-to-user). Active work on integration test coverage. user-to-user). Active work on integration test coverage.
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on - **Deployment:** Running on rift as a container, fronted by MC-Proxy on
@@ -59,21 +55,19 @@ architecture in development.
### MC-Proxy — TLS Proxy and Router ### MC-Proxy — TLS Proxy and Router
- **Version:** v1.2.2. - **Version:** v1.2.1.
- **Phase:** Maintenance. Stable and actively routing traffic on rift and svc. - **Phase:** Maintenance. Stable and actively routing traffic on rift.
- **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports - **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091. Routes persisted 443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091. Routes persisted
in SQLite and managed via gRPC API. Svc runs its own mc-proxy on :443 with in SQLite and managed via gRPC API.
public-facing routes. - **Recent work:** MCR route additions, Nix flake, L7 backend cert handling,
- **Recent work:** Route persistence (SQLite), idempotent AddRoute (upsert), Prometheus metrics, L7 policies.
golangci-lint v2 compliance, module path migration to mc/ org.
- **Artifacts:** systemd units (service + backup timer), Docker Compose - **Artifacts:** systemd units (service + backup timer), Docker Compose
(standard + rift), install and backup scripts, rift config. (standard + rift), install and backup scripts, rift config.
### MCR — Container Registry ### MCR — Container Registry
- **Version:** v1.3.2 (API v1.2.1, Web v1.3.2). All implementation phases - **Version:** v1.2.1. All implementation phases complete.
complete.
- **Phase:** Production. Deployed on rift, serving container images. - **Phase:** Production. Deployed on rift, serving container images.
- **Deployment:** Running on rift as two containers (mcr API + mcr-web), - **Deployment:** Running on rift as two containers (mcr API + mcr-web),
fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and
@@ -86,26 +80,26 @@ architecture in development.
### MCAT — Login Policy Tester ### MCAT — Login Policy Tester
- **Version:** v1.2.0. - **Version:** v1.1.1.
- **Phase:** Production. Deployed on rift as a container managed by MCP. - **Phase:** Complete. Diagnostic tool, not core infrastructure.
- **Deployment:** Running on rift. Lightweight tool for testing MCIAS login - **Deployment:** Available for ad-hoc use. Lightweight tool for testing
policy rules. MCIAS login policy rules.
- **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web. - **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web.
- **Artifacts:** systemd unit, install script, example config. - **Artifacts:** systemd unit, install script, example config.
### MCDSL — Standard Library ### MCDSL — Standard Library
- **Version:** v1.7.0. - **Version:** v1.4.0.
- **Phase:** Stable. All 9 packages implemented and tested. Being adopted - **Phase:** Stable. All 9 packages implemented and tested. Being adopted
across the platform. across the platform.
- **Deployment:** N/A (Go library, imported by other services). - **Deployment:** N/A (Go library, imported by other services).
- **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health, - **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health,
archive. archive.
- **Adoption:** All services except mcias on v1.7.0. mcias pending. - **Adoption:** All services except mcias on v1.4.0. mcias pending.
### MCNS — Networking Service ### MCNS — Networking Service
- **Version:** v1.2.0. - **Version:** v1.1.1.
- **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor. - **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor.
- **Deployment:** Running on rift as a container managed by MCP. Serves two - **Deployment:** Running on rift as a container managed by MCP. Serves two
authoritative zones plus upstream forwarding. REST + gRPC APIs with MCIAS authoritative zones plus upstream forwarding. REST + gRPC APIs with MCIAS
@@ -115,75 +109,52 @@ architecture in development.
- **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition, - **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition,
systemd units, install script, example config. systemd units, install script, example config.
### MCP — Control Plane
- **Version:** v0.7.6.
- **Phase:** Production. Phases AD complete (automated port assignment, route
registration, TLS cert provisioning, DNS registration).
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
with rootless podman. Manages metacrypt, mc-proxy, mcr, and mcns containers.
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
`mcp-agent` (per-node daemon with SQLite registry, podman management,
monitoring with drift/flap detection). gRPC-only (no REST). 15 RPCs, 17+
CLI commands.
- **Recent work:** Phase C (automated TLS cert provisioning via Metacrypt CA),
Phase D (automated DNS registration via MCNS), undeploy command, logs
command, edit command, auto-login to MCR, system account auth model.
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
definition files, design docs.
### MCDoc — Documentation Server ### MCDoc — Documentation Server
- **Version:** v0.1.0. - **Version:** v0.1.0.
- **Phase:** Production. Fetches and renders markdown documentation from Gitea. - **Phase:** Active development.
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on - **Deployment:** Not yet deployed.
port 443 (L7). - **Description:** Documentation server — fetches markdown from Gitea, renders
- **Recent work:** Initial implementation, Gitea content fetching, goldmark HTML, serves public docs via mc-proxy. No MCIAS auth required.
rendering with syntax highlighting, webhook-driven refresh.
- **Artifacts:** Dockerfile, MCP service definition.
### MCQ — Document Review Queue
- **Version:** v0.4.2.
- **Phase:** Production. Document review queue with MCP server for Claude
integration.
- **Deployment:** Running on rift as a container managed by MCP.
- **Recent work:** Claude MCP server integration, document review workflow.
- **Artifacts:** Dockerfile, MCP service definition.
### MCP — Control Plane
- **Version:** v0.9.0 (agent on rift: v0.8.3-dirty, agent on svc: v0.9.0).
- **Phase:** Production. Phases AD complete. Multi-node capable with svc
operating as an edge node. V2 architecture in development, Phase E planning
underway.
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
with rootless podman. Manages metacrypt, mc-proxy, mcr, mcns, mcdoc, mcat,
mcq, and non-platform containers. Svc runs an MCP agent for edge mc-proxy
route management.
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
`mcp-agent` (per-node daemon with SQLite registry, podman management,
monitoring with drift/flap detection, route registration with mc-proxy,
automated TLS cert provisioning for L7 routes via Metacrypt CA, automated
DNS registration in MCNS). gRPC-only (no REST). 15 RPCs, 17+ CLI commands.
- **Recent work:** Phase C (automated TLS cert provisioning), Phase D
(automated DNS registration via MCNS), undeploy command, logs command,
edit command, auto-login to MCR, system account auth model, module path
migration, multi-node edge support (svc).
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
definition files, design docs.
## Node Inventory ## Node Inventory
| Node | Address (LAN) | Address (Tailscale) | Role | | Node | Address (LAN) | Address (Tailscale) | Role |
|------|---------------|---------------------|------| |------|---------------|---------------------|------|
| rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services | | rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services |
| svc | — | 100.106.232.4 | Edge — public mc-proxy, MCIAS (systemd) |
| orion | 192.168.88.214 | — | Worker (provisioned, offline for maintenance) |
## Rift Port Map ## Rift Port Map
Note: Services deployed via MCP receive dynamically assigned host ports Note: Services deployed via MCP receive dynamically assigned host ports
(1000060000). The ports below are for infrastructure services with static (1000060000). The ports below are for infrastructure services with static
assignments or well-known ports. assignments.
| Port | Protocol | Services | | Port | Protocol | Services |
|------|----------|----------| |------|----------|----------|
| 53 | DNS (LAN + Tailscale) | mcns | | 53 | DNS (LAN + Tailscale) | mcns |
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web, mcdoc, mcat, kls | | 443 | L7 (TLS termination) | metacrypt-web, mcr-web |
| 8080 | HTTP (all interfaces) | exod | | 8080 | HTTP (all interfaces) | exod |
| 8443 | L4 (SNI passthrough) | metacrypt API, mcr API | | 8443 | L4 (SNI passthrough) | metacrypt API, mcr API |
| 9090 | HTTP (all interfaces) | exod | | 9090 | HTTP (all interfaces) | exod |
| 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard | | 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard |
| 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics | | 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics |
Svc runs its own mc-proxy on :443 (L7) with public-facing routes: Non-platform services also running on rift: **exod** (ports 8080/9090),
kls.metacircular.net, mcq.metacircular.net, metacrypt.metacircular.net, **sgardd** (port 19473, fronted by MC-Proxy on 9443).
docs.metacircular.net, git.metacircular.net, git.wntrmute.dev.
Non-platform services also running on rift: **exod** (v0.1.0, ports
8080/9090), **sgardd** (v3.2.0, port 19473, fronted by MC-Proxy on 9443),
**kls** (v0.2.0).

View File

@@ -1,158 +0,0 @@
# svc.metacircular.net — Phase 1 Work Log
Date: 2026-04-02
Purpose: Deploy mcp-agent to svc (edge node) for MCP v2 Phase 1.
## Changes Made
### 1. Created `mcp` system user
```
useradd --system --home-dir /srv/mcp --create-home --shell /usr/sbin/nologin mcp
usermod -aG mc-proxy mcp
```
- UID 992, GID 991
- Member of `mc-proxy` group for socket access
### 2. Created `/srv/mcp/` directory structure
```
/srv/mcp/
├── mcp-agent # binary (v0.8.3-1-gfa8ba6f, linux/amd64)
├── mcp-agent.toml # agent config
├── mcp.db # SQLite registry (created on first run)
└── certs/
├── cert.pem # TLS cert (SAN: IP:100.106.232.4, DNS:svc.svc.mcp.metacircular.net)
├── key.pem # TLS private key
└── ca.pem # Metacircular CA cert
```
- Owned by `mcp:mcp`, key file mode 0600
### 3. TLS certificate
- Issued from the Metacircular CA (`ca/ca.pem` + `ca/ca.key`)
- Subject: `CN=mcp-agent-svc`
- SANs: `IP:100.106.232.4`, `DNS:svc.svc.mcp.metacircular.net`
- Validity: 365 days
- Stored at `/srv/mcp/certs/{cert,key,ca}.pem`
### 4. Agent configuration
- File: `/srv/mcp/mcp-agent.toml`
- gRPC listen: `100.106.232.4:9555` (port 9444 in use by MCNS)
- MCIAS: `https://mcias.metacircular.net:8443`
- mc-proxy socket: `/srv/mc-proxy/mc-proxy.sock`
- Node name: `svc`
- Runtime: `podman` (not used on edge, but required by config)
### 5. systemd unit
- File: `/etc/systemd/system/mcp-agent.service`
- Runs as `mcp:mcp`
- Security hardened (NoNewPrivileges, ProtectSystem=strict, etc.)
- ReadWritePaths: `/srv/mcp`, `/srv/mc-proxy/mc-proxy.sock`
- Enabled and started
### 6. mc-proxy directory permissions
- Changed `/srv/mc-proxy/` from `drwx------` to `drwxr-x---` (group traversal)
- Changed `/srv/mc-proxy/mc-proxy.sock` from `srw-------` to `srw-rw----` (group read/write)
- Required for `mcp` user (in `mc-proxy` group) to access the socket
### 7. MCP CLI config update (on rift)
- Added svc node to `~/.config/mcp/mcp.toml`:
```toml
[[nodes]]
name = "svc"
address = "100.106.232.4:9555"
```
## Verification
```
$ mcp node list
NAME ADDRESS VERSION
rift 100.95.252.120:9444 v0.8.3-dirty
svc 100.106.232.4:9555 v0.8.3-1-gfa8ba6f
$ mcp route list -n svc
NODE: svc
mc-proxy v1.2.1-2-g82fce41-dirty
:443 routes=6
l7 git.wntrmute.dev → 127.0.0.1:3000
l7 kls.metacircular.net → 100.95.252.120:58080
l7 mcq.metacircular.net → 100.95.252.120:48080
l7 metacrypt.metacircular.net → 100.95.252.120:18080 (re-encrypt)
l7 docs.metacircular.net → 100.95.252.120:38080
l7 git.metacircular.net → 127.0.0.1:3000
```
## Agent Cert Reissue (2026-04-02)
Both agent certs reissued with comprehensive SANs:
**Rift agent** (`/srv/mcp/certs/cert.pem`):
- DNS: `rift.scylla-hammerhead.ts.net`, `mcp-agent.svc.mcp.metacircular.net`
- IP: `100.95.252.120`, `192.168.88.181`
**Svc agent** (`/srv/mcp/certs/cert.pem`):
- DNS: `svc.scylla-hammerhead.ts.net`, `svc.svc.mcp.metacircular.net`
- IP: `100.106.232.4`
Both agents upgraded to v0.10.0 (Phase 2 edge routing RPCs + v2 proto fields).
## MCP Master Deployment (2026-04-02)
**Binary**: `/srv/mcp-master/mcp-master` (v0.10.0) on rift
**Config**: `/srv/mcp-master/mcp-master.toml`
**Database**: `/srv/mcp-master/master.db`
**Certs**: `/srv/mcp-master/certs/{cert,key,ca}.pem`
- SAN: `rift.scylla-hammerhead.ts.net`, `mcp-master.svc.mcp.metacircular.net`, IP `100.95.252.120`
**Service token**: `/srv/mcp-master/mcias-token` (MCIAS identity: `mcp-master`, expires 2027-04-03)
**Listen**: `100.95.252.120:9555`
**Bootstrap nodes**: rift (master), svc (edge)
**Status**: Running via `doas` (ad-hoc). NixOS read-only /etc prevents
direct systemd unit creation — needs NixOS config update for persistent
service.
**Tested**:
- `mcp deploy mcq` → master places on rift, forwards to agent ✓
- `mcp undeploy mcq` → master forwards to agent, cleans up placement ✓
- `mcp ps` → fleet-wide status through agents ✓
- `mcp node list` → both nodes visible with versions ✓
## CLI Config Changes (vade)
Updated `~/.config/mcp/mcp.toml`:
- Added `[master]` section: `address = "rift.scylla-hammerhead.ts.net:9555"`
- All node addresses switched to Tailscale DNS names
- Added CA cert path
## Known Limitations
- ~~mc-proxy socket permissions will reset on restart~~ **FIXED**: mc-proxy
now creates the socket with 0660 (was 0600). Committed to mc-proxy master.
- Master runs ad-hoc via `doas` on rift. Needs NixOS systemd config for
persistent service (rift has read-only /etc).
- DNS registration not configured on master (MCNS config omitted for now).
- Edge routing not yet tested end-to-end through master (svc cert provisioning
not configured).
- The TLS cert was issued from the local CA directly, not via Metacrypt API.
Should be re-issued via Metacrypt once the agent has cert provisioning.
- Container runtime is set to `podman` but podman is not installed on svc
(Docker is). Edge agents don't run containers so this is benign.
- Metacrypt and MCNS integrations not configured (not needed for edge role).
## Edge Routing E2E Test (2026-04-02)
Full edge routing flow tested successfully through the master:
1. Added `public = true` route for `mcq.metacircular.net` to mcq service def
2. `mcp deploy mcq` → master placed on rift, deployed, registered DNS, set up edge route on svc
3. Svc agent provisioned TLS cert from Metacrypt (expires 2026-06-28)
4. mc-proxy route created: `mcq.metacircular.net → 100.95.252.120:8443` (re-encrypt)
5. Edge route persisted in both master DB and svc agent registry
**Fix required**: `RouteDef` in servicedef.go was missing `Public` field —
TOML `public = true` was silently dropped. Fixed in v0.10.2 along with
`Tier` field and relaxed `Node` validation.
## DNS Registration Working (2026-04-02)
Master registers Tailnet IPs in MCNS (not LAN IPs). Fix in v0.10.1
resolves Tailscale DNS names to IPs before passing to MCNS.
`mcq.svc.mcp.metacircular.net → 100.95.252.120` (Tailnet IP) ✓

View File

@@ -8,11 +8,12 @@ Go binaries built with `CGO_ENABLED=0` and installed via Nix flakes.
| Tool | Project | Purpose | Install target | | Tool | Project | Purpose | Install target |
|------|---------|---------|---------------| |------|---------|---------|---------------|
| `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion | | `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion |
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift, svc (systemd) | | `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift (systemd) |
| `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift | | `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift |
| `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift | | `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift |
| `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift | | `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift |
| `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift | | `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift |
| `mcdeploy` | mcdeploy | Deployment CLI — build, push, deploy (bridge tool) | vade |
### Server-only binaries (not installed as tools) ### Server-only binaries (not installed as tools)
@@ -28,8 +29,6 @@ These run inside containers and are not installed on operator workstations:
| `mc-proxy` | mc-proxy | TLS proxy server | | `mc-proxy` | mc-proxy | TLS proxy server |
| `mcns` | mcns | DNS server | | `mcns` | mcns | DNS server |
| `mcat` | mcat | Login policy tester web app | | `mcat` | mcat | Login policy tester web app |
| `mcdoc` | mcdoc | Documentation server |
| `mcq` | mcq | Document review queue |
## Installation ## Installation
@@ -47,18 +46,15 @@ via `mcpkg.nix` in the NixOS configuration. Adding a tool:
- Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps). - Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps).
- ldflags: `-s -w -X main.version=${version}`. - ldflags: `-s -w -X main.version=${version}`.
- `subPackages`: list only the client binaries, not servers. - `subPackages`: list only the client binaries, not servers.
- `system`: `x86_64-linux` for rift, svc, and orion; `aarch64-linux` - `system`: `x86_64-linux` (all machines are x86_64).
for hyperborea. Flakes that target the full fleet should support both.
### MCP agent ### MCP agent
The `mcp-agent` is a special case: it runs as a systemd service on The `mcp-agent` is a special case: it runs as a systemd service on
managed nodes (not as a container, since it manages containers). Its managed nodes (not as a container, since it manages containers). Its
flake exposes both `mcp` (client) and `mcp-agent` (server). Phase E is flake exposes both `mcp` (client) and `mcp-agent` (server). The NixOS
moving the agent binary to `/srv/mcp/mcp-agent` on all nodes — NixOS config in `configs/mcp.nix` references the Nix package path for the
`ExecStart` will point there instead of a nix store path, and Debian agent binary.
nodes use the same layout. svc already follows this convention. See
`docs/phase-e-plan.md` for details.
## Flake status ## Flake status
@@ -67,10 +63,9 @@ nodes use the same layout. svc already follows this convention. See
| mcias | Yes | mciasctl, mciasgrpcctl | Yes | | | mcias | Yes | mciasctl, mciasgrpcctl | Yes | |
| mc-proxy | Yes | mcproxyctl | Yes | | | mc-proxy | Yes | mcproxyctl | Yes | |
| mcr | Yes | mcrctl | Yes | | | mcr | Yes | mcrctl | Yes | |
| mcdeploy | Yes | mcdeploy | Yes | |
| mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit | | mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit |
| mcns | No | — | No | Server-only, no client tool yet | | mcns | No | — | No | Server-only, no client tool yet |
| metacrypt | No | — | No | Server-only, no client tool yet | | metacrypt | No | — | No | Server-only, no client tool yet |
| mcat | No | — | No | Server-only, no client tool yet | | mcat | No | — | No | Server-only, no client tool yet |
| mcdoc | No | — | No | Server-only, deployed as container |
| mcq | No | — | No | Server-only, document review queue |
| mcdsl | No | — | No | Library, no binaries | | mcdsl | No | — | No | Library, no binaries |

File diff suppressed because it is too large Load Diff

View File

@@ -1,344 +0,0 @@
# Disaster Recovery: Bootstrap from Zero
This document covers recovering the Metacircular platform when all
services on rift are down and no containers are running. It assumes:
- The machine boots and NixOS is functional
- The mcp-agent systemd service starts automatically
- Tailscale is configured and connects on boot
- Service data directories (`/srv/<service>/`) are intact on disk
- Container images are cached in podman's overlay storage
If images are NOT cached (fresh machine, disk wipe), see the
"Cold Start" section at the end.
## Prerequisites
Before starting recovery, verify:
```bash
# 1. Machine is up
hostname # should print "rift"
# 2. Tailscale is connected
tailscale status --self
# Should show the Tailnet IP (100.95.252.120)
# 3. The mcp user exists
id mcp
# Should show uid=850(mcp) gid=850(mcp)
# 4. The agent is running
systemctl status mcp-agent
# Should be active
# 5. Images are cached
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
# Should be > 0
```
If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up`
If the agent is not running: check `/srv/mcp/mcp-agent` exists and
`/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`.
## Recovery Order
Services must be started in dependency order. Each stage must be
healthy before the next starts.
```
Stage 1 (Foundation): MCNS → DNS works
Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs
Stage 3 (Management): mcp-master → orchestration
Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications
```
## Manual Recovery Commands
All commands run as the mcp user. Use this shell prefix:
```bash
# Set up the environment
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
# Or SSH as mcp directly (if SSH login is enabled):
ssh mcp@rift
```
For brevity, commands below show the `podman run` portion only. Prefix
with the environment setup above.
### Stage 1: MCNS (DNS)
MCNS must start first. Without it, no hostname resolution works.
```bash
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp \
-p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp \
-p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
```
**Verify:**
```bash
dig @192.168.88.181 google.com +short
# Should return an IP address
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# Should return a Tailnet IP
```
**Note:** MCNS binds to specific IPs, not `0.0.0.0`, because
systemd-resolved holds port 53 on localhost. The explicit bindings
avoid the conflict.
### Stage 2: Core Infrastructure
#### mc-proxy (TLS routing)
```bash
podman run -d --name mc-proxy --restart unless-stopped \
--network host \
-v /srv/mc-proxy:/srv/mc-proxy \
mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
server --config /srv/mc-proxy/mc-proxy.toml
```
**Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1`
(should get a response, even if 404)
#### MCR (Container Registry)
```bash
# API server
podman run -d --name mcr-api --restart unless-stopped \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28443:8443 \
-p 127.0.0.1:29443:9443 \
mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
server --config /srv/mcr/mcr.toml
# Web UI
podman run -d --name mcr-web --restart unless-stopped \
--user 0:0 \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
server --config /srv/mcr/mcr.toml
```
**If MCR fails with "chmod" or "readonly database":**
```bash
podman stop mcr-api
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
podman start mcr-api
```
This recreates the database empty. Image blobs in `/srv/mcr/layers/`
are preserved but tag metadata is lost. Re-push images to rebuild the
registry.
#### Metacrypt (PKI / Secrets)
```bash
# API server
podman run -d --name metacrypt-api --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18443:8443 \
-p 127.0.0.1:19443:9443 \
mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
server --config /srv/metacrypt/metacrypt.toml
# Web UI
podman run -d --name metacrypt-web --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18080:8080 \
mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
--config /srv/metacrypt/metacrypt.toml
```
**If Metacrypt fails with "chmod" or "readonly database":**
Same fix as MCR — delete the database files. **Warning:** this loses
all encrypted secrets, issued certs tracking, and CA state. The CA
key itself is in the sealed vault (password-protected), not in SQLite.
### Stage 3: MCP Master
```bash
podman run -d --name mcp-master --restart unless-stopped \
--network host \
-v /srv/mcp-master:/srv/mcp-master \
mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
server --config /srv/mcp-master/mcp-master.toml
```
**Verify:**
```bash
# From vade (operator workstation):
mcp node list
# Should show rift, svc, orion
```
### Stage 4: Application Services
Once the master is running, deploy applications through MCP:
```bash
mcp deploy mcq --direct
mcp deploy mcdoc --direct
mcp deploy mcat --direct
mcp deploy kls --direct
```
Or start them manually:
```bash
# MCQ
podman run -d --name mcq --restart unless-stopped \
-v /srv/mcq:/srv/mcq \
-p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
server --config /srv/mcq/mcq.toml
# MCDoc
podman run -d --name mcdoc --restart unless-stopped \
-v /srv/mcdoc:/srv/mcdoc \
-p 127.0.0.1:38080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
server --config /srv/mcdoc/mcdoc.toml
# MCAT
podman run -d --name mcat --restart unless-stopped \
-v /srv/mcat:/srv/mcat \
-p 127.0.0.1:48116:8443 \
mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
server --config /srv/mcat/mcat.toml
# KLS
podman run -d --name kls --restart unless-stopped \
-v /srv/kls:/srv/kls \
-p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
-f /srv/kls/kls.conf
# Sgard
podman run -d --name sgardd --restart unless-stopped \
-v /srv/sgard:/srv/sgard \
-p 127.0.0.1:19473:9473 \
mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
--repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
--tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
```
## Verification Checklist
After all services are running:
```bash
# Fleet status
mcp ps
# All services should show "running"
# DNS
dig @192.168.88.181 google.com +short
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# MCIAS (runs on svc, should be unaffected by rift outage)
curl -sk https://mcias.metacircular.net:8443/v1/health
# MCR
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
# Metacrypt
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
# Public routes via svc
curl -sk https://mcq.metacircular.net/
curl -sk https://docs.metacircular.net/
```
## Common Errors
### "chmod: operation not permitted"
modernc.org/sqlite calls `fchmod()` on database files. This is denied
inside rootless podman user namespaces. Fix:
```bash
# Delete the database and let the service recreate it
podman stop <container>
rm -f /srv/<service>/<service>.db*
podman start <container>
```
The `fchmod` error will still appear in logs as a warning but is
non-fatal for newly created databases.
### "address already in use" on port 53
systemd-resolved holds port 53 on localhost. MCNS must bind to
specific IPs, not `0.0.0.0:53`. Use explicit port bindings:
`-p 192.168.88.181:53:53 -p 100.95.252.120:53:53`
### "connection refused" to MCR
MCR is down. Images are cached locally — you can start services that
use cached images without MCR. MCR itself starts from its cached
image.
### Agent shows "error" for all nodes
Check:
1. Tailscale is running on both the CLI machine and the target node
2. The agent is listening: `ss -tlnp | grep 9444`
3. The CLI config has the correct addresses
4. TLS certs have the right SANs for the Tailnet IP
### "podman: executable file not found"
This warning appears for svc (which uses Docker, not podman). It's
benign — svc is an edge node that doesn't run containers.
## Cold Start (No Cached Images)
If the machine was wiped and no images are cached:
1. **MCIAS** runs on svc (Docker container), not rift. It should be
unaffected by a rift failure. Verify: `ssh svc.metacircular.net
"docker ps | grep mcias"`.
2. **Pre-stage images** by pulling from a backup or building locally:
```bash
# On vade (operator workstation), build and push to a temp location
cd ~/src/metacircular/mcns && make docker
podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
ssh rift "podman load"
```
Repeat for each service.
3. Alternatively, if another node has MCR access, push images there
first, then pull from the running MCR instance.
## Service Reference
Quick reference for all services, their images, and critical flags:
| Service | Image | Network | Key Ports | Config Path |
|---------|-------|---------|-----------|-------------|
| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`.

View File

@@ -9,8 +9,8 @@ the platform's internals.
## Platform Overview ## Platform Overview
Metacircular is a multi-service infrastructure platform. Services are Metacircular is a multi-service infrastructure platform. Services are
Go binaries running as containers across a fleet of Linux nodes, Go binaries running as containers on Linux nodes, managed by these core
managed by these core components: components:
| Component | Role | | Component | Role |
|-----------|------| |-----------|------|
@@ -21,27 +21,8 @@ managed by these core components:
| **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` | | **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` |
The operator workflow is: **build image → push to MCR → write service The operator workflow is: **build image → push to MCR → write service
definition → deploy via MCP**. MCP handles port assignment, TLS cert definition → deploy via MCP**. MCP handles port assignment, route
provisioning, route registration, DNS registration, and container registration, and container lifecycle.
lifecycle.
### Fleet Topology
The platform runs across multiple nodes connected via Tailnet:
| Node | Role | OS | Arch | Purpose |
|------|------|----|------|---------|
| **rift** | Compute + core infra | NixOS | amd64 | Runs most services (Metacrypt, MCR, MCNS, etc.) |
| **svc** | Edge | Debian | amd64 | Public-facing mc-proxy, routes traffic over Tailnet to compute nodes |
| **orion** | Compute | NixOS | amd64 | Provisioned, currently offline |
**Node roles:**
- **Compute nodes** (rift, orion, future RPis) run the full container
lifecycle via rootless Podman.
- **Edge nodes** (svc) run mc-proxy for public traffic routing only.
The MCP agent on edge nodes manages mc-proxy routes but does not run
application containers.
--- ---
@@ -53,7 +34,7 @@ The platform runs across multiple nodes connected via Tailnet:
| Container engine | Docker or Podman (for building images) | | Container engine | Docker or Podman (for building images) |
| `mcp` CLI | Installed on the operator workstation | | `mcp` CLI | Installed on the operator workstation |
| MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` | | MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` |
| MCP agent | Running on the target node (`rift` for services, `svc` for edge routing) | | MCP agent | Running on the target node (currently `rift`) |
| MCIAS account | For `mcp` CLI authentication to the agent | | MCIAS account | For `mcp` CLI authentication to the agent |
--- ---
@@ -339,20 +320,14 @@ recreates the containers.
### What Happens During Deploy ### What Happens During Deploy
1. Agent assigns a free host port (1000060000) for each declared route. 1. Agent assigns a free host port (1000060000) for each declared route.
2. For L7 routes, agent provisions a TLS certificate from Metacrypt CA 2. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
(via `POST /v1/engine/request`). Certs are written to
`/srv/mc-proxy/certs/<service>.pem` and `.key`. Existing valid certs
(more than 30 days from expiry) are reused.
3. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
variables set to the assigned ports. variables set to the assigned ports.
4. Agent registers routes with mc-proxy via gRPC (hostname → 3. Agent registers routes with mc-proxy (hostname → `127.0.0.1:<port>`,
`<node-address>:<port>`, mode, TLS cert paths). mode, TLS cert paths).
5. Agent registers DNS entries in MCNS for 4. Agent records the full state in its SQLite registry.
`<service>.svc.mcp.metacircular.net`.
6. Agent records the full state in its SQLite registry.
On stop (`mcp stop <service>`), the agent reverses the process: removes On stop (`mcp stop <service>`), the agent reverses the process: removes
DNS entries, removes mc-proxy routes, then stops containers. mc-proxy routes, then stops containers.
--- ---
@@ -410,14 +385,7 @@ tags = []
level = "info" level = "info"
``` ```
For services with SSO-enabled web UIs, add: For services with a web UI, add:
```toml
[sso]
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
```
For services with a separate web UI binary, add:
```toml ```toml
[web] [web]
@@ -465,72 +433,18 @@ these.
## 6. Authentication (MCIAS Integration) ## 6. Authentication (MCIAS Integration)
Every service delegates authentication to MCIAS. No service maintains Every service delegates authentication to MCIAS. No service maintains
its own user database. Services support two login modes: **SSO its own user database.
redirect** (recommended for web UIs) and **direct credentials**
(fallback / API clients).
### SSO Login (Web UIs) ### Auth Flow
SSO is the preferred login method for web UIs. The flow is an OAuth
2.0-style authorization code exchange:
1. User visits the service and is redirected to `/login`.
2. Login page shows a "Sign in with MCIAS" button.
3. Click redirects to MCIAS (`/sso/authorize`), which authenticates the
user.
4. MCIAS redirects back to the service's `/sso/callback` with an
authorization code.
5. The service exchanges the code for a JWT via a server-to-server call
to MCIAS `POST /v1/sso/token`.
6. The JWT is stored in a session cookie.
SSO is enabled by adding an `[sso]` section to the service config and
registering the service as an SSO client in MCIAS.
**Service config:**
```toml
[sso]
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
```
**MCIAS config** (add to the `[[sso_clients]]` list):
```toml
[[sso_clients]]
client_id = "<service>"
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
service_name = "<service>"
```
The `redirect_uri` must match exactly between the service config and
the MCIAS client registration.
When `[sso].redirect_uri` is empty or absent, the service falls back to
the direct credentials form.
**Implementation:** Services use `mcdsl/sso` (v1.7.0+) which handles
state management, CSRF-safe cookies, and the code exchange. The web
server registers three routes:
| Route | Purpose |
|-------|---------|
| `GET /login` | Renders landing page with "Sign in with MCIAS" button |
| `GET /sso/redirect` | Sets state cookies, redirects to MCIAS |
| `GET /sso/callback` | Validates state, exchanges code for JWT, sets session |
### Direct Credentials (API / Fallback)
1. Client sends credentials to the service's `POST /v1/auth/login`. 1. Client sends credentials to the service's `POST /v1/auth/login`.
2. Service forwards them to MCIAS via `mcdsl/auth.Authenticator.Login()`. 2. Service forwards them to MCIAS via the client library
(`git.wntrmute.dev/mc/mcias/clients/go`).
3. MCIAS validates and returns a bearer token. 3. MCIAS validates and returns a bearer token.
4. Subsequent requests include `Authorization: Bearer <token>`. 4. Subsequent requests include `Authorization: Bearer <token>`.
5. Service validates tokens via `ValidateToken()`, cached for 30s 5. Service validates tokens via MCIAS `ValidateToken()`, cached for 30s
(keyed by SHA-256 of the token). (keyed by SHA-256 of the token).
Web UIs use this mode when SSO is not configured, presenting a
username/password/TOTP form instead of the SSO button.
### Roles ### Roles
| Role | Access | | Role | Access |
@@ -771,10 +685,9 @@ For reference, these services are operational on the platform:
| Service | Version | Node | Purpose | | Service | Version | Node | Purpose |
|---------|---------|------|---------| |---------|---------|------|---------|
| MCIAS | v1.9.0 | (separate) | Identity and access | | MCIAS | v1.9.0 | (separate) | Identity and access |
| Metacrypt | v1.4.1 | rift | Cryptographic service, PKI/CA | | Metacrypt | v1.3.1 | rift | Cryptographic service, PKI/CA |
| MC-Proxy | v1.2.1 | rift, svc | TLS proxy and router (svc handles public edge) | | MC-Proxy | v1.2.1 | rift | TLS proxy and router |
| MCR | v1.2.1 | rift | Container registry | | MCR | v1.2.1 | rift | Container registry |
| MCNS | v1.1.1 | rift | Authoritative DNS | | MCNS | v1.1.1 | rift | Authoritative DNS |
| MCDoc | v0.1.0 | rift | Documentation server | | MCDoc | v0.1.0 | rift | Documentation server |
| MCQ | v0.4.0 | rift | Document review queue | | MCP | v0.7.6 | rift | Control plane agent |
| MCP | v0.7.6 | rift, svc | Control plane agent |

View File

@@ -1,127 +0,0 @@
# Phase E: Multi-Node Orchestration
Phase D (automated DNS registration) is complete. Phase E extends MCP from
a single-node agent on rift to a multi-node fleet with a central master
process.
## Goal
Deploy and manage services across multiple nodes from a single control
plane. The operator runs `mcp deploy` and the system places the workload on
the right node, provisions certs, registers DNS, and configures routing --
same as today on rift, but across the fleet.
## Fleet Topology
| Node | OS | Arch | Role |
|------|----|------|------|
| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
| orion | NixOS | amd64 | Worker |
| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
Tailnet is the interconnect between all nodes. Public traffic enters via
mc-proxy on svc, which forwards over Tailnet to worker nodes.
## Key Architecture Decisions
These were resolved in the 2026-04-01 design session:
1. **Rift is the master node.** No separate straylight machine. Core infra
stays on rift, which gains mcp-master alongside its existing agent.
2. **Master-mediated coordination.** Agents never talk to each other. All
cross-node operations go through the master. Agents only dial the master
(for registration and heartbeats) and respond to master RPCs.
3. **Agent self-registration.** Agents register with the master on startup
(name, role, address, arch). The master maintains the live node registry.
No static `[[nodes]]` config required except for bootstrap.
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
(with resource data). If the master misses 3 heartbeats (90s), it
actively probes the agent. Failed probe marks the node unhealthy.
5. **Tier-based placement.** `tier = "core"` runs on the master node.
`tier = "worker"` (default) is auto-placed on a worker with capacity.
Explicit `node = "orion"` overrides tier for pinned services.
6. **Two separate certs for public services.** Internal cert
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
(`metacircular.net`) issued by edge agent. Internal names never
appear on edge certs.
7. **`public = true` on routes.** Public routes declare intent with a
boolean flag. The master assigns the route to an edge node (currently
always svc). No explicit `edge` field in service definitions.
## Components
### Master (`mcp-master`)
Long-lived orchestrator on rift. Responsibilities:
- Accept CLI commands and dispatch to the correct agent
- Maintain node registry from agent self-registration
- Place services based on tier, explicit node, and resource availability
- Detect `public = true` routes and coordinate edge setup
- Validate public hostnames against allowed domain list
- Aggregate status from all agents (fleet-wide view)
- Probe agents on missed heartbeats
The master is stateless in the durable sense -- it rebuilds its world view
from agents on startup. If the master goes down, running services continue
unaffected; only new deploys and rescheduling stop.
### Agent upgrades
The fleet is heterogeneous (NixOS + Debian, amd64 + arm64), so NixOS flake
inputs don't work as a universal update mechanism.
**Design:** MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
- `mcp agent upgrade [node]` -- CLI cross-compiles for the target's
GOARCH, SCPs the binary, restarts via SSH
- Node config gains `ssh` (user@host) and `arch` (amd64/arm64) fields
- rift's NixOS `ExecStart` changes from nix store path to
`/srv/mcp/mcp-agent`
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
`mcp-agent.service`
### Edge agents
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
routing only: when the master tells it to set up an edge route, it
provisions a TLS cert from Metacrypt and registers the route in its local
mc-proxy via the gRPC admin API.
## Migration Plan
### Phase 1: Agent on svc
Deploy mcp-agent to svc. Verify with `mcp node list`.
### Phase 2: Edge routing RPCs
Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
Test by calling directly from CLI.
### Phase 3: Build mcp-master
Core loop: registration, heartbeats, deploy routing, placement, edge
coordination.
### Phase 4: Agent registration and health
Self-registration, heartbeat loop, master probe fallback, fleet status.
### Phase 5: Cut over
Point CLI at master, add tier fields to service defs, deploy agents to
orion and hyperborea.
## What Phase E Does NOT Include
These remain future work:
- Auto-reconciliation (agent auto-restarting drifted containers)
- Live migration (snapshot streaming between nodes)
- Web UI for fleet management
- Observability / log aggregation
- Object store
- Multiple edge nodes / master HA

View File

@@ -1,627 +0,0 @@
# Building MCQ: A Reading Queue in a Session
*A document reading queue goes from design discussion to production
deployment — including the deployment's sharp edges.*
*Written by Claude (Opus 4.6), reflecting on a collaborative session with
Kyle. This log focuses heavily on the deployment side, including what
went wrong, what was done manually versus what should have been MCP-managed,
and why `mcp ps` doesn't know mcq exists.*
---
## The Idea
Kyle was out and about, away from his tailnet, and wanted to review
platform documentation on his phone. The existing tools — mcdoc (which
renders docs from Gitea repos) and the repos themselves — require either
tailnet access or a desktop workflow.
The concept: a **document queue**. Push raw markdown from inside the
infrastructure, read rendered HTML from anywhere via a browser. Like a
self-hosted Pocket, but for internal docs you're actively iterating on.
After a design discussion, we settled on:
- **Name**: mcq (Metacircular Document Queue)
- **Data model**: Documents keyed by slug, upsert semantics (re-push
replaces content, resets read flag)
- **Auth**: MCIAS on everything — any user including guest can read, any
user including system accounts can push
- **Rendering**: Goldmark with GFM + syntax highlighting, rendered on
each page view
- **Architecture**: Single binary, REST API + gRPC + web UI
## Building the Service
### Codebase Exploration
Before writing any code, I explored the existing platform services to
understand the patterns:
- **mcat** (`~/src/metacircular/mcat/`): Reference for the web UI pattern —
chi router, CSRF, session cookies, htmx, embedded templates, cobra CLI,
config loading via `mcdsl/config`.
- **mcns** (`~/src/metacircular/mcns/`): Reference for REST + gRPC pattern —
separate `internal/server/` (REST) and `internal/grpcserver/` (gRPC),
method maps for auth interceptors, SQLite via `mcdsl/db`.
- **mcdoc** (`~/src/metacircular/mcdoc/`): Reference for goldmark rendering
and plain HTTP serving (mcdoc doesn't use mcdsl for config or HTTP — it
has its own, because it serves plain HTTP behind mc-proxy).
- **mcdsl** (`~/src/metacircular/mcdsl/`): The shared library — auth,
config, db, httpserver, grpcserver, csrf, web packages.
### Implementation (on vade, Kyle's workstation)
Created `~/src/mcq/` with the standard platform layout:
```
cmd/mcq/ main.go, server.go (cobra CLI)
internal/
config/ custom config (TLS optional, see below)
db/ SQLite schema, migrations, document CRUD
server/ REST API routes and handlers
grpcserver/ gRPC server, interceptors, service handlers
webserver/ Web UI routes, templates, session management
render/ goldmark markdown-to-HTML renderer
proto/mcq/v1/ Protobuf definitions
gen/mcq/v1/ Generated Go code
web/ Embedded templates + static files
deploy/ systemd, examples
```
Key files:
- **Proto** (`proto/mcq/v1/mcq.proto`): DocumentService (ListDocuments,
GetDocument, PutDocument, DeleteDocument, MarkRead, MarkUnread),
AuthService (Login, Logout), AdminService (Health).
- **DB** (`internal/db/documents.go`): Single `documents` table with slug
as unique key. PutDocument uses `INSERT ... ON CONFLICT(slug) DO UPDATE`.
- **REST** (`internal/server/routes.go`): All routes under `/v1/`
`PUT /v1/documents/{slug}` for upsert, standard CRUD otherwise.
- **Web UI** (`internal/webserver/server.go`): Login page, document list
at `/`, rendered markdown reader at `/d/{slug}`.
- **gRPC** (`internal/grpcserver/`): Mirrors REST exactly. Method map puts
all document operations in `authRequiredMethods`, nothing in
`adminRequiredMethods`.
Proto generation ran on vade:
```bash
cd ~/src/mcq
protoc --go_out=. --go_opt=module=git.wntrmute.dev/mc/mcq \
--go-grpc_out=. --go-grpc_opt=module=git.wntrmute.dev/mc/mcq \
proto/mcq/v1/*.proto
```
### The .gitignore Bug
First `git add -A` missed `cmd/mcq/`, `proto/mcq/`, and `gen/mcq/`. The
`.gitignore` had:
```
mcq
srv/
```
The pattern `mcq` (without a leading slash) matches any file or directory
named `mcq` at any level — so it was ignoring `cmd/mcq/`, `gen/mcq/`, and
`proto/mcq/`. Fixed to:
```
/mcq
/srv/
```
### The TLS Decision
This was the most consequential design decision for deployment.
The standard platform pattern (mcdsl's `httpserver`) enforces TLS 1.3
minimum. But mc-proxy on svc terminates TLS at the edge and forwards to
backends as plain HTTP (for localhost services) or HTTPS (for remote
backends like rift). Gitea on svc runs plain HTTP on port 3000 behind
mc-proxy. mcdoc on rift runs plain HTTP on port 38080 behind mc-proxy.
mcdsl's `config.Load` validates that `tls_cert` and `tls_key` are present
— they're required fields. So I couldn't use `config.Base` with empty TLS
fields.
**Solution**: Created `internal/config/config.go` — mcq's own config
package, modeled after mcdoc's. Same TOML loading, env var overrides, and
validation, but TLS fields are optional. When empty, the server uses
`http.ListenAndServe()` instead of `httpserver.ListenAndServeTLS()`.
This meant giving up the mcdsl httpserver (with its logging middleware and
TLS enforcement) for the plain HTTP path. The gRPC server was also dropped
from the svc deployment since it requires TLS. The REST API and web UI
are sufficient for the use case.
### Build and Test (on vade)
```bash
cd ~/src/mcq
go mod tidy
go build ./... # clean
go vet ./... # clean
go test ./... # 6 tests pass (all in internal/db)
# Production binary
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w -X main.version=v0.1.0" \
-o mcq ./cmd/mcq
# Result: 21MB static binary
```
---
## Deployment
### Why mcq is NOT in `mcp ps`
**This is the most important thing in this log.**
mcq was deployed as a **manual systemd service on svc**, not as an
MCP-managed container. This means:
- `mcp ps` doesn't know about it
- `mcp stop mcq` won't work
- `mcp deploy mcq` won't work
- There's no service definition in `~/.config/mcp/services/`
- There's no container image in MCR
- The binary was `scp`'d directly to svc and `install`'d to `/usr/local/bin/`
**Why?** Three reasons:
1. **svc has no MCP agent.** The MCP agent (`mcp-agent`) only runs on rift.
svc is a Debian VPS that hosts MCIAS, mc-proxy, MCNS, and Gitea — all
deployed as manual systemd services, not via MCP. Getting mcq into MCP
would require deploying an MCP agent to svc first (Phase E in
PLATFORM_EVOLUTION.md, items #10-#12).
2. **mcq runs as a native binary, not a container.** MCP manages containers
(podman). mcq on svc is a bare binary under systemd, like MCIAS and
mc-proxy on svc. To make it MCP-managed, it would need to be
containerized and pushed to MCR first.
3. **The deployment followed the existing svc pattern.** Every service on
svc was deployed this way: build on vade, scp to svc, install, write
config, write systemd unit, enable. This was a deliberate choice to
match the existing operational model rather than block on MCP agent
deployment.
### What MCP-managed deployment would look like
Once svc has an MCP agent, mcq could be managed like services on rift:
```toml
# ~/.config/mcp/services/mcq.toml
name = "mcq"
node = "svc"
version = "v0.1.0"
[[components]]
name = "api"
[[components.routes]]
port = 8090
mode = "l7"
hostname = "mcq.metacircular.net"
```
This would require:
- MCP agent running on svc
- mcq containerized (Dockerfile) and pushed to MCR
- Agent handles port assignment, mc-proxy route registration, lifecycle
### The Actual Deployment Steps
All commands below were run from vade (Kyle's workstation) via SSH to svc,
unless otherwise noted.
#### 1. Push repo to Gitea (from vade)
```bash
cd ~/src/mcq
git remote add origin git@git.wntrmute.dev:mc/mcq.git
git push -u origin master
```
The mc/mcq repo was created manually in Gitea (the MCP tool's API token
lacked `write:organization` scope for creating repos under the mc org).
#### 2. Copy binary to svc (from vade)
```bash
scp ~/src/mcq/mcq kyle@svc:/tmp/mcq
```
SSH to svc uses Tailscale hostname resolution — `svc` resolves to
`100.106.232.4` via tailscale. No SSH config entry was needed. Had to
accept the host key on first connection:
```bash
ssh -o StrictHostKeyChecking=accept-new kyle@svc
```
#### 3. Create user and install binary (on svc, as root via sudo)
```bash
sudo useradd --system --no-create-home --shell /usr/sbin/nologin mcq
sudo mkdir -p /srv/mcq
sudo chown mcq:mcq /srv/mcq
sudo chmod 0700 /srv/mcq
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
```
Verified: `/usr/local/bin/mcq --version``mcq version v0.1.0`
#### 4. Write config (on svc)
Created `/srv/mcq/mcq.toml`:
```toml
[server]
listen_addr = "127.0.0.1:8090"
[database]
path = "/srv/mcq/mcq.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = "/srv/mcq/ca.pem"
service_name = "mcq"
tags = []
[log]
level = "info"
```
**Important detail**: The first attempt used `server_url = "https://127.0.0.1:8443"`
which failed because MCIAS's TLS cert has SANs for `mcias.wntrmute.dev`
and `mcias.metacircular.net` but **not** `127.0.0.1` or `localhost`. Token
validation returned "invalid or expired token" because the mcdsl auth
client couldn't establish a TLS connection to MCIAS.
Fixed by copying the pattern from MCNS on svc:
- `server_url = "https://mcias.metacircular.net:8443"` (uses the hostname
that matches the cert's SAN)
- `ca_cert = "/srv/mcq/ca.pem"` (the WNTRMUTE root CA cert, copied from
`/srv/mcns/certs/ca.pem`)
The hostname `mcias.metacircular.net` resolves to svc's public IP, so
this still connects to localhost MCIAS — it just goes through the public
IP for TLS hostname verification. (On a locked-down firewall this could
be an issue, but svc allows loopback through its public IP.)
#### 5. Create systemd unit (on svc)
Created `/etc/systemd/system/mcq.service`:
```ini
[Unit]
Description=MCQ Document Queue
After=network-online.target mcias.service
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcq server --config /srv/mcq/mcq.toml
WorkingDirectory=/srv/mcq
Restart=on-failure
RestartSec=5
User=mcq
Group=mcq
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/srv/mcq
PrivateTmp=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
[Install]
WantedBy=multi-user.target
```
```bash
sudo systemctl daemon-reload
sudo systemctl enable --now mcq
```
Verified running: PID 3765144, memory 7.8MB, started cleanly.
#### 6. Generate TLS certificate for mc-proxy (on vade)
mc-proxy needs a TLS cert for the `mcq.metacircular.net` hostname (it
terminates TLS at the edge). Generated using the local WNTRMUTE root CA:
```bash
cd /tmp
openssl ecparam -name prime256v1 -genkey -noout -out mcq.key
openssl req -new -key mcq.key -out mcq.csr \
-subj "/CN=mcq.metacircular.net/O=Metacircular Dynamics" \
-addext "subjectAltName=DNS:mcq.metacircular.net"
openssl x509 -req -in mcq.csr \
-CA ~/src/metacircular/ca/ca.pem \
-CAkey ~/src/metacircular/ca/ca.key \
-CAcreateserial -out mcq.pem -days 365 -sha256 \
-extfile <(echo "subjectAltName=DNS:mcq.metacircular.net
keyUsage=digitalSignature
extendedKeyUsage=serverAuth")
```
The CA key and cert are at `~/src/metacircular/ca/` — this is the
WNTRMUTE Issuing Authority root CA. Not Metacrypt (which has its own
intermediate CA for automated issuance). The existing mc-proxy certs
(docs, git, metacrypt) were all signed by this same root CA.
Copied to svc:
```bash
scp /tmp/mcq.pem /tmp/mcq.key kyle@svc:/tmp/
```
Installed on svc:
```bash
sudo cp /tmp/mcq.pem /srv/mc-proxy/certs/mcq.metacircular.net.pem
sudo cp /tmp/mcq.key /srv/mc-proxy/certs/mcq.metacircular.net.key
sudo chown mc-proxy:mc-proxy /srv/mc-proxy/certs/mcq.metacircular.net.*
sudo chmod 0600 /srv/mc-proxy/certs/mcq.metacircular.net.key
```
#### 7. Add mc-proxy route (on svc)
mc-proxy on svc uses SQLite for route persistence. The TOML config only
seeds the database on first run (`store.IsEmpty()` check). After that,
routes are loaded from SQLite. So editing the TOML alone doesn't add a
route — you must also insert into the database.
I did both (TOML for documentation/re-seeding, SQLite for immediate effect):
**TOML** (added via `sed` to `/srv/mc-proxy/mc-proxy.toml`):
```toml
[[listeners.routes]]
hostname = "mcq.metacircular.net"
backend = "127.0.0.1:8090"
mode = "l7"
tls_cert = "/srv/mc-proxy/certs/mcq.metacircular.net.pem"
tls_key = "/srv/mc-proxy/certs/mcq.metacircular.net.key"
backend_tls = false
```
**SQLite** (direct insert):
```bash
sudo sqlite3 /srv/mc-proxy/mc-proxy.db "
INSERT INTO routes (listener_id, hostname, backend, mode, tls_cert, tls_key, backend_tls)
VALUES (1, 'mcq.metacircular.net', '127.0.0.1:8090', 'l7',
'/srv/mc-proxy/certs/mcq.metacircular.net.pem',
'/srv/mc-proxy/certs/mcq.metacircular.net.key', 0);
"
```
The `listener_id = 1` is the `:443` listener (only listener on svc's
mc-proxy).
**Note on `backend_tls = false`**: mcq serves plain HTTP on localhost.
mc-proxy terminates TLS for the client and forwards as plain HTTP to
`127.0.0.1:8090`. This is the same pattern as Gitea (`127.0.0.1:3000`)
and mcdoc (`100.95.252.120:38080`). Only metacrypt uses `backend_tls = true`
because its backend is on rift over Tailscale.
#### 8. Restart mc-proxy (on svc)
```bash
sudo systemctl restart mc-proxy
```
This was messy. mc-proxy's graceful shutdown waits for in-flight
connections to drain, and the 30-second shutdown timeout was exceeded
(lingering connections from internet scanners hitting git.metacircular.net).
The shutdown hung for ~30 seconds before logging "shutdown timeout exceeded,
forcing close". systemd then moved to `deactivating (stop-sigterm)` state.
Had to force it:
```bash
sudo systemctl kill mc-proxy
sleep 2
sudo systemctl start mc-proxy
```
After restart: `routes=5` (was 4 before mcq). Confirmed:
```bash
curl -sk https://mcq.metacircular.net/v1/health
# {"status":"ok"}
```
#### 9. Push documents (from vade)
Used the mcp-agent service account token (from
`~/data/downloads/service-account-76d35a82-77ca-422f-85a3-b9f9360d5164.token`)
to authenticate API calls. This is a long-lived JWT issued by MCIAS with
`admin` role, `exp` in 2027.
```bash
TOKEN=$(cat ~/data/downloads/service-account-*.token)
# Push MCP Architecture
python3 -c "
import json
body = open('mcp/ARCHITECTURE.md').read()
print(json.dumps({'title': 'MCP Architecture', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/mcp-architecture \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
# Push Platform Evolution
python3 -c "
import json
body = open('PLATFORM_EVOLUTION.md').read()
print(json.dumps({'title': 'Platform Evolution', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/platform-evolution \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
# Push Packaging doc
python3 -c "
import json
body = open('docs/packaging-and-deployment.md').read()
print(json.dumps({'title': 'Packaging and Deployment', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/packaging-and-deployment \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
```
Used `python3` for JSON encoding because `jq` isn't installed on vade
(NixOS — would need to add it to the system config or use `nix-shell`).
All three documents pushed successfully. The token identifies as
`mcp-agent` (the service account name), so `pushed_by` shows `mcp-agent`
on each document.
### Subsequent Update: Tufte Theme
Kyle wanted a wider reading area (70%) and a Tufte-inspired theme. Updated
`web/static/style.css`:
- Serif font stack (Georgia, Palatino)
- Cream background (`#fffff8`)
- Italic headings, small-caps labels
- `width: 70%` on `.page-container` (was `max-width: 720px`)
- Minimal chrome — document list uses ruled lines instead of cards,
tables use bottom-borders only
- Mobile fallback: full width below 768px
Rebuilt, deployed same way:
```bash
# On vade
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w -X main.version=v0.1.1" \
-o mcq ./cmd/mcq
scp mcq kyle@svc:/tmp/mcq
# On svc
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
sudo systemctl restart mcq
```
---
## State After This Session
### What's running on svc
| Service | Port | Managed by | Notes |
|---------|------|------------|-------|
| MCIAS | :8443/:9443 | systemd | Identity/auth, been here longest |
| mc-proxy | :443 | systemd | L7 TLS termination, 5 routes |
| MCNS | :53/:8444/:9444 | systemd | Authoritative DNS |
| Gitea | :3000 | systemd | Git hosting |
| **mcq** | **:8090** | **systemd** | **NEW: document queue** |
None of these are MCP-managed. svc has no MCP agent.
### mc-proxy routes on svc
| Hostname | Backend | Mode | TLS Backend |
|----------|---------|------|-------------|
| metacrypt.metacircular.net | 100.95.252.120:18080 | L7 | yes (rift) |
| git.metacircular.net | 127.0.0.1:3000 | L7 | no |
| git.wntrmute.dev | 127.0.0.1:3000 | L7 | no |
| docs.metacircular.net | 100.95.252.120:38080 | L7 | no |
| **mcq.metacircular.net** | **127.0.0.1:8090** | **L7** | **no** |
### DNS
`mcq.metacircular.net` is a CNAME to `svc.metacircular.net` (set up by
Kyle at the DNS registrar before this session). mc-proxy's SNI-based
routing handles the rest.
### Documents in queue
| Slug | Title | Pushed By |
|------|-------|-----------|
| mcp-architecture | MCP Architecture | mcp-agent |
| platform-evolution | Platform Evolution | mcp-agent |
| packaging-and-deployment | Packaging and Deployment | mcp-agent |
### Git
Repo: `mc/mcq` on Gitea (`git.wntrmute.dev:mc/mcq.git`)
Commits:
1. `bc16279` — Initial implementation
2. `648e9dc` — Support plain HTTP mode for mc-proxy L7 deployment
3. `a5b90b6` — Switch to Tufte-inspired reading theme
---
## What Would Be Different with MCP
If svc had an MCP agent and mcq were containerized:
1. **No manual SSH**`mcp deploy mcq` from vade would push the service
definition, agent would pull the image from MCR.
2. **No manual port picking** — agent assigns a free port from 10000-60000.
3. **No manual mc-proxy route** — agent calls mc-proxy's gRPC API to
register the route (Phase B, already working on rift).
4. **No manual TLS cert** — agent provisions from Metacrypt CA
(Phase C, already working on rift).
5. **No manual systemd unit** — agent manages the container lifecycle.
6. **`mcp ps` would show mcq** — because the agent tracks it in its
registry.
7. **`mcp stop mcq` / `mcp restart mcq` would work** — standard lifecycle.
The gap is: svc has no agent. That's Phase E work (items #10-#12 in
PLATFORM_EVOLUTION.md). The prerequisites are the agent binary location
convention, SSH-based upgrade tooling, and node provisioning for Debian.
---
## Rough Edges and Lessons
1. **MCIAS cert hostname**: Every new service on svc will hit this. The
MCIAS cert doesn't include localhost as a SAN. Services must use
`server_url = "https://mcias.metacircular.net:8443"` (which routes
through the public IP back to localhost) and include the CA cert.
Could fix by reissuing the MCIAS cert with a localhost SAN.
2. **mc-proxy route persistence**: The TOML-seeds-once-then-SQLite model
means you have to touch two places (TOML for future re-seeds, SQLite
for immediate effect). On rift this is handled by the agent's gRPC
calls. On svc without an agent, it's manual database surgery.
3. **mc-proxy shutdown timeout**: The 30-second timeout isn't enough when
internet scanners maintain persistent connections to git.metacircular.net.
Had to force-kill on restart. Should increase `shutdown_timeout` or
add a SIGKILL escalation in the systemd unit (`TimeoutStopSec=45`,
which sends SIGKILL after 45s).
4. **No jq on vade**: NixOS doesn't have jq in the default system config.
Used python3 as a workaround for JSON encoding. Minor friction.
5. **mcdsl httpserver assumes TLS**: Services behind mc-proxy L7 can't use
`mcdsl/httpserver` because it enforces TLS 1.3. mcdoc solved this with
its own config/server. mcq now does the same. This is a recurring
pattern — might warrant adding a plain HTTP mode to mcdsl httpserver,
or a separate `mcdsl/httpserver/plain` package.
6. **Session cookie Secure flag behind plain HTTP**: The mcdsl `web`
package always sets `Secure: true` on session cookies. This works
behind mc-proxy L7 because the *browser* sees HTTPS (mc-proxy
terminates TLS) — the `Secure` flag is about the browser's view of
the connection, not the backend. If mcq were ever accessed directly
(not through mc-proxy), cookies would silently fail.

View File

@@ -1,231 +0,0 @@
# Incident Report: UID Change Cascading Failure
**Date**: 2026-04-03
**Duration**: ~2 hours (08:4510:45 PDT)
**Severity**: Full platform outage on rift (all containers lost)
**Root cause**: Changing the `mcp` system user UID from 995 to 850
## Timeline
### Background
Orion was being provisioned as a new worker node. Its NixOS config
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
was already assigned to the `sshd` user, causing a UID collision:
```
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
```
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
rejected rootless podman operations because the calling process's UID
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
### The UID Change
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
the 800899 range, empty on all nodes). Both rift and orion were
rebuilt with `nixos-rebuild switch`.
**Problem 1: NixOS doesn't change UIDs for existing users.** The
rebuild created the NixOS config with `uid = 850` but the existing
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
was required on each node.
**Problem 2: Rootless podman caches the UID everywhere.**
- Podman's SQLite database (`db.sql`) stores absolute paths like
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
- The systemd user session (`/run/user/995/`) is tied to the UID
- subuid/subgid mappings reference the user by name but the kernel
checks the actual UID
- Container storage overlay directories have file ownership based on
the old UID namespace mapping (995 → 100000)
After changing the UID, `podman` operations failed with:
```
newuidmap: write to uid_map failed: Operation not permitted
```
### The Reboot
Rift was rebooted to get a clean systemd user session for UID 850.
The reboot succeeded, but **all containers were gone**:
```
$ podman ps -a
(empty)
```
Podman's database was recreated fresh on boot because the old database
referenced paths under `/run/user/995/` which no longer existed. The
images were still in overlay storage but the container definitions
(names, port mappings, volume mounts, restart policies) were lost.
### DNS Collapse
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
ran as a container on rift. When all containers were lost, DNS
resolution broke:
- `mcq.svc.mcp.metacircular.net` → no answer
- MCNS also served as a recursive resolver for the LAN
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
Tailscale DNS (MagicDNS) was also affected because resolved's global
DNS config pointed to MCNS. Tailscale itself remained functional
(its coordination servers are external), but hostname resolution via
Tailscale DNS names failed.
The operator turned off Tailscale on vade (the workstation) because
Tailscale's MagicDNS was routing ALL DNS queries through the broken
MCNS resolver — external services including Claude Code and Gitea
were unreachable. Disabling Tailscale was the only way to restore
external DNS resolution. However, this also broke connectivity to
rift since the MCP agent binds to the Tailnet IP only
(`100.95.252.120:9444`).
### Recovery
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
connectivity works without MCNS — MagicDNS uses Tailscale's own
servers for `.ts.net` names.
**Step 2**: Start MCNS manually via `podman run`. The image was cached
in overlay storage. MCNS needed explicit port bindings (not `--network
host`) because systemd-resolved holds port 53 on localhost:
```bash
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
```
DNS resolution restored within seconds.
**Step 3**: Start remaining services manually via `podman run`. Images
were all cached. The `mcp deploy` CLI couldn't work because:
- MCR was down (can't pull images)
- The agent's registry was empty (podman DB reset)
- Auto-build failed (`/etc/resolv.conf` permission denied in build
containers)
Each service was started with explicit `podman run` commands matching
the service definitions in `~/.config/mcp/services/*.toml`.
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
were owned by UID 850 (the mcp user on the host). Inside containers,
UID 0 (root) maps to host UID 850 via subuid. But:
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
to match the container's user namespace
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
in the database directory
- modernc.org/sqlite calls `fchmod()` on the database file, which is
denied inside rootless podman user namespaces (even for UID 0 in the
namespace)
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
was fatal for MCR and Metacrypt. The fix:
```bash
# Stop the container
podman stop metacrypt-api
# Delete the database (WAL and SHM too)
rm -f /srv/metacrypt/metacrypt.db*
# Restart — the service recreates the database
podman start metacrypt-api
```
The `fchmod` error still occurs on the newly created database but is
non-fatal — the service logs a warning and continues.
**Data loss**: MCR and Metacrypt databases were deleted and recreated
empty. MCR lost its manifest/tag metadata (images still exist in
overlay storage but are unregistered). Metacrypt lost its CA state
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
etc.) started successfully because their databases survived the
ownership changes.
## Root Causes
1. **UID collision between system users**: NixOS auto-assigns UIDs
downward from 999. Pinning UID 995 for mcp collided with sshd on
orion.
2. **Rootless podman's deep UID dependency**: Changing a user's UID
after rootless podman has been used requires:
- Updating podman's internal database paths
- Recreating the systemd user session
- Fixing subuid/subgid mappings
- Fixing overlay storage ownership
- Fixing service data file ownership
- None of these happen automatically
3. **No boot sequencing**: When rift rebooted with no running
containers, there was no mechanism to start services in dependency
order. The boot sequence feature in the v2 architecture exists
precisely for this, but wasn't implemented yet.
4. **MCNS as a single point of DNS failure**: All machines used MCNS
as their DNS resolver. When MCNS went down, everything broke
including the ability to manage infrastructure.
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
library calls `fchmod()` on database files, which is denied inside
rootless podman user namespaces. This is a known incompatibility
that was masked by the previous UID setup.
## Lessons Learned
1. **Never change a rootless podman user's UID.** If a UID collision
exists, resolve it on the conflicting node (change sshd, not mcp)
or use a per-host UID override. Changing the UID after podman has
been used is destructive.
2. **DNS must not be a single point of failure.** All machines should
have fallback DNS resolvers that work independently of MCNS. The
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
fallbacks, not just MCNS.
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
(foundation → core → management) is not a nice-to-have. Without it,
manual recovery requires knowing the exact dependency order and the
exact `podman run` commands for each service.
4. **The MCP agent should be able to recover containers from its
registry.** After a podman database reset, the agent's SQLite
registry still knows what should be running. A `mcp agent recover`
command that recreates containers from the registry would eliminate
the manual `podman run` recovery.
5. **Service definitions must include all runtime parameters.** The
manual recovery required knowing port mappings, volume mounts,
network modes, user overrides, and command arguments for each
service. All of this is in the service definition files, but there
was no tool to translate a service definition into a `podman run`
command without the full MCP deploy pipeline.
6. **Tailscale MagicDNS amplifies DNS failures.** When MCNS is down
and MagicDNS routes through it, ALL DNS breaks — not just internal
names. Disabling Tailscale restores external DNS but loses Tailnet
connectivity. The fix is fallback resolvers that bypass MCNS, not
disabling Tailscale.
## Action Items
- [x] Write disaster recovery runbook → `docs/disaster-recovery.md`
- [x] Add fallback DNS resolvers to NixOS config → all nodes now have
1.1.1.1 and 8.8.8.8 as fallbacks after MCNS
- [x] Implement `mcp agent recover` command → MCP v0.10.5. Recreates
containers from the agent registry when podman DB is lost.
- [x] Implement boot sequencing in the agent → MCP v0.10.6.
[[boot.sequence]] config with per-stage health checks.
- [x] Fix modernc.org/sqlite `fchmod` → was our own `os.Chmod` in
`mcdsl/db/db.go`, not sqlite. Made best-effort in mcdsl v1.8.0.
- [x] Add multi-address support to node config → MCP v0.10.4.
Fallback addresses tried in order when primary fails.
- [x] Stabilize mcp UID → pinned at 850 with NEVER CHANGE comment