Compare commits

..

32 Commits

Author SHA1 Message Date
c05ed968f9 All incident action items complete (7/7)
- Disaster recovery runbook ✓
- Fallback DNS resolvers ✓
- Agent recover command (v0.10.5) ✓
- Boot sequencing (v0.10.6) ✓
- sqlite chmod fix (mcdsl v1.8.0) ✓
- Multi-address nodes (v0.10.4) ✓
- UID stabilization ✓

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 11:54:09 -07:00
02cfb27937 Update incident action items: 4/7 complete
Done: disaster recovery runbook, fallback DNS resolvers, sqlite chmod
fix (mcdsl v1.8.0), UID stabilization.
Remaining: agent recover command, boot sequencing, multi-address nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:33:41 -07:00
5aceb496e8 Fix incident report: Tailscale was disabled deliberately
MagicDNS routed all DNS through broken MCNS, making external services
(Claude, Gitea) unreachable. Disabling Tailscale was the correct
action to restore external DNS, not a mistake.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:25:49 -07:00
a474f17a8b Move incident report to log/ directory
Incident reports go in log/ (operational work logs), not docs/
(architecture and reference documentation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:22:42 -07:00
ad34904b0b Add incident report and disaster recovery runbook
incident-2026-04-03-uid-change.md: Detailed post-mortem of the UID
change cascading failure that took down all services on rift. Documents
the timeline, root causes, recovery steps, and lessons learned.

disaster-recovery.md: Step-by-step runbook for bootstrapping the
platform from zero when all containers are gone. Covers the boot
order (MCNS → mc-proxy/MCR/Metacrypt → master → apps), exact podman
run commands for each service, common errors, and verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:18:45 -07:00
cdf3d30bfe Update work log: edge routing E2E verified
Full edge routing flow works through the master: deploy with
public=true triggers cert provisioning on svc via Metacrypt,
mc-proxy route registration, and edge route persistence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:49:38 -07:00
6a6c5a0c8e Update work log: master deployed, agents upgraded
- Both agent certs reissued with Tailscale DNS names + internal names + IPs
- Both agents upgraded to v0.10.0 (edge routing RPCs, v2 proto fields)
- mcp-master v0.10.0 deployed on rift with TLS, MCIAS identity, and
  bootstrap node config
- CLI config switched to Tailscale DNS names with [master] section
- End-to-end tested: deploy, undeploy, ps all work through the master

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 17:31:59 -07:00
96abb87740 Merge branch 'worktree-agent-acf2563d' 2026-04-02 15:15:22 -07:00
4de025c071 Merge branch 'worktree-agent-aec15a54' 2026-04-02 15:15:22 -07:00
5158e6ae24 Merge branch 'worktree-agent-ae8e6d9d' 2026-04-02 15:15:22 -07:00
979a64a854 Update packaging guide for multi-node fleet topology
Reflect that the platform now spans multiple nodes (rift for compute,
svc for public edge routing, orion provisioned but offline). Add Fleet
Topology section, update deploy steps to include TLS cert provisioning
from Metacrypt CA, DNS registration in MCNS, and gRPC-based mc-proxy
route registration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:14:23 -07:00
634b9096c0 Update STATUS.md to current deployed state (2026-04-02)
- Update all service versions from mcp ps output
- Add svc as edge node, orion as offline/maintenance to node inventory
- Add MCQ service details section
- Update MCP to v0.9.0 with multi-node edge support
- Add svc port map and public-facing routes
- Update non-platform services (exod, sgardd, kls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:14:20 -07:00
24a0db0f47 Update root CLAUDE.md: remove mcdeploy, add log/, update MCP description
- Remove deprecated mcdeploy/ from project map (directory doesn't exist)
- Add log/ directory to project map
- Update MCP description to reflect multi-node fleet management
- Add note about subprojects at sibling paths (mcat, mcdoc, mcq)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:14:17 -07:00
78ce5003e1 Update PLATFORM_EVOLUTION.md and TOOLING.md for current state
Reflect Phase E in progress, add orion to fleet, mark mcdoc as
deployed, remove deprecated mcdeploy, update multi-arch fleet info,
and add mcdoc/mcq to tooling tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:14:13 -07:00
0b11899a9b Distinguish node infrastructure from deployed services
mcp-agent and mc-proxy run on every node as systemd services — they
are not placed by the master and don't belong in the placements table.
Snapshot paths are node-keyed for infrastructure (<service>/<node>/)
to avoid collisions between instances on different nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:07:59 -07:00
fb9942db14 Add svc Phase 1 work log
Documents all changes made to svc.metacircular.net for MCP v2 Phase 1:
mcp-agent deployment, user creation, TLS cert, systemd unit, mc-proxy
socket access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:06:54 -07:00
f9eea00a6c Remove completed architecture review
Review is complete — all items resolved, document approved for
implementation. The review served its purpose and is no longer needed
in the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:09:44 -07:00
f3a608c4ae Add v2 architecture and review docs
- docs/architecture-v2.md: Full MCP v2 multi-node architecture
  covering master/agent topology, security model, service placement,
  edge routing, snapshots, migration, and 6-phase implementation plan.
  Reviewed and ready for implementation.
- docs/architecture-v2-review.md: External architecture review with
  all items resolved.
- docs/phase-e-plan.md: Updated to match v2 decisions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:09:36 -07:00
8fb6374257 Document SSO login flow in packaging and deployment guide
Add SSO redirect flow alongside direct credentials, MCIAS client
registration steps, [sso] config section, and updated service versions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 23:49:36 -07:00
e441df34c8 Add MCQ to platform docs and service inventory
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:05:12 -07:00
644ded9efc Mark mcdeploy as deprecated (archived, superseded by MCP)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:29:02 -07:00
4386fb0896 Sync docs/metacircular.md versions and add undeploy capability
Update version references to match current git tags: MCIAS v1.9.0,
Metacrypt v1.3.1, MCP v0.7.6. Add Phase D (DNS registration) to MCP
status, update RPC/CLI counts, and document undeploy as a first-class
capability. Also sync STATUS.md and packaging-and-deployment.md with
the same version updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:20:30 -07:00
5c4ed5c681 Update packaging docs: fix stale versions, add missing CLI commands
Sync the deployed services version table with current git tags (MCIAS
v1.9.0, Metacrypt v1.3.1, MCR v1.2.1, MCNS v1.1.1, MCP v0.7.6) and
add mcp logs and mcp edit to the command reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:19:53 -07:00
4722fdb0da Sync platform docs: Phase D complete, Phase E planned, version updates
- PLATFORM_EVOLUTION: Mark Phase D (DNS) complete, add Phase E
  (multi-node agent management) planning with items #10-12
- PLATFORM_EVOLUTION: Fix stale mcdsl reference (v1.2.0 adds → added,
  consuming services now on v1.4.0)
- STATUS: Update all service versions to current, note Phase A-D
  completion and Phase E planning
- docs/packaging-and-deployment: Add agent management section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:19:53 -07:00
1715a53193 Update metacircular.md with system account auth model
Document the three-tier identity model across MCIAS, MCR, Metacrypt,
MCNS, and MCP. Update version numbers for MCR (v1.2.1), MCNS (v1.1.1),
MCP (v0.7.2). Clarify that admin is reserved for MCIAS administration,
not routine deploy operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:18:53 -07:00
5e7e2522d7 Add CLI security standard: never echo passwords
New standard requiring mcdsl/terminal.ReadPassword for interactive
password prompts. Codifies the fix applied to MCP's login command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:18:53 -07:00
d041df01e4 Sync packaging docs with MCP implementation reality
Fix service definition examples to require explicit image fields,
remove phantom version top-level field, document container listen
address requirement (0.0.0.0:$PORT not localhost), add undeploy
command, and update deployed service versions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:18:53 -07:00
a04543028b Mark Phase C complete in PLATFORM_EVOLUTION.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:18:53 -07:00
90781014db Add mcdeploy to project maps, update MCDoc status
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 06:16:06 +00:00
24a8ae8c21 Sync STATUS.md: MC-Proxy v1.2.1, MCP v0.4.0, add MCDoc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 06:16:06 +00:00
84610339aa Update deployed services versions in packaging guide 2026-03-29 06:15:57 +00:00
8550b8c2e5 Sync docs/metacircular.md versions and capabilities to current state
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 14:59:20 -07:00
14 changed files with 3694 additions and 241 deletions

View File

@@ -17,14 +17,17 @@ Metacircular is a multi-service personal infrastructure platform. This root repo
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
| `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go |
| `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go |
| `mcp/` | Control plane — operator-driven deployment, service registry, container lifecycle (master/agent) | Go |
| `mcdeploy/` | Deployment CLI — tactical bridge tool for build, push, deploy operations | Go |
| `mcq/` | Document review queue — push docs for review, MCP server for Claude integration | Go |
| `mcp/` | Control plane — service deployment, container lifecycle, multi-node fleet management (CLI/agent, master in development) | Go |
| `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go |
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
| `docs/` | Platform-wide documentation (architecture overview, deployment guide) | Markdown |
| `log/` | Operational work logs and session notes | Markdown |
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
Some subprojects (mcat, mcdoc, mcq) may live at sibling paths (e.g., `../mcq/`) rather than as subdirectories, depending on workspace layout.
## Service Dependencies
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
@@ -35,6 +38,7 @@ mcias (standalone — no MCIAS dependency)
├── mc-proxy (uses MCIAS for admin auth)
├── mcr (uses MCIAS for auth + policy)
├── mcdoc (public, no MCIAS — fetches docs from Gitea)
├── mcq (uses MCIAS for auth; document review queue)
├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle)
├── mcns (uses MCIAS for auth; authoritative DNS for internal zones)
└── mcat (tests MCIAS login policies)

View File

@@ -5,7 +5,7 @@ from its current manually-wired state to fully declarative deployment.
It is a living design document — not a spec, not a commitment, but a
record of where we are, where we want to be, and what's between.
Last updated: 2026-03-27 (Phases A + B complete)
Last updated: 2026-04-02 (Phases AD complete, Phase E in progress)
---
@@ -181,9 +181,9 @@ about one node, one mc-proxy, or loopback-only backends.
#### 1. mcdsl: Proper Module Versioning — DONE
mcdsl is already properly versioned and released:
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`, `v1.1.0`, `v1.2.0`
- All consuming services import by URL with pinned versions
(mcr, mcat, mcns, mc-proxy → `v1.0.0`; metacrypt → `v1.0.1`)
(all consuming services on `v1.2.0`)
- No `replace` directives anywhere
- Docker builds use standard `go mod download`
- `uses_mcdsl` eliminated from service definitions and docs
@@ -215,18 +215,14 @@ routes during deploy and stop:
- L4 routes: TLS passthrough, backend handles its own TLS
- Hostnames default to `<service>.svc.mcp.metacircular.net`
#### 4. MCP Agent: TLS Cert Provisioning
#### 4. MCP Agent: TLS Cert Provisioning — DONE
**Gap**: certs are manually provisioned and placed on disk. There is no
automated issuance flow.
**Work**:
- Agent requests certs from Metacrypt CA via its API.
- Certs are stored in a standard location
(`/srv/mc-proxy/certs/<service>.pem`).
- Cert renewal is handled automatically before expiry.
**Depends on**: Metacrypt cert issuance policy (#7).
Agent provisions TLS certificates from Metacrypt CA automatically during
deploy for L7 routes:
- ACME client library requests certs from Metacrypt CA via its API
- Certs stored in `/srv/mc-proxy/certs/<service>.pem`
- Provisioning happens during deploy before mc-proxy route registration
- L7 routes get agent-provisioned certs; L4 routes use service-managed TLS
#### 5. mc-proxy: Route Persistence — DONE
@@ -243,57 +239,49 @@ mc-proxy routes are fully persisted in SQLite and survive restarts:
bootstrap before MCP is operational. The gRPC API and mcproxyctl
are the primary route management interfaces going forward.
#### 6. MCP Agent: DNS Registration
#### 6. MCP Agent: DNS Registration — DONE
**Gap**: DNS records are manually configured in MCNS zone files.
Agent automatically manages DNS records during deploy and stop:
- Deploy: calls MCNS API to create/update A records for
`<service>.svc.mcp.metacircular.net` pointing to the node's address.
- Stop/undeploy: removes DNS records before stopping containers.
- Config: `[mcns]` section in agent config with server URL, CA cert,
token path, zone, and node address.
- Nil-safe: if MCNS not configured, silently skipped (backward compatible).
- Authorization: mcp-agent system account can manage any record name.
**Work**:
- Agent creates/updates A records in MCNS for
`<service>.svc.mcp.metacircular.net`.
- Agent removes records on service teardown.
#### 7. Metacrypt: Automated Cert Issuance Policy — DONE
**Depends on**: MCNS record management API (#8).
MCP agent has MCIAS credentials and Metacrypt policy for automated cert
issuance:
- MCP agent authenticates to Metacrypt with MCIAS service credentials
- Metacrypt policy allows cert issuance for
`*.svc.mcp.metacircular.net`
- One cert per hostname per service — no wildcard certs
#### 7. Metacrypt: Automated Cert Issuance Policy
#### 8. MCNS: Record Management API — DONE
**Gap**: no policy exists for automated cert issuance. The MCP agent
doesn't have a Metacrypt identity or permissions.
**Work**:
- MCP agent gets an MCIAS service account.
- Metacrypt policy allows this account to issue certs scoped to
`*.svc.mcp.metacircular.net` (and explicitly listed public
hostnames).
- No wildcard certs — one cert per hostname per service.
**Depends on**: MCIAS service account provisioning (exists today, just
needs the account created).
#### 8. MCNS: Record Management API
**Gap**: MCNS v1.0.0 has REST + gRPC APIs and SQLite storage, but
records are currently seeded from migrations (static). The API supports
CRUD operations but MCP does not yet call it for dynamic registration.
**Work**:
- MCP agent calls MCNS API to create/update/delete records on
deploy/stop.
- MCIAS auth scoping to allow MCP agent to manage
`*.svc.mcp.metacircular.net` records.
**Depends on**: MCNS API exists. Remaining work is MCP integration
and auth scoping.
MCNS provides full CRUD for DNS records via REST and gRPC:
- REST: POST/GET/PUT/DELETE on `/v1/zones/{zone}/records`
- gRPC: RecordService with ListRecords, CreateRecord, GetRecord,
UpdateRecord, DeleteRecord RPCs
- SQLite-backed with transactional writes, CNAME exclusivity enforcement,
and automatic SOA serial bumping on mutations
- Authorization: admin can manage any record, mcp-agent system account
can manage any record name, other system accounts scoped to own name
- MCP agent uses the REST API to register/deregister records on
deploy/stop
#### 9. Application $PORT Convention — DONE
mcdsl v1.1.0 adds `$PORT` and `$PORT_GRPC` env var support:
mcdsl v1.2.0 added `$PORT` and `$PORT_GRPC` env var support:
- `config.Load` checks `$PORT` → overrides `Server.ListenAddr`
- `config.Load` checks `$PORT_GRPC` → overrides `Server.GRPCAddr`
- Takes precedence over TOML and generic env overrides
(`$MCR_SERVER_LISTEN_ADDR`) — agent-assigned ports are authoritative
- Handles both `config.Base` embedding (MCR, MCNS, MCAT) and direct
`ServerConfig` embedding (Metacrypt) via struct tree walking
- MCR, Metacrypt, MCNS upgraded to mcdsl v1.1.0
- All consuming services on mcdsl v1.4.0
---
@@ -311,32 +299,92 @@ Phase A — Independent groundwork: ✓ COMPLETE
Phase B — MCP route registration: ✓ COMPLETE
#3 Agent registers routes with mc-proxy ✓ DONE
Phase C — Automated TLS:
#7 Metacrypt cert issuance policy
#4 Agent provisions certs
Phase C — Automated TLS: ✓ COMPLETE
#7 Metacrypt cert issuance policy ✓ DONE
#4 Agent provisions certs ✓ DONE
(depends on #7)
Phase D — DNS:
#8 MCNS record management API
#6 Agent registers DNS
Phase D — DNS: ✓ COMPLETE
#8 MCNS record management API ✓ DONE
#6 Agent registers DNS ✓ DONE
(depends on #8)
Phase E — Multi-node agent management:
#10 Agent binary at /srv/mcp/mcp-agent on all nodes
#11 mcp agent upgrade (SSH-based cross-compiled push)
#12 Node provisioning tooling (Debian + NixOS)
(depends on #10)
```
**Phases A and B are complete.** Services can be deployed with
agent-assigned ports, `$PORT` env vars, and automatic mc-proxy route
registration. No more manual port picking, mcproxyctl, or TOML editing.
The remaining manual steps are TLS cert provisioning (Phase C) and
DNS registration (Phase D).
**Phases A, B, C, and D are complete.** Services can be deployed with
agent-assigned ports, `$PORT` env vars, automatic mc-proxy route
registration, automated TLS cert provisioning from Metacrypt CA, and
automatic DNS registration in MCNS. No more manual port picking,
mcproxyctl, TOML editing, cert generation, or DNS zone editing.
### Immediate Next Steps
1. **Phase C: Automated TLS** — Metacrypt cert issuance policy for MCP
agent, then agent provisions certs automatically during deploy.
2. **Phase D: DNS** — MCNS record management API integration, then
agent registers DNS records during deploy.
3. **mcdoc implementation** — fully designed, no platform evolution
dependency. Deployable now with the new route system.
1. **Phase E: Multi-node agent management** — see below. See
`docs/phase-e-plan.md` for the detailed Phase E design and
`docs/architecture-v2.md` for the v2 master/agent architecture.
2. **mcdoc** — deployed (v0.1.0 on rift). Serves rendered markdown
from Gitea via mc-proxy.
#### 10. Agent Binary Location Convention
**Gap**: The agent binary is currently NixOS-managed on rift (lives in
`/nix/store/`, systemd `ExecStart` points there). This doesn't work for
Debian nodes and requires a full `nixos-rebuild` for every MCP release.
**Work**:
- Standardize agent binary at `/srv/mcp/mcp-agent` on all nodes.
- NixOS config: change `ExecStart` from nix store path to
`/srv/mcp/mcp-agent`. NixOS still owns user, systemd unit, podman,
directories — just not the binary version.
- Debian nodes: same layout, provisioned by setup script.
#### 11. Agent Upgrade via SSH Push
**Gap**: Updating the agent requires manual, OS-specific steps. On
NixOS: update flake lock, commit, push, rebuild. On Debian: build, scp,
restart. With multiple nodes and architectures (amd64 + arm64), this
doesn't scale.
**Work**:
- `mcp agent upgrade [node]` CLI command.
- Cross-compiles agent for each target arch (`GOARCH` from node config).
- Uses `golang.org/x/crypto/ssh` to push the binary and restart the
service. No external tool dependencies.
- Node config gains `ssh` (hostname) and `arch` (GOARCH) fields.
- Upgrades all nodes by default to prevent version skew. New RPCs cause
`Unimplemented` errors if agent and CLI are out of sync.
**Depends on**: #10 (binary location convention).
#### 12. Node Provisioning Tooling
**Gap**: Setting up a new node requires manual steps: create user,
create directories, install podman, write config, create systemd unit.
Different for NixOS vs Debian.
**Work**:
- Go-based provisioning tool (part of MCP CLI) or standalone script.
- `mcp node provision <name>` SSHs to the node and runs setup:
create `mcp` user with podman access, create `/srv/mcp/`, write
systemd unit, install initial binary, start service.
- For NixOS, provisioning remains in the NixOS config (declarative).
The provisioning tool targets Debian/generic Linux.
**Depends on**: #10 (binary location convention), #11 (SSH infra).
**Current fleet**:
| Node | OS | Arch | Status |
|------|----|------|--------|
| rift | NixOS | amd64 | Operational, single MCP agent |
| hyperborea | Debian (RPi) | arm64 | Online, agent provisioning planned |
| svc | Debian | amd64 | Runs MCIAS, needs agent for public edge services |
| orion | NixOS | amd64 | Provisioned, offline for maintenance |
---

View File

@@ -26,7 +26,8 @@ lives in [docs/metacircular.md](docs/metacircular.md).
| **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented |
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented |
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | Implemented |
| **MCDeploy** | Deployment CLI — single-binary tool for build, push, deploy, cert renewal, and status. Tactical bridge tool while MCP capabilities mature. | Active dev |
| **MCQ** | Document review queue — push docs for review, MCP server for Claude Code integration. | Implemented |
| **MCDeploy** | Deployment CLI — tactical bridge tool, now deprecated and archived. Superseded by MCP. | Deprecated |
Shared library: **MCDSL** — standard library for all services (auth, db,
config, TLS server, CSRF, snapshots).
@@ -103,8 +104,9 @@ metacircular/
├── mcns/ DNS server
├── mcat/ Login policy tester
├── mcdsl/ Standard library (shared packages)
├── mcdeploy/ Deployment CLI tool
├── mcdeploy/ Deployment CLI (deprecated, archived)
├── mcdoc/ Documentation server
├── mcq/ Document review queue
├── ca/ PKI infrastructure (dev/test, not source code)
└── docs/ Platform-wide documentation
```

134
STATUS.md
View File

@@ -1,39 +1,45 @@
# Metacircular Platform Status
Last updated: 2026-03-27
Last updated: 2026-04-02
## Platform Overview
One node operational (**rift**), running core infrastructure services as
containers fronted by MC-Proxy. MCIAS runs separately (not on rift).
Bootstrap phases 04 complete (MCIAS, Metacrypt, MC-Proxy, MCR all
operational). MCP is deployed and managing all platform containers. MCNS is
deployed on rift, serving authoritative DNS.
Two nodes operational (**rift** + **svc**), with **orion** provisioned but
offline for maintenance. Core infrastructure services run as containers on
rift, fronted by MC-Proxy. Svc operates as an MCP edge node managing
mc-proxy routing only (no containers); MCIAS runs on svc separately as a
systemd service. Bootstrap phases 04 complete (MCIAS, Metacrypt, MC-Proxy,
MCR all operational). MCP is deployed and managing all platform containers
on rift, with multi-node capability (svc as edge node). MCNS is deployed on
rift, serving authoritative DNS. Platform evolution Phases AD complete
(automated port assignment, route registration, TLS cert provisioning, and
DNS registration). Phase E (multi-node expansion) is in planning, with v2
architecture in development.
## Service Status
| Service | Version | SDLC Phase | Deployed | Node |
|---------|---------|------------|----------|------|
| MCIAS | v1.8.0 | Maintenance | Yes | (separate) |
| Metacrypt | v1.1.0 | Production | Yes | rift |
| MC-Proxy | v1.2.1 | Maintenance | Yes | rift |
| MCR | v1.2.0 | Production | Yes | rift |
| MCAT | v1.1.0 | Complete | Unknown | — |
| MCDSL | v1.2.0 | Stable | N/A (library) | — |
| MCNS | v1.1.0 | Production | Yes | rift |
| MCIAS | v1.10.5 | Maintenance | Yes | svc (systemd) |
| Metacrypt | v1.4.1 | Production | Yes | rift |
| MC-Proxy | v1.2.2 | Maintenance | Yes | rift |
| MCR | v1.3.2 | Production | Yes | rift |
| MCAT | v1.2.0 | Production | Yes | rift |
| MCDSL | v1.7.0 | Stable | N/A (library) | — |
| MCNS | v1.2.0 | Production | Yes | rift |
| MCDoc | v0.1.0 | Production | Yes | rift |
| MCP | v0.4.0 | Production | Yes | rift |
| MCDeploy | v0.2.0 | Active dev | N/A (CLI tool) | — |
| MCQ | v0.4.2 | Production | Yes | rift |
| MCP | v0.9.0 | Production | Yes | rift |
## Service Details
### MCIAS — Identity and Access Service
- **Version:** v1.8.0 (client library: clients/go/v0.2.0)
- **Version:** v1.10.5 (client library: clients/go/v0.2.0)
- **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active
refinement.
- **Deployment:** Running in production. All other services authenticate
against it.
- **Deployment:** Running in production on svc as a systemd service. All
other services authenticate against it.
- **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login
policies, Nix flake for CLI tools.
- **Artifacts:** systemd units (service + backup timer), install script,
@@ -41,7 +47,7 @@ deployed on rift, serving authoritative DNS.
### Metacrypt — Cryptographic Service Engine
- **Version:** v1.1.0.
- **Version:** v1.4.1 (API v1.3.1, Web v1.4.1).
- **Phase:** Production. All four engine types implemented (CA, SSH CA, transit,
user-to-user). Active work on integration test coverage.
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on
@@ -53,10 +59,12 @@ deployed on rift, serving authoritative DNS.
### MC-Proxy — TLS Proxy and Router
- **Version:** v1.2.1.
- **Phase:** Maintenance. Stable and actively routing traffic on rift.
- **Version:** v1.2.2.
- **Phase:** Maintenance. Stable and actively routing traffic on rift and svc.
- **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091.
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091. Routes persisted
in SQLite and managed via gRPC API. Svc runs its own mc-proxy on :443 with
public-facing routes.
- **Recent work:** Route persistence (SQLite), idempotent AddRoute (upsert),
golangci-lint v2 compliance, module path migration to mc/ org.
- **Artifacts:** systemd units (service + backup timer), Docker Compose
@@ -64,7 +72,8 @@ deployed on rift, serving authoritative DNS.
### MCR — Container Registry
- **Version:** v1.2.0. All implementation phases complete.
- **Version:** v1.3.2 (API v1.2.1, Web v1.3.2). All implementation phases
complete.
- **Phase:** Production. Deployed on rift, serving container images.
- **Deployment:** Running on rift as two containers (mcr API + mcr-web),
fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and
@@ -77,29 +86,30 @@ deployed on rift, serving authoritative DNS.
### MCAT — Login Policy Tester
- **Version:** v1.1.0.
- **Phase:** Complete. Diagnostic tool, not core infrastructure.
- **Deployment:** Available for ad-hoc use. Lightweight tool for testing
MCIAS login policy rules.
- **Version:** v1.2.0.
- **Phase:** Production. Deployed on rift as a container managed by MCP.
- **Deployment:** Running on rift. Lightweight tool for testing MCIAS login
policy rules.
- **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web.
- **Artifacts:** systemd unit, install script, example config.
### MCDSL — Standard Library
- **Version:** v1.2.0.
- **Version:** v1.7.0.
- **Phase:** Stable. All 9 packages implemented and tested. Being adopted
across the platform.
- **Deployment:** N/A (Go library, imported by other services).
- **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health,
archive.
- **Adoption:** All services except mcias on v1.2.0. mcias pending.
- **Adoption:** All services except mcias on v1.7.0. mcias pending.
### MCNS — Networking Service
- **Version:** v1.1.0.
- **Version:** v1.2.0.
- **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor.
- **Deployment:** Running on rift as a container managed by MCP. Serves two
authoritative zones plus upstream forwarding.
authoritative zones plus upstream forwarding. REST + gRPC APIs with MCIAS
auth and name-scoped system account authorization.
- **Recent work:** v1.0.0 implementation (custom Go DNS server), engineering
review, deployed to rift replacing CoreDNS.
- **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition,
@@ -115,53 +125,65 @@ deployed on rift, serving authoritative DNS.
rendering with syntax highlighting, webhook-driven refresh.
- **Artifacts:** Dockerfile, MCP service definition.
### MCQ — Document Review Queue
- **Version:** v0.4.2.
- **Phase:** Production. Document review queue with MCP server for Claude
integration.
- **Deployment:** Running on rift as a container managed by MCP.
- **Recent work:** Claude MCP server integration, document review workflow.
- **Artifacts:** Dockerfile, MCP service definition.
### MCP — Control Plane
- **Version:** v0.4.0.
- **Phase:** Production. Phases 0-4 complete. Phase C (automated TLS cert
provisioning) implemented. Deployed to rift, managing all platform containers.
- **Version:** v0.9.0 (agent on rift: v0.8.3-dirty, agent on svc: v0.9.0).
- **Phase:** Production. Phases AD complete. Multi-node capable with svc
operating as an edge node. V2 architecture in development, Phase E planning
underway.
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
with rootless podman. Manages metacrypt, mc-proxy, mcr, and mcns containers.
with rootless podman. Manages metacrypt, mc-proxy, mcr, mcns, mcdoc, mcat,
mcq, and non-platform containers. Svc runs an MCP agent for edge mc-proxy
route management.
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
`mcp-agent` (per-node daemon with SQLite registry, podman management,
monitoring with drift/flap detection, route registration with mc-proxy during
deploy/stop, automated TLS cert provisioning for L7 routes via Metacrypt CA).
gRPC-only (no REST).
- **Recent work:** Full v1 implementation (12 RPCs, 15 CLI commands),
deployment to rift, container migration from kyle→mcp user, service
definition authoring. Phase C automated TLS cert provisioning for L7 routes,
mc-proxy route registration during deploy, mc-proxy dependency updated to
v1.2.0, module path migration.
monitoring with drift/flap detection, route registration with mc-proxy,
automated TLS cert provisioning for L7 routes via Metacrypt CA, automated
DNS registration in MCNS). gRPC-only (no REST). 15 RPCs, 17+ CLI commands.
- **Recent work:** Phase C (automated TLS cert provisioning), Phase D
(automated DNS registration via MCNS), undeploy command, logs command,
edit command, auto-login to MCR, system account auth model, module path
migration, multi-node edge support (svc).
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
definition files, design docs.
### MCDeploy — Deployment CLI
- **Version:** v0.2.0.
- **Phase:** Active development. Tactical bridge tool for deploying services
while MCP is being built.
- **Deployment:** N/A (local CLI tool, not a server).
- **Recent work:** Initial implementation, Nix flake.
- **Description:** Single-binary CLI that shells out to podman/ssh/scp/git
for build, push, deploy, cert renewal, and status. TOML-configured.
## Node Inventory
| Node | Address (LAN) | Address (Tailscale) | Role |
|------|---------------|---------------------|------|
| rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services |
| svc | — | 100.106.232.4 | Edge — public mc-proxy, MCIAS (systemd) |
| orion | 192.168.88.214 | — | Worker (provisioned, offline for maintenance) |
## Rift Port Map
Note: Services deployed via MCP receive dynamically assigned host ports
(1000060000). The ports below are for infrastructure services with static
assignments or well-known ports.
| Port | Protocol | Services |
|------|----------|----------|
| 53 | DNS (LAN + Tailscale) | mcns |
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web |
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web, mcdoc, mcat, kls |
| 8080 | HTTP (all interfaces) | exod |
| 8443 | L4 (SNI passthrough) | metacrypt API, mcr API |
| 9090 | HTTP (all interfaces) | exod |
| 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard |
| 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics |
Non-platform services also running on rift: **exod** (ports 8080/9090),
**sgardd** (port 19473, fronted by MC-Proxy on 9443).
Svc runs its own mc-proxy on :443 (L7) with public-facing routes:
kls.metacircular.net, mcq.metacircular.net, metacrypt.metacircular.net,
docs.metacircular.net, git.metacircular.net, git.wntrmute.dev.
Non-platform services also running on rift: **exod** (v0.1.0, ports
8080/9090), **sgardd** (v3.2.0, port 19473, fronted by MC-Proxy on 9443),
**kls** (v0.2.0).

158
SVC_WORK.md Normal file
View File

@@ -0,0 +1,158 @@
# svc.metacircular.net — Phase 1 Work Log
Date: 2026-04-02
Purpose: Deploy mcp-agent to svc (edge node) for MCP v2 Phase 1.
## Changes Made
### 1. Created `mcp` system user
```
useradd --system --home-dir /srv/mcp --create-home --shell /usr/sbin/nologin mcp
usermod -aG mc-proxy mcp
```
- UID 992, GID 991
- Member of `mc-proxy` group for socket access
### 2. Created `/srv/mcp/` directory structure
```
/srv/mcp/
├── mcp-agent # binary (v0.8.3-1-gfa8ba6f, linux/amd64)
├── mcp-agent.toml # agent config
├── mcp.db # SQLite registry (created on first run)
└── certs/
├── cert.pem # TLS cert (SAN: IP:100.106.232.4, DNS:svc.svc.mcp.metacircular.net)
├── key.pem # TLS private key
└── ca.pem # Metacircular CA cert
```
- Owned by `mcp:mcp`, key file mode 0600
### 3. TLS certificate
- Issued from the Metacircular CA (`ca/ca.pem` + `ca/ca.key`)
- Subject: `CN=mcp-agent-svc`
- SANs: `IP:100.106.232.4`, `DNS:svc.svc.mcp.metacircular.net`
- Validity: 365 days
- Stored at `/srv/mcp/certs/{cert,key,ca}.pem`
### 4. Agent configuration
- File: `/srv/mcp/mcp-agent.toml`
- gRPC listen: `100.106.232.4:9555` (port 9444 in use by MCNS)
- MCIAS: `https://mcias.metacircular.net:8443`
- mc-proxy socket: `/srv/mc-proxy/mc-proxy.sock`
- Node name: `svc`
- Runtime: `podman` (not used on edge, but required by config)
### 5. systemd unit
- File: `/etc/systemd/system/mcp-agent.service`
- Runs as `mcp:mcp`
- Security hardened (NoNewPrivileges, ProtectSystem=strict, etc.)
- ReadWritePaths: `/srv/mcp`, `/srv/mc-proxy/mc-proxy.sock`
- Enabled and started
### 6. mc-proxy directory permissions
- Changed `/srv/mc-proxy/` from `drwx------` to `drwxr-x---` (group traversal)
- Changed `/srv/mc-proxy/mc-proxy.sock` from `srw-------` to `srw-rw----` (group read/write)
- Required for `mcp` user (in `mc-proxy` group) to access the socket
### 7. MCP CLI config update (on rift)
- Added svc node to `~/.config/mcp/mcp.toml`:
```toml
[[nodes]]
name = "svc"
address = "100.106.232.4:9555"
```
## Verification
```
$ mcp node list
NAME ADDRESS VERSION
rift 100.95.252.120:9444 v0.8.3-dirty
svc 100.106.232.4:9555 v0.8.3-1-gfa8ba6f
$ mcp route list -n svc
NODE: svc
mc-proxy v1.2.1-2-g82fce41-dirty
:443 routes=6
l7 git.wntrmute.dev → 127.0.0.1:3000
l7 kls.metacircular.net → 100.95.252.120:58080
l7 mcq.metacircular.net → 100.95.252.120:48080
l7 metacrypt.metacircular.net → 100.95.252.120:18080 (re-encrypt)
l7 docs.metacircular.net → 100.95.252.120:38080
l7 git.metacircular.net → 127.0.0.1:3000
```
## Agent Cert Reissue (2026-04-02)
Both agent certs reissued with comprehensive SANs:
**Rift agent** (`/srv/mcp/certs/cert.pem`):
- DNS: `rift.scylla-hammerhead.ts.net`, `mcp-agent.svc.mcp.metacircular.net`
- IP: `100.95.252.120`, `192.168.88.181`
**Svc agent** (`/srv/mcp/certs/cert.pem`):
- DNS: `svc.scylla-hammerhead.ts.net`, `svc.svc.mcp.metacircular.net`
- IP: `100.106.232.4`
Both agents upgraded to v0.10.0 (Phase 2 edge routing RPCs + v2 proto fields).
## MCP Master Deployment (2026-04-02)
**Binary**: `/srv/mcp-master/mcp-master` (v0.10.0) on rift
**Config**: `/srv/mcp-master/mcp-master.toml`
**Database**: `/srv/mcp-master/master.db`
**Certs**: `/srv/mcp-master/certs/{cert,key,ca}.pem`
- SAN: `rift.scylla-hammerhead.ts.net`, `mcp-master.svc.mcp.metacircular.net`, IP `100.95.252.120`
**Service token**: `/srv/mcp-master/mcias-token` (MCIAS identity: `mcp-master`, expires 2027-04-03)
**Listen**: `100.95.252.120:9555`
**Bootstrap nodes**: rift (master), svc (edge)
**Status**: Running via `doas` (ad-hoc). NixOS read-only /etc prevents
direct systemd unit creation — needs NixOS config update for persistent
service.
**Tested**:
- `mcp deploy mcq` → master places on rift, forwards to agent ✓
- `mcp undeploy mcq` → master forwards to agent, cleans up placement ✓
- `mcp ps` → fleet-wide status through agents ✓
- `mcp node list` → both nodes visible with versions ✓
## CLI Config Changes (vade)
Updated `~/.config/mcp/mcp.toml`:
- Added `[master]` section: `address = "rift.scylla-hammerhead.ts.net:9555"`
- All node addresses switched to Tailscale DNS names
- Added CA cert path
## Known Limitations
- ~~mc-proxy socket permissions will reset on restart~~ **FIXED**: mc-proxy
now creates the socket with 0660 (was 0600). Committed to mc-proxy master.
- Master runs ad-hoc via `doas` on rift. Needs NixOS systemd config for
persistent service (rift has read-only /etc).
- DNS registration not configured on master (MCNS config omitted for now).
- Edge routing not yet tested end-to-end through master (svc cert provisioning
not configured).
- The TLS cert was issued from the local CA directly, not via Metacrypt API.
Should be re-issued via Metacrypt once the agent has cert provisioning.
- Container runtime is set to `podman` but podman is not installed on svc
(Docker is). Edge agents don't run containers so this is benign.
- Metacrypt and MCNS integrations not configured (not needed for edge role).
## Edge Routing E2E Test (2026-04-02)
Full edge routing flow tested successfully through the master:
1. Added `public = true` route for `mcq.metacircular.net` to mcq service def
2. `mcp deploy mcq` → master placed on rift, deployed, registered DNS, set up edge route on svc
3. Svc agent provisioned TLS cert from Metacrypt (expires 2026-06-28)
4. mc-proxy route created: `mcq.metacircular.net → 100.95.252.120:8443` (re-encrypt)
5. Edge route persisted in both master DB and svc agent registry
**Fix required**: `RouteDef` in servicedef.go was missing `Public` field —
TOML `public = true` was silently dropped. Fixed in v0.10.2 along with
`Tier` field and relaxed `Node` validation.
## DNS Registration Working (2026-04-02)
Master registers Tailnet IPs in MCNS (not LAN IPs). Fix in v0.10.1
resolves Tailscale DNS names to IPs before passing to MCNS.
`mcq.svc.mcp.metacircular.net → 100.95.252.120` (Tailnet IP) ✓

View File

@@ -8,12 +8,11 @@ Go binaries built with `CGO_ENABLED=0` and installed via Nix flakes.
| Tool | Project | Purpose | Install target |
|------|---------|---------|---------------|
| `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion |
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift (systemd) |
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift, svc (systemd) |
| `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift |
| `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift |
| `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift |
| `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift |
| `mcdeploy` | mcdeploy | Deployment CLI — build, push, deploy (bridge tool) | vade |
### Server-only binaries (not installed as tools)
@@ -29,6 +28,8 @@ These run inside containers and are not installed on operator workstations:
| `mc-proxy` | mc-proxy | TLS proxy server |
| `mcns` | mcns | DNS server |
| `mcat` | mcat | Login policy tester web app |
| `mcdoc` | mcdoc | Documentation server |
| `mcq` | mcq | Document review queue |
## Installation
@@ -46,15 +47,18 @@ via `mcpkg.nix` in the NixOS configuration. Adding a tool:
- Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps).
- ldflags: `-s -w -X main.version=${version}`.
- `subPackages`: list only the client binaries, not servers.
- `system`: `x86_64-linux` (all machines are x86_64).
- `system`: `x86_64-linux` for rift, svc, and orion; `aarch64-linux`
for hyperborea. Flakes that target the full fleet should support both.
### MCP agent
The `mcp-agent` is a special case: it runs as a systemd service on
managed nodes (not as a container, since it manages containers). Its
flake exposes both `mcp` (client) and `mcp-agent` (server). The NixOS
config in `configs/mcp.nix` references the Nix package path for the
agent binary.
flake exposes both `mcp` (client) and `mcp-agent` (server). Phase E is
moving the agent binary to `/srv/mcp/mcp-agent` on all nodes — NixOS
`ExecStart` will point there instead of a nix store path, and Debian
nodes use the same layout. svc already follows this convention. See
`docs/phase-e-plan.md` for details.
## Flake status
@@ -63,9 +67,10 @@ agent binary.
| mcias | Yes | mciasctl, mciasgrpcctl | Yes | |
| mc-proxy | Yes | mcproxyctl | Yes | |
| mcr | Yes | mcrctl | Yes | |
| mcdeploy | Yes | mcdeploy | Yes | |
| mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit |
| mcns | No | — | No | Server-only, no client tool yet |
| metacrypt | No | — | No | Server-only, no client tool yet |
| mcat | No | — | No | Server-only, no client tool yet |
| mcdoc | No | — | No | Server-only, deployed as container |
| mcq | No | — | No | Server-only, document review queue |
| mcdsl | No | — | No | Library, no binaries |

1657
docs/architecture-v2.md Normal file

File diff suppressed because it is too large Load Diff

344
docs/disaster-recovery.md Normal file
View File

@@ -0,0 +1,344 @@
# Disaster Recovery: Bootstrap from Zero
This document covers recovering the Metacircular platform when all
services on rift are down and no containers are running. It assumes:
- The machine boots and NixOS is functional
- The mcp-agent systemd service starts automatically
- Tailscale is configured and connects on boot
- Service data directories (`/srv/<service>/`) are intact on disk
- Container images are cached in podman's overlay storage
If images are NOT cached (fresh machine, disk wipe), see the
"Cold Start" section at the end.
## Prerequisites
Before starting recovery, verify:
```bash
# 1. Machine is up
hostname # should print "rift"
# 2. Tailscale is connected
tailscale status --self
# Should show the Tailnet IP (100.95.252.120)
# 3. The mcp user exists
id mcp
# Should show uid=850(mcp) gid=850(mcp)
# 4. The agent is running
systemctl status mcp-agent
# Should be active
# 5. Images are cached
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
# Should be > 0
```
If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up`
If the agent is not running: check `/srv/mcp/mcp-agent` exists and
`/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`.
## Recovery Order
Services must be started in dependency order. Each stage must be
healthy before the next starts.
```
Stage 1 (Foundation): MCNS → DNS works
Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs
Stage 3 (Management): mcp-master → orchestration
Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications
```
## Manual Recovery Commands
All commands run as the mcp user. Use this shell prefix:
```bash
# Set up the environment
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
# Or SSH as mcp directly (if SSH login is enabled):
ssh mcp@rift
```
For brevity, commands below show the `podman run` portion only. Prefix
with the environment setup above.
### Stage 1: MCNS (DNS)
MCNS must start first. Without it, no hostname resolution works.
```bash
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp \
-p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp \
-p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
```
**Verify:**
```bash
dig @192.168.88.181 google.com +short
# Should return an IP address
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# Should return a Tailnet IP
```
**Note:** MCNS binds to specific IPs, not `0.0.0.0`, because
systemd-resolved holds port 53 on localhost. The explicit bindings
avoid the conflict.
### Stage 2: Core Infrastructure
#### mc-proxy (TLS routing)
```bash
podman run -d --name mc-proxy --restart unless-stopped \
--network host \
-v /srv/mc-proxy:/srv/mc-proxy \
mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
server --config /srv/mc-proxy/mc-proxy.toml
```
**Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1`
(should get a response, even if 404)
#### MCR (Container Registry)
```bash
# API server
podman run -d --name mcr-api --restart unless-stopped \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28443:8443 \
-p 127.0.0.1:29443:9443 \
mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
server --config /srv/mcr/mcr.toml
# Web UI
podman run -d --name mcr-web --restart unless-stopped \
--user 0:0 \
-v /srv/mcr:/srv/mcr \
-p 127.0.0.1:28080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
server --config /srv/mcr/mcr.toml
```
**If MCR fails with "chmod" or "readonly database":**
```bash
podman stop mcr-api
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
podman start mcr-api
```
This recreates the database empty. Image blobs in `/srv/mcr/layers/`
are preserved but tag metadata is lost. Re-push images to rebuild the
registry.
#### Metacrypt (PKI / Secrets)
```bash
# API server
podman run -d --name metacrypt-api --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18443:8443 \
-p 127.0.0.1:19443:9443 \
mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
server --config /srv/metacrypt/metacrypt.toml
# Web UI
podman run -d --name metacrypt-web --restart unless-stopped \
-v /srv/metacrypt:/srv/metacrypt \
-p 127.0.0.1:18080:8080 \
mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
--config /srv/metacrypt/metacrypt.toml
```
**If Metacrypt fails with "chmod" or "readonly database":**
Same fix as MCR — delete the database files. **Warning:** this loses
all encrypted secrets, issued certs tracking, and CA state. The CA
key itself is in the sealed vault (password-protected), not in SQLite.
### Stage 3: MCP Master
```bash
podman run -d --name mcp-master --restart unless-stopped \
--network host \
-v /srv/mcp-master:/srv/mcp-master \
mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
server --config /srv/mcp-master/mcp-master.toml
```
**Verify:**
```bash
# From vade (operator workstation):
mcp node list
# Should show rift, svc, orion
```
### Stage 4: Application Services
Once the master is running, deploy applications through MCP:
```bash
mcp deploy mcq --direct
mcp deploy mcdoc --direct
mcp deploy mcat --direct
mcp deploy kls --direct
```
Or start them manually:
```bash
# MCQ
podman run -d --name mcq --restart unless-stopped \
-v /srv/mcq:/srv/mcq \
-p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
server --config /srv/mcq/mcq.toml
# MCDoc
podman run -d --name mcdoc --restart unless-stopped \
-v /srv/mcdoc:/srv/mcdoc \
-p 127.0.0.1:38080:8080 \
mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
server --config /srv/mcdoc/mcdoc.toml
# MCAT
podman run -d --name mcat --restart unless-stopped \
-v /srv/mcat:/srv/mcat \
-p 127.0.0.1:48116:8443 \
mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
server --config /srv/mcat/mcat.toml
# KLS
podman run -d --name kls --restart unless-stopped \
-v /srv/kls:/srv/kls \
-p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
-f /srv/kls/kls.conf
# Sgard
podman run -d --name sgardd --restart unless-stopped \
-v /srv/sgard:/srv/sgard \
-p 127.0.0.1:19473:9473 \
mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
--repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
--tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
```
## Verification Checklist
After all services are running:
```bash
# Fleet status
mcp ps
# All services should show "running"
# DNS
dig @192.168.88.181 google.com +short
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
# MCIAS (runs on svc, should be unaffected by rift outage)
curl -sk https://mcias.metacircular.net:8443/v1/health
# MCR
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
# Metacrypt
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
# Public routes via svc
curl -sk https://mcq.metacircular.net/
curl -sk https://docs.metacircular.net/
```
## Common Errors
### "chmod: operation not permitted"
modernc.org/sqlite calls `fchmod()` on database files. This is denied
inside rootless podman user namespaces. Fix:
```bash
# Delete the database and let the service recreate it
podman stop <container>
rm -f /srv/<service>/<service>.db*
podman start <container>
```
The `fchmod` error will still appear in logs as a warning but is
non-fatal for newly created databases.
### "address already in use" on port 53
systemd-resolved holds port 53 on localhost. MCNS must bind to
specific IPs, not `0.0.0.0:53`. Use explicit port bindings:
`-p 192.168.88.181:53:53 -p 100.95.252.120:53:53`
### "connection refused" to MCR
MCR is down. Images are cached locally — you can start services that
use cached images without MCR. MCR itself starts from its cached
image.
### Agent shows "error" for all nodes
Check:
1. Tailscale is running on both the CLI machine and the target node
2. The agent is listening: `ss -tlnp | grep 9444`
3. The CLI config has the correct addresses
4. TLS certs have the right SANs for the Tailnet IP
### "podman: executable file not found"
This warning appears for svc (which uses Docker, not podman). It's
benign — svc is an edge node that doesn't run containers.
## Cold Start (No Cached Images)
If the machine was wiped and no images are cached:
1. **MCIAS** runs on svc (Docker container), not rift. It should be
unaffected by a rift failure. Verify: `ssh svc.metacircular.net
"docker ps | grep mcias"`.
2. **Pre-stage images** by pulling from a backup or building locally:
```bash
# On vade (operator workstation), build and push to a temp location
cd ~/src/metacircular/mcns && make docker
podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
ssh rift "podman load"
```
Repeat for each service.
3. Alternatively, if another node has MCR access, push images there
first, then pull from the running MCR instance.
## Service Reference
Quick reference for all services, their images, and critical flags:
| Service | Image | Network | Key Ports | Config Path |
|---------|-------|---------|-----------|-------------|
| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`.

View File

@@ -48,11 +48,11 @@ the spec disagree, one of them has a bug.
## High-Level Overview
Metacircular infrastructure is built from six core components, plus a shared
standard library (**MCDSL**) that provides the common patterns all services
depend on (auth integration, database setup, config loading, HTTP/gRPC server
bootstrapping, CSRF, web session management, health checks, snapshots, and
service directory archiving):
Metacircular infrastructure is built from six core components and a
documentation server, plus a shared standard library (**MCDSL**) that provides
the common patterns all services depend on (auth integration, database setup,
config loading, HTTP/gRPC server bootstrapping, CSRF, web session management,
health checks, snapshots, and service directory archiving):
- **MCIAS** — Identity and access. The root of trust for all other services.
Handles authentication, token issuance, role management, and login policy
@@ -75,6 +75,10 @@ service directory archiving):
accepts outside connections, and routes them to the correct service — either
as raw TCP passthrough or via TLS-terminating HTTP/2 reverse proxy.
- **MCDoc** — Documentation server. Fetches markdown from Gitea repositories,
renders HTML with syntax highlighting, serves a navigable documentation site.
Public-facing, no MCIAS authentication required.
These components form a dependency graph rooted at MCIAS:
```
@@ -179,14 +183,19 @@ delegates authentication to it; no service maintains its own user database.
Services validate tokens by calling back to MCIAS (cached 30s by SHA-256 of
the token).
- **Role-based access.** Three roles — `admin` (full access, policy bypass),
`user` (policy-governed), `guest` (service-dependent restrictions). Admin
detection comes solely from the MCIAS `admin` role; services never promote
users locally.
- **Role-based access.** Three roles — `admin` (MCIAS account management,
policy changes, zone mutations — reserved for human operators), `user`
(policy-governed), `guest` (service-dependent restrictions, rejected by MCP
agent). Admin detection comes solely from the MCIAS `admin` role; services
never promote users locally. Routine operations (deploy, push, DNS updates)
do not require admin.
- **Account types.** Human accounts (interactive users) and system accounts
(service-to-service). Both authenticate the same way; system accounts enable
automated workflows.
(service-to-service). Both produce standard JWTs validated the same way.
System accounts carry no roles — their authorization is handled by each
service's policy engine (Metacrypt policies, MCNS name-scoped access, MCR
default policies). System account tokens are long-lived (365-day default)
and do not require passwords for issuance.
- **Login policy.** Priority-based ACL rules that control who can log into
which services. Rules can target roles, account types, service names, and
@@ -204,7 +213,7 @@ MCIAS evaluates login policy against the service context, verifies credentials,
and returns a bearer token. The MCIAS Go client library
(`git.wntrmute.dev/mc/mcias/clients/go`) handles this flow.
**Status:** Implemented. v1.7.0. Feature-complete with active refinement
**Status:** Implemented. v1.9.0. Feature-complete with active refinement
(WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login policies).
---
@@ -255,7 +264,7 @@ core.
operations on which engine mounts. Priority-based evaluation, default deny,
admin bypass. See Metacrypt's `POLICY.md` for the full model.
**Status:** Implemented. All four engine types complete — CA (with ACME
**Status:** Implemented. v1.3.1. All four engine types complete — CA (with ACME
support), SSH CA, transit encryption, and user-to-user encryption.
---
@@ -274,7 +283,9 @@ serves the container images that MCP deploys across the platform.
- **Authenticated access.** No anonymous access. MCR uses the OCI token
authentication flow: clients hit `/v2/`, receive a 401 with a token
endpoint, authenticate via MCIAS, and use the returned JWT for subsequent
requests.
requests. The token endpoint accepts both username/password (standard
login) and pre-existing MCIAS JWTs as passwords (personal-access-token
pattern), enabling non-interactive push/pull for system accounts and CI.
- **Policy-controlled push/pull.** Fine-grained ACL rules govern who can push
to or pull from which repositories. Integrated with MCIAS roles.
@@ -286,7 +297,7 @@ serves the container images that MCP deploys across the platform.
is scheduled, MCP tells the node's agent which image to pull and where to get
it. MCR sits behind an MC-Proxy instance for TLS routing.
**Status:** Implemented. Phase 13 (deployment artifacts) complete.
**Status:** Implemented. v1.2.1. All implementation phases complete.
---
@@ -333,7 +344,9 @@ two instances — an edge proxy on a public VPS and an origin proxy on the
private network, connected over the overlay with PROXY protocol preserving
client IPs across the hop.
**Status:** Implemented.
**Status:** Implemented. v1.2.1. Route state persisted in SQLite with
write-through semantics. gRPC admin API with idempotent AddRoute for runtime
route management.
---
@@ -365,9 +378,13 @@ into DNS records.
using internal DNS names automatically resolve to the right place without
config changes.
- **Record management API.** Authenticated via MCIAS. MCP is the primary
consumer for dynamic updates. Operators can also manage records directly
for static entries (node addresses, aliases).
- **Record management API.** Authenticated via MCIAS with name-scoped
authorization. Admin can manage all records and zones. The `mcp-agent`
system account can create and delete any record. Other system accounts
can only manage records matching their own name (e.g., system account
`mcq` can manage `mcq.svc.mcp.metacircular.net` but not other records).
Human users have read-only access to records. Zone mutations (create,
update, delete zones) remain admin-only.
**How it fits in:** MCNS answers "what is the address of X?" MCP answers "where
is service α running?" and pushes the answer to MCNS. This separation means
@@ -375,10 +392,11 @@ services can use stable DNS names in their configs (e.g.,
`mcias.svc.mcp.metacircular.net` in `[mcias] server_url`) that survive
migration without config changes.
**Status:** Implemented. v1.0.0. Custom Go DNS server deployed on rift,
**Status:** Implemented. v1.1.1. Custom Go DNS server deployed on rift,
serving two authoritative zones (`svc.mcp.metacircular.net` and
`mcp.metacircular.net`) plus upstream forwarding. REST + gRPC APIs with
MCIAS auth. Records stored in SQLite.
MCIAS auth and name-scoped system account authorization. Records stored
in SQLite.
---
@@ -403,6 +421,10 @@ each managed node.
the initial config, pulls the image from MCR, starts the container, and
pushes a DNS update to MCNS (`α.svc.mcp.metacircular.net` → node address).
- **Undeploy.** Full teardown of a service. Stops the container, removes
MC-Proxy routes, deletes DNS records from MCNS, and cleans up the service
registry entry. The inverse of deploy.
- **Migrate.** Move a service from one node to another. MCP snapshots the
service's `/srv/<service>/` directory on the source node (as a tar.zst
image), transfers it to the destination, extracts it, starts the service,
@@ -429,9 +451,17 @@ each managed node.
- **Master/agent architecture.** MCP Master runs on the operator's machine.
Agents run on every managed node, receiving C2 (command and control) from
Master, reporting node status, and managing local workloads. The C2 channel
is authenticated via MCIAS. The master does not need to be always-on —
agents keep running their workloads independently; the master is needed only
to issue new commands.
is authenticated via MCIAS — any authenticated non-guest user or system
account is accepted (admin role is not required for deploy operations).
The master does not need to be always-on — agents keep running their
workloads independently; the master is needed only to issue new commands.
- **System account automation.** The agent uses an `mcp-agent` system account
for all service-to-service communication: TLS cert provisioning (Metacrypt),
DNS record management (MCNS), and container image pulls (MCR). Each service
authorizes the agent through its own policy engine. Per-service system
accounts (e.g., `mcq`) can be created for scoped self-management — a service
account can only manage its own DNS records, not other services'.
- **Node management.** Track which nodes are in the platform, their health,
available resources, and running workloads.
@@ -452,11 +482,15 @@ services it depends on.
can deploy them. The systemd unit files exist as a fallback and for bootstrap —
the long-term deployment model is MCP-managed containers.
**Status:** Implemented. v0.1.0. Deployed on rift managing all platform
containers. Two components — `mcp` CLI (operator workstation) and
**Status:** Implemented. v0.7.6. Deployed on rift managing all platform
containers. Route declarations with automatic port allocation (`$PORT` /
`$PORT_<NAME>` env vars passed to containers). MC-Proxy route registration
during deploy and stop. Automated TLS cert provisioning for L7 routes via
Metacrypt CA (Phase C). Automated DNS registration in MCNS during deploy
and stop (Phase D). Two components — `mcp` CLI (operator workstation) and
`mcp-agent` (per-node daemon with SQLite registry, rootless Podman,
monitoring with drift/flap detection). gRPC-only (no REST). 12 RPCs,
15 CLI commands.
monitoring with drift/flap detection). gRPC-only (no REST). 15 RPCs,
17+ CLI commands.
---
@@ -663,20 +697,22 @@ renew certificates programmatically.
### How Services Get Certificates Today
Currently, certificates are provisioned through Metacrypt's **REST API or web
UI** and placed into each service's `/srv/<service>/certs/` directory. This is
a manual process — the operator issues a certificate, downloads it, and
deploys the files. The ACME client library exists but is not yet integrated
into any service.
For services deployed via MCP with L7 routes, certificates are provisioned
automatically during deploy — MCP uses the Metacrypt ACME client library to
obtain certs and transfers them to the node. For other services and during
bootstrap, certificates are provisioned through Metacrypt's **REST API or web
UI** and placed into each service's `/srv/<service>/certs/` directory manually.
### How It Will Work With MCP
### How MCP Automates Certificates
MCP is the natural place to automate certificate provisioning:
MCP automates certificate provisioning for deploy workflows, with renewal and
migration automation planned:
- **Initial deploy.** When MCP deploys a new service, it can provision a
certificate from Metacrypt (via the ACME client library or the REST API),
transfer the cert and key to the node as part of the config push to
`/srv/<service>/certs/`, and start the service with valid TLS material.
- **Initial deploy.** When MCP deploys a new service, it provisions a
certificate from Metacrypt (via the ACME client library), transfers the cert
and key to the node as part of the config push to `/srv/<service>/certs/`,
and starts the service with valid TLS material. For L7 routes, MCP also
provisions a TLS certificate for MC-Proxy's termination endpoint.
- **Renewal.** MCP knows what services are running and when their certificates
expire. It can renew certificates before expiry by re-running the ACME flow
@@ -689,10 +725,8 @@ MCP is the natural place to automate certificate provisioning:
for the new name.
- **MC-Proxy L7 routes.** MC-Proxy's L7 mode requires certificate/key pairs
for TLS termination. MCP (or the operator) can provision these from
Metacrypt and push them to MC-Proxy's cert directory. MC-Proxy's
architecture doc lists ACME integration and Metacrypt key storage as future
work.
for TLS termination. MCP provisions these from Metacrypt during deploy and
pushes them to the node alongside the route registration.
### Trust Distribution
@@ -793,8 +827,13 @@ Operator workstation (vade)
├── Scheduling: select Node C (best fit)
├── Provision TLS certificate from Metacrypt
(ACME flow or REST API)
├── Port assignment: allocate a free host port for each
declared route (passed as $PORT / $PORT_<NAME> env vars)
├── Provision TLS certificate from Metacrypt CA
│ (ACME client library) for the service
│ — for L7 routes, also provision a cert for MC-Proxy
│ TLS termination
├── C2 to Node C agent:
│ 1. Create /srv/α/ directory structure
@@ -802,15 +841,15 @@ Operator workstation (vade)
│ 3. Transfer TLS cert+key → /srv/α/certs/
│ 4. Transfer root CA cert → /srv/α/certs/ca.pem
│ 5. Pull image from MCR
│ 6. Start container
│ 6. Start container with $PORT / $PORT_<NAME> env vars
├── Register routes with MC-Proxy
│ (gRPC AddRoute for each declared route)
├── Update service registry: α → Node C
── Push DNS update to MCNS:
α.svc.mcp.metacircular.net → Node C address
└── (Optionally) update MC-Proxy route table
if α needs external ingress
── Push DNS update to MCNS:
α.svc.mcp.metacircular.net → Node C address
```
### 4. Migration

View File

@@ -9,8 +9,8 @@ the platform's internals.
## Platform Overview
Metacircular is a multi-service infrastructure platform. Services are
Go binaries running as containers on Linux nodes, managed by these core
components:
Go binaries running as containers across a fleet of Linux nodes,
managed by these core components:
| Component | Role |
|-----------|------|
@@ -21,8 +21,27 @@ components:
| **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` |
The operator workflow is: **build image → push to MCR → write service
definition → deploy via MCP**. MCP handles port assignment, route
registration, and container lifecycle.
definition → deploy via MCP**. MCP handles port assignment, TLS cert
provisioning, route registration, DNS registration, and container
lifecycle.
### Fleet Topology
The platform runs across multiple nodes connected via Tailnet:
| Node | Role | OS | Arch | Purpose |
|------|------|----|------|---------|
| **rift** | Compute + core infra | NixOS | amd64 | Runs most services (Metacrypt, MCR, MCNS, etc.) |
| **svc** | Edge | Debian | amd64 | Public-facing mc-proxy, routes traffic over Tailnet to compute nodes |
| **orion** | Compute | NixOS | amd64 | Provisioned, currently offline |
**Node roles:**
- **Compute nodes** (rift, orion, future RPis) run the full container
lifecycle via rootless Podman.
- **Edge nodes** (svc) run mc-proxy for public traffic routing only.
The MCP agent on edge nodes manages mc-proxy routes but does not run
application containers.
---
@@ -34,7 +53,7 @@ registration, and container lifecycle.
| Container engine | Docker or Podman (for building images) |
| `mcp` CLI | Installed on the operator workstation |
| MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` |
| MCP agent | Running on the target node (currently `rift`) |
| MCP agent | Running on the target node (`rift` for services, `svc` for edge routing) |
| MCIAS account | For `mcp` CLI authentication to the agent |
---
@@ -123,18 +142,38 @@ Service definitions are TOML files that tell MCP what to deploy. They
live at `~/.config/mcp/services/<service>.toml` on the operator
workstation.
### Minimal Example (Single Component)
### Minimal Example (Single Component, L7)
```toml
name = "myservice"
node = "rift"
[build.images]
myservice = "Dockerfile"
[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
[[components.routes]]
port = 8443
mode = "l7"
```
### API Service Example (L4, Multiple Routes)
```toml
name = "myservice"
node = "rift"
version = "v1.0.0"
[build.images]
myservice = "Dockerfile"
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
volumes = ["/srv/myservice:/srv/myservice"]
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
[[components.routes]]
name = "rest"
@@ -152,7 +191,6 @@ mode = "l4"
```toml
name = "myservice"
node = "rift"
version = "v1.0.0"
[build.images]
myservice = "Dockerfile.api"
@@ -160,6 +198,7 @@ myservice-web = "Dockerfile.web"
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
volumes = ["/srv/myservice:/srv/myservice"]
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
@@ -175,6 +214,7 @@ mode = "l4"
[[components]]
name = "web"
image = "mcr.svc.mcp.metacircular.net:8443/myservice-web:v1.0.0"
volumes = ["/srv/myservice:/srv/myservice"]
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
@@ -183,21 +223,16 @@ port = 443
mode = "l7"
```
### Convention-Derived Defaults
### Conventions
Most fields are optional — MCP derives them from conventions:
A few fields are derived by the agent at deploy time:
| Field | Default | Override when... |
|-------|---------|------------------|
| Image name | `<service>` (api), `<service>-<component>` (others) | Image name differs from convention |
| Image registry | `mcr.svc.mcp.metacircular.net:8443` (from global MCP config) | Never — always use MCR |
| Version | Service-level `version` field | A component needs a different version |
| Volumes | `/srv/<service>:/srv/<service>` | Additional mounts are needed |
| Network | `mcpnet` | Service needs host networking or a different network |
| User | `0:0` | Never change this for standard services |
| Restart | `unless-stopped` | Service should not auto-restart |
| Source path | `<service>` relative to workspace root | Directory name differs from service name |
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname |
| Source path | `<service>` relative to workspace root | Directory name differs from service name (use `path`) |
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname (use route `hostname`) |
All other fields must be explicit in the service definition.
### Service Definition Reference
@@ -207,7 +242,6 @@ Most fields are optional — MCP derives them from conventions:
|-------|----------|---------|
| `name` | Yes | Service name (matches project name) |
| `node` | Yes | Target node to deploy to |
| `version` | Yes | Image version tag (semver, e.g. `v1.0.0`) |
| `active` | No | Whether MCP keeps this running (default: `true`) |
| `path` | No | Source directory relative to workspace (default: `name`) |
@@ -215,20 +249,20 @@ Most fields are optional — MCP derives them from conventions:
| Field | Purpose |
|-------|---------|
| `build.images.<name>` | Maps image name to Dockerfile path |
| `build.images.<name>` | Maps build image name to Dockerfile path. The `<name>` must match the repository name in a component's `image` field (the part after the last `/`, before the `:` tag). |
**Component fields:**
| Field | Purpose |
|-------|---------|
| `name` | Component name (e.g. `api`, `web`) |
| `image` | Full image reference override |
| `version` | Version override for this component |
| `volumes` | Volume mounts (list of `host:container` strings) |
| `cmd` | Command override (list of strings) |
| `network` | Container network override |
| `user` | Container user override |
| `restart` | Restart policy override |
| Field | Required | Purpose |
|-------|----------|---------|
| `name` | Yes | Component name (e.g. `api`, `web`) |
| `image` | Yes | Full image reference (e.g. `mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0`) |
| `volumes` | No | Volume mounts (list of `host:container` strings) |
| `cmd` | No | Command override (list of strings) |
| `env` | No | Extra environment variables (list of `KEY=VALUE` strings) |
| `network` | No | Container network (default: none) |
| `user` | No | Container user (e.g. `0:0`) |
| `restart` | No | Restart policy (e.g. `unless-stopped`) |
**Route fields (under `[[components.routes]]`):**
@@ -248,9 +282,11 @@ Most fields are optional — MCP derives them from conventions:
### Version Pinning
Service definitions **must** pin an explicit semver tag (e.g. `v1.1.0`).
Never use `:latest`. This ensures deployments are reproducible and
`mcp status` shows the actual running version.
Component `image` fields **must** pin an explicit semver tag (e.g.
`mcr.svc.mcp.metacircular.net:8443/myservice:v1.1.0`). Never use
`:latest`. This ensures deployments are reproducible and `mcp status`
shows the actual running version. The version is extracted from the
image tag.
---
@@ -303,14 +339,20 @@ recreates the containers.
### What Happens During Deploy
1. Agent assigns a free host port (1000060000) for each declared route.
2. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
2. For L7 routes, agent provisions a TLS certificate from Metacrypt CA
(via `POST /v1/engine/request`). Certs are written to
`/srv/mc-proxy/certs/<service>.pem` and `.key`. Existing valid certs
(more than 30 days from expiry) are reused.
3. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
variables set to the assigned ports.
3. Agent registers routes with mc-proxy (hostname → `127.0.0.1:<port>`,
mode, TLS cert paths).
4. Agent records the full state in its SQLite registry.
4. Agent registers routes with mc-proxy via gRPC (hostname →
`<node-address>:<port>`, mode, TLS cert paths).
5. Agent registers DNS entries in MCNS for
`<service>.svc.mcp.metacircular.net`.
6. Agent records the full state in its SQLite registry.
On stop (`mcp stop <service>`), the agent reverses the process: removes
mc-proxy routes, then stops containers.
DNS entries, removes mc-proxy routes, then stops containers.
---
@@ -368,7 +410,14 @@ tags = []
level = "info"
```
For services with a web UI, add:
For services with SSO-enabled web UIs, add:
```toml
[sso]
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
```
For services with a separate web UI binary, add:
```toml
[web]
@@ -385,12 +434,17 @@ addresses** — they will be overridden at deploy time.
| Env var | When set |
|---------|----------|
| `$PORT` | Component has a single route |
| `$PORT_<NAME>` | Component has multiple named routes |
| `$PORT` | Component has a single unnamed route |
| `$PORT_<NAME>` | Component has named routes |
Route names are uppercased: `name = "rest"``$PORT_REST`,
`name = "grpc"``$PORT_GRPC`.
**Container listen address:** Services must bind to `0.0.0.0:$PORT`
(or `:$PORT`), not `localhost:$PORT`. Podman port-forwards go through
the container's network namespace — binding to `localhost` inside the
container makes the port unreachable from outside.
Services built with **mcdsl v1.1.0+** handle this automatically —
`config.Load` checks `$PORT` → overrides `Server.ListenAddr`, and
`$PORT_GRPC` → overrides `Server.GRPCAddr`. These take precedence over
@@ -411,18 +465,72 @@ these.
## 6. Authentication (MCIAS Integration)
Every service delegates authentication to MCIAS. No service maintains
its own user database.
its own user database. Services support two login modes: **SSO
redirect** (recommended for web UIs) and **direct credentials**
(fallback / API clients).
### Auth Flow
### SSO Login (Web UIs)
SSO is the preferred login method for web UIs. The flow is an OAuth
2.0-style authorization code exchange:
1. User visits the service and is redirected to `/login`.
2. Login page shows a "Sign in with MCIAS" button.
3. Click redirects to MCIAS (`/sso/authorize`), which authenticates the
user.
4. MCIAS redirects back to the service's `/sso/callback` with an
authorization code.
5. The service exchanges the code for a JWT via a server-to-server call
to MCIAS `POST /v1/sso/token`.
6. The JWT is stored in a session cookie.
SSO is enabled by adding an `[sso]` section to the service config and
registering the service as an SSO client in MCIAS.
**Service config:**
```toml
[sso]
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
```
**MCIAS config** (add to the `[[sso_clients]]` list):
```toml
[[sso_clients]]
client_id = "<service>"
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
service_name = "<service>"
```
The `redirect_uri` must match exactly between the service config and
the MCIAS client registration.
When `[sso].redirect_uri` is empty or absent, the service falls back to
the direct credentials form.
**Implementation:** Services use `mcdsl/sso` (v1.7.0+) which handles
state management, CSRF-safe cookies, and the code exchange. The web
server registers three routes:
| Route | Purpose |
|-------|---------|
| `GET /login` | Renders landing page with "Sign in with MCIAS" button |
| `GET /sso/redirect` | Sets state cookies, redirects to MCIAS |
| `GET /sso/callback` | Validates state, exchanges code for JWT, sets session |
### Direct Credentials (API / Fallback)
1. Client sends credentials to the service's `POST /v1/auth/login`.
2. Service forwards them to MCIAS via the client library
(`git.wntrmute.dev/mc/mcias/clients/go`).
2. Service forwards them to MCIAS via `mcdsl/auth.Authenticator.Login()`.
3. MCIAS validates and returns a bearer token.
4. Subsequent requests include `Authorization: Bearer <token>`.
5. Service validates tokens via MCIAS `ValidateToken()`, cached for 30s
5. Service validates tokens via `ValidateToken()`, cached for 30s
(keyed by SHA-256 of the token).
Web UIs use this mode when SSO is not configured, presenting a
username/password/TOTP form instead of the SSO button.
### Roles
| Role | Access |
@@ -475,11 +583,14 @@ co-located on the same node).
| `mcp build <service>` | Build and push images to MCR |
| `mcp sync` | Push all service definitions to agents; auto-build missing images |
| `mcp deploy <service>` | Pull image, (re)create containers, register routes |
| `mcp undeploy <service>` | Full teardown: remove routes, DNS, certs, and containers |
| `mcp stop <service>` | Remove routes, stop containers |
| `mcp start <service>` | Start previously stopped containers |
| `mcp restart <service>` | Restart containers in place |
| `mcp ps` | List all managed containers and status |
| `mcp status [service]` | Detailed status for a specific service |
| `mcp logs <service>` | Stream container logs |
| `mcp edit <service>` | Edit service definition |
---
@@ -504,13 +615,14 @@ git push origin v1.0.0
cat > ~/.config/mcp/services/myservice.toml << 'EOF'
name = "myservice"
node = "rift"
version = "v1.0.0"
[build.images]
myservice = "Dockerfile.api"
[[components]]
name = "api"
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
volumes = ["/srv/myservice:/srv/myservice"]
[[components.routes]]
name = "rest"
@@ -584,15 +696,85 @@ Services follow a standard directory structure:
---
## 10. Agent Management
MCP manages a fleet of nodes with heterogeneous operating systems and
architectures. The agent binary lives at `/srv/mcp/mcp-agent` on every
node — this is a mutable path that MCP controls, regardless of whether
the node runs NixOS or Debian.
### Node Configuration
Each node in `~/.config/mcp/mcp.toml` includes SSH and architecture
info for agent management:
```toml
[[nodes]]
name = "rift"
address = "100.95.252.120:9444"
ssh = "rift"
arch = "amd64"
[[nodes]]
name = "hyperborea"
address = "100.x.x.x:9444"
ssh = "hyperborea"
arch = "arm64"
```
### Upgrading Agents
After tagging a new MCP release:
```bash
# Upgrade all nodes (recommended — prevents version skew)
mcp agent upgrade
# Upgrade a single node
mcp agent upgrade rift
# Check versions across the fleet
mcp agent status
```
`mcp agent upgrade` cross-compiles the agent binary for each target
architecture, SSHs to each node, atomically replaces the binary, and
restarts the systemd service. All nodes should be upgraded together
because new CLI versions often depend on new agent RPCs.
### Provisioning New Nodes
One-time setup for a new Debian node:
```bash
# 1. Provision the node (creates user, dirs, systemd unit, installs binary)
mcp node provision <name>
# 2. Register the node
mcp node add <name> <address>
# 3. Deploy services
mcp deploy <service>
```
For NixOS nodes, provisioning is handled by the NixOS configuration.
The NixOS config creates the `mcp` user, systemd unit, and directories.
The `ExecStart` path points to `/srv/mcp/mcp-agent` so that `mcp agent
upgrade` works the same as on Debian nodes.
---
## Appendix: Currently Deployed Services
For reference, these services are operational on the platform:
| Service | Version | Node | Purpose |
|---------|---------|------|---------|
| MCIAS | v1.8.0 | (separate) | Identity and access |
| Metacrypt | v1.1.0 | rift | Cryptographic service, PKI/CA |
| MC-Proxy | v1.1.0 | rift | TLS proxy and router |
| MCR | v1.2.0 | rift | Container registry |
| MCNS | v1.1.0 | rift | Authoritative DNS |
| MCP | v0.3.0 | rift | Control plane agent |
| MCIAS | v1.9.0 | (separate) | Identity and access |
| Metacrypt | v1.4.1 | rift | Cryptographic service, PKI/CA |
| MC-Proxy | v1.2.1 | rift, svc | TLS proxy and router (svc handles public edge) |
| MCR | v1.2.1 | rift | Container registry |
| MCNS | v1.1.1 | rift | Authoritative DNS |
| MCDoc | v0.1.0 | rift | Documentation server |
| MCQ | v0.4.0 | rift | Document review queue |
| MCP | v0.7.6 | rift, svc | Control plane agent |

127
docs/phase-e-plan.md Normal file
View File

@@ -0,0 +1,127 @@
# Phase E: Multi-Node Orchestration
Phase D (automated DNS registration) is complete. Phase E extends MCP from
a single-node agent on rift to a multi-node fleet with a central master
process.
## Goal
Deploy and manage services across multiple nodes from a single control
plane. The operator runs `mcp deploy` and the system places the workload on
the right node, provisions certs, registers DNS, and configures routing --
same as today on rift, but across the fleet.
## Fleet Topology
| Node | OS | Arch | Role |
|------|----|------|------|
| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
| orion | NixOS | amd64 | Worker |
| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
Tailnet is the interconnect between all nodes. Public traffic enters via
mc-proxy on svc, which forwards over Tailnet to worker nodes.
## Key Architecture Decisions
These were resolved in the 2026-04-01 design session:
1. **Rift is the master node.** No separate straylight machine. Core infra
stays on rift, which gains mcp-master alongside its existing agent.
2. **Master-mediated coordination.** Agents never talk to each other. All
cross-node operations go through the master. Agents only dial the master
(for registration and heartbeats) and respond to master RPCs.
3. **Agent self-registration.** Agents register with the master on startup
(name, role, address, arch). The master maintains the live node registry.
No static `[[nodes]]` config required except for bootstrap.
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
(with resource data). If the master misses 3 heartbeats (90s), it
actively probes the agent. Failed probe marks the node unhealthy.
5. **Tier-based placement.** `tier = "core"` runs on the master node.
`tier = "worker"` (default) is auto-placed on a worker with capacity.
Explicit `node = "orion"` overrides tier for pinned services.
6. **Two separate certs for public services.** Internal cert
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
(`metacircular.net`) issued by edge agent. Internal names never
appear on edge certs.
7. **`public = true` on routes.** Public routes declare intent with a
boolean flag. The master assigns the route to an edge node (currently
always svc). No explicit `edge` field in service definitions.
## Components
### Master (`mcp-master`)
Long-lived orchestrator on rift. Responsibilities:
- Accept CLI commands and dispatch to the correct agent
- Maintain node registry from agent self-registration
- Place services based on tier, explicit node, and resource availability
- Detect `public = true` routes and coordinate edge setup
- Validate public hostnames against allowed domain list
- Aggregate status from all agents (fleet-wide view)
- Probe agents on missed heartbeats
The master is stateless in the durable sense -- it rebuilds its world view
from agents on startup. If the master goes down, running services continue
unaffected; only new deploys and rescheduling stop.
### Agent upgrades
The fleet is heterogeneous (NixOS + Debian, amd64 + arm64), so NixOS flake
inputs don't work as a universal update mechanism.
**Design:** MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
- `mcp agent upgrade [node]` -- CLI cross-compiles for the target's
GOARCH, SCPs the binary, restarts via SSH
- Node config gains `ssh` (user@host) and `arch` (amd64/arm64) fields
- rift's NixOS `ExecStart` changes from nix store path to
`/srv/mcp/mcp-agent`
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
`mcp-agent.service`
### Edge agents
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
routing only: when the master tells it to set up an edge route, it
provisions a TLS cert from Metacrypt and registers the route in its local
mc-proxy via the gRPC admin API.
## Migration Plan
### Phase 1: Agent on svc
Deploy mcp-agent to svc. Verify with `mcp node list`.
### Phase 2: Edge routing RPCs
Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
Test by calling directly from CLI.
### Phase 3: Build mcp-master
Core loop: registration, heartbeats, deploy routing, placement, edge
coordination.
### Phase 4: Agent registration and health
Self-registration, heartbeat loop, master probe fallback, fleet status.
### Phase 5: Cut over
Point CLI at master, add tier fields to service defs, deploy agents to
orion and hyperborea.
## What Phase E Does NOT Include
These remain future work:
- Auto-reconciliation (agent auto-restarting drifted containers)
- Live migration (snapshot streaming between nodes)
- Web UI for fleet management
- Observability / log aggregation
- Object store
- Multiple edge nodes / master HA

View File

@@ -1018,6 +1018,13 @@ Write these before writing code. They are the blueprint, not the afterthought.
- **Never log secrets.** Keys, passwords, tokens, and plaintext must never
appear in log output.
### CLI Security
- **Never echo passwords.** Interactive password prompts must suppress
terminal echo. Use `mcdsl/terminal.ReadPassword` — it wraps
`golang.org/x/term.ReadPassword` with proper prompt and newline handling.
Never read passwords with `bufio.Scanner` or `fmt.Scanln`.
### Web Security
- CSRF tokens on all mutating requests.

View File

@@ -0,0 +1,627 @@
# Building MCQ: A Reading Queue in a Session
*A document reading queue goes from design discussion to production
deployment — including the deployment's sharp edges.*
*Written by Claude (Opus 4.6), reflecting on a collaborative session with
Kyle. This log focuses heavily on the deployment side, including what
went wrong, what was done manually versus what should have been MCP-managed,
and why `mcp ps` doesn't know mcq exists.*
---
## The Idea
Kyle was out and about, away from his tailnet, and wanted to review
platform documentation on his phone. The existing tools — mcdoc (which
renders docs from Gitea repos) and the repos themselves — require either
tailnet access or a desktop workflow.
The concept: a **document queue**. Push raw markdown from inside the
infrastructure, read rendered HTML from anywhere via a browser. Like a
self-hosted Pocket, but for internal docs you're actively iterating on.
After a design discussion, we settled on:
- **Name**: mcq (Metacircular Document Queue)
- **Data model**: Documents keyed by slug, upsert semantics (re-push
replaces content, resets read flag)
- **Auth**: MCIAS on everything — any user including guest can read, any
user including system accounts can push
- **Rendering**: Goldmark with GFM + syntax highlighting, rendered on
each page view
- **Architecture**: Single binary, REST API + gRPC + web UI
## Building the Service
### Codebase Exploration
Before writing any code, I explored the existing platform services to
understand the patterns:
- **mcat** (`~/src/metacircular/mcat/`): Reference for the web UI pattern —
chi router, CSRF, session cookies, htmx, embedded templates, cobra CLI,
config loading via `mcdsl/config`.
- **mcns** (`~/src/metacircular/mcns/`): Reference for REST + gRPC pattern —
separate `internal/server/` (REST) and `internal/grpcserver/` (gRPC),
method maps for auth interceptors, SQLite via `mcdsl/db`.
- **mcdoc** (`~/src/metacircular/mcdoc/`): Reference for goldmark rendering
and plain HTTP serving (mcdoc doesn't use mcdsl for config or HTTP — it
has its own, because it serves plain HTTP behind mc-proxy).
- **mcdsl** (`~/src/metacircular/mcdsl/`): The shared library — auth,
config, db, httpserver, grpcserver, csrf, web packages.
### Implementation (on vade, Kyle's workstation)
Created `~/src/mcq/` with the standard platform layout:
```
cmd/mcq/ main.go, server.go (cobra CLI)
internal/
config/ custom config (TLS optional, see below)
db/ SQLite schema, migrations, document CRUD
server/ REST API routes and handlers
grpcserver/ gRPC server, interceptors, service handlers
webserver/ Web UI routes, templates, session management
render/ goldmark markdown-to-HTML renderer
proto/mcq/v1/ Protobuf definitions
gen/mcq/v1/ Generated Go code
web/ Embedded templates + static files
deploy/ systemd, examples
```
Key files:
- **Proto** (`proto/mcq/v1/mcq.proto`): DocumentService (ListDocuments,
GetDocument, PutDocument, DeleteDocument, MarkRead, MarkUnread),
AuthService (Login, Logout), AdminService (Health).
- **DB** (`internal/db/documents.go`): Single `documents` table with slug
as unique key. PutDocument uses `INSERT ... ON CONFLICT(slug) DO UPDATE`.
- **REST** (`internal/server/routes.go`): All routes under `/v1/`
`PUT /v1/documents/{slug}` for upsert, standard CRUD otherwise.
- **Web UI** (`internal/webserver/server.go`): Login page, document list
at `/`, rendered markdown reader at `/d/{slug}`.
- **gRPC** (`internal/grpcserver/`): Mirrors REST exactly. Method map puts
all document operations in `authRequiredMethods`, nothing in
`adminRequiredMethods`.
Proto generation ran on vade:
```bash
cd ~/src/mcq
protoc --go_out=. --go_opt=module=git.wntrmute.dev/mc/mcq \
--go-grpc_out=. --go-grpc_opt=module=git.wntrmute.dev/mc/mcq \
proto/mcq/v1/*.proto
```
### The .gitignore Bug
First `git add -A` missed `cmd/mcq/`, `proto/mcq/`, and `gen/mcq/`. The
`.gitignore` had:
```
mcq
srv/
```
The pattern `mcq` (without a leading slash) matches any file or directory
named `mcq` at any level — so it was ignoring `cmd/mcq/`, `gen/mcq/`, and
`proto/mcq/`. Fixed to:
```
/mcq
/srv/
```
### The TLS Decision
This was the most consequential design decision for deployment.
The standard platform pattern (mcdsl's `httpserver`) enforces TLS 1.3
minimum. But mc-proxy on svc terminates TLS at the edge and forwards to
backends as plain HTTP (for localhost services) or HTTPS (for remote
backends like rift). Gitea on svc runs plain HTTP on port 3000 behind
mc-proxy. mcdoc on rift runs plain HTTP on port 38080 behind mc-proxy.
mcdsl's `config.Load` validates that `tls_cert` and `tls_key` are present
— they're required fields. So I couldn't use `config.Base` with empty TLS
fields.
**Solution**: Created `internal/config/config.go` — mcq's own config
package, modeled after mcdoc's. Same TOML loading, env var overrides, and
validation, but TLS fields are optional. When empty, the server uses
`http.ListenAndServe()` instead of `httpserver.ListenAndServeTLS()`.
This meant giving up the mcdsl httpserver (with its logging middleware and
TLS enforcement) for the plain HTTP path. The gRPC server was also dropped
from the svc deployment since it requires TLS. The REST API and web UI
are sufficient for the use case.
### Build and Test (on vade)
```bash
cd ~/src/mcq
go mod tidy
go build ./... # clean
go vet ./... # clean
go test ./... # 6 tests pass (all in internal/db)
# Production binary
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w -X main.version=v0.1.0" \
-o mcq ./cmd/mcq
# Result: 21MB static binary
```
---
## Deployment
### Why mcq is NOT in `mcp ps`
**This is the most important thing in this log.**
mcq was deployed as a **manual systemd service on svc**, not as an
MCP-managed container. This means:
- `mcp ps` doesn't know about it
- `mcp stop mcq` won't work
- `mcp deploy mcq` won't work
- There's no service definition in `~/.config/mcp/services/`
- There's no container image in MCR
- The binary was `scp`'d directly to svc and `install`'d to `/usr/local/bin/`
**Why?** Three reasons:
1. **svc has no MCP agent.** The MCP agent (`mcp-agent`) only runs on rift.
svc is a Debian VPS that hosts MCIAS, mc-proxy, MCNS, and Gitea — all
deployed as manual systemd services, not via MCP. Getting mcq into MCP
would require deploying an MCP agent to svc first (Phase E in
PLATFORM_EVOLUTION.md, items #10-#12).
2. **mcq runs as a native binary, not a container.** MCP manages containers
(podman). mcq on svc is a bare binary under systemd, like MCIAS and
mc-proxy on svc. To make it MCP-managed, it would need to be
containerized and pushed to MCR first.
3. **The deployment followed the existing svc pattern.** Every service on
svc was deployed this way: build on vade, scp to svc, install, write
config, write systemd unit, enable. This was a deliberate choice to
match the existing operational model rather than block on MCP agent
deployment.
### What MCP-managed deployment would look like
Once svc has an MCP agent, mcq could be managed like services on rift:
```toml
# ~/.config/mcp/services/mcq.toml
name = "mcq"
node = "svc"
version = "v0.1.0"
[[components]]
name = "api"
[[components.routes]]
port = 8090
mode = "l7"
hostname = "mcq.metacircular.net"
```
This would require:
- MCP agent running on svc
- mcq containerized (Dockerfile) and pushed to MCR
- Agent handles port assignment, mc-proxy route registration, lifecycle
### The Actual Deployment Steps
All commands below were run from vade (Kyle's workstation) via SSH to svc,
unless otherwise noted.
#### 1. Push repo to Gitea (from vade)
```bash
cd ~/src/mcq
git remote add origin git@git.wntrmute.dev:mc/mcq.git
git push -u origin master
```
The mc/mcq repo was created manually in Gitea (the MCP tool's API token
lacked `write:organization` scope for creating repos under the mc org).
#### 2. Copy binary to svc (from vade)
```bash
scp ~/src/mcq/mcq kyle@svc:/tmp/mcq
```
SSH to svc uses Tailscale hostname resolution — `svc` resolves to
`100.106.232.4` via tailscale. No SSH config entry was needed. Had to
accept the host key on first connection:
```bash
ssh -o StrictHostKeyChecking=accept-new kyle@svc
```
#### 3. Create user and install binary (on svc, as root via sudo)
```bash
sudo useradd --system --no-create-home --shell /usr/sbin/nologin mcq
sudo mkdir -p /srv/mcq
sudo chown mcq:mcq /srv/mcq
sudo chmod 0700 /srv/mcq
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
```
Verified: `/usr/local/bin/mcq --version``mcq version v0.1.0`
#### 4. Write config (on svc)
Created `/srv/mcq/mcq.toml`:
```toml
[server]
listen_addr = "127.0.0.1:8090"
[database]
path = "/srv/mcq/mcq.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = "/srv/mcq/ca.pem"
service_name = "mcq"
tags = []
[log]
level = "info"
```
**Important detail**: The first attempt used `server_url = "https://127.0.0.1:8443"`
which failed because MCIAS's TLS cert has SANs for `mcias.wntrmute.dev`
and `mcias.metacircular.net` but **not** `127.0.0.1` or `localhost`. Token
validation returned "invalid or expired token" because the mcdsl auth
client couldn't establish a TLS connection to MCIAS.
Fixed by copying the pattern from MCNS on svc:
- `server_url = "https://mcias.metacircular.net:8443"` (uses the hostname
that matches the cert's SAN)
- `ca_cert = "/srv/mcq/ca.pem"` (the WNTRMUTE root CA cert, copied from
`/srv/mcns/certs/ca.pem`)
The hostname `mcias.metacircular.net` resolves to svc's public IP, so
this still connects to localhost MCIAS — it just goes through the public
IP for TLS hostname verification. (On a locked-down firewall this could
be an issue, but svc allows loopback through its public IP.)
#### 5. Create systemd unit (on svc)
Created `/etc/systemd/system/mcq.service`:
```ini
[Unit]
Description=MCQ Document Queue
After=network-online.target mcias.service
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/mcq server --config /srv/mcq/mcq.toml
WorkingDirectory=/srv/mcq
Restart=on-failure
RestartSec=5
User=mcq
Group=mcq
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/srv/mcq
PrivateTmp=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
[Install]
WantedBy=multi-user.target
```
```bash
sudo systemctl daemon-reload
sudo systemctl enable --now mcq
```
Verified running: PID 3765144, memory 7.8MB, started cleanly.
#### 6. Generate TLS certificate for mc-proxy (on vade)
mc-proxy needs a TLS cert for the `mcq.metacircular.net` hostname (it
terminates TLS at the edge). Generated using the local WNTRMUTE root CA:
```bash
cd /tmp
openssl ecparam -name prime256v1 -genkey -noout -out mcq.key
openssl req -new -key mcq.key -out mcq.csr \
-subj "/CN=mcq.metacircular.net/O=Metacircular Dynamics" \
-addext "subjectAltName=DNS:mcq.metacircular.net"
openssl x509 -req -in mcq.csr \
-CA ~/src/metacircular/ca/ca.pem \
-CAkey ~/src/metacircular/ca/ca.key \
-CAcreateserial -out mcq.pem -days 365 -sha256 \
-extfile <(echo "subjectAltName=DNS:mcq.metacircular.net
keyUsage=digitalSignature
extendedKeyUsage=serverAuth")
```
The CA key and cert are at `~/src/metacircular/ca/` — this is the
WNTRMUTE Issuing Authority root CA. Not Metacrypt (which has its own
intermediate CA for automated issuance). The existing mc-proxy certs
(docs, git, metacrypt) were all signed by this same root CA.
Copied to svc:
```bash
scp /tmp/mcq.pem /tmp/mcq.key kyle@svc:/tmp/
```
Installed on svc:
```bash
sudo cp /tmp/mcq.pem /srv/mc-proxy/certs/mcq.metacircular.net.pem
sudo cp /tmp/mcq.key /srv/mc-proxy/certs/mcq.metacircular.net.key
sudo chown mc-proxy:mc-proxy /srv/mc-proxy/certs/mcq.metacircular.net.*
sudo chmod 0600 /srv/mc-proxy/certs/mcq.metacircular.net.key
```
#### 7. Add mc-proxy route (on svc)
mc-proxy on svc uses SQLite for route persistence. The TOML config only
seeds the database on first run (`store.IsEmpty()` check). After that,
routes are loaded from SQLite. So editing the TOML alone doesn't add a
route — you must also insert into the database.
I did both (TOML for documentation/re-seeding, SQLite for immediate effect):
**TOML** (added via `sed` to `/srv/mc-proxy/mc-proxy.toml`):
```toml
[[listeners.routes]]
hostname = "mcq.metacircular.net"
backend = "127.0.0.1:8090"
mode = "l7"
tls_cert = "/srv/mc-proxy/certs/mcq.metacircular.net.pem"
tls_key = "/srv/mc-proxy/certs/mcq.metacircular.net.key"
backend_tls = false
```
**SQLite** (direct insert):
```bash
sudo sqlite3 /srv/mc-proxy/mc-proxy.db "
INSERT INTO routes (listener_id, hostname, backend, mode, tls_cert, tls_key, backend_tls)
VALUES (1, 'mcq.metacircular.net', '127.0.0.1:8090', 'l7',
'/srv/mc-proxy/certs/mcq.metacircular.net.pem',
'/srv/mc-proxy/certs/mcq.metacircular.net.key', 0);
"
```
The `listener_id = 1` is the `:443` listener (only listener on svc's
mc-proxy).
**Note on `backend_tls = false`**: mcq serves plain HTTP on localhost.
mc-proxy terminates TLS for the client and forwards as plain HTTP to
`127.0.0.1:8090`. This is the same pattern as Gitea (`127.0.0.1:3000`)
and mcdoc (`100.95.252.120:38080`). Only metacrypt uses `backend_tls = true`
because its backend is on rift over Tailscale.
#### 8. Restart mc-proxy (on svc)
```bash
sudo systemctl restart mc-proxy
```
This was messy. mc-proxy's graceful shutdown waits for in-flight
connections to drain, and the 30-second shutdown timeout was exceeded
(lingering connections from internet scanners hitting git.metacircular.net).
The shutdown hung for ~30 seconds before logging "shutdown timeout exceeded,
forcing close". systemd then moved to `deactivating (stop-sigterm)` state.
Had to force it:
```bash
sudo systemctl kill mc-proxy
sleep 2
sudo systemctl start mc-proxy
```
After restart: `routes=5` (was 4 before mcq). Confirmed:
```bash
curl -sk https://mcq.metacircular.net/v1/health
# {"status":"ok"}
```
#### 9. Push documents (from vade)
Used the mcp-agent service account token (from
`~/data/downloads/service-account-76d35a82-77ca-422f-85a3-b9f9360d5164.token`)
to authenticate API calls. This is a long-lived JWT issued by MCIAS with
`admin` role, `exp` in 2027.
```bash
TOKEN=$(cat ~/data/downloads/service-account-*.token)
# Push MCP Architecture
python3 -c "
import json
body = open('mcp/ARCHITECTURE.md').read()
print(json.dumps({'title': 'MCP Architecture', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/mcp-architecture \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
# Push Platform Evolution
python3 -c "
import json
body = open('PLATFORM_EVOLUTION.md').read()
print(json.dumps({'title': 'Platform Evolution', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/platform-evolution \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
# Push Packaging doc
python3 -c "
import json
body = open('docs/packaging-and-deployment.md').read()
print(json.dumps({'title': 'Packaging and Deployment', 'body': body}))
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/packaging-and-deployment \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @-
```
Used `python3` for JSON encoding because `jq` isn't installed on vade
(NixOS — would need to add it to the system config or use `nix-shell`).
All three documents pushed successfully. The token identifies as
`mcp-agent` (the service account name), so `pushed_by` shows `mcp-agent`
on each document.
### Subsequent Update: Tufte Theme
Kyle wanted a wider reading area (70%) and a Tufte-inspired theme. Updated
`web/static/style.css`:
- Serif font stack (Georgia, Palatino)
- Cream background (`#fffff8`)
- Italic headings, small-caps labels
- `width: 70%` on `.page-container` (was `max-width: 720px`)
- Minimal chrome — document list uses ruled lines instead of cards,
tables use bottom-borders only
- Mobile fallback: full width below 768px
Rebuilt, deployed same way:
```bash
# On vade
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w -X main.version=v0.1.1" \
-o mcq ./cmd/mcq
scp mcq kyle@svc:/tmp/mcq
# On svc
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
sudo systemctl restart mcq
```
---
## State After This Session
### What's running on svc
| Service | Port | Managed by | Notes |
|---------|------|------------|-------|
| MCIAS | :8443/:9443 | systemd | Identity/auth, been here longest |
| mc-proxy | :443 | systemd | L7 TLS termination, 5 routes |
| MCNS | :53/:8444/:9444 | systemd | Authoritative DNS |
| Gitea | :3000 | systemd | Git hosting |
| **mcq** | **:8090** | **systemd** | **NEW: document queue** |
None of these are MCP-managed. svc has no MCP agent.
### mc-proxy routes on svc
| Hostname | Backend | Mode | TLS Backend |
|----------|---------|------|-------------|
| metacrypt.metacircular.net | 100.95.252.120:18080 | L7 | yes (rift) |
| git.metacircular.net | 127.0.0.1:3000 | L7 | no |
| git.wntrmute.dev | 127.0.0.1:3000 | L7 | no |
| docs.metacircular.net | 100.95.252.120:38080 | L7 | no |
| **mcq.metacircular.net** | **127.0.0.1:8090** | **L7** | **no** |
### DNS
`mcq.metacircular.net` is a CNAME to `svc.metacircular.net` (set up by
Kyle at the DNS registrar before this session). mc-proxy's SNI-based
routing handles the rest.
### Documents in queue
| Slug | Title | Pushed By |
|------|-------|-----------|
| mcp-architecture | MCP Architecture | mcp-agent |
| platform-evolution | Platform Evolution | mcp-agent |
| packaging-and-deployment | Packaging and Deployment | mcp-agent |
### Git
Repo: `mc/mcq` on Gitea (`git.wntrmute.dev:mc/mcq.git`)
Commits:
1. `bc16279` — Initial implementation
2. `648e9dc` — Support plain HTTP mode for mc-proxy L7 deployment
3. `a5b90b6` — Switch to Tufte-inspired reading theme
---
## What Would Be Different with MCP
If svc had an MCP agent and mcq were containerized:
1. **No manual SSH**`mcp deploy mcq` from vade would push the service
definition, agent would pull the image from MCR.
2. **No manual port picking** — agent assigns a free port from 10000-60000.
3. **No manual mc-proxy route** — agent calls mc-proxy's gRPC API to
register the route (Phase B, already working on rift).
4. **No manual TLS cert** — agent provisions from Metacrypt CA
(Phase C, already working on rift).
5. **No manual systemd unit** — agent manages the container lifecycle.
6. **`mcp ps` would show mcq** — because the agent tracks it in its
registry.
7. **`mcp stop mcq` / `mcp restart mcq` would work** — standard lifecycle.
The gap is: svc has no agent. That's Phase E work (items #10-#12 in
PLATFORM_EVOLUTION.md). The prerequisites are the agent binary location
convention, SSH-based upgrade tooling, and node provisioning for Debian.
---
## Rough Edges and Lessons
1. **MCIAS cert hostname**: Every new service on svc will hit this. The
MCIAS cert doesn't include localhost as a SAN. Services must use
`server_url = "https://mcias.metacircular.net:8443"` (which routes
through the public IP back to localhost) and include the CA cert.
Could fix by reissuing the MCIAS cert with a localhost SAN.
2. **mc-proxy route persistence**: The TOML-seeds-once-then-SQLite model
means you have to touch two places (TOML for future re-seeds, SQLite
for immediate effect). On rift this is handled by the agent's gRPC
calls. On svc without an agent, it's manual database surgery.
3. **mc-proxy shutdown timeout**: The 30-second timeout isn't enough when
internet scanners maintain persistent connections to git.metacircular.net.
Had to force-kill on restart. Should increase `shutdown_timeout` or
add a SIGKILL escalation in the systemd unit (`TimeoutStopSec=45`,
which sends SIGKILL after 45s).
4. **No jq on vade**: NixOS doesn't have jq in the default system config.
Used python3 as a workaround for JSON encoding. Minor friction.
5. **mcdsl httpserver assumes TLS**: Services behind mc-proxy L7 can't use
`mcdsl/httpserver` because it enforces TLS 1.3. mcdoc solved this with
its own config/server. mcq now does the same. This is a recurring
pattern — might warrant adding a plain HTTP mode to mcdsl httpserver,
or a separate `mcdsl/httpserver/plain` package.
6. **Session cookie Secure flag behind plain HTTP**: The mcdsl `web`
package always sets `Secure: true` on session cookies. This works
behind mc-proxy L7 because the *browser* sees HTTPS (mc-proxy
terminates TLS) — the `Secure` flag is about the browser's view of
the connection, not the backend. If mcq were ever accessed directly
(not through mc-proxy), cookies would silently fail.

View File

@@ -0,0 +1,231 @@
# Incident Report: UID Change Cascading Failure
**Date**: 2026-04-03
**Duration**: ~2 hours (08:4510:45 PDT)
**Severity**: Full platform outage on rift (all containers lost)
**Root cause**: Changing the `mcp` system user UID from 995 to 850
## Timeline
### Background
Orion was being provisioned as a new worker node. Its NixOS config
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
was already assigned to the `sshd` user, causing a UID collision:
```
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
```
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
rejected rootless podman operations because the calling process's UID
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
### The UID Change
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
the 800899 range, empty on all nodes). Both rift and orion were
rebuilt with `nixos-rebuild switch`.
**Problem 1: NixOS doesn't change UIDs for existing users.** The
rebuild created the NixOS config with `uid = 850` but the existing
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
was required on each node.
**Problem 2: Rootless podman caches the UID everywhere.**
- Podman's SQLite database (`db.sql`) stores absolute paths like
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
- The systemd user session (`/run/user/995/`) is tied to the UID
- subuid/subgid mappings reference the user by name but the kernel
checks the actual UID
- Container storage overlay directories have file ownership based on
the old UID namespace mapping (995 → 100000)
After changing the UID, `podman` operations failed with:
```
newuidmap: write to uid_map failed: Operation not permitted
```
### The Reboot
Rift was rebooted to get a clean systemd user session for UID 850.
The reboot succeeded, but **all containers were gone**:
```
$ podman ps -a
(empty)
```
Podman's database was recreated fresh on boot because the old database
referenced paths under `/run/user/995/` which no longer existed. The
images were still in overlay storage but the container definitions
(names, port mappings, volume mounts, restart policies) were lost.
### DNS Collapse
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
ran as a container on rift. When all containers were lost, DNS
resolution broke:
- `mcq.svc.mcp.metacircular.net` → no answer
- MCNS also served as a recursive resolver for the LAN
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
Tailscale DNS (MagicDNS) was also affected because resolved's global
DNS config pointed to MCNS. Tailscale itself remained functional
(its coordination servers are external), but hostname resolution via
Tailscale DNS names failed.
The operator turned off Tailscale on vade (the workstation) because
Tailscale's MagicDNS was routing ALL DNS queries through the broken
MCNS resolver — external services including Claude Code and Gitea
were unreachable. Disabling Tailscale was the only way to restore
external DNS resolution. However, this also broke connectivity to
rift since the MCP agent binds to the Tailnet IP only
(`100.95.252.120:9444`).
### Recovery
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
connectivity works without MCNS — MagicDNS uses Tailscale's own
servers for `.ts.net` names.
**Step 2**: Start MCNS manually via `podman run`. The image was cached
in overlay storage. MCNS needed explicit port bindings (not `--network
host`) because systemd-resolved holds port 53 on localhost:
```bash
podman run -d --name mcns --restart unless-stopped \
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
-p 127.0.0.1:38443:8443 \
-v /srv/mcns:/srv/mcns \
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
server --config /srv/mcns/mcns.toml
```
DNS resolution restored within seconds.
**Step 3**: Start remaining services manually via `podman run`. Images
were all cached. The `mcp deploy` CLI couldn't work because:
- MCR was down (can't pull images)
- The agent's registry was empty (podman DB reset)
- Auto-build failed (`/etc/resolv.conf` permission denied in build
containers)
Each service was started with explicit `podman run` commands matching
the service definitions in `~/.config/mcp/services/*.toml`.
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
were owned by UID 850 (the mcp user on the host). Inside containers,
UID 0 (root) maps to host UID 850 via subuid. But:
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
to match the container's user namespace
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
in the database directory
- modernc.org/sqlite calls `fchmod()` on the database file, which is
denied inside rootless podman user namespaces (even for UID 0 in the
namespace)
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
was fatal for MCR and Metacrypt. The fix:
```bash
# Stop the container
podman stop metacrypt-api
# Delete the database (WAL and SHM too)
rm -f /srv/metacrypt/metacrypt.db*
# Restart — the service recreates the database
podman start metacrypt-api
```
The `fchmod` error still occurs on the newly created database but is
non-fatal — the service logs a warning and continues.
**Data loss**: MCR and Metacrypt databases were deleted and recreated
empty. MCR lost its manifest/tag metadata (images still exist in
overlay storage but are unregistered). Metacrypt lost its CA state
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
etc.) started successfully because their databases survived the
ownership changes.
## Root Causes
1. **UID collision between system users**: NixOS auto-assigns UIDs
downward from 999. Pinning UID 995 for mcp collided with sshd on
orion.
2. **Rootless podman's deep UID dependency**: Changing a user's UID
after rootless podman has been used requires:
- Updating podman's internal database paths
- Recreating the systemd user session
- Fixing subuid/subgid mappings
- Fixing overlay storage ownership
- Fixing service data file ownership
- None of these happen automatically
3. **No boot sequencing**: When rift rebooted with no running
containers, there was no mechanism to start services in dependency
order. The boot sequence feature in the v2 architecture exists
precisely for this, but wasn't implemented yet.
4. **MCNS as a single point of DNS failure**: All machines used MCNS
as their DNS resolver. When MCNS went down, everything broke
including the ability to manage infrastructure.
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
library calls `fchmod()` on database files, which is denied inside
rootless podman user namespaces. This is a known incompatibility
that was masked by the previous UID setup.
## Lessons Learned
1. **Never change a rootless podman user's UID.** If a UID collision
exists, resolve it on the conflicting node (change sshd, not mcp)
or use a per-host UID override. Changing the UID after podman has
been used is destructive.
2. **DNS must not be a single point of failure.** All machines should
have fallback DNS resolvers that work independently of MCNS. The
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
fallbacks, not just MCNS.
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
(foundation → core → management) is not a nice-to-have. Without it,
manual recovery requires knowing the exact dependency order and the
exact `podman run` commands for each service.
4. **The MCP agent should be able to recover containers from its
registry.** After a podman database reset, the agent's SQLite
registry still knows what should be running. A `mcp agent recover`
command that recreates containers from the registry would eliminate
the manual `podman run` recovery.
5. **Service definitions must include all runtime parameters.** The
manual recovery required knowing port mappings, volume mounts,
network modes, user overrides, and command arguments for each
service. All of this is in the service definition files, but there
was no tool to translate a service definition into a `podman run`
command without the full MCP deploy pipeline.
6. **Tailscale MagicDNS amplifies DNS failures.** When MCNS is down
and MagicDNS routes through it, ALL DNS breaks — not just internal
names. Disabling Tailscale restores external DNS but loses Tailnet
connectivity. The fix is fallback resolvers that bypass MCNS, not
disabling Tailscale.
## Action Items
- [x] Write disaster recovery runbook → `docs/disaster-recovery.md`
- [x] Add fallback DNS resolvers to NixOS config → all nodes now have
1.1.1.1 and 8.8.8.8 as fallbacks after MCNS
- [x] Implement `mcp agent recover` command → MCP v0.10.5. Recreates
containers from the agent registry when podman DB is lost.
- [x] Implement boot sequencing in the agent → MCP v0.10.6.
[[boot.sequence]] config with per-stage health checks.
- [x] Fix modernc.org/sqlite `fchmod` → was our own `os.Chmod` in
`mcdsl/db/db.go`, not sqlite. Made best-effort in mcdsl v1.8.0.
- [x] Add multi-address support to node config → MCP v0.10.4.
Fallback addresses tried in order when primary fails.
- [x] Stabilize mcp UID → pinned at 850 with NEVER CHANGE comment