Compare commits
32 Commits
25ee89455a
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| c05ed968f9 | |||
| 02cfb27937 | |||
| 5aceb496e8 | |||
| a474f17a8b | |||
| ad34904b0b | |||
| cdf3d30bfe | |||
| 6a6c5a0c8e | |||
| 96abb87740 | |||
| 4de025c071 | |||
| 5158e6ae24 | |||
| 979a64a854 | |||
| 634b9096c0 | |||
| 24a0db0f47 | |||
| 78ce5003e1 | |||
| 0b11899a9b | |||
| fb9942db14 | |||
| f9eea00a6c | |||
| f3a608c4ae | |||
| 8fb6374257 | |||
| e441df34c8 | |||
| 644ded9efc | |||
| 4386fb0896 | |||
| 5c4ed5c681 | |||
| 4722fdb0da | |||
| 1715a53193 | |||
| 5e7e2522d7 | |||
| d041df01e4 | |||
| a04543028b | |||
| 90781014db | |||
| 24a8ae8c21 | |||
| 84610339aa | |||
| 8550b8c2e5 |
@@ -17,12 +17,17 @@ Metacircular is a multi-service personal infrastructure platform. This root repo
|
||||
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
|
||||
| `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go |
|
||||
| `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go |
|
||||
| `mcp/` | Control plane — operator-driven deployment, service registry, container lifecycle (master/agent) | Go |
|
||||
| `mcq/` | Document review queue — push docs for review, MCP server for Claude integration | Go |
|
||||
| `mcp/` | Control plane — service deployment, container lifecycle, multi-node fleet management (CLI/agent, master in development) | Go |
|
||||
| `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go |
|
||||
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
|
||||
| `docs/` | Platform-wide documentation (architecture overview, deployment guide) | Markdown |
|
||||
| `log/` | Operational work logs and session notes | Markdown |
|
||||
|
||||
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
|
||||
|
||||
Some subprojects (mcat, mcdoc, mcq) may live at sibling paths (e.g., `../mcq/`) rather than as subdirectories, depending on workspace layout.
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
|
||||
@@ -33,6 +38,7 @@ mcias (standalone — no MCIAS dependency)
|
||||
├── mc-proxy (uses MCIAS for admin auth)
|
||||
├── mcr (uses MCIAS for auth + policy)
|
||||
├── mcdoc (public, no MCIAS — fetches docs from Gitea)
|
||||
├── mcq (uses MCIAS for auth; document review queue)
|
||||
├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle)
|
||||
├── mcns (uses MCIAS for auth; authoritative DNS for internal zones)
|
||||
└── mcat (tests MCIAS login policies)
|
||||
|
||||
@@ -5,7 +5,7 @@ from its current manually-wired state to fully declarative deployment.
|
||||
It is a living design document — not a spec, not a commitment, but a
|
||||
record of where we are, where we want to be, and what's between.
|
||||
|
||||
Last updated: 2026-03-27 (Phases A + B complete)
|
||||
Last updated: 2026-04-02 (Phases A–D complete, Phase E in progress)
|
||||
|
||||
---
|
||||
|
||||
@@ -181,9 +181,9 @@ about one node, one mc-proxy, or loopback-only backends.
|
||||
#### 1. mcdsl: Proper Module Versioning — DONE
|
||||
|
||||
mcdsl is already properly versioned and released:
|
||||
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`
|
||||
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`, `v1.1.0`, `v1.2.0`
|
||||
- All consuming services import by URL with pinned versions
|
||||
(mcr, mcat, mcns, mc-proxy → `v1.0.0`; metacrypt → `v1.0.1`)
|
||||
(all consuming services on `v1.2.0`)
|
||||
- No `replace` directives anywhere
|
||||
- Docker builds use standard `go mod download`
|
||||
- `uses_mcdsl` eliminated from service definitions and docs
|
||||
@@ -215,18 +215,14 @@ routes during deploy and stop:
|
||||
- L4 routes: TLS passthrough, backend handles its own TLS
|
||||
- Hostnames default to `<service>.svc.mcp.metacircular.net`
|
||||
|
||||
#### 4. MCP Agent: TLS Cert Provisioning
|
||||
#### 4. MCP Agent: TLS Cert Provisioning — DONE
|
||||
|
||||
**Gap**: certs are manually provisioned and placed on disk. There is no
|
||||
automated issuance flow.
|
||||
|
||||
**Work**:
|
||||
- Agent requests certs from Metacrypt CA via its API.
|
||||
- Certs are stored in a standard location
|
||||
(`/srv/mc-proxy/certs/<service>.pem`).
|
||||
- Cert renewal is handled automatically before expiry.
|
||||
|
||||
**Depends on**: Metacrypt cert issuance policy (#7).
|
||||
Agent provisions TLS certificates from Metacrypt CA automatically during
|
||||
deploy for L7 routes:
|
||||
- ACME client library requests certs from Metacrypt CA via its API
|
||||
- Certs stored in `/srv/mc-proxy/certs/<service>.pem`
|
||||
- Provisioning happens during deploy before mc-proxy route registration
|
||||
- L7 routes get agent-provisioned certs; L4 routes use service-managed TLS
|
||||
|
||||
#### 5. mc-proxy: Route Persistence — DONE
|
||||
|
||||
@@ -243,57 +239,49 @@ mc-proxy routes are fully persisted in SQLite and survive restarts:
|
||||
bootstrap before MCP is operational. The gRPC API and mcproxyctl
|
||||
are the primary route management interfaces going forward.
|
||||
|
||||
#### 6. MCP Agent: DNS Registration
|
||||
#### 6. MCP Agent: DNS Registration — DONE
|
||||
|
||||
**Gap**: DNS records are manually configured in MCNS zone files.
|
||||
Agent automatically manages DNS records during deploy and stop:
|
||||
- Deploy: calls MCNS API to create/update A records for
|
||||
`<service>.svc.mcp.metacircular.net` pointing to the node's address.
|
||||
- Stop/undeploy: removes DNS records before stopping containers.
|
||||
- Config: `[mcns]` section in agent config with server URL, CA cert,
|
||||
token path, zone, and node address.
|
||||
- Nil-safe: if MCNS not configured, silently skipped (backward compatible).
|
||||
- Authorization: mcp-agent system account can manage any record name.
|
||||
|
||||
**Work**:
|
||||
- Agent creates/updates A records in MCNS for
|
||||
`<service>.svc.mcp.metacircular.net`.
|
||||
- Agent removes records on service teardown.
|
||||
#### 7. Metacrypt: Automated Cert Issuance Policy — DONE
|
||||
|
||||
**Depends on**: MCNS record management API (#8).
|
||||
MCP agent has MCIAS credentials and Metacrypt policy for automated cert
|
||||
issuance:
|
||||
- MCP agent authenticates to Metacrypt with MCIAS service credentials
|
||||
- Metacrypt policy allows cert issuance for
|
||||
`*.svc.mcp.metacircular.net`
|
||||
- One cert per hostname per service — no wildcard certs
|
||||
|
||||
#### 7. Metacrypt: Automated Cert Issuance Policy
|
||||
#### 8. MCNS: Record Management API — DONE
|
||||
|
||||
**Gap**: no policy exists for automated cert issuance. The MCP agent
|
||||
doesn't have a Metacrypt identity or permissions.
|
||||
|
||||
**Work**:
|
||||
- MCP agent gets an MCIAS service account.
|
||||
- Metacrypt policy allows this account to issue certs scoped to
|
||||
`*.svc.mcp.metacircular.net` (and explicitly listed public
|
||||
hostnames).
|
||||
- No wildcard certs — one cert per hostname per service.
|
||||
|
||||
**Depends on**: MCIAS service account provisioning (exists today, just
|
||||
needs the account created).
|
||||
|
||||
#### 8. MCNS: Record Management API
|
||||
|
||||
**Gap**: MCNS v1.0.0 has REST + gRPC APIs and SQLite storage, but
|
||||
records are currently seeded from migrations (static). The API supports
|
||||
CRUD operations but MCP does not yet call it for dynamic registration.
|
||||
|
||||
**Work**:
|
||||
- MCP agent calls MCNS API to create/update/delete records on
|
||||
deploy/stop.
|
||||
- MCIAS auth scoping to allow MCP agent to manage
|
||||
`*.svc.mcp.metacircular.net` records.
|
||||
|
||||
**Depends on**: MCNS API exists. Remaining work is MCP integration
|
||||
and auth scoping.
|
||||
MCNS provides full CRUD for DNS records via REST and gRPC:
|
||||
- REST: POST/GET/PUT/DELETE on `/v1/zones/{zone}/records`
|
||||
- gRPC: RecordService with ListRecords, CreateRecord, GetRecord,
|
||||
UpdateRecord, DeleteRecord RPCs
|
||||
- SQLite-backed with transactional writes, CNAME exclusivity enforcement,
|
||||
and automatic SOA serial bumping on mutations
|
||||
- Authorization: admin can manage any record, mcp-agent system account
|
||||
can manage any record name, other system accounts scoped to own name
|
||||
- MCP agent uses the REST API to register/deregister records on
|
||||
deploy/stop
|
||||
|
||||
#### 9. Application $PORT Convention — DONE
|
||||
|
||||
mcdsl v1.1.0 adds `$PORT` and `$PORT_GRPC` env var support:
|
||||
mcdsl v1.2.0 added `$PORT` and `$PORT_GRPC` env var support:
|
||||
- `config.Load` checks `$PORT` → overrides `Server.ListenAddr`
|
||||
- `config.Load` checks `$PORT_GRPC` → overrides `Server.GRPCAddr`
|
||||
- Takes precedence over TOML and generic env overrides
|
||||
(`$MCR_SERVER_LISTEN_ADDR`) — agent-assigned ports are authoritative
|
||||
- Handles both `config.Base` embedding (MCR, MCNS, MCAT) and direct
|
||||
`ServerConfig` embedding (Metacrypt) via struct tree walking
|
||||
- MCR, Metacrypt, MCNS upgraded to mcdsl v1.1.0
|
||||
- All consuming services on mcdsl v1.4.0
|
||||
|
||||
---
|
||||
|
||||
@@ -311,32 +299,92 @@ Phase A — Independent groundwork: ✓ COMPLETE
|
||||
Phase B — MCP route registration: ✓ COMPLETE
|
||||
#3 Agent registers routes with mc-proxy ✓ DONE
|
||||
|
||||
Phase C — Automated TLS:
|
||||
#7 Metacrypt cert issuance policy
|
||||
#4 Agent provisions certs
|
||||
Phase C — Automated TLS: ✓ COMPLETE
|
||||
#7 Metacrypt cert issuance policy ✓ DONE
|
||||
#4 Agent provisions certs ✓ DONE
|
||||
(depends on #7)
|
||||
|
||||
Phase D — DNS:
|
||||
#8 MCNS record management API
|
||||
#6 Agent registers DNS
|
||||
Phase D — DNS: ✓ COMPLETE
|
||||
#8 MCNS record management API ✓ DONE
|
||||
#6 Agent registers DNS ✓ DONE
|
||||
(depends on #8)
|
||||
|
||||
Phase E — Multi-node agent management:
|
||||
#10 Agent binary at /srv/mcp/mcp-agent on all nodes
|
||||
#11 mcp agent upgrade (SSH-based cross-compiled push)
|
||||
#12 Node provisioning tooling (Debian + NixOS)
|
||||
(depends on #10)
|
||||
```
|
||||
|
||||
**Phases A and B are complete.** Services can be deployed with
|
||||
agent-assigned ports, `$PORT` env vars, and automatic mc-proxy route
|
||||
registration. No more manual port picking, mcproxyctl, or TOML editing.
|
||||
|
||||
The remaining manual steps are TLS cert provisioning (Phase C) and
|
||||
DNS registration (Phase D).
|
||||
**Phases A, B, C, and D are complete.** Services can be deployed with
|
||||
agent-assigned ports, `$PORT` env vars, automatic mc-proxy route
|
||||
registration, automated TLS cert provisioning from Metacrypt CA, and
|
||||
automatic DNS registration in MCNS. No more manual port picking,
|
||||
mcproxyctl, TOML editing, cert generation, or DNS zone editing.
|
||||
|
||||
### Immediate Next Steps
|
||||
|
||||
1. **Phase C: Automated TLS** — Metacrypt cert issuance policy for MCP
|
||||
agent, then agent provisions certs automatically during deploy.
|
||||
2. **Phase D: DNS** — MCNS record management API integration, then
|
||||
agent registers DNS records during deploy.
|
||||
3. **mcdoc implementation** — fully designed, no platform evolution
|
||||
dependency. Deployable now with the new route system.
|
||||
1. **Phase E: Multi-node agent management** — see below. See
|
||||
`docs/phase-e-plan.md` for the detailed Phase E design and
|
||||
`docs/architecture-v2.md` for the v2 master/agent architecture.
|
||||
2. **mcdoc** — deployed (v0.1.0 on rift). Serves rendered markdown
|
||||
from Gitea via mc-proxy.
|
||||
|
||||
#### 10. Agent Binary Location Convention
|
||||
|
||||
**Gap**: The agent binary is currently NixOS-managed on rift (lives in
|
||||
`/nix/store/`, systemd `ExecStart` points there). This doesn't work for
|
||||
Debian nodes and requires a full `nixos-rebuild` for every MCP release.
|
||||
|
||||
**Work**:
|
||||
- Standardize agent binary at `/srv/mcp/mcp-agent` on all nodes.
|
||||
- NixOS config: change `ExecStart` from nix store path to
|
||||
`/srv/mcp/mcp-agent`. NixOS still owns user, systemd unit, podman,
|
||||
directories — just not the binary version.
|
||||
- Debian nodes: same layout, provisioned by setup script.
|
||||
|
||||
#### 11. Agent Upgrade via SSH Push
|
||||
|
||||
**Gap**: Updating the agent requires manual, OS-specific steps. On
|
||||
NixOS: update flake lock, commit, push, rebuild. On Debian: build, scp,
|
||||
restart. With multiple nodes and architectures (amd64 + arm64), this
|
||||
doesn't scale.
|
||||
|
||||
**Work**:
|
||||
- `mcp agent upgrade [node]` CLI command.
|
||||
- Cross-compiles agent for each target arch (`GOARCH` from node config).
|
||||
- Uses `golang.org/x/crypto/ssh` to push the binary and restart the
|
||||
service. No external tool dependencies.
|
||||
- Node config gains `ssh` (hostname) and `arch` (GOARCH) fields.
|
||||
- Upgrades all nodes by default to prevent version skew. New RPCs cause
|
||||
`Unimplemented` errors if agent and CLI are out of sync.
|
||||
|
||||
**Depends on**: #10 (binary location convention).
|
||||
|
||||
#### 12. Node Provisioning Tooling
|
||||
|
||||
**Gap**: Setting up a new node requires manual steps: create user,
|
||||
create directories, install podman, write config, create systemd unit.
|
||||
Different for NixOS vs Debian.
|
||||
|
||||
**Work**:
|
||||
- Go-based provisioning tool (part of MCP CLI) or standalone script.
|
||||
- `mcp node provision <name>` SSHs to the node and runs setup:
|
||||
create `mcp` user with podman access, create `/srv/mcp/`, write
|
||||
systemd unit, install initial binary, start service.
|
||||
- For NixOS, provisioning remains in the NixOS config (declarative).
|
||||
The provisioning tool targets Debian/generic Linux.
|
||||
|
||||
**Depends on**: #10 (binary location convention), #11 (SSH infra).
|
||||
|
||||
**Current fleet**:
|
||||
|
||||
| Node | OS | Arch | Status |
|
||||
|------|----|------|--------|
|
||||
| rift | NixOS | amd64 | Operational, single MCP agent |
|
||||
| hyperborea | Debian (RPi) | arm64 | Online, agent provisioning planned |
|
||||
| svc | Debian | amd64 | Runs MCIAS, needs agent for public edge services |
|
||||
| orion | NixOS | amd64 | Provisioned, offline for maintenance |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -25,7 +25,9 @@ lives in [docs/metacircular.md](docs/metacircular.md).
|
||||
| **MC-Proxy** | Node ingress — TLS proxy and router. L4 passthrough or L7 terminating (per-route), PROXY protocol, firewall with rate limiting and GeoIP. | Implemented |
|
||||
| **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented |
|
||||
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented |
|
||||
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | In progress |
|
||||
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | Implemented |
|
||||
| **MCQ** | Document review queue — push docs for review, MCP server for Claude Code integration. | Implemented |
|
||||
| **MCDeploy** | Deployment CLI — tactical bridge tool, now deprecated and archived. Superseded by MCP. | Deprecated |
|
||||
|
||||
Shared library: **MCDSL** — standard library for all services (auth, db,
|
||||
config, TLS server, CSRF, snapshots).
|
||||
@@ -102,7 +104,9 @@ metacircular/
|
||||
├── mcns/ DNS server
|
||||
├── mcat/ Login policy tester
|
||||
├── mcdsl/ Standard library (shared packages)
|
||||
├── mcdeploy/ Deployment CLI (deprecated, archived)
|
||||
├── mcdoc/ Documentation server
|
||||
├── mcq/ Document review queue
|
||||
├── ca/ PKI infrastructure (dev/test, not source code)
|
||||
└── docs/ Platform-wide documentation
|
||||
```
|
||||
|
||||
145
STATUS.md
145
STATUS.md
@@ -1,38 +1,45 @@
|
||||
# Metacircular Platform Status
|
||||
|
||||
Last updated: 2026-03-26
|
||||
Last updated: 2026-04-02
|
||||
|
||||
## Platform Overview
|
||||
|
||||
One node operational (**rift**), running core infrastructure services as
|
||||
containers fronted by MC-Proxy. MCIAS runs separately (not on rift).
|
||||
Bootstrap phases 0–4 complete (MCIAS, Metacrypt, MC-Proxy, MCR all
|
||||
operational). MCP is deployed and managing all platform containers. MCNS is
|
||||
deployed on rift, serving authoritative DNS.
|
||||
Two nodes operational (**rift** + **svc**), with **orion** provisioned but
|
||||
offline for maintenance. Core infrastructure services run as containers on
|
||||
rift, fronted by MC-Proxy. Svc operates as an MCP edge node managing
|
||||
mc-proxy routing only (no containers); MCIAS runs on svc separately as a
|
||||
systemd service. Bootstrap phases 0–4 complete (MCIAS, Metacrypt, MC-Proxy,
|
||||
MCR all operational). MCP is deployed and managing all platform containers
|
||||
on rift, with multi-node capability (svc as edge node). MCNS is deployed on
|
||||
rift, serving authoritative DNS. Platform evolution Phases A–D complete
|
||||
(automated port assignment, route registration, TLS cert provisioning, and
|
||||
DNS registration). Phase E (multi-node expansion) is in planning, with v2
|
||||
architecture in development.
|
||||
|
||||
## Service Status
|
||||
|
||||
| Service | Version | SDLC Phase | Deployed | Node |
|
||||
|---------|---------|------------|----------|------|
|
||||
| MCIAS | v1.8.0 | Maintenance | Yes | (separate) |
|
||||
| Metacrypt | v1.1.0 | Production | Yes | rift |
|
||||
| MC-Proxy | v1.1.0 | Maintenance | Yes | rift |
|
||||
| MCR | v1.2.0 | Production | Yes | rift |
|
||||
| MCAT | v1.1.0 | Complete | Unknown | — |
|
||||
| MCDSL | v1.2.0 | Stable | N/A (library) | — |
|
||||
| MCNS | v1.1.0 | Production | Yes | rift |
|
||||
| MCP | v0.3.0 | Production | Yes | rift |
|
||||
| MCDeploy | v0.2.0 | Active dev | N/A (CLI tool) | — |
|
||||
| MCIAS | v1.10.5 | Maintenance | Yes | svc (systemd) |
|
||||
| Metacrypt | v1.4.1 | Production | Yes | rift |
|
||||
| MC-Proxy | v1.2.2 | Maintenance | Yes | rift |
|
||||
| MCR | v1.3.2 | Production | Yes | rift |
|
||||
| MCAT | v1.2.0 | Production | Yes | rift |
|
||||
| MCDSL | v1.7.0 | Stable | N/A (library) | — |
|
||||
| MCNS | v1.2.0 | Production | Yes | rift |
|
||||
| MCDoc | v0.1.0 | Production | Yes | rift |
|
||||
| MCQ | v0.4.2 | Production | Yes | rift |
|
||||
| MCP | v0.9.0 | Production | Yes | rift |
|
||||
|
||||
## Service Details
|
||||
|
||||
### MCIAS — Identity and Access Service
|
||||
|
||||
- **Version:** v1.8.0 (client library: clients/go/v0.2.0)
|
||||
- **Version:** v1.10.5 (client library: clients/go/v0.2.0)
|
||||
- **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active
|
||||
refinement.
|
||||
- **Deployment:** Running in production. All other services authenticate
|
||||
against it.
|
||||
- **Deployment:** Running in production on svc as a systemd service. All
|
||||
other services authenticate against it.
|
||||
- **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login
|
||||
policies, Nix flake for CLI tools.
|
||||
- **Artifacts:** systemd units (service + backup timer), install script,
|
||||
@@ -40,7 +47,7 @@ deployed on rift, serving authoritative DNS.
|
||||
|
||||
### Metacrypt — Cryptographic Service Engine
|
||||
|
||||
- **Version:** v1.1.0.
|
||||
- **Version:** v1.4.1 (API v1.3.1, Web v1.4.1).
|
||||
- **Phase:** Production. All four engine types implemented (CA, SSH CA, transit,
|
||||
user-to-user). Active work on integration test coverage.
|
||||
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on
|
||||
@@ -52,18 +59,21 @@ deployed on rift, serving authoritative DNS.
|
||||
|
||||
### MC-Proxy — TLS Proxy and Router
|
||||
|
||||
- **Version:** v1.1.0. Phases 1-8 complete.
|
||||
- **Phase:** Maintenance. Stable and actively routing traffic on rift.
|
||||
- **Version:** v1.2.2.
|
||||
- **Phase:** Maintenance. Stable and actively routing traffic on rift and svc.
|
||||
- **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports
|
||||
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091.
|
||||
- **Recent work:** MCR route additions, Nix flake, L7 backend cert handling,
|
||||
Prometheus metrics, L7 policies.
|
||||
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091. Routes persisted
|
||||
in SQLite and managed via gRPC API. Svc runs its own mc-proxy on :443 with
|
||||
public-facing routes.
|
||||
- **Recent work:** Route persistence (SQLite), idempotent AddRoute (upsert),
|
||||
golangci-lint v2 compliance, module path migration to mc/ org.
|
||||
- **Artifacts:** systemd units (service + backup timer), Docker Compose
|
||||
(standard + rift), install and backup scripts, rift config.
|
||||
|
||||
### MCR — Container Registry
|
||||
|
||||
- **Version:** v1.2.0. All implementation phases complete.
|
||||
- **Version:** v1.3.2 (API v1.2.1, Web v1.3.2). All implementation phases
|
||||
complete.
|
||||
- **Phase:** Production. Deployed on rift, serving container images.
|
||||
- **Deployment:** Running on rift as two containers (mcr API + mcr-web),
|
||||
fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and
|
||||
@@ -76,77 +86,104 @@ deployed on rift, serving authoritative DNS.
|
||||
|
||||
### MCAT — Login Policy Tester
|
||||
|
||||
- **Version:** v1.1.0.
|
||||
- **Phase:** Complete. Diagnostic tool, not core infrastructure.
|
||||
- **Deployment:** Available for ad-hoc use. Lightweight tool for testing
|
||||
MCIAS login policy rules.
|
||||
- **Version:** v1.2.0.
|
||||
- **Phase:** Production. Deployed on rift as a container managed by MCP.
|
||||
- **Deployment:** Running on rift. Lightweight tool for testing MCIAS login
|
||||
policy rules.
|
||||
- **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web.
|
||||
- **Artifacts:** systemd unit, install script, example config.
|
||||
|
||||
### MCDSL — Standard Library
|
||||
|
||||
- **Version:** v1.2.0.
|
||||
- **Version:** v1.7.0.
|
||||
- **Phase:** Stable. All 9 packages implemented and tested. Being adopted
|
||||
across the platform.
|
||||
- **Deployment:** N/A (Go library, imported by other services).
|
||||
- **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health,
|
||||
archive.
|
||||
- **Adoption:** All services except mcias on v1.2.0. mcias pending.
|
||||
- **Adoption:** All services except mcias on v1.7.0. mcias pending.
|
||||
|
||||
### MCNS — Networking Service
|
||||
|
||||
- **Version:** v1.1.0.
|
||||
- **Version:** v1.2.0.
|
||||
- **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor.
|
||||
- **Deployment:** Running on rift as a container managed by MCP. Serves two
|
||||
authoritative zones plus upstream forwarding.
|
||||
authoritative zones plus upstream forwarding. REST + gRPC APIs with MCIAS
|
||||
auth and name-scoped system account authorization.
|
||||
- **Recent work:** v1.0.0 implementation (custom Go DNS server), engineering
|
||||
review, deployed to rift replacing CoreDNS.
|
||||
- **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition,
|
||||
systemd units, install script, example config.
|
||||
|
||||
### MCDoc — Documentation Server
|
||||
|
||||
- **Version:** v0.1.0.
|
||||
- **Phase:** Production. Fetches and renders markdown documentation from Gitea.
|
||||
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on
|
||||
port 443 (L7).
|
||||
- **Recent work:** Initial implementation, Gitea content fetching, goldmark
|
||||
rendering with syntax highlighting, webhook-driven refresh.
|
||||
- **Artifacts:** Dockerfile, MCP service definition.
|
||||
|
||||
### MCQ — Document Review Queue
|
||||
|
||||
- **Version:** v0.4.2.
|
||||
- **Phase:** Production. Document review queue with MCP server for Claude
|
||||
integration.
|
||||
- **Deployment:** Running on rift as a container managed by MCP.
|
||||
- **Recent work:** Claude MCP server integration, document review workflow.
|
||||
- **Artifacts:** Dockerfile, MCP service definition.
|
||||
|
||||
### MCP — Control Plane
|
||||
|
||||
- **Version:** v0.3.0.
|
||||
- **Phase:** Production. Phases 0-4 complete. Deployed to rift, managing all
|
||||
platform containers.
|
||||
- **Version:** v0.9.0 (agent on rift: v0.8.3-dirty, agent on svc: v0.9.0).
|
||||
- **Phase:** Production. Phases A–D complete. Multi-node capable with svc
|
||||
operating as an edge node. V2 architecture in development, Phase E planning
|
||||
underway.
|
||||
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
|
||||
with rootless podman. Manages metacrypt, mc-proxy, mcr, and mcns containers.
|
||||
with rootless podman. Manages metacrypt, mc-proxy, mcr, mcns, mcdoc, mcat,
|
||||
mcq, and non-platform containers. Svc runs an MCP agent for edge mc-proxy
|
||||
route management.
|
||||
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
|
||||
`mcp-agent` (per-node daemon with SQLite registry, podman management,
|
||||
monitoring with drift/flap detection). gRPC-only (no REST).
|
||||
- **Recent work:** Full v1 implementation (12 RPCs, 15 CLI commands),
|
||||
deployment to rift, container migration from kyle→mcp user, service
|
||||
definition authoring.
|
||||
monitoring with drift/flap detection, route registration with mc-proxy,
|
||||
automated TLS cert provisioning for L7 routes via Metacrypt CA, automated
|
||||
DNS registration in MCNS). gRPC-only (no REST). 15 RPCs, 17+ CLI commands.
|
||||
- **Recent work:** Phase C (automated TLS cert provisioning), Phase D
|
||||
(automated DNS registration via MCNS), undeploy command, logs command,
|
||||
edit command, auto-login to MCR, system account auth model, module path
|
||||
migration, multi-node edge support (svc).
|
||||
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
|
||||
definition files, design docs.
|
||||
|
||||
### MCDeploy — Deployment CLI
|
||||
|
||||
- **Version:** v0.2.0.
|
||||
- **Phase:** Active development. Tactical bridge tool for deploying services
|
||||
while MCP is being built.
|
||||
- **Deployment:** N/A (local CLI tool, not a server).
|
||||
- **Recent work:** Initial implementation, Nix flake.
|
||||
- **Description:** Single-binary CLI that shells out to podman/ssh/scp/git
|
||||
for build, push, deploy, cert renewal, and status. TOML-configured.
|
||||
|
||||
## Node Inventory
|
||||
|
||||
| Node | Address (LAN) | Address (Tailscale) | Role |
|
||||
|------|---------------|---------------------|------|
|
||||
| rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services |
|
||||
| svc | — | 100.106.232.4 | Edge — public mc-proxy, MCIAS (systemd) |
|
||||
| orion | 192.168.88.214 | — | Worker (provisioned, offline for maintenance) |
|
||||
|
||||
## Rift Port Map
|
||||
|
||||
Note: Services deployed via MCP receive dynamically assigned host ports
|
||||
(10000–60000). The ports below are for infrastructure services with static
|
||||
assignments or well-known ports.
|
||||
|
||||
| Port | Protocol | Services |
|
||||
|------|----------|----------|
|
||||
| 53 | DNS (LAN + Tailscale) | mcns |
|
||||
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web |
|
||||
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web, mcdoc, mcat, kls |
|
||||
| 8080 | HTTP (all interfaces) | exod |
|
||||
| 8443 | L4 (SNI passthrough) | metacrypt API, mcr API |
|
||||
| 9090 | HTTP (all interfaces) | exod |
|
||||
| 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard |
|
||||
| 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics |
|
||||
|
||||
Non-platform services also running on rift: **exod** (ports 8080/9090),
|
||||
**sgardd** (port 19473, fronted by MC-Proxy on 9443).
|
||||
Svc runs its own mc-proxy on :443 (L7) with public-facing routes:
|
||||
kls.metacircular.net, mcq.metacircular.net, metacrypt.metacircular.net,
|
||||
docs.metacircular.net, git.metacircular.net, git.wntrmute.dev.
|
||||
|
||||
Non-platform services also running on rift: **exod** (v0.1.0, ports
|
||||
8080/9090), **sgardd** (v3.2.0, port 19473, fronted by MC-Proxy on 9443),
|
||||
**kls** (v0.2.0).
|
||||
|
||||
158
SVC_WORK.md
Normal file
158
SVC_WORK.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# svc.metacircular.net — Phase 1 Work Log
|
||||
|
||||
Date: 2026-04-02
|
||||
Purpose: Deploy mcp-agent to svc (edge node) for MCP v2 Phase 1.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Created `mcp` system user
|
||||
```
|
||||
useradd --system --home-dir /srv/mcp --create-home --shell /usr/sbin/nologin mcp
|
||||
usermod -aG mc-proxy mcp
|
||||
```
|
||||
- UID 992, GID 991
|
||||
- Member of `mc-proxy` group for socket access
|
||||
|
||||
### 2. Created `/srv/mcp/` directory structure
|
||||
```
|
||||
/srv/mcp/
|
||||
├── mcp-agent # binary (v0.8.3-1-gfa8ba6f, linux/amd64)
|
||||
├── mcp-agent.toml # agent config
|
||||
├── mcp.db # SQLite registry (created on first run)
|
||||
└── certs/
|
||||
├── cert.pem # TLS cert (SAN: IP:100.106.232.4, DNS:svc.svc.mcp.metacircular.net)
|
||||
├── key.pem # TLS private key
|
||||
└── ca.pem # Metacircular CA cert
|
||||
```
|
||||
- Owned by `mcp:mcp`, key file mode 0600
|
||||
|
||||
### 3. TLS certificate
|
||||
- Issued from the Metacircular CA (`ca/ca.pem` + `ca/ca.key`)
|
||||
- Subject: `CN=mcp-agent-svc`
|
||||
- SANs: `IP:100.106.232.4`, `DNS:svc.svc.mcp.metacircular.net`
|
||||
- Validity: 365 days
|
||||
- Stored at `/srv/mcp/certs/{cert,key,ca}.pem`
|
||||
|
||||
### 4. Agent configuration
|
||||
- File: `/srv/mcp/mcp-agent.toml`
|
||||
- gRPC listen: `100.106.232.4:9555` (port 9444 in use by MCNS)
|
||||
- MCIAS: `https://mcias.metacircular.net:8443`
|
||||
- mc-proxy socket: `/srv/mc-proxy/mc-proxy.sock`
|
||||
- Node name: `svc`
|
||||
- Runtime: `podman` (not used on edge, but required by config)
|
||||
|
||||
### 5. systemd unit
|
||||
- File: `/etc/systemd/system/mcp-agent.service`
|
||||
- Runs as `mcp:mcp`
|
||||
- Security hardened (NoNewPrivileges, ProtectSystem=strict, etc.)
|
||||
- ReadWritePaths: `/srv/mcp`, `/srv/mc-proxy/mc-proxy.sock`
|
||||
- Enabled and started
|
||||
|
||||
### 6. mc-proxy directory permissions
|
||||
- Changed `/srv/mc-proxy/` from `drwx------` to `drwxr-x---` (group traversal)
|
||||
- Changed `/srv/mc-proxy/mc-proxy.sock` from `srw-------` to `srw-rw----` (group read/write)
|
||||
- Required for `mcp` user (in `mc-proxy` group) to access the socket
|
||||
|
||||
### 7. MCP CLI config update (on rift)
|
||||
- Added svc node to `~/.config/mcp/mcp.toml`:
|
||||
```toml
|
||||
[[nodes]]
|
||||
name = "svc"
|
||||
address = "100.106.232.4:9555"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
$ mcp node list
|
||||
NAME ADDRESS VERSION
|
||||
rift 100.95.252.120:9444 v0.8.3-dirty
|
||||
svc 100.106.232.4:9555 v0.8.3-1-gfa8ba6f
|
||||
|
||||
$ mcp route list -n svc
|
||||
NODE: svc
|
||||
mc-proxy v1.2.1-2-g82fce41-dirty
|
||||
:443 routes=6
|
||||
l7 git.wntrmute.dev → 127.0.0.1:3000
|
||||
l7 kls.metacircular.net → 100.95.252.120:58080
|
||||
l7 mcq.metacircular.net → 100.95.252.120:48080
|
||||
l7 metacrypt.metacircular.net → 100.95.252.120:18080 (re-encrypt)
|
||||
l7 docs.metacircular.net → 100.95.252.120:38080
|
||||
l7 git.metacircular.net → 127.0.0.1:3000
|
||||
```
|
||||
|
||||
## Agent Cert Reissue (2026-04-02)
|
||||
|
||||
Both agent certs reissued with comprehensive SANs:
|
||||
|
||||
**Rift agent** (`/srv/mcp/certs/cert.pem`):
|
||||
- DNS: `rift.scylla-hammerhead.ts.net`, `mcp-agent.svc.mcp.metacircular.net`
|
||||
- IP: `100.95.252.120`, `192.168.88.181`
|
||||
|
||||
**Svc agent** (`/srv/mcp/certs/cert.pem`):
|
||||
- DNS: `svc.scylla-hammerhead.ts.net`, `svc.svc.mcp.metacircular.net`
|
||||
- IP: `100.106.232.4`
|
||||
|
||||
Both agents upgraded to v0.10.0 (Phase 2 edge routing RPCs + v2 proto fields).
|
||||
|
||||
## MCP Master Deployment (2026-04-02)
|
||||
|
||||
**Binary**: `/srv/mcp-master/mcp-master` (v0.10.0) on rift
|
||||
**Config**: `/srv/mcp-master/mcp-master.toml`
|
||||
**Database**: `/srv/mcp-master/master.db`
|
||||
**Certs**: `/srv/mcp-master/certs/{cert,key,ca}.pem`
|
||||
- SAN: `rift.scylla-hammerhead.ts.net`, `mcp-master.svc.mcp.metacircular.net`, IP `100.95.252.120`
|
||||
**Service token**: `/srv/mcp-master/mcias-token` (MCIAS identity: `mcp-master`, expires 2027-04-03)
|
||||
**Listen**: `100.95.252.120:9555`
|
||||
**Bootstrap nodes**: rift (master), svc (edge)
|
||||
|
||||
**Status**: Running via `doas` (ad-hoc). NixOS read-only /etc prevents
|
||||
direct systemd unit creation — needs NixOS config update for persistent
|
||||
service.
|
||||
|
||||
**Tested**:
|
||||
- `mcp deploy mcq` → master places on rift, forwards to agent ✓
|
||||
- `mcp undeploy mcq` → master forwards to agent, cleans up placement ✓
|
||||
- `mcp ps` → fleet-wide status through agents ✓
|
||||
- `mcp node list` → both nodes visible with versions ✓
|
||||
|
||||
## CLI Config Changes (vade)
|
||||
|
||||
Updated `~/.config/mcp/mcp.toml`:
|
||||
- Added `[master]` section: `address = "rift.scylla-hammerhead.ts.net:9555"`
|
||||
- All node addresses switched to Tailscale DNS names
|
||||
- Added CA cert path
|
||||
|
||||
## Known Limitations
|
||||
- ~~mc-proxy socket permissions will reset on restart~~ **FIXED**: mc-proxy
|
||||
now creates the socket with 0660 (was 0600). Committed to mc-proxy master.
|
||||
- Master runs ad-hoc via `doas` on rift. Needs NixOS systemd config for
|
||||
persistent service (rift has read-only /etc).
|
||||
- DNS registration not configured on master (MCNS config omitted for now).
|
||||
- Edge routing not yet tested end-to-end through master (svc cert provisioning
|
||||
not configured).
|
||||
- The TLS cert was issued from the local CA directly, not via Metacrypt API.
|
||||
Should be re-issued via Metacrypt once the agent has cert provisioning.
|
||||
- Container runtime is set to `podman` but podman is not installed on svc
|
||||
(Docker is). Edge agents don't run containers so this is benign.
|
||||
- Metacrypt and MCNS integrations not configured (not needed for edge role).
|
||||
|
||||
## Edge Routing E2E Test (2026-04-02)
|
||||
|
||||
Full edge routing flow tested successfully through the master:
|
||||
|
||||
1. Added `public = true` route for `mcq.metacircular.net` to mcq service def
|
||||
2. `mcp deploy mcq` → master placed on rift, deployed, registered DNS, set up edge route on svc
|
||||
3. Svc agent provisioned TLS cert from Metacrypt (expires 2026-06-28)
|
||||
4. mc-proxy route created: `mcq.metacircular.net → 100.95.252.120:8443` (re-encrypt)
|
||||
5. Edge route persisted in both master DB and svc agent registry
|
||||
|
||||
**Fix required**: `RouteDef` in servicedef.go was missing `Public` field —
|
||||
TOML `public = true` was silently dropped. Fixed in v0.10.2 along with
|
||||
`Tier` field and relaxed `Node` validation.
|
||||
|
||||
## DNS Registration Working (2026-04-02)
|
||||
|
||||
Master registers Tailnet IPs in MCNS (not LAN IPs). Fix in v0.10.1
|
||||
resolves Tailscale DNS names to IPs before passing to MCNS.
|
||||
|
||||
`mcq.svc.mcp.metacircular.net → 100.95.252.120` (Tailnet IP) ✓
|
||||
19
TOOLING.md
19
TOOLING.md
@@ -8,12 +8,11 @@ Go binaries built with `CGO_ENABLED=0` and installed via Nix flakes.
|
||||
| Tool | Project | Purpose | Install target |
|
||||
|------|---------|---------|---------------|
|
||||
| `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion |
|
||||
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift (systemd) |
|
||||
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift, svc (systemd) |
|
||||
| `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift |
|
||||
| `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift |
|
||||
| `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift |
|
||||
| `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift |
|
||||
| `mcdeploy` | mcdeploy | Deployment CLI — build, push, deploy (bridge tool) | vade |
|
||||
|
||||
### Server-only binaries (not installed as tools)
|
||||
|
||||
@@ -29,6 +28,8 @@ These run inside containers and are not installed on operator workstations:
|
||||
| `mc-proxy` | mc-proxy | TLS proxy server |
|
||||
| `mcns` | mcns | DNS server |
|
||||
| `mcat` | mcat | Login policy tester web app |
|
||||
| `mcdoc` | mcdoc | Documentation server |
|
||||
| `mcq` | mcq | Document review queue |
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -46,15 +47,18 @@ via `mcpkg.nix` in the NixOS configuration. Adding a tool:
|
||||
- Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps).
|
||||
- ldflags: `-s -w -X main.version=${version}`.
|
||||
- `subPackages`: list only the client binaries, not servers.
|
||||
- `system`: `x86_64-linux` (all machines are x86_64).
|
||||
- `system`: `x86_64-linux` for rift, svc, and orion; `aarch64-linux`
|
||||
for hyperborea. Flakes that target the full fleet should support both.
|
||||
|
||||
### MCP agent
|
||||
|
||||
The `mcp-agent` is a special case: it runs as a systemd service on
|
||||
managed nodes (not as a container, since it manages containers). Its
|
||||
flake exposes both `mcp` (client) and `mcp-agent` (server). The NixOS
|
||||
config in `configs/mcp.nix` references the Nix package path for the
|
||||
agent binary.
|
||||
flake exposes both `mcp` (client) and `mcp-agent` (server). Phase E is
|
||||
moving the agent binary to `/srv/mcp/mcp-agent` on all nodes — NixOS
|
||||
`ExecStart` will point there instead of a nix store path, and Debian
|
||||
nodes use the same layout. svc already follows this convention. See
|
||||
`docs/phase-e-plan.md` for details.
|
||||
|
||||
## Flake status
|
||||
|
||||
@@ -63,9 +67,10 @@ agent binary.
|
||||
| mcias | Yes | mciasctl, mciasgrpcctl | Yes | |
|
||||
| mc-proxy | Yes | mcproxyctl | Yes | |
|
||||
| mcr | Yes | mcrctl | Yes | |
|
||||
| mcdeploy | Yes | mcdeploy | Yes | |
|
||||
| mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit |
|
||||
| mcns | No | — | No | Server-only, no client tool yet |
|
||||
| metacrypt | No | — | No | Server-only, no client tool yet |
|
||||
| mcat | No | — | No | Server-only, no client tool yet |
|
||||
| mcdoc | No | — | No | Server-only, deployed as container |
|
||||
| mcq | No | — | No | Server-only, document review queue |
|
||||
| mcdsl | No | — | No | Library, no binaries |
|
||||
|
||||
1657
docs/architecture-v2.md
Normal file
1657
docs/architecture-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
344
docs/disaster-recovery.md
Normal file
344
docs/disaster-recovery.md
Normal file
@@ -0,0 +1,344 @@
|
||||
# Disaster Recovery: Bootstrap from Zero
|
||||
|
||||
This document covers recovering the Metacircular platform when all
|
||||
services on rift are down and no containers are running. It assumes:
|
||||
|
||||
- The machine boots and NixOS is functional
|
||||
- The mcp-agent systemd service starts automatically
|
||||
- Tailscale is configured and connects on boot
|
||||
- Service data directories (`/srv/<service>/`) are intact on disk
|
||||
- Container images are cached in podman's overlay storage
|
||||
|
||||
If images are NOT cached (fresh machine, disk wipe), see the
|
||||
"Cold Start" section at the end.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting recovery, verify:
|
||||
|
||||
```bash
|
||||
# 1. Machine is up
|
||||
hostname # should print "rift"
|
||||
|
||||
# 2. Tailscale is connected
|
||||
tailscale status --self
|
||||
# Should show the Tailnet IP (100.95.252.120)
|
||||
|
||||
# 3. The mcp user exists
|
||||
id mcp
|
||||
# Should show uid=850(mcp) gid=850(mcp)
|
||||
|
||||
# 4. The agent is running
|
||||
systemctl status mcp-agent
|
||||
# Should be active
|
||||
|
||||
# 5. Images are cached
|
||||
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
|
||||
# Should be > 0
|
||||
```
|
||||
|
||||
If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up`
|
||||
|
||||
If the agent is not running: check `/srv/mcp/mcp-agent` exists and
|
||||
`/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`.
|
||||
|
||||
## Recovery Order
|
||||
|
||||
Services must be started in dependency order. Each stage must be
|
||||
healthy before the next starts.
|
||||
|
||||
```
|
||||
Stage 1 (Foundation): MCNS → DNS works
|
||||
Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs
|
||||
Stage 3 (Management): mcp-master → orchestration
|
||||
Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications
|
||||
```
|
||||
|
||||
## Manual Recovery Commands
|
||||
|
||||
All commands run as the mcp user. Use this shell prefix:
|
||||
|
||||
```bash
|
||||
# Set up the environment
|
||||
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
|
||||
# Or SSH as mcp directly (if SSH login is enabled):
|
||||
ssh mcp@rift
|
||||
```
|
||||
|
||||
For brevity, commands below show the `podman run` portion only. Prefix
|
||||
with the environment setup above.
|
||||
|
||||
### Stage 1: MCNS (DNS)
|
||||
|
||||
MCNS must start first. Without it, no hostname resolution works.
|
||||
|
||||
```bash
|
||||
podman run -d --name mcns --restart unless-stopped \
|
||||
-p 192.168.88.181:53:53/tcp \
|
||||
-p 192.168.88.181:53:53/udp \
|
||||
-p 100.95.252.120:53:53/tcp \
|
||||
-p 100.95.252.120:53:53/udp \
|
||||
-p 127.0.0.1:38443:8443 \
|
||||
-v /srv/mcns:/srv/mcns \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
|
||||
server --config /srv/mcns/mcns.toml
|
||||
```
|
||||
|
||||
**Verify:**
|
||||
```bash
|
||||
dig @192.168.88.181 google.com +short
|
||||
# Should return an IP address
|
||||
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
|
||||
# Should return a Tailnet IP
|
||||
```
|
||||
|
||||
**Note:** MCNS binds to specific IPs, not `0.0.0.0`, because
|
||||
systemd-resolved holds port 53 on localhost. The explicit bindings
|
||||
avoid the conflict.
|
||||
|
||||
### Stage 2: Core Infrastructure
|
||||
|
||||
#### mc-proxy (TLS routing)
|
||||
|
||||
```bash
|
||||
podman run -d --name mc-proxy --restart unless-stopped \
|
||||
--network host \
|
||||
-v /srv/mc-proxy:/srv/mc-proxy \
|
||||
mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
|
||||
server --config /srv/mc-proxy/mc-proxy.toml
|
||||
```
|
||||
|
||||
**Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1`
|
||||
(should get a response, even if 404)
|
||||
|
||||
#### MCR (Container Registry)
|
||||
|
||||
```bash
|
||||
# API server
|
||||
podman run -d --name mcr-api --restart unless-stopped \
|
||||
-v /srv/mcr:/srv/mcr \
|
||||
-p 127.0.0.1:28443:8443 \
|
||||
-p 127.0.0.1:29443:9443 \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
|
||||
server --config /srv/mcr/mcr.toml
|
||||
|
||||
# Web UI
|
||||
podman run -d --name mcr-web --restart unless-stopped \
|
||||
--user 0:0 \
|
||||
-v /srv/mcr:/srv/mcr \
|
||||
-p 127.0.0.1:28080:8080 \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
|
||||
server --config /srv/mcr/mcr.toml
|
||||
```
|
||||
|
||||
**If MCR fails with "chmod" or "readonly database":**
|
||||
```bash
|
||||
podman stop mcr-api
|
||||
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
|
||||
podman start mcr-api
|
||||
```
|
||||
This recreates the database empty. Image blobs in `/srv/mcr/layers/`
|
||||
are preserved but tag metadata is lost. Re-push images to rebuild the
|
||||
registry.
|
||||
|
||||
#### Metacrypt (PKI / Secrets)
|
||||
|
||||
```bash
|
||||
# API server
|
||||
podman run -d --name metacrypt-api --restart unless-stopped \
|
||||
-v /srv/metacrypt:/srv/metacrypt \
|
||||
-p 127.0.0.1:18443:8443 \
|
||||
-p 127.0.0.1:19443:9443 \
|
||||
mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
|
||||
server --config /srv/metacrypt/metacrypt.toml
|
||||
|
||||
# Web UI
|
||||
podman run -d --name metacrypt-web --restart unless-stopped \
|
||||
-v /srv/metacrypt:/srv/metacrypt \
|
||||
-p 127.0.0.1:18080:8080 \
|
||||
mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
|
||||
--config /srv/metacrypt/metacrypt.toml
|
||||
```
|
||||
|
||||
**If Metacrypt fails with "chmod" or "readonly database":**
|
||||
Same fix as MCR — delete the database files. **Warning:** this loses
|
||||
all encrypted secrets, issued certs tracking, and CA state. The CA
|
||||
key itself is in the sealed vault (password-protected), not in SQLite.
|
||||
|
||||
### Stage 3: MCP Master
|
||||
|
||||
```bash
|
||||
podman run -d --name mcp-master --restart unless-stopped \
|
||||
--network host \
|
||||
-v /srv/mcp-master:/srv/mcp-master \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
|
||||
server --config /srv/mcp-master/mcp-master.toml
|
||||
```
|
||||
|
||||
**Verify:**
|
||||
```bash
|
||||
# From vade (operator workstation):
|
||||
mcp node list
|
||||
# Should show rift, svc, orion
|
||||
```
|
||||
|
||||
### Stage 4: Application Services
|
||||
|
||||
Once the master is running, deploy applications through MCP:
|
||||
|
||||
```bash
|
||||
mcp deploy mcq --direct
|
||||
mcp deploy mcdoc --direct
|
||||
mcp deploy mcat --direct
|
||||
mcp deploy kls --direct
|
||||
```
|
||||
|
||||
Or start them manually:
|
||||
|
||||
```bash
|
||||
# MCQ
|
||||
podman run -d --name mcq --restart unless-stopped \
|
||||
-v /srv/mcq:/srv/mcq \
|
||||
-p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
|
||||
server --config /srv/mcq/mcq.toml
|
||||
|
||||
# MCDoc
|
||||
podman run -d --name mcdoc --restart unless-stopped \
|
||||
-v /srv/mcdoc:/srv/mcdoc \
|
||||
-p 127.0.0.1:38080:8080 \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
|
||||
server --config /srv/mcdoc/mcdoc.toml
|
||||
|
||||
# MCAT
|
||||
podman run -d --name mcat --restart unless-stopped \
|
||||
-v /srv/mcat:/srv/mcat \
|
||||
-p 127.0.0.1:48116:8443 \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
|
||||
server --config /srv/mcat/mcat.toml
|
||||
|
||||
# KLS
|
||||
podman run -d --name kls --restart unless-stopped \
|
||||
-v /srv/kls:/srv/kls \
|
||||
-p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
|
||||
mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
|
||||
-f /srv/kls/kls.conf
|
||||
|
||||
# Sgard
|
||||
podman run -d --name sgardd --restart unless-stopped \
|
||||
-v /srv/sgard:/srv/sgard \
|
||||
-p 127.0.0.1:19473:9473 \
|
||||
mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
|
||||
--repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
|
||||
--tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After all services are running:
|
||||
|
||||
```bash
|
||||
# Fleet status
|
||||
mcp ps
|
||||
# All services should show "running"
|
||||
|
||||
# DNS
|
||||
dig @192.168.88.181 google.com +short
|
||||
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
|
||||
|
||||
# MCIAS (runs on svc, should be unaffected by rift outage)
|
||||
curl -sk https://mcias.metacircular.net:8443/v1/health
|
||||
|
||||
# MCR
|
||||
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
|
||||
|
||||
# Metacrypt
|
||||
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
|
||||
|
||||
# Public routes via svc
|
||||
curl -sk https://mcq.metacircular.net/
|
||||
curl -sk https://docs.metacircular.net/
|
||||
```
|
||||
|
||||
## Common Errors
|
||||
|
||||
### "chmod: operation not permitted"
|
||||
|
||||
modernc.org/sqlite calls `fchmod()` on database files. This is denied
|
||||
inside rootless podman user namespaces. Fix:
|
||||
|
||||
```bash
|
||||
# Delete the database and let the service recreate it
|
||||
podman stop <container>
|
||||
rm -f /srv/<service>/<service>.db*
|
||||
podman start <container>
|
||||
```
|
||||
|
||||
The `fchmod` error will still appear in logs as a warning but is
|
||||
non-fatal for newly created databases.
|
||||
|
||||
### "address already in use" on port 53
|
||||
|
||||
systemd-resolved holds port 53 on localhost. MCNS must bind to
|
||||
specific IPs, not `0.0.0.0:53`. Use explicit port bindings:
|
||||
`-p 192.168.88.181:53:53 -p 100.95.252.120:53:53`
|
||||
|
||||
### "connection refused" to MCR
|
||||
|
||||
MCR is down. Images are cached locally — you can start services that
|
||||
use cached images without MCR. MCR itself starts from its cached
|
||||
image.
|
||||
|
||||
### Agent shows "error" for all nodes
|
||||
|
||||
Check:
|
||||
1. Tailscale is running on both the CLI machine and the target node
|
||||
2. The agent is listening: `ss -tlnp | grep 9444`
|
||||
3. The CLI config has the correct addresses
|
||||
4. TLS certs have the right SANs for the Tailnet IP
|
||||
|
||||
### "podman: executable file not found"
|
||||
|
||||
This warning appears for svc (which uses Docker, not podman). It's
|
||||
benign — svc is an edge node that doesn't run containers.
|
||||
|
||||
## Cold Start (No Cached Images)
|
||||
|
||||
If the machine was wiped and no images are cached:
|
||||
|
||||
1. **MCIAS** runs on svc (Docker container), not rift. It should be
|
||||
unaffected by a rift failure. Verify: `ssh svc.metacircular.net
|
||||
"docker ps | grep mcias"`.
|
||||
|
||||
2. **Pre-stage images** by pulling from a backup or building locally:
|
||||
```bash
|
||||
# On vade (operator workstation), build and push to a temp location
|
||||
cd ~/src/metacircular/mcns && make docker
|
||||
podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
|
||||
ssh rift "podman load"
|
||||
```
|
||||
Repeat for each service.
|
||||
|
||||
3. Alternatively, if another node has MCR access, push images there
|
||||
first, then pull from the running MCR instance.
|
||||
|
||||
## Service Reference
|
||||
|
||||
Quick reference for all services, their images, and critical flags:
|
||||
|
||||
| Service | Image | Network | Key Ports | Config Path |
|
||||
|---------|-------|---------|-----------|-------------|
|
||||
| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
|
||||
| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
|
||||
| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
|
||||
| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
|
||||
| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
|
||||
| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
|
||||
| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
|
||||
| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
|
||||
| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
|
||||
| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
|
||||
| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
|
||||
| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
|
||||
|
||||
All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`.
|
||||
@@ -48,11 +48,11 @@ the spec disagree, one of them has a bug.
|
||||
|
||||
## High-Level Overview
|
||||
|
||||
Metacircular infrastructure is built from six core components, plus a shared
|
||||
standard library (**MCDSL**) that provides the common patterns all services
|
||||
depend on (auth integration, database setup, config loading, HTTP/gRPC server
|
||||
bootstrapping, CSRF, web session management, health checks, snapshots, and
|
||||
service directory archiving):
|
||||
Metacircular infrastructure is built from six core components and a
|
||||
documentation server, plus a shared standard library (**MCDSL**) that provides
|
||||
the common patterns all services depend on (auth integration, database setup,
|
||||
config loading, HTTP/gRPC server bootstrapping, CSRF, web session management,
|
||||
health checks, snapshots, and service directory archiving):
|
||||
|
||||
- **MCIAS** — Identity and access. The root of trust for all other services.
|
||||
Handles authentication, token issuance, role management, and login policy
|
||||
@@ -75,6 +75,10 @@ service directory archiving):
|
||||
accepts outside connections, and routes them to the correct service — either
|
||||
as raw TCP passthrough or via TLS-terminating HTTP/2 reverse proxy.
|
||||
|
||||
- **MCDoc** — Documentation server. Fetches markdown from Gitea repositories,
|
||||
renders HTML with syntax highlighting, serves a navigable documentation site.
|
||||
Public-facing, no MCIAS authentication required.
|
||||
|
||||
These components form a dependency graph rooted at MCIAS:
|
||||
|
||||
```
|
||||
@@ -179,14 +183,19 @@ delegates authentication to it; no service maintains its own user database.
|
||||
Services validate tokens by calling back to MCIAS (cached 30s by SHA-256 of
|
||||
the token).
|
||||
|
||||
- **Role-based access.** Three roles — `admin` (full access, policy bypass),
|
||||
`user` (policy-governed), `guest` (service-dependent restrictions). Admin
|
||||
detection comes solely from the MCIAS `admin` role; services never promote
|
||||
users locally.
|
||||
- **Role-based access.** Three roles — `admin` (MCIAS account management,
|
||||
policy changes, zone mutations — reserved for human operators), `user`
|
||||
(policy-governed), `guest` (service-dependent restrictions, rejected by MCP
|
||||
agent). Admin detection comes solely from the MCIAS `admin` role; services
|
||||
never promote users locally. Routine operations (deploy, push, DNS updates)
|
||||
do not require admin.
|
||||
|
||||
- **Account types.** Human accounts (interactive users) and system accounts
|
||||
(service-to-service). Both authenticate the same way; system accounts enable
|
||||
automated workflows.
|
||||
(service-to-service). Both produce standard JWTs validated the same way.
|
||||
System accounts carry no roles — their authorization is handled by each
|
||||
service's policy engine (Metacrypt policies, MCNS name-scoped access, MCR
|
||||
default policies). System account tokens are long-lived (365-day default)
|
||||
and do not require passwords for issuance.
|
||||
|
||||
- **Login policy.** Priority-based ACL rules that control who can log into
|
||||
which services. Rules can target roles, account types, service names, and
|
||||
@@ -204,7 +213,7 @@ MCIAS evaluates login policy against the service context, verifies credentials,
|
||||
and returns a bearer token. The MCIAS Go client library
|
||||
(`git.wntrmute.dev/mc/mcias/clients/go`) handles this flow.
|
||||
|
||||
**Status:** Implemented. v1.7.0. Feature-complete with active refinement
|
||||
**Status:** Implemented. v1.9.0. Feature-complete with active refinement
|
||||
(WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login policies).
|
||||
|
||||
---
|
||||
@@ -255,7 +264,7 @@ core.
|
||||
operations on which engine mounts. Priority-based evaluation, default deny,
|
||||
admin bypass. See Metacrypt's `POLICY.md` for the full model.
|
||||
|
||||
**Status:** Implemented. All four engine types complete — CA (with ACME
|
||||
**Status:** Implemented. v1.3.1. All four engine types complete — CA (with ACME
|
||||
support), SSH CA, transit encryption, and user-to-user encryption.
|
||||
|
||||
---
|
||||
@@ -274,7 +283,9 @@ serves the container images that MCP deploys across the platform.
|
||||
- **Authenticated access.** No anonymous access. MCR uses the OCI token
|
||||
authentication flow: clients hit `/v2/`, receive a 401 with a token
|
||||
endpoint, authenticate via MCIAS, and use the returned JWT for subsequent
|
||||
requests.
|
||||
requests. The token endpoint accepts both username/password (standard
|
||||
login) and pre-existing MCIAS JWTs as passwords (personal-access-token
|
||||
pattern), enabling non-interactive push/pull for system accounts and CI.
|
||||
|
||||
- **Policy-controlled push/pull.** Fine-grained ACL rules govern who can push
|
||||
to or pull from which repositories. Integrated with MCIAS roles.
|
||||
@@ -286,7 +297,7 @@ serves the container images that MCP deploys across the platform.
|
||||
is scheduled, MCP tells the node's agent which image to pull and where to get
|
||||
it. MCR sits behind an MC-Proxy instance for TLS routing.
|
||||
|
||||
**Status:** Implemented. Phase 13 (deployment artifacts) complete.
|
||||
**Status:** Implemented. v1.2.1. All implementation phases complete.
|
||||
|
||||
---
|
||||
|
||||
@@ -333,7 +344,9 @@ two instances — an edge proxy on a public VPS and an origin proxy on the
|
||||
private network, connected over the overlay with PROXY protocol preserving
|
||||
client IPs across the hop.
|
||||
|
||||
**Status:** Implemented.
|
||||
**Status:** Implemented. v1.2.1. Route state persisted in SQLite with
|
||||
write-through semantics. gRPC admin API with idempotent AddRoute for runtime
|
||||
route management.
|
||||
|
||||
---
|
||||
|
||||
@@ -365,9 +378,13 @@ into DNS records.
|
||||
using internal DNS names automatically resolve to the right place without
|
||||
config changes.
|
||||
|
||||
- **Record management API.** Authenticated via MCIAS. MCP is the primary
|
||||
consumer for dynamic updates. Operators can also manage records directly
|
||||
for static entries (node addresses, aliases).
|
||||
- **Record management API.** Authenticated via MCIAS with name-scoped
|
||||
authorization. Admin can manage all records and zones. The `mcp-agent`
|
||||
system account can create and delete any record. Other system accounts
|
||||
can only manage records matching their own name (e.g., system account
|
||||
`mcq` can manage `mcq.svc.mcp.metacircular.net` but not other records).
|
||||
Human users have read-only access to records. Zone mutations (create,
|
||||
update, delete zones) remain admin-only.
|
||||
|
||||
**How it fits in:** MCNS answers "what is the address of X?" MCP answers "where
|
||||
is service α running?" and pushes the answer to MCNS. This separation means
|
||||
@@ -375,10 +392,11 @@ services can use stable DNS names in their configs (e.g.,
|
||||
`mcias.svc.mcp.metacircular.net` in `[mcias] server_url`) that survive
|
||||
migration without config changes.
|
||||
|
||||
**Status:** Implemented. v1.0.0. Custom Go DNS server deployed on rift,
|
||||
**Status:** Implemented. v1.1.1. Custom Go DNS server deployed on rift,
|
||||
serving two authoritative zones (`svc.mcp.metacircular.net` and
|
||||
`mcp.metacircular.net`) plus upstream forwarding. REST + gRPC APIs with
|
||||
MCIAS auth. Records stored in SQLite.
|
||||
MCIAS auth and name-scoped system account authorization. Records stored
|
||||
in SQLite.
|
||||
|
||||
---
|
||||
|
||||
@@ -403,6 +421,10 @@ each managed node.
|
||||
the initial config, pulls the image from MCR, starts the container, and
|
||||
pushes a DNS update to MCNS (`α.svc.mcp.metacircular.net` → node address).
|
||||
|
||||
- **Undeploy.** Full teardown of a service. Stops the container, removes
|
||||
MC-Proxy routes, deletes DNS records from MCNS, and cleans up the service
|
||||
registry entry. The inverse of deploy.
|
||||
|
||||
- **Migrate.** Move a service from one node to another. MCP snapshots the
|
||||
service's `/srv/<service>/` directory on the source node (as a tar.zst
|
||||
image), transfers it to the destination, extracts it, starts the service,
|
||||
@@ -429,9 +451,17 @@ each managed node.
|
||||
- **Master/agent architecture.** MCP Master runs on the operator's machine.
|
||||
Agents run on every managed node, receiving C2 (command and control) from
|
||||
Master, reporting node status, and managing local workloads. The C2 channel
|
||||
is authenticated via MCIAS. The master does not need to be always-on —
|
||||
agents keep running their workloads independently; the master is needed only
|
||||
to issue new commands.
|
||||
is authenticated via MCIAS — any authenticated non-guest user or system
|
||||
account is accepted (admin role is not required for deploy operations).
|
||||
The master does not need to be always-on — agents keep running their
|
||||
workloads independently; the master is needed only to issue new commands.
|
||||
|
||||
- **System account automation.** The agent uses an `mcp-agent` system account
|
||||
for all service-to-service communication: TLS cert provisioning (Metacrypt),
|
||||
DNS record management (MCNS), and container image pulls (MCR). Each service
|
||||
authorizes the agent through its own policy engine. Per-service system
|
||||
accounts (e.g., `mcq`) can be created for scoped self-management — a service
|
||||
account can only manage its own DNS records, not other services'.
|
||||
|
||||
- **Node management.** Track which nodes are in the platform, their health,
|
||||
available resources, and running workloads.
|
||||
@@ -452,11 +482,15 @@ services it depends on.
|
||||
can deploy them. The systemd unit files exist as a fallback and for bootstrap —
|
||||
the long-term deployment model is MCP-managed containers.
|
||||
|
||||
**Status:** Implemented. v0.1.0. Deployed on rift managing all platform
|
||||
containers. Two components — `mcp` CLI (operator workstation) and
|
||||
**Status:** Implemented. v0.7.6. Deployed on rift managing all platform
|
||||
containers. Route declarations with automatic port allocation (`$PORT` /
|
||||
`$PORT_<NAME>` env vars passed to containers). MC-Proxy route registration
|
||||
during deploy and stop. Automated TLS cert provisioning for L7 routes via
|
||||
Metacrypt CA (Phase C). Automated DNS registration in MCNS during deploy
|
||||
and stop (Phase D). Two components — `mcp` CLI (operator workstation) and
|
||||
`mcp-agent` (per-node daemon with SQLite registry, rootless Podman,
|
||||
monitoring with drift/flap detection). gRPC-only (no REST). 12 RPCs,
|
||||
15 CLI commands.
|
||||
monitoring with drift/flap detection). gRPC-only (no REST). 15 RPCs,
|
||||
17+ CLI commands.
|
||||
|
||||
---
|
||||
|
||||
@@ -663,20 +697,22 @@ renew certificates programmatically.
|
||||
|
||||
### How Services Get Certificates Today
|
||||
|
||||
Currently, certificates are provisioned through Metacrypt's **REST API or web
|
||||
UI** and placed into each service's `/srv/<service>/certs/` directory. This is
|
||||
a manual process — the operator issues a certificate, downloads it, and
|
||||
deploys the files. The ACME client library exists but is not yet integrated
|
||||
into any service.
|
||||
For services deployed via MCP with L7 routes, certificates are provisioned
|
||||
automatically during deploy — MCP uses the Metacrypt ACME client library to
|
||||
obtain certs and transfers them to the node. For other services and during
|
||||
bootstrap, certificates are provisioned through Metacrypt's **REST API or web
|
||||
UI** and placed into each service's `/srv/<service>/certs/` directory manually.
|
||||
|
||||
### How It Will Work With MCP
|
||||
### How MCP Automates Certificates
|
||||
|
||||
MCP is the natural place to automate certificate provisioning:
|
||||
MCP automates certificate provisioning for deploy workflows, with renewal and
|
||||
migration automation planned:
|
||||
|
||||
- **Initial deploy.** When MCP deploys a new service, it can provision a
|
||||
certificate from Metacrypt (via the ACME client library or the REST API),
|
||||
transfer the cert and key to the node as part of the config push to
|
||||
`/srv/<service>/certs/`, and start the service with valid TLS material.
|
||||
- **Initial deploy.** When MCP deploys a new service, it provisions a
|
||||
certificate from Metacrypt (via the ACME client library), transfers the cert
|
||||
and key to the node as part of the config push to `/srv/<service>/certs/`,
|
||||
and starts the service with valid TLS material. For L7 routes, MCP also
|
||||
provisions a TLS certificate for MC-Proxy's termination endpoint.
|
||||
|
||||
- **Renewal.** MCP knows what services are running and when their certificates
|
||||
expire. It can renew certificates before expiry by re-running the ACME flow
|
||||
@@ -689,10 +725,8 @@ MCP is the natural place to automate certificate provisioning:
|
||||
for the new name.
|
||||
|
||||
- **MC-Proxy L7 routes.** MC-Proxy's L7 mode requires certificate/key pairs
|
||||
for TLS termination. MCP (or the operator) can provision these from
|
||||
Metacrypt and push them to MC-Proxy's cert directory. MC-Proxy's
|
||||
architecture doc lists ACME integration and Metacrypt key storage as future
|
||||
work.
|
||||
for TLS termination. MCP provisions these from Metacrypt during deploy and
|
||||
pushes them to the node alongside the route registration.
|
||||
|
||||
### Trust Distribution
|
||||
|
||||
@@ -793,8 +827,13 @@ Operator workstation (vade)
|
||||
│
|
||||
├── Scheduling: select Node C (best fit)
|
||||
│
|
||||
├── Provision TLS certificate from Metacrypt
|
||||
│ (ACME flow or REST API)
|
||||
├── Port assignment: allocate a free host port for each
|
||||
│ declared route (passed as $PORT / $PORT_<NAME> env vars)
|
||||
│
|
||||
├── Provision TLS certificate from Metacrypt CA
|
||||
│ (ACME client library) for the service
|
||||
│ — for L7 routes, also provision a cert for MC-Proxy
|
||||
│ TLS termination
|
||||
│
|
||||
├── C2 to Node C agent:
|
||||
│ 1. Create /srv/α/ directory structure
|
||||
@@ -802,15 +841,15 @@ Operator workstation (vade)
|
||||
│ 3. Transfer TLS cert+key → /srv/α/certs/
|
||||
│ 4. Transfer root CA cert → /srv/α/certs/ca.pem
|
||||
│ 5. Pull image from MCR
|
||||
│ 6. Start container
|
||||
│ 6. Start container with $PORT / $PORT_<NAME> env vars
|
||||
│
|
||||
├── Register routes with MC-Proxy
|
||||
│ (gRPC AddRoute for each declared route)
|
||||
│
|
||||
├── Update service registry: α → Node C
|
||||
│
|
||||
├── Push DNS update to MCNS:
|
||||
│ α.svc.mcp.metacircular.net → Node C address
|
||||
│
|
||||
└── (Optionally) update MC-Proxy route table
|
||||
if α needs external ingress
|
||||
└── Push DNS update to MCNS:
|
||||
α.svc.mcp.metacircular.net → Node C address
|
||||
```
|
||||
|
||||
### 4. Migration
|
||||
|
||||
@@ -9,8 +9,8 @@ the platform's internals.
|
||||
## Platform Overview
|
||||
|
||||
Metacircular is a multi-service infrastructure platform. Services are
|
||||
Go binaries running as containers on Linux nodes, managed by these core
|
||||
components:
|
||||
Go binaries running as containers across a fleet of Linux nodes,
|
||||
managed by these core components:
|
||||
|
||||
| Component | Role |
|
||||
|-----------|------|
|
||||
@@ -21,8 +21,27 @@ components:
|
||||
| **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` |
|
||||
|
||||
The operator workflow is: **build image → push to MCR → write service
|
||||
definition → deploy via MCP**. MCP handles port assignment, route
|
||||
registration, and container lifecycle.
|
||||
definition → deploy via MCP**. MCP handles port assignment, TLS cert
|
||||
provisioning, route registration, DNS registration, and container
|
||||
lifecycle.
|
||||
|
||||
### Fleet Topology
|
||||
|
||||
The platform runs across multiple nodes connected via Tailnet:
|
||||
|
||||
| Node | Role | OS | Arch | Purpose |
|
||||
|------|------|----|------|---------|
|
||||
| **rift** | Compute + core infra | NixOS | amd64 | Runs most services (Metacrypt, MCR, MCNS, etc.) |
|
||||
| **svc** | Edge | Debian | amd64 | Public-facing mc-proxy, routes traffic over Tailnet to compute nodes |
|
||||
| **orion** | Compute | NixOS | amd64 | Provisioned, currently offline |
|
||||
|
||||
**Node roles:**
|
||||
|
||||
- **Compute nodes** (rift, orion, future RPis) run the full container
|
||||
lifecycle via rootless Podman.
|
||||
- **Edge nodes** (svc) run mc-proxy for public traffic routing only.
|
||||
The MCP agent on edge nodes manages mc-proxy routes but does not run
|
||||
application containers.
|
||||
|
||||
---
|
||||
|
||||
@@ -34,7 +53,7 @@ registration, and container lifecycle.
|
||||
| Container engine | Docker or Podman (for building images) |
|
||||
| `mcp` CLI | Installed on the operator workstation |
|
||||
| MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` |
|
||||
| MCP agent | Running on the target node (currently `rift`) |
|
||||
| MCP agent | Running on the target node (`rift` for services, `svc` for edge routing) |
|
||||
| MCIAS account | For `mcp` CLI authentication to the agent |
|
||||
|
||||
---
|
||||
@@ -123,18 +142,38 @@ Service definitions are TOML files that tell MCP what to deploy. They
|
||||
live at `~/.config/mcp/services/<service>.toml` on the operator
|
||||
workstation.
|
||||
|
||||
### Minimal Example (Single Component)
|
||||
### Minimal Example (Single Component, L7)
|
||||
|
||||
```toml
|
||||
name = "myservice"
|
||||
node = "rift"
|
||||
|
||||
[build.images]
|
||||
myservice = "Dockerfile"
|
||||
|
||||
[[components]]
|
||||
name = "web"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||
|
||||
[[components.routes]]
|
||||
port = 8443
|
||||
mode = "l7"
|
||||
```
|
||||
|
||||
### API Service Example (L4, Multiple Routes)
|
||||
|
||||
```toml
|
||||
name = "myservice"
|
||||
node = "rift"
|
||||
version = "v1.0.0"
|
||||
|
||||
[build.images]
|
||||
myservice = "Dockerfile"
|
||||
|
||||
[[components]]
|
||||
name = "api"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||
volumes = ["/srv/myservice:/srv/myservice"]
|
||||
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||
|
||||
[[components.routes]]
|
||||
name = "rest"
|
||||
@@ -152,7 +191,6 @@ mode = "l4"
|
||||
```toml
|
||||
name = "myservice"
|
||||
node = "rift"
|
||||
version = "v1.0.0"
|
||||
|
||||
[build.images]
|
||||
myservice = "Dockerfile.api"
|
||||
@@ -160,6 +198,7 @@ myservice-web = "Dockerfile.web"
|
||||
|
||||
[[components]]
|
||||
name = "api"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||
volumes = ["/srv/myservice:/srv/myservice"]
|
||||
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||
|
||||
@@ -175,6 +214,7 @@ mode = "l4"
|
||||
|
||||
[[components]]
|
||||
name = "web"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/myservice-web:v1.0.0"
|
||||
volumes = ["/srv/myservice:/srv/myservice"]
|
||||
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||
|
||||
@@ -183,21 +223,16 @@ port = 443
|
||||
mode = "l7"
|
||||
```
|
||||
|
||||
### Convention-Derived Defaults
|
||||
### Conventions
|
||||
|
||||
Most fields are optional — MCP derives them from conventions:
|
||||
A few fields are derived by the agent at deploy time:
|
||||
|
||||
| Field | Default | Override when... |
|
||||
|-------|---------|------------------|
|
||||
| Image name | `<service>` (api), `<service>-<component>` (others) | Image name differs from convention |
|
||||
| Image registry | `mcr.svc.mcp.metacircular.net:8443` (from global MCP config) | Never — always use MCR |
|
||||
| Version | Service-level `version` field | A component needs a different version |
|
||||
| Volumes | `/srv/<service>:/srv/<service>` | Additional mounts are needed |
|
||||
| Network | `mcpnet` | Service needs host networking or a different network |
|
||||
| User | `0:0` | Never change this for standard services |
|
||||
| Restart | `unless-stopped` | Service should not auto-restart |
|
||||
| Source path | `<service>` relative to workspace root | Directory name differs from service name |
|
||||
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname |
|
||||
| Source path | `<service>` relative to workspace root | Directory name differs from service name (use `path`) |
|
||||
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname (use route `hostname`) |
|
||||
|
||||
All other fields must be explicit in the service definition.
|
||||
|
||||
### Service Definition Reference
|
||||
|
||||
@@ -207,7 +242,6 @@ Most fields are optional — MCP derives them from conventions:
|
||||
|-------|----------|---------|
|
||||
| `name` | Yes | Service name (matches project name) |
|
||||
| `node` | Yes | Target node to deploy to |
|
||||
| `version` | Yes | Image version tag (semver, e.g. `v1.0.0`) |
|
||||
| `active` | No | Whether MCP keeps this running (default: `true`) |
|
||||
| `path` | No | Source directory relative to workspace (default: `name`) |
|
||||
|
||||
@@ -215,20 +249,20 @@ Most fields are optional — MCP derives them from conventions:
|
||||
|
||||
| Field | Purpose |
|
||||
|-------|---------|
|
||||
| `build.images.<name>` | Maps image name to Dockerfile path |
|
||||
| `build.images.<name>` | Maps build image name to Dockerfile path. The `<name>` must match the repository name in a component's `image` field (the part after the last `/`, before the `:` tag). |
|
||||
|
||||
**Component fields:**
|
||||
|
||||
| Field | Purpose |
|
||||
|-------|---------|
|
||||
| `name` | Component name (e.g. `api`, `web`) |
|
||||
| `image` | Full image reference override |
|
||||
| `version` | Version override for this component |
|
||||
| `volumes` | Volume mounts (list of `host:container` strings) |
|
||||
| `cmd` | Command override (list of strings) |
|
||||
| `network` | Container network override |
|
||||
| `user` | Container user override |
|
||||
| `restart` | Restart policy override |
|
||||
| Field | Required | Purpose |
|
||||
|-------|----------|---------|
|
||||
| `name` | Yes | Component name (e.g. `api`, `web`) |
|
||||
| `image` | Yes | Full image reference (e.g. `mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0`) |
|
||||
| `volumes` | No | Volume mounts (list of `host:container` strings) |
|
||||
| `cmd` | No | Command override (list of strings) |
|
||||
| `env` | No | Extra environment variables (list of `KEY=VALUE` strings) |
|
||||
| `network` | No | Container network (default: none) |
|
||||
| `user` | No | Container user (e.g. `0:0`) |
|
||||
| `restart` | No | Restart policy (e.g. `unless-stopped`) |
|
||||
|
||||
**Route fields (under `[[components.routes]]`):**
|
||||
|
||||
@@ -248,9 +282,11 @@ Most fields are optional — MCP derives them from conventions:
|
||||
|
||||
### Version Pinning
|
||||
|
||||
Service definitions **must** pin an explicit semver tag (e.g. `v1.1.0`).
|
||||
Never use `:latest`. This ensures deployments are reproducible and
|
||||
`mcp status` shows the actual running version.
|
||||
Component `image` fields **must** pin an explicit semver tag (e.g.
|
||||
`mcr.svc.mcp.metacircular.net:8443/myservice:v1.1.0`). Never use
|
||||
`:latest`. This ensures deployments are reproducible and `mcp status`
|
||||
shows the actual running version. The version is extracted from the
|
||||
image tag.
|
||||
|
||||
---
|
||||
|
||||
@@ -303,14 +339,20 @@ recreates the containers.
|
||||
### What Happens During Deploy
|
||||
|
||||
1. Agent assigns a free host port (10000–60000) for each declared route.
|
||||
2. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
|
||||
2. For L7 routes, agent provisions a TLS certificate from Metacrypt CA
|
||||
(via `POST /v1/engine/request`). Certs are written to
|
||||
`/srv/mc-proxy/certs/<service>.pem` and `.key`. Existing valid certs
|
||||
(more than 30 days from expiry) are reused.
|
||||
3. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
|
||||
variables set to the assigned ports.
|
||||
3. Agent registers routes with mc-proxy (hostname → `127.0.0.1:<port>`,
|
||||
mode, TLS cert paths).
|
||||
4. Agent records the full state in its SQLite registry.
|
||||
4. Agent registers routes with mc-proxy via gRPC (hostname →
|
||||
`<node-address>:<port>`, mode, TLS cert paths).
|
||||
5. Agent registers DNS entries in MCNS for
|
||||
`<service>.svc.mcp.metacircular.net`.
|
||||
6. Agent records the full state in its SQLite registry.
|
||||
|
||||
On stop (`mcp stop <service>`), the agent reverses the process: removes
|
||||
mc-proxy routes, then stops containers.
|
||||
DNS entries, removes mc-proxy routes, then stops containers.
|
||||
|
||||
---
|
||||
|
||||
@@ -368,7 +410,14 @@ tags = []
|
||||
level = "info"
|
||||
```
|
||||
|
||||
For services with a web UI, add:
|
||||
For services with SSO-enabled web UIs, add:
|
||||
|
||||
```toml
|
||||
[sso]
|
||||
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||
```
|
||||
|
||||
For services with a separate web UI binary, add:
|
||||
|
||||
```toml
|
||||
[web]
|
||||
@@ -385,12 +434,17 @@ addresses** — they will be overridden at deploy time.
|
||||
|
||||
| Env var | When set |
|
||||
|---------|----------|
|
||||
| `$PORT` | Component has a single route |
|
||||
| `$PORT_<NAME>` | Component has multiple named routes |
|
||||
| `$PORT` | Component has a single unnamed route |
|
||||
| `$PORT_<NAME>` | Component has named routes |
|
||||
|
||||
Route names are uppercased: `name = "rest"` → `$PORT_REST`,
|
||||
`name = "grpc"` → `$PORT_GRPC`.
|
||||
|
||||
**Container listen address:** Services must bind to `0.0.0.0:$PORT`
|
||||
(or `:$PORT`), not `localhost:$PORT`. Podman port-forwards go through
|
||||
the container's network namespace — binding to `localhost` inside the
|
||||
container makes the port unreachable from outside.
|
||||
|
||||
Services built with **mcdsl v1.1.0+** handle this automatically —
|
||||
`config.Load` checks `$PORT` → overrides `Server.ListenAddr`, and
|
||||
`$PORT_GRPC` → overrides `Server.GRPCAddr`. These take precedence over
|
||||
@@ -411,18 +465,72 @@ these.
|
||||
## 6. Authentication (MCIAS Integration)
|
||||
|
||||
Every service delegates authentication to MCIAS. No service maintains
|
||||
its own user database.
|
||||
its own user database. Services support two login modes: **SSO
|
||||
redirect** (recommended for web UIs) and **direct credentials**
|
||||
(fallback / API clients).
|
||||
|
||||
### Auth Flow
|
||||
### SSO Login (Web UIs)
|
||||
|
||||
SSO is the preferred login method for web UIs. The flow is an OAuth
|
||||
2.0-style authorization code exchange:
|
||||
|
||||
1. User visits the service and is redirected to `/login`.
|
||||
2. Login page shows a "Sign in with MCIAS" button.
|
||||
3. Click redirects to MCIAS (`/sso/authorize`), which authenticates the
|
||||
user.
|
||||
4. MCIAS redirects back to the service's `/sso/callback` with an
|
||||
authorization code.
|
||||
5. The service exchanges the code for a JWT via a server-to-server call
|
||||
to MCIAS `POST /v1/sso/token`.
|
||||
6. The JWT is stored in a session cookie.
|
||||
|
||||
SSO is enabled by adding an `[sso]` section to the service config and
|
||||
registering the service as an SSO client in MCIAS.
|
||||
|
||||
**Service config:**
|
||||
|
||||
```toml
|
||||
[sso]
|
||||
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||
```
|
||||
|
||||
**MCIAS config** (add to the `[[sso_clients]]` list):
|
||||
|
||||
```toml
|
||||
[[sso_clients]]
|
||||
client_id = "<service>"
|
||||
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||
service_name = "<service>"
|
||||
```
|
||||
|
||||
The `redirect_uri` must match exactly between the service config and
|
||||
the MCIAS client registration.
|
||||
|
||||
When `[sso].redirect_uri` is empty or absent, the service falls back to
|
||||
the direct credentials form.
|
||||
|
||||
**Implementation:** Services use `mcdsl/sso` (v1.7.0+) which handles
|
||||
state management, CSRF-safe cookies, and the code exchange. The web
|
||||
server registers three routes:
|
||||
|
||||
| Route | Purpose |
|
||||
|-------|---------|
|
||||
| `GET /login` | Renders landing page with "Sign in with MCIAS" button |
|
||||
| `GET /sso/redirect` | Sets state cookies, redirects to MCIAS |
|
||||
| `GET /sso/callback` | Validates state, exchanges code for JWT, sets session |
|
||||
|
||||
### Direct Credentials (API / Fallback)
|
||||
|
||||
1. Client sends credentials to the service's `POST /v1/auth/login`.
|
||||
2. Service forwards them to MCIAS via the client library
|
||||
(`git.wntrmute.dev/mc/mcias/clients/go`).
|
||||
2. Service forwards them to MCIAS via `mcdsl/auth.Authenticator.Login()`.
|
||||
3. MCIAS validates and returns a bearer token.
|
||||
4. Subsequent requests include `Authorization: Bearer <token>`.
|
||||
5. Service validates tokens via MCIAS `ValidateToken()`, cached for 30s
|
||||
5. Service validates tokens via `ValidateToken()`, cached for 30s
|
||||
(keyed by SHA-256 of the token).
|
||||
|
||||
Web UIs use this mode when SSO is not configured, presenting a
|
||||
username/password/TOTP form instead of the SSO button.
|
||||
|
||||
### Roles
|
||||
|
||||
| Role | Access |
|
||||
@@ -475,11 +583,14 @@ co-located on the same node).
|
||||
| `mcp build <service>` | Build and push images to MCR |
|
||||
| `mcp sync` | Push all service definitions to agents; auto-build missing images |
|
||||
| `mcp deploy <service>` | Pull image, (re)create containers, register routes |
|
||||
| `mcp undeploy <service>` | Full teardown: remove routes, DNS, certs, and containers |
|
||||
| `mcp stop <service>` | Remove routes, stop containers |
|
||||
| `mcp start <service>` | Start previously stopped containers |
|
||||
| `mcp restart <service>` | Restart containers in place |
|
||||
| `mcp ps` | List all managed containers and status |
|
||||
| `mcp status [service]` | Detailed status for a specific service |
|
||||
| `mcp logs <service>` | Stream container logs |
|
||||
| `mcp edit <service>` | Edit service definition |
|
||||
|
||||
---
|
||||
|
||||
@@ -504,13 +615,14 @@ git push origin v1.0.0
|
||||
cat > ~/.config/mcp/services/myservice.toml << 'EOF'
|
||||
name = "myservice"
|
||||
node = "rift"
|
||||
version = "v1.0.0"
|
||||
|
||||
[build.images]
|
||||
myservice = "Dockerfile.api"
|
||||
|
||||
[[components]]
|
||||
name = "api"
|
||||
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||
volumes = ["/srv/myservice:/srv/myservice"]
|
||||
|
||||
[[components.routes]]
|
||||
name = "rest"
|
||||
@@ -584,15 +696,85 @@ Services follow a standard directory structure:
|
||||
|
||||
---
|
||||
|
||||
## 10. Agent Management
|
||||
|
||||
MCP manages a fleet of nodes with heterogeneous operating systems and
|
||||
architectures. The agent binary lives at `/srv/mcp/mcp-agent` on every
|
||||
node — this is a mutable path that MCP controls, regardless of whether
|
||||
the node runs NixOS or Debian.
|
||||
|
||||
### Node Configuration
|
||||
|
||||
Each node in `~/.config/mcp/mcp.toml` includes SSH and architecture
|
||||
info for agent management:
|
||||
|
||||
```toml
|
||||
[[nodes]]
|
||||
name = "rift"
|
||||
address = "100.95.252.120:9444"
|
||||
ssh = "rift"
|
||||
arch = "amd64"
|
||||
|
||||
[[nodes]]
|
||||
name = "hyperborea"
|
||||
address = "100.x.x.x:9444"
|
||||
ssh = "hyperborea"
|
||||
arch = "arm64"
|
||||
```
|
||||
|
||||
### Upgrading Agents
|
||||
|
||||
After tagging a new MCP release:
|
||||
|
||||
```bash
|
||||
# Upgrade all nodes (recommended — prevents version skew)
|
||||
mcp agent upgrade
|
||||
|
||||
# Upgrade a single node
|
||||
mcp agent upgrade rift
|
||||
|
||||
# Check versions across the fleet
|
||||
mcp agent status
|
||||
```
|
||||
|
||||
`mcp agent upgrade` cross-compiles the agent binary for each target
|
||||
architecture, SSHs to each node, atomically replaces the binary, and
|
||||
restarts the systemd service. All nodes should be upgraded together
|
||||
because new CLI versions often depend on new agent RPCs.
|
||||
|
||||
### Provisioning New Nodes
|
||||
|
||||
One-time setup for a new Debian node:
|
||||
|
||||
```bash
|
||||
# 1. Provision the node (creates user, dirs, systemd unit, installs binary)
|
||||
mcp node provision <name>
|
||||
|
||||
# 2. Register the node
|
||||
mcp node add <name> <address>
|
||||
|
||||
# 3. Deploy services
|
||||
mcp deploy <service>
|
||||
```
|
||||
|
||||
For NixOS nodes, provisioning is handled by the NixOS configuration.
|
||||
The NixOS config creates the `mcp` user, systemd unit, and directories.
|
||||
The `ExecStart` path points to `/srv/mcp/mcp-agent` so that `mcp agent
|
||||
upgrade` works the same as on Debian nodes.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Currently Deployed Services
|
||||
|
||||
For reference, these services are operational on the platform:
|
||||
|
||||
| Service | Version | Node | Purpose |
|
||||
|---------|---------|------|---------|
|
||||
| MCIAS | v1.8.0 | (separate) | Identity and access |
|
||||
| Metacrypt | v1.1.0 | rift | Cryptographic service, PKI/CA |
|
||||
| MC-Proxy | v1.1.0 | rift | TLS proxy and router |
|
||||
| MCR | v1.2.0 | rift | Container registry |
|
||||
| MCNS | v1.1.0 | rift | Authoritative DNS |
|
||||
| MCP | v0.3.0 | rift | Control plane agent |
|
||||
| MCIAS | v1.9.0 | (separate) | Identity and access |
|
||||
| Metacrypt | v1.4.1 | rift | Cryptographic service, PKI/CA |
|
||||
| MC-Proxy | v1.2.1 | rift, svc | TLS proxy and router (svc handles public edge) |
|
||||
| MCR | v1.2.1 | rift | Container registry |
|
||||
| MCNS | v1.1.1 | rift | Authoritative DNS |
|
||||
| MCDoc | v0.1.0 | rift | Documentation server |
|
||||
| MCQ | v0.4.0 | rift | Document review queue |
|
||||
| MCP | v0.7.6 | rift, svc | Control plane agent |
|
||||
|
||||
127
docs/phase-e-plan.md
Normal file
127
docs/phase-e-plan.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Phase E: Multi-Node Orchestration
|
||||
|
||||
Phase D (automated DNS registration) is complete. Phase E extends MCP from
|
||||
a single-node agent on rift to a multi-node fleet with a central master
|
||||
process.
|
||||
|
||||
## Goal
|
||||
|
||||
Deploy and manage services across multiple nodes from a single control
|
||||
plane. The operator runs `mcp deploy` and the system places the workload on
|
||||
the right node, provisions certs, registers DNS, and configures routing --
|
||||
same as today on rift, but across the fleet.
|
||||
|
||||
## Fleet Topology
|
||||
|
||||
| Node | OS | Arch | Role |
|
||||
|------|----|------|------|
|
||||
| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
|
||||
| orion | NixOS | amd64 | Worker |
|
||||
| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
|
||||
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
|
||||
|
||||
Tailnet is the interconnect between all nodes. Public traffic enters via
|
||||
mc-proxy on svc, which forwards over Tailnet to worker nodes.
|
||||
|
||||
## Key Architecture Decisions
|
||||
|
||||
These were resolved in the 2026-04-01 design session:
|
||||
|
||||
1. **Rift is the master node.** No separate straylight machine. Core infra
|
||||
stays on rift, which gains mcp-master alongside its existing agent.
|
||||
|
||||
2. **Master-mediated coordination.** Agents never talk to each other. All
|
||||
cross-node operations go through the master. Agents only dial the master
|
||||
(for registration and heartbeats) and respond to master RPCs.
|
||||
|
||||
3. **Agent self-registration.** Agents register with the master on startup
|
||||
(name, role, address, arch). The master maintains the live node registry.
|
||||
No static `[[nodes]]` config required except for bootstrap.
|
||||
|
||||
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
|
||||
(with resource data). If the master misses 3 heartbeats (90s), it
|
||||
actively probes the agent. Failed probe marks the node unhealthy.
|
||||
|
||||
5. **Tier-based placement.** `tier = "core"` runs on the master node.
|
||||
`tier = "worker"` (default) is auto-placed on a worker with capacity.
|
||||
Explicit `node = "orion"` overrides tier for pinned services.
|
||||
|
||||
6. **Two separate certs for public services.** Internal cert
|
||||
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
|
||||
(`metacircular.net`) issued by edge agent. Internal names never
|
||||
appear on edge certs.
|
||||
|
||||
7. **`public = true` on routes.** Public routes declare intent with a
|
||||
boolean flag. The master assigns the route to an edge node (currently
|
||||
always svc). No explicit `edge` field in service definitions.
|
||||
|
||||
## Components
|
||||
|
||||
### Master (`mcp-master`)
|
||||
|
||||
Long-lived orchestrator on rift. Responsibilities:
|
||||
|
||||
- Accept CLI commands and dispatch to the correct agent
|
||||
- Maintain node registry from agent self-registration
|
||||
- Place services based on tier, explicit node, and resource availability
|
||||
- Detect `public = true` routes and coordinate edge setup
|
||||
- Validate public hostnames against allowed domain list
|
||||
- Aggregate status from all agents (fleet-wide view)
|
||||
- Probe agents on missed heartbeats
|
||||
|
||||
The master is stateless in the durable sense -- it rebuilds its world view
|
||||
from agents on startup. If the master goes down, running services continue
|
||||
unaffected; only new deploys and rescheduling stop.
|
||||
|
||||
### Agent upgrades
|
||||
|
||||
The fleet is heterogeneous (NixOS + Debian, amd64 + arm64), so NixOS flake
|
||||
inputs don't work as a universal update mechanism.
|
||||
|
||||
**Design:** MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
|
||||
|
||||
- `mcp agent upgrade [node]` -- CLI cross-compiles for the target's
|
||||
GOARCH, SCPs the binary, restarts via SSH
|
||||
- Node config gains `ssh` (user@host) and `arch` (amd64/arm64) fields
|
||||
- rift's NixOS `ExecStart` changes from nix store path to
|
||||
`/srv/mcp/mcp-agent`
|
||||
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
|
||||
`mcp-agent.service`
|
||||
|
||||
### Edge agents
|
||||
|
||||
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
|
||||
routing only: when the master tells it to set up an edge route, it
|
||||
provisions a TLS cert from Metacrypt and registers the route in its local
|
||||
mc-proxy via the gRPC admin API.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Agent on svc
|
||||
Deploy mcp-agent to svc. Verify with `mcp node list`.
|
||||
|
||||
### Phase 2: Edge routing RPCs
|
||||
Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
|
||||
Test by calling directly from CLI.
|
||||
|
||||
### Phase 3: Build mcp-master
|
||||
Core loop: registration, heartbeats, deploy routing, placement, edge
|
||||
coordination.
|
||||
|
||||
### Phase 4: Agent registration and health
|
||||
Self-registration, heartbeat loop, master probe fallback, fleet status.
|
||||
|
||||
### Phase 5: Cut over
|
||||
Point CLI at master, add tier fields to service defs, deploy agents to
|
||||
orion and hyperborea.
|
||||
|
||||
## What Phase E Does NOT Include
|
||||
|
||||
These remain future work:
|
||||
|
||||
- Auto-reconciliation (agent auto-restarting drifted containers)
|
||||
- Live migration (snapshot streaming between nodes)
|
||||
- Web UI for fleet management
|
||||
- Observability / log aggregation
|
||||
- Object store
|
||||
- Multiple edge nodes / master HA
|
||||
@@ -1018,6 +1018,13 @@ Write these before writing code. They are the blueprint, not the afterthought.
|
||||
- **Never log secrets.** Keys, passwords, tokens, and plaintext must never
|
||||
appear in log output.
|
||||
|
||||
### CLI Security
|
||||
|
||||
- **Never echo passwords.** Interactive password prompts must suppress
|
||||
terminal echo. Use `mcdsl/terminal.ReadPassword` — it wraps
|
||||
`golang.org/x/term.ReadPassword` with proper prompt and newline handling.
|
||||
Never read passwords with `bufio.Scanner` or `fmt.Scanln`.
|
||||
|
||||
### Web Security
|
||||
|
||||
- CSRF tokens on all mutating requests.
|
||||
|
||||
627
log/2026-03-28-building-mcq.md
Normal file
627
log/2026-03-28-building-mcq.md
Normal file
@@ -0,0 +1,627 @@
|
||||
# Building MCQ: A Reading Queue in a Session
|
||||
|
||||
*A document reading queue goes from design discussion to production
|
||||
deployment — including the deployment's sharp edges.*
|
||||
|
||||
*Written by Claude (Opus 4.6), reflecting on a collaborative session with
|
||||
Kyle. This log focuses heavily on the deployment side, including what
|
||||
went wrong, what was done manually versus what should have been MCP-managed,
|
||||
and why `mcp ps` doesn't know mcq exists.*
|
||||
|
||||
---
|
||||
|
||||
## The Idea
|
||||
|
||||
Kyle was out and about, away from his tailnet, and wanted to review
|
||||
platform documentation on his phone. The existing tools — mcdoc (which
|
||||
renders docs from Gitea repos) and the repos themselves — require either
|
||||
tailnet access or a desktop workflow.
|
||||
|
||||
The concept: a **document queue**. Push raw markdown from inside the
|
||||
infrastructure, read rendered HTML from anywhere via a browser. Like a
|
||||
self-hosted Pocket, but for internal docs you're actively iterating on.
|
||||
|
||||
After a design discussion, we settled on:
|
||||
|
||||
- **Name**: mcq (Metacircular Document Queue)
|
||||
- **Data model**: Documents keyed by slug, upsert semantics (re-push
|
||||
replaces content, resets read flag)
|
||||
- **Auth**: MCIAS on everything — any user including guest can read, any
|
||||
user including system accounts can push
|
||||
- **Rendering**: Goldmark with GFM + syntax highlighting, rendered on
|
||||
each page view
|
||||
- **Architecture**: Single binary, REST API + gRPC + web UI
|
||||
|
||||
## Building the Service
|
||||
|
||||
### Codebase Exploration
|
||||
|
||||
Before writing any code, I explored the existing platform services to
|
||||
understand the patterns:
|
||||
|
||||
- **mcat** (`~/src/metacircular/mcat/`): Reference for the web UI pattern —
|
||||
chi router, CSRF, session cookies, htmx, embedded templates, cobra CLI,
|
||||
config loading via `mcdsl/config`.
|
||||
- **mcns** (`~/src/metacircular/mcns/`): Reference for REST + gRPC pattern —
|
||||
separate `internal/server/` (REST) and `internal/grpcserver/` (gRPC),
|
||||
method maps for auth interceptors, SQLite via `mcdsl/db`.
|
||||
- **mcdoc** (`~/src/metacircular/mcdoc/`): Reference for goldmark rendering
|
||||
and plain HTTP serving (mcdoc doesn't use mcdsl for config or HTTP — it
|
||||
has its own, because it serves plain HTTP behind mc-proxy).
|
||||
- **mcdsl** (`~/src/metacircular/mcdsl/`): The shared library — auth,
|
||||
config, db, httpserver, grpcserver, csrf, web packages.
|
||||
|
||||
### Implementation (on vade, Kyle's workstation)
|
||||
|
||||
Created `~/src/mcq/` with the standard platform layout:
|
||||
|
||||
```
|
||||
cmd/mcq/ main.go, server.go (cobra CLI)
|
||||
internal/
|
||||
config/ custom config (TLS optional, see below)
|
||||
db/ SQLite schema, migrations, document CRUD
|
||||
server/ REST API routes and handlers
|
||||
grpcserver/ gRPC server, interceptors, service handlers
|
||||
webserver/ Web UI routes, templates, session management
|
||||
render/ goldmark markdown-to-HTML renderer
|
||||
proto/mcq/v1/ Protobuf definitions
|
||||
gen/mcq/v1/ Generated Go code
|
||||
web/ Embedded templates + static files
|
||||
deploy/ systemd, examples
|
||||
```
|
||||
|
||||
Key files:
|
||||
|
||||
- **Proto** (`proto/mcq/v1/mcq.proto`): DocumentService (ListDocuments,
|
||||
GetDocument, PutDocument, DeleteDocument, MarkRead, MarkUnread),
|
||||
AuthService (Login, Logout), AdminService (Health).
|
||||
- **DB** (`internal/db/documents.go`): Single `documents` table with slug
|
||||
as unique key. PutDocument uses `INSERT ... ON CONFLICT(slug) DO UPDATE`.
|
||||
- **REST** (`internal/server/routes.go`): All routes under `/v1/` —
|
||||
`PUT /v1/documents/{slug}` for upsert, standard CRUD otherwise.
|
||||
- **Web UI** (`internal/webserver/server.go`): Login page, document list
|
||||
at `/`, rendered markdown reader at `/d/{slug}`.
|
||||
- **gRPC** (`internal/grpcserver/`): Mirrors REST exactly. Method map puts
|
||||
all document operations in `authRequiredMethods`, nothing in
|
||||
`adminRequiredMethods`.
|
||||
|
||||
Proto generation ran on vade:
|
||||
|
||||
```bash
|
||||
cd ~/src/mcq
|
||||
protoc --go_out=. --go_opt=module=git.wntrmute.dev/mc/mcq \
|
||||
--go-grpc_out=. --go-grpc_opt=module=git.wntrmute.dev/mc/mcq \
|
||||
proto/mcq/v1/*.proto
|
||||
```
|
||||
|
||||
### The .gitignore Bug
|
||||
|
||||
First `git add -A` missed `cmd/mcq/`, `proto/mcq/`, and `gen/mcq/`. The
|
||||
`.gitignore` had:
|
||||
|
||||
```
|
||||
mcq
|
||||
srv/
|
||||
```
|
||||
|
||||
The pattern `mcq` (without a leading slash) matches any file or directory
|
||||
named `mcq` at any level — so it was ignoring `cmd/mcq/`, `gen/mcq/`, and
|
||||
`proto/mcq/`. Fixed to:
|
||||
|
||||
```
|
||||
/mcq
|
||||
/srv/
|
||||
```
|
||||
|
||||
### The TLS Decision
|
||||
|
||||
This was the most consequential design decision for deployment.
|
||||
|
||||
The standard platform pattern (mcdsl's `httpserver`) enforces TLS 1.3
|
||||
minimum. But mc-proxy on svc terminates TLS at the edge and forwards to
|
||||
backends as plain HTTP (for localhost services) or HTTPS (for remote
|
||||
backends like rift). Gitea on svc runs plain HTTP on port 3000 behind
|
||||
mc-proxy. mcdoc on rift runs plain HTTP on port 38080 behind mc-proxy.
|
||||
|
||||
mcdsl's `config.Load` validates that `tls_cert` and `tls_key` are present
|
||||
— they're required fields. So I couldn't use `config.Base` with empty TLS
|
||||
fields.
|
||||
|
||||
**Solution**: Created `internal/config/config.go` — mcq's own config
|
||||
package, modeled after mcdoc's. Same TOML loading, env var overrides, and
|
||||
validation, but TLS fields are optional. When empty, the server uses
|
||||
`http.ListenAndServe()` instead of `httpserver.ListenAndServeTLS()`.
|
||||
|
||||
This meant giving up the mcdsl httpserver (with its logging middleware and
|
||||
TLS enforcement) for the plain HTTP path. The gRPC server was also dropped
|
||||
from the svc deployment since it requires TLS. The REST API and web UI
|
||||
are sufficient for the use case.
|
||||
|
||||
### Build and Test (on vade)
|
||||
|
||||
```bash
|
||||
cd ~/src/mcq
|
||||
go mod tidy
|
||||
go build ./... # clean
|
||||
go vet ./... # clean
|
||||
go test ./... # 6 tests pass (all in internal/db)
|
||||
|
||||
# Production binary
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
|
||||
go build -trimpath -ldflags="-s -w -X main.version=v0.1.0" \
|
||||
-o mcq ./cmd/mcq
|
||||
|
||||
# Result: 21MB static binary
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Why mcq is NOT in `mcp ps`
|
||||
|
||||
**This is the most important thing in this log.**
|
||||
|
||||
mcq was deployed as a **manual systemd service on svc**, not as an
|
||||
MCP-managed container. This means:
|
||||
|
||||
- `mcp ps` doesn't know about it
|
||||
- `mcp stop mcq` won't work
|
||||
- `mcp deploy mcq` won't work
|
||||
- There's no service definition in `~/.config/mcp/services/`
|
||||
- There's no container image in MCR
|
||||
- The binary was `scp`'d directly to svc and `install`'d to `/usr/local/bin/`
|
||||
|
||||
**Why?** Three reasons:
|
||||
|
||||
1. **svc has no MCP agent.** The MCP agent (`mcp-agent`) only runs on rift.
|
||||
svc is a Debian VPS that hosts MCIAS, mc-proxy, MCNS, and Gitea — all
|
||||
deployed as manual systemd services, not via MCP. Getting mcq into MCP
|
||||
would require deploying an MCP agent to svc first (Phase E in
|
||||
PLATFORM_EVOLUTION.md, items #10-#12).
|
||||
|
||||
2. **mcq runs as a native binary, not a container.** MCP manages containers
|
||||
(podman). mcq on svc is a bare binary under systemd, like MCIAS and
|
||||
mc-proxy on svc. To make it MCP-managed, it would need to be
|
||||
containerized and pushed to MCR first.
|
||||
|
||||
3. **The deployment followed the existing svc pattern.** Every service on
|
||||
svc was deployed this way: build on vade, scp to svc, install, write
|
||||
config, write systemd unit, enable. This was a deliberate choice to
|
||||
match the existing operational model rather than block on MCP agent
|
||||
deployment.
|
||||
|
||||
### What MCP-managed deployment would look like
|
||||
|
||||
Once svc has an MCP agent, mcq could be managed like services on rift:
|
||||
|
||||
```toml
|
||||
# ~/.config/mcp/services/mcq.toml
|
||||
name = "mcq"
|
||||
node = "svc"
|
||||
version = "v0.1.0"
|
||||
|
||||
[[components]]
|
||||
name = "api"
|
||||
|
||||
[[components.routes]]
|
||||
port = 8090
|
||||
mode = "l7"
|
||||
hostname = "mcq.metacircular.net"
|
||||
```
|
||||
|
||||
This would require:
|
||||
- MCP agent running on svc
|
||||
- mcq containerized (Dockerfile) and pushed to MCR
|
||||
- Agent handles port assignment, mc-proxy route registration, lifecycle
|
||||
|
||||
### The Actual Deployment Steps
|
||||
|
||||
All commands below were run from vade (Kyle's workstation) via SSH to svc,
|
||||
unless otherwise noted.
|
||||
|
||||
#### 1. Push repo to Gitea (from vade)
|
||||
|
||||
```bash
|
||||
cd ~/src/mcq
|
||||
git remote add origin git@git.wntrmute.dev:mc/mcq.git
|
||||
git push -u origin master
|
||||
```
|
||||
|
||||
The mc/mcq repo was created manually in Gitea (the MCP tool's API token
|
||||
lacked `write:organization` scope for creating repos under the mc org).
|
||||
|
||||
#### 2. Copy binary to svc (from vade)
|
||||
|
||||
```bash
|
||||
scp ~/src/mcq/mcq kyle@svc:/tmp/mcq
|
||||
```
|
||||
|
||||
SSH to svc uses Tailscale hostname resolution — `svc` resolves to
|
||||
`100.106.232.4` via tailscale. No SSH config entry was needed. Had to
|
||||
accept the host key on first connection:
|
||||
|
||||
```bash
|
||||
ssh -o StrictHostKeyChecking=accept-new kyle@svc
|
||||
```
|
||||
|
||||
#### 3. Create user and install binary (on svc, as root via sudo)
|
||||
|
||||
```bash
|
||||
sudo useradd --system --no-create-home --shell /usr/sbin/nologin mcq
|
||||
sudo mkdir -p /srv/mcq
|
||||
sudo chown mcq:mcq /srv/mcq
|
||||
sudo chmod 0700 /srv/mcq
|
||||
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
|
||||
```
|
||||
|
||||
Verified: `/usr/local/bin/mcq --version` → `mcq version v0.1.0`
|
||||
|
||||
#### 4. Write config (on svc)
|
||||
|
||||
Created `/srv/mcq/mcq.toml`:
|
||||
|
||||
```toml
|
||||
[server]
|
||||
listen_addr = "127.0.0.1:8090"
|
||||
|
||||
[database]
|
||||
path = "/srv/mcq/mcq.db"
|
||||
|
||||
[mcias]
|
||||
server_url = "https://mcias.metacircular.net:8443"
|
||||
ca_cert = "/srv/mcq/ca.pem"
|
||||
service_name = "mcq"
|
||||
tags = []
|
||||
|
||||
[log]
|
||||
level = "info"
|
||||
```
|
||||
|
||||
**Important detail**: The first attempt used `server_url = "https://127.0.0.1:8443"`
|
||||
which failed because MCIAS's TLS cert has SANs for `mcias.wntrmute.dev`
|
||||
and `mcias.metacircular.net` but **not** `127.0.0.1` or `localhost`. Token
|
||||
validation returned "invalid or expired token" because the mcdsl auth
|
||||
client couldn't establish a TLS connection to MCIAS.
|
||||
|
||||
Fixed by copying the pattern from MCNS on svc:
|
||||
- `server_url = "https://mcias.metacircular.net:8443"` (uses the hostname
|
||||
that matches the cert's SAN)
|
||||
- `ca_cert = "/srv/mcq/ca.pem"` (the WNTRMUTE root CA cert, copied from
|
||||
`/srv/mcns/certs/ca.pem`)
|
||||
|
||||
The hostname `mcias.metacircular.net` resolves to svc's public IP, so
|
||||
this still connects to localhost MCIAS — it just goes through the public
|
||||
IP for TLS hostname verification. (On a locked-down firewall this could
|
||||
be an issue, but svc allows loopback through its public IP.)
|
||||
|
||||
#### 5. Create systemd unit (on svc)
|
||||
|
||||
Created `/etc/systemd/system/mcq.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=MCQ Document Queue
|
||||
After=network-online.target mcias.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/mcq server --config /srv/mcq/mcq.toml
|
||||
WorkingDirectory=/srv/mcq
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
User=mcq
|
||||
Group=mcq
|
||||
|
||||
NoNewPrivileges=yes
|
||||
ProtectSystem=strict
|
||||
ProtectHome=yes
|
||||
ReadWritePaths=/srv/mcq
|
||||
PrivateTmp=yes
|
||||
ProtectKernelTunables=yes
|
||||
ProtectControlGroups=yes
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now mcq
|
||||
```
|
||||
|
||||
Verified running: PID 3765144, memory 7.8MB, started cleanly.
|
||||
|
||||
#### 6. Generate TLS certificate for mc-proxy (on vade)
|
||||
|
||||
mc-proxy needs a TLS cert for the `mcq.metacircular.net` hostname (it
|
||||
terminates TLS at the edge). Generated using the local WNTRMUTE root CA:
|
||||
|
||||
```bash
|
||||
cd /tmp
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out mcq.key
|
||||
|
||||
openssl req -new -key mcq.key -out mcq.csr \
|
||||
-subj "/CN=mcq.metacircular.net/O=Metacircular Dynamics" \
|
||||
-addext "subjectAltName=DNS:mcq.metacircular.net"
|
||||
|
||||
openssl x509 -req -in mcq.csr \
|
||||
-CA ~/src/metacircular/ca/ca.pem \
|
||||
-CAkey ~/src/metacircular/ca/ca.key \
|
||||
-CAcreateserial -out mcq.pem -days 365 -sha256 \
|
||||
-extfile <(echo "subjectAltName=DNS:mcq.metacircular.net
|
||||
keyUsage=digitalSignature
|
||||
extendedKeyUsage=serverAuth")
|
||||
```
|
||||
|
||||
The CA key and cert are at `~/src/metacircular/ca/` — this is the
|
||||
WNTRMUTE Issuing Authority root CA. Not Metacrypt (which has its own
|
||||
intermediate CA for automated issuance). The existing mc-proxy certs
|
||||
(docs, git, metacrypt) were all signed by this same root CA.
|
||||
|
||||
Copied to svc:
|
||||
|
||||
```bash
|
||||
scp /tmp/mcq.pem /tmp/mcq.key kyle@svc:/tmp/
|
||||
```
|
||||
|
||||
Installed on svc:
|
||||
|
||||
```bash
|
||||
sudo cp /tmp/mcq.pem /srv/mc-proxy/certs/mcq.metacircular.net.pem
|
||||
sudo cp /tmp/mcq.key /srv/mc-proxy/certs/mcq.metacircular.net.key
|
||||
sudo chown mc-proxy:mc-proxy /srv/mc-proxy/certs/mcq.metacircular.net.*
|
||||
sudo chmod 0600 /srv/mc-proxy/certs/mcq.metacircular.net.key
|
||||
```
|
||||
|
||||
#### 7. Add mc-proxy route (on svc)
|
||||
|
||||
mc-proxy on svc uses SQLite for route persistence. The TOML config only
|
||||
seeds the database on first run (`store.IsEmpty()` check). After that,
|
||||
routes are loaded from SQLite. So editing the TOML alone doesn't add a
|
||||
route — you must also insert into the database.
|
||||
|
||||
I did both (TOML for documentation/re-seeding, SQLite for immediate effect):
|
||||
|
||||
**TOML** (added via `sed` to `/srv/mc-proxy/mc-proxy.toml`):
|
||||
|
||||
```toml
|
||||
[[listeners.routes]]
|
||||
hostname = "mcq.metacircular.net"
|
||||
backend = "127.0.0.1:8090"
|
||||
mode = "l7"
|
||||
tls_cert = "/srv/mc-proxy/certs/mcq.metacircular.net.pem"
|
||||
tls_key = "/srv/mc-proxy/certs/mcq.metacircular.net.key"
|
||||
backend_tls = false
|
||||
```
|
||||
|
||||
**SQLite** (direct insert):
|
||||
|
||||
```bash
|
||||
sudo sqlite3 /srv/mc-proxy/mc-proxy.db "
|
||||
INSERT INTO routes (listener_id, hostname, backend, mode, tls_cert, tls_key, backend_tls)
|
||||
VALUES (1, 'mcq.metacircular.net', '127.0.0.1:8090', 'l7',
|
||||
'/srv/mc-proxy/certs/mcq.metacircular.net.pem',
|
||||
'/srv/mc-proxy/certs/mcq.metacircular.net.key', 0);
|
||||
"
|
||||
```
|
||||
|
||||
The `listener_id = 1` is the `:443` listener (only listener on svc's
|
||||
mc-proxy).
|
||||
|
||||
**Note on `backend_tls = false`**: mcq serves plain HTTP on localhost.
|
||||
mc-proxy terminates TLS for the client and forwards as plain HTTP to
|
||||
`127.0.0.1:8090`. This is the same pattern as Gitea (`127.0.0.1:3000`)
|
||||
and mcdoc (`100.95.252.120:38080`). Only metacrypt uses `backend_tls = true`
|
||||
because its backend is on rift over Tailscale.
|
||||
|
||||
#### 8. Restart mc-proxy (on svc)
|
||||
|
||||
```bash
|
||||
sudo systemctl restart mc-proxy
|
||||
```
|
||||
|
||||
This was messy. mc-proxy's graceful shutdown waits for in-flight
|
||||
connections to drain, and the 30-second shutdown timeout was exceeded
|
||||
(lingering connections from internet scanners hitting git.metacircular.net).
|
||||
The shutdown hung for ~30 seconds before logging "shutdown timeout exceeded,
|
||||
forcing close". systemd then moved to `deactivating (stop-sigterm)` state.
|
||||
|
||||
Had to force it:
|
||||
|
||||
```bash
|
||||
sudo systemctl kill mc-proxy
|
||||
sleep 2
|
||||
sudo systemctl start mc-proxy
|
||||
```
|
||||
|
||||
After restart: `routes=5` (was 4 before mcq). Confirmed:
|
||||
|
||||
```bash
|
||||
curl -sk https://mcq.metacircular.net/v1/health
|
||||
# {"status":"ok"}
|
||||
```
|
||||
|
||||
#### 9. Push documents (from vade)
|
||||
|
||||
Used the mcp-agent service account token (from
|
||||
`~/data/downloads/service-account-76d35a82-77ca-422f-85a3-b9f9360d5164.token`)
|
||||
to authenticate API calls. This is a long-lived JWT issued by MCIAS with
|
||||
`admin` role, `exp` in 2027.
|
||||
|
||||
```bash
|
||||
TOKEN=$(cat ~/data/downloads/service-account-*.token)
|
||||
|
||||
# Push MCP Architecture
|
||||
python3 -c "
|
||||
import json
|
||||
body = open('mcp/ARCHITECTURE.md').read()
|
||||
print(json.dumps({'title': 'MCP Architecture', 'body': body}))
|
||||
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/mcp-architecture \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @-
|
||||
|
||||
# Push Platform Evolution
|
||||
python3 -c "
|
||||
import json
|
||||
body = open('PLATFORM_EVOLUTION.md').read()
|
||||
print(json.dumps({'title': 'Platform Evolution', 'body': body}))
|
||||
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/platform-evolution \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @-
|
||||
|
||||
# Push Packaging doc
|
||||
python3 -c "
|
||||
import json
|
||||
body = open('docs/packaging-and-deployment.md').read()
|
||||
print(json.dumps({'title': 'Packaging and Deployment', 'body': body}))
|
||||
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/packaging-and-deployment \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @-
|
||||
```
|
||||
|
||||
Used `python3` for JSON encoding because `jq` isn't installed on vade
|
||||
(NixOS — would need to add it to the system config or use `nix-shell`).
|
||||
|
||||
All three documents pushed successfully. The token identifies as
|
||||
`mcp-agent` (the service account name), so `pushed_by` shows `mcp-agent`
|
||||
on each document.
|
||||
|
||||
### Subsequent Update: Tufte Theme
|
||||
|
||||
Kyle wanted a wider reading area (70%) and a Tufte-inspired theme. Updated
|
||||
`web/static/style.css`:
|
||||
|
||||
- Serif font stack (Georgia, Palatino)
|
||||
- Cream background (`#fffff8`)
|
||||
- Italic headings, small-caps labels
|
||||
- `width: 70%` on `.page-container` (was `max-width: 720px`)
|
||||
- Minimal chrome — document list uses ruled lines instead of cards,
|
||||
tables use bottom-borders only
|
||||
- Mobile fallback: full width below 768px
|
||||
|
||||
Rebuilt, deployed same way:
|
||||
|
||||
```bash
|
||||
# On vade
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
|
||||
go build -trimpath -ldflags="-s -w -X main.version=v0.1.1" \
|
||||
-o mcq ./cmd/mcq
|
||||
scp mcq kyle@svc:/tmp/mcq
|
||||
|
||||
# On svc
|
||||
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
|
||||
sudo systemctl restart mcq
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## State After This Session
|
||||
|
||||
### What's running on svc
|
||||
|
||||
| Service | Port | Managed by | Notes |
|
||||
|---------|------|------------|-------|
|
||||
| MCIAS | :8443/:9443 | systemd | Identity/auth, been here longest |
|
||||
| mc-proxy | :443 | systemd | L7 TLS termination, 5 routes |
|
||||
| MCNS | :53/:8444/:9444 | systemd | Authoritative DNS |
|
||||
| Gitea | :3000 | systemd | Git hosting |
|
||||
| **mcq** | **:8090** | **systemd** | **NEW: document queue** |
|
||||
|
||||
None of these are MCP-managed. svc has no MCP agent.
|
||||
|
||||
### mc-proxy routes on svc
|
||||
|
||||
| Hostname | Backend | Mode | TLS Backend |
|
||||
|----------|---------|------|-------------|
|
||||
| metacrypt.metacircular.net | 100.95.252.120:18080 | L7 | yes (rift) |
|
||||
| git.metacircular.net | 127.0.0.1:3000 | L7 | no |
|
||||
| git.wntrmute.dev | 127.0.0.1:3000 | L7 | no |
|
||||
| docs.metacircular.net | 100.95.252.120:38080 | L7 | no |
|
||||
| **mcq.metacircular.net** | **127.0.0.1:8090** | **L7** | **no** |
|
||||
|
||||
### DNS
|
||||
|
||||
`mcq.metacircular.net` is a CNAME to `svc.metacircular.net` (set up by
|
||||
Kyle at the DNS registrar before this session). mc-proxy's SNI-based
|
||||
routing handles the rest.
|
||||
|
||||
### Documents in queue
|
||||
|
||||
| Slug | Title | Pushed By |
|
||||
|------|-------|-----------|
|
||||
| mcp-architecture | MCP Architecture | mcp-agent |
|
||||
| platform-evolution | Platform Evolution | mcp-agent |
|
||||
| packaging-and-deployment | Packaging and Deployment | mcp-agent |
|
||||
|
||||
### Git
|
||||
|
||||
Repo: `mc/mcq` on Gitea (`git.wntrmute.dev:mc/mcq.git`)
|
||||
|
||||
Commits:
|
||||
1. `bc16279` — Initial implementation
|
||||
2. `648e9dc` — Support plain HTTP mode for mc-proxy L7 deployment
|
||||
3. `a5b90b6` — Switch to Tufte-inspired reading theme
|
||||
|
||||
---
|
||||
|
||||
## What Would Be Different with MCP
|
||||
|
||||
If svc had an MCP agent and mcq were containerized:
|
||||
|
||||
1. **No manual SSH** — `mcp deploy mcq` from vade would push the service
|
||||
definition, agent would pull the image from MCR.
|
||||
2. **No manual port picking** — agent assigns a free port from 10000-60000.
|
||||
3. **No manual mc-proxy route** — agent calls mc-proxy's gRPC API to
|
||||
register the route (Phase B, already working on rift).
|
||||
4. **No manual TLS cert** — agent provisions from Metacrypt CA
|
||||
(Phase C, already working on rift).
|
||||
5. **No manual systemd unit** — agent manages the container lifecycle.
|
||||
6. **`mcp ps` would show mcq** — because the agent tracks it in its
|
||||
registry.
|
||||
7. **`mcp stop mcq` / `mcp restart mcq` would work** — standard lifecycle.
|
||||
|
||||
The gap is: svc has no agent. That's Phase E work (items #10-#12 in
|
||||
PLATFORM_EVOLUTION.md). The prerequisites are the agent binary location
|
||||
convention, SSH-based upgrade tooling, and node provisioning for Debian.
|
||||
|
||||
---
|
||||
|
||||
## Rough Edges and Lessons
|
||||
|
||||
1. **MCIAS cert hostname**: Every new service on svc will hit this. The
|
||||
MCIAS cert doesn't include localhost as a SAN. Services must use
|
||||
`server_url = "https://mcias.metacircular.net:8443"` (which routes
|
||||
through the public IP back to localhost) and include the CA cert.
|
||||
Could fix by reissuing the MCIAS cert with a localhost SAN.
|
||||
|
||||
2. **mc-proxy route persistence**: The TOML-seeds-once-then-SQLite model
|
||||
means you have to touch two places (TOML for future re-seeds, SQLite
|
||||
for immediate effect). On rift this is handled by the agent's gRPC
|
||||
calls. On svc without an agent, it's manual database surgery.
|
||||
|
||||
3. **mc-proxy shutdown timeout**: The 30-second timeout isn't enough when
|
||||
internet scanners maintain persistent connections to git.metacircular.net.
|
||||
Had to force-kill on restart. Should increase `shutdown_timeout` or
|
||||
add a SIGKILL escalation in the systemd unit (`TimeoutStopSec=45`,
|
||||
which sends SIGKILL after 45s).
|
||||
|
||||
4. **No jq on vade**: NixOS doesn't have jq in the default system config.
|
||||
Used python3 as a workaround for JSON encoding. Minor friction.
|
||||
|
||||
5. **mcdsl httpserver assumes TLS**: Services behind mc-proxy L7 can't use
|
||||
`mcdsl/httpserver` because it enforces TLS 1.3. mcdoc solved this with
|
||||
its own config/server. mcq now does the same. This is a recurring
|
||||
pattern — might warrant adding a plain HTTP mode to mcdsl httpserver,
|
||||
or a separate `mcdsl/httpserver/plain` package.
|
||||
|
||||
6. **Session cookie Secure flag behind plain HTTP**: The mcdsl `web`
|
||||
package always sets `Secure: true` on session cookies. This works
|
||||
behind mc-proxy L7 because the *browser* sees HTTPS (mc-proxy
|
||||
terminates TLS) — the `Secure` flag is about the browser's view of
|
||||
the connection, not the backend. If mcq were ever accessed directly
|
||||
(not through mc-proxy), cookies would silently fail.
|
||||
231
log/2026-04-03-uid-incident.md
Normal file
231
log/2026-04-03-uid-incident.md
Normal file
@@ -0,0 +1,231 @@
|
||||
# Incident Report: UID Change Cascading Failure
|
||||
|
||||
**Date**: 2026-04-03
|
||||
**Duration**: ~2 hours (08:45–10:45 PDT)
|
||||
**Severity**: Full platform outage on rift (all containers lost)
|
||||
**Root cause**: Changing the `mcp` system user UID from 995 to 850
|
||||
|
||||
## Timeline
|
||||
|
||||
### Background
|
||||
|
||||
Orion was being provisioned as a new worker node. Its NixOS config
|
||||
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
|
||||
was already assigned to the `sshd` user, causing a UID collision:
|
||||
|
||||
```
|
||||
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
|
||||
```
|
||||
|
||||
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
|
||||
rejected rootless podman operations because the calling process's UID
|
||||
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
|
||||
|
||||
### The UID Change
|
||||
|
||||
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
|
||||
the 800–899 range, empty on all nodes). Both rift and orion were
|
||||
rebuilt with `nixos-rebuild switch`.
|
||||
|
||||
**Problem 1: NixOS doesn't change UIDs for existing users.** The
|
||||
rebuild created the NixOS config with `uid = 850` but the existing
|
||||
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
|
||||
was required on each node.
|
||||
|
||||
**Problem 2: Rootless podman caches the UID everywhere.**
|
||||
- Podman's SQLite database (`db.sql`) stores absolute paths like
|
||||
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
|
||||
- The systemd user session (`/run/user/995/`) is tied to the UID
|
||||
- subuid/subgid mappings reference the user by name but the kernel
|
||||
checks the actual UID
|
||||
- Container storage overlay directories have file ownership based on
|
||||
the old UID namespace mapping (995 → 100000)
|
||||
|
||||
After changing the UID, `podman` operations failed with:
|
||||
```
|
||||
newuidmap: write to uid_map failed: Operation not permitted
|
||||
```
|
||||
|
||||
### The Reboot
|
||||
|
||||
Rift was rebooted to get a clean systemd user session for UID 850.
|
||||
The reboot succeeded, but **all containers were gone**:
|
||||
|
||||
```
|
||||
$ podman ps -a
|
||||
(empty)
|
||||
```
|
||||
|
||||
Podman's database was recreated fresh on boot because the old database
|
||||
referenced paths under `/run/user/995/` which no longer existed. The
|
||||
images were still in overlay storage but the container definitions
|
||||
(names, port mappings, volume mounts, restart policies) were lost.
|
||||
|
||||
### DNS Collapse
|
||||
|
||||
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
|
||||
ran as a container on rift. When all containers were lost, DNS
|
||||
resolution broke:
|
||||
|
||||
- `mcq.svc.mcp.metacircular.net` → no answer
|
||||
- MCNS also served as a recursive resolver for the LAN
|
||||
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
|
||||
|
||||
Tailscale DNS (MagicDNS) was also affected because resolved's global
|
||||
DNS config pointed to MCNS. Tailscale itself remained functional
|
||||
(its coordination servers are external), but hostname resolution via
|
||||
Tailscale DNS names failed.
|
||||
|
||||
The operator turned off Tailscale on vade (the workstation) because
|
||||
Tailscale's MagicDNS was routing ALL DNS queries through the broken
|
||||
MCNS resolver — external services including Claude Code and Gitea
|
||||
were unreachable. Disabling Tailscale was the only way to restore
|
||||
external DNS resolution. However, this also broke connectivity to
|
||||
rift since the MCP agent binds to the Tailnet IP only
|
||||
(`100.95.252.120:9444`).
|
||||
|
||||
### Recovery
|
||||
|
||||
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
|
||||
connectivity works without MCNS — MagicDNS uses Tailscale's own
|
||||
servers for `.ts.net` names.
|
||||
|
||||
**Step 2**: Start MCNS manually via `podman run`. The image was cached
|
||||
in overlay storage. MCNS needed explicit port bindings (not `--network
|
||||
host`) because systemd-resolved holds port 53 on localhost:
|
||||
|
||||
```bash
|
||||
podman run -d --name mcns --restart unless-stopped \
|
||||
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
|
||||
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
|
||||
-p 127.0.0.1:38443:8443 \
|
||||
-v /srv/mcns:/srv/mcns \
|
||||
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
|
||||
server --config /srv/mcns/mcns.toml
|
||||
```
|
||||
|
||||
DNS resolution restored within seconds.
|
||||
|
||||
**Step 3**: Start remaining services manually via `podman run`. Images
|
||||
were all cached. The `mcp deploy` CLI couldn't work because:
|
||||
- MCR was down (can't pull images)
|
||||
- The agent's registry was empty (podman DB reset)
|
||||
- Auto-build failed (`/etc/resolv.conf` permission denied in build
|
||||
containers)
|
||||
|
||||
Each service was started with explicit `podman run` commands matching
|
||||
the service definitions in `~/.config/mcp/services/*.toml`.
|
||||
|
||||
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
|
||||
were owned by UID 850 (the mcp user on the host). Inside containers,
|
||||
UID 0 (root) maps to host UID 850 via subuid. But:
|
||||
|
||||
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
|
||||
to match the container's user namespace
|
||||
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
|
||||
in the database directory
|
||||
- modernc.org/sqlite calls `fchmod()` on the database file, which is
|
||||
denied inside rootless podman user namespaces (even for UID 0 in the
|
||||
namespace)
|
||||
|
||||
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
|
||||
was fatal for MCR and Metacrypt. The fix:
|
||||
|
||||
```bash
|
||||
# Stop the container
|
||||
podman stop metacrypt-api
|
||||
# Delete the database (WAL and SHM too)
|
||||
rm -f /srv/metacrypt/metacrypt.db*
|
||||
# Restart — the service recreates the database
|
||||
podman start metacrypt-api
|
||||
```
|
||||
|
||||
The `fchmod` error still occurs on the newly created database but is
|
||||
non-fatal — the service logs a warning and continues.
|
||||
|
||||
**Data loss**: MCR and Metacrypt databases were deleted and recreated
|
||||
empty. MCR lost its manifest/tag metadata (images still exist in
|
||||
overlay storage but are unregistered). Metacrypt lost its CA state
|
||||
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
|
||||
etc.) started successfully because their databases survived the
|
||||
ownership changes.
|
||||
|
||||
## Root Causes
|
||||
|
||||
1. **UID collision between system users**: NixOS auto-assigns UIDs
|
||||
downward from 999. Pinning UID 995 for mcp collided with sshd on
|
||||
orion.
|
||||
|
||||
2. **Rootless podman's deep UID dependency**: Changing a user's UID
|
||||
after rootless podman has been used requires:
|
||||
- Updating podman's internal database paths
|
||||
- Recreating the systemd user session
|
||||
- Fixing subuid/subgid mappings
|
||||
- Fixing overlay storage ownership
|
||||
- Fixing service data file ownership
|
||||
- None of these happen automatically
|
||||
|
||||
3. **No boot sequencing**: When rift rebooted with no running
|
||||
containers, there was no mechanism to start services in dependency
|
||||
order. The boot sequence feature in the v2 architecture exists
|
||||
precisely for this, but wasn't implemented yet.
|
||||
|
||||
4. **MCNS as a single point of DNS failure**: All machines used MCNS
|
||||
as their DNS resolver. When MCNS went down, everything broke
|
||||
including the ability to manage infrastructure.
|
||||
|
||||
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
|
||||
library calls `fchmod()` on database files, which is denied inside
|
||||
rootless podman user namespaces. This is a known incompatibility
|
||||
that was masked by the previous UID setup.
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Never change a rootless podman user's UID.** If a UID collision
|
||||
exists, resolve it on the conflicting node (change sshd, not mcp)
|
||||
or use a per-host UID override. Changing the UID after podman has
|
||||
been used is destructive.
|
||||
|
||||
2. **DNS must not be a single point of failure.** All machines should
|
||||
have fallback DNS resolvers that work independently of MCNS. The
|
||||
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
|
||||
fallbacks, not just MCNS.
|
||||
|
||||
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
|
||||
(foundation → core → management) is not a nice-to-have. Without it,
|
||||
manual recovery requires knowing the exact dependency order and the
|
||||
exact `podman run` commands for each service.
|
||||
|
||||
4. **The MCP agent should be able to recover containers from its
|
||||
registry.** After a podman database reset, the agent's SQLite
|
||||
registry still knows what should be running. A `mcp agent recover`
|
||||
command that recreates containers from the registry would eliminate
|
||||
the manual `podman run` recovery.
|
||||
|
||||
5. **Service definitions must include all runtime parameters.** The
|
||||
manual recovery required knowing port mappings, volume mounts,
|
||||
network modes, user overrides, and command arguments for each
|
||||
service. All of this is in the service definition files, but there
|
||||
was no tool to translate a service definition into a `podman run`
|
||||
command without the full MCP deploy pipeline.
|
||||
|
||||
6. **Tailscale MagicDNS amplifies DNS failures.** When MCNS is down
|
||||
and MagicDNS routes through it, ALL DNS breaks — not just internal
|
||||
names. Disabling Tailscale restores external DNS but loses Tailnet
|
||||
connectivity. The fix is fallback resolvers that bypass MCNS, not
|
||||
disabling Tailscale.
|
||||
|
||||
## Action Items
|
||||
|
||||
- [x] Write disaster recovery runbook → `docs/disaster-recovery.md`
|
||||
- [x] Add fallback DNS resolvers to NixOS config → all nodes now have
|
||||
1.1.1.1 and 8.8.8.8 as fallbacks after MCNS
|
||||
- [x] Implement `mcp agent recover` command → MCP v0.10.5. Recreates
|
||||
containers from the agent registry when podman DB is lost.
|
||||
- [x] Implement boot sequencing in the agent → MCP v0.10.6.
|
||||
[[boot.sequence]] config with per-stage health checks.
|
||||
- [x] Fix modernc.org/sqlite `fchmod` → was our own `os.Chmod` in
|
||||
`mcdsl/db/db.go`, not sqlite. Made best-effort in mcdsl v1.8.0.
|
||||
- [x] Add multi-address support to node config → MCP v0.10.4.
|
||||
Fallback addresses tried in order when primary fails.
|
||||
- [x] Stabilize mcp UID → pinned at 850 with NEVER CHANGE comment
|
||||
Reference in New Issue
Block a user