Compare commits
32 Commits
25ee89455a
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| c05ed968f9 | |||
| 02cfb27937 | |||
| 5aceb496e8 | |||
| a474f17a8b | |||
| ad34904b0b | |||
| cdf3d30bfe | |||
| 6a6c5a0c8e | |||
| 96abb87740 | |||
| 4de025c071 | |||
| 5158e6ae24 | |||
| 979a64a854 | |||
| 634b9096c0 | |||
| 24a0db0f47 | |||
| 78ce5003e1 | |||
| 0b11899a9b | |||
| fb9942db14 | |||
| f9eea00a6c | |||
| f3a608c4ae | |||
| 8fb6374257 | |||
| e441df34c8 | |||
| 644ded9efc | |||
| 4386fb0896 | |||
| 5c4ed5c681 | |||
| 4722fdb0da | |||
| 1715a53193 | |||
| 5e7e2522d7 | |||
| d041df01e4 | |||
| a04543028b | |||
| 90781014db | |||
| 24a8ae8c21 | |||
| 84610339aa | |||
| 8550b8c2e5 |
@@ -17,14 +17,17 @@ Metacircular is a multi-service personal infrastructure platform. This root repo
|
|||||||
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
|
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
|
||||||
| `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go |
|
| `mcdsl/` | Standard library — shared packages for auth, db, config, HTTP/gRPC servers, CSRF, snapshots | Go |
|
||||||
| `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go |
|
| `mcdoc/` | Documentation server — renders markdown from Gitea, serves public docs via mc-proxy | Go |
|
||||||
| `mcp/` | Control plane — operator-driven deployment, service registry, container lifecycle (master/agent) | Go |
|
| `mcq/` | Document review queue — push docs for review, MCP server for Claude integration | Go |
|
||||||
| `mcdeploy/` | Deployment CLI — tactical bridge tool for build, push, deploy operations | Go |
|
| `mcp/` | Control plane — service deployment, container lifecycle, multi-node fleet management (CLI/agent, master in development) | Go |
|
||||||
| `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go |
|
| `mcns/` | Networking service — custom Go DNS server, authoritative for internal zones | Go |
|
||||||
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
|
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
|
||||||
| `docs/` | Platform-wide documentation (architecture overview, deployment guide) | Markdown |
|
| `docs/` | Platform-wide documentation (architecture overview, deployment guide) | Markdown |
|
||||||
|
| `log/` | Operational work logs and session notes | Markdown |
|
||||||
|
|
||||||
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
|
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
|
||||||
|
|
||||||
|
Some subprojects (mcat, mcdoc, mcq) may live at sibling paths (e.g., `../mcq/`) rather than as subdirectories, depending on workspace layout.
|
||||||
|
|
||||||
## Service Dependencies
|
## Service Dependencies
|
||||||
|
|
||||||
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
|
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
|
||||||
@@ -35,6 +38,7 @@ mcias (standalone — no MCIAS dependency)
|
|||||||
├── mc-proxy (uses MCIAS for admin auth)
|
├── mc-proxy (uses MCIAS for admin auth)
|
||||||
├── mcr (uses MCIAS for auth + policy)
|
├── mcr (uses MCIAS for auth + policy)
|
||||||
├── mcdoc (public, no MCIAS — fetches docs from Gitea)
|
├── mcdoc (public, no MCIAS — fetches docs from Gitea)
|
||||||
|
├── mcq (uses MCIAS for auth; document review queue)
|
||||||
├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle)
|
├── mcp (uses MCIAS for auth; orchestrates deployment and lifecycle)
|
||||||
├── mcns (uses MCIAS for auth; authoritative DNS for internal zones)
|
├── mcns (uses MCIAS for auth; authoritative DNS for internal zones)
|
||||||
└── mcat (tests MCIAS login policies)
|
└── mcat (tests MCIAS login policies)
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ from its current manually-wired state to fully declarative deployment.
|
|||||||
It is a living design document — not a spec, not a commitment, but a
|
It is a living design document — not a spec, not a commitment, but a
|
||||||
record of where we are, where we want to be, and what's between.
|
record of where we are, where we want to be, and what's between.
|
||||||
|
|
||||||
Last updated: 2026-03-27 (Phases A + B complete)
|
Last updated: 2026-04-02 (Phases A–D complete, Phase E in progress)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -181,9 +181,9 @@ about one node, one mc-proxy, or loopback-only backends.
|
|||||||
#### 1. mcdsl: Proper Module Versioning — DONE
|
#### 1. mcdsl: Proper Module Versioning — DONE
|
||||||
|
|
||||||
mcdsl is already properly versioned and released:
|
mcdsl is already properly versioned and released:
|
||||||
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`
|
- Tagged releases: `v0.1.0`, `v1.0.0`, `v1.0.1`, `v1.1.0`, `v1.2.0`
|
||||||
- All consuming services import by URL with pinned versions
|
- All consuming services import by URL with pinned versions
|
||||||
(mcr, mcat, mcns, mc-proxy → `v1.0.0`; metacrypt → `v1.0.1`)
|
(all consuming services on `v1.2.0`)
|
||||||
- No `replace` directives anywhere
|
- No `replace` directives anywhere
|
||||||
- Docker builds use standard `go mod download`
|
- Docker builds use standard `go mod download`
|
||||||
- `uses_mcdsl` eliminated from service definitions and docs
|
- `uses_mcdsl` eliminated from service definitions and docs
|
||||||
@@ -215,18 +215,14 @@ routes during deploy and stop:
|
|||||||
- L4 routes: TLS passthrough, backend handles its own TLS
|
- L4 routes: TLS passthrough, backend handles its own TLS
|
||||||
- Hostnames default to `<service>.svc.mcp.metacircular.net`
|
- Hostnames default to `<service>.svc.mcp.metacircular.net`
|
||||||
|
|
||||||
#### 4. MCP Agent: TLS Cert Provisioning
|
#### 4. MCP Agent: TLS Cert Provisioning — DONE
|
||||||
|
|
||||||
**Gap**: certs are manually provisioned and placed on disk. There is no
|
Agent provisions TLS certificates from Metacrypt CA automatically during
|
||||||
automated issuance flow.
|
deploy for L7 routes:
|
||||||
|
- ACME client library requests certs from Metacrypt CA via its API
|
||||||
**Work**:
|
- Certs stored in `/srv/mc-proxy/certs/<service>.pem`
|
||||||
- Agent requests certs from Metacrypt CA via its API.
|
- Provisioning happens during deploy before mc-proxy route registration
|
||||||
- Certs are stored in a standard location
|
- L7 routes get agent-provisioned certs; L4 routes use service-managed TLS
|
||||||
(`/srv/mc-proxy/certs/<service>.pem`).
|
|
||||||
- Cert renewal is handled automatically before expiry.
|
|
||||||
|
|
||||||
**Depends on**: Metacrypt cert issuance policy (#7).
|
|
||||||
|
|
||||||
#### 5. mc-proxy: Route Persistence — DONE
|
#### 5. mc-proxy: Route Persistence — DONE
|
||||||
|
|
||||||
@@ -243,57 +239,49 @@ mc-proxy routes are fully persisted in SQLite and survive restarts:
|
|||||||
bootstrap before MCP is operational. The gRPC API and mcproxyctl
|
bootstrap before MCP is operational. The gRPC API and mcproxyctl
|
||||||
are the primary route management interfaces going forward.
|
are the primary route management interfaces going forward.
|
||||||
|
|
||||||
#### 6. MCP Agent: DNS Registration
|
#### 6. MCP Agent: DNS Registration — DONE
|
||||||
|
|
||||||
**Gap**: DNS records are manually configured in MCNS zone files.
|
Agent automatically manages DNS records during deploy and stop:
|
||||||
|
- Deploy: calls MCNS API to create/update A records for
|
||||||
|
`<service>.svc.mcp.metacircular.net` pointing to the node's address.
|
||||||
|
- Stop/undeploy: removes DNS records before stopping containers.
|
||||||
|
- Config: `[mcns]` section in agent config with server URL, CA cert,
|
||||||
|
token path, zone, and node address.
|
||||||
|
- Nil-safe: if MCNS not configured, silently skipped (backward compatible).
|
||||||
|
- Authorization: mcp-agent system account can manage any record name.
|
||||||
|
|
||||||
**Work**:
|
#### 7. Metacrypt: Automated Cert Issuance Policy — DONE
|
||||||
- Agent creates/updates A records in MCNS for
|
|
||||||
`<service>.svc.mcp.metacircular.net`.
|
|
||||||
- Agent removes records on service teardown.
|
|
||||||
|
|
||||||
**Depends on**: MCNS record management API (#8).
|
MCP agent has MCIAS credentials and Metacrypt policy for automated cert
|
||||||
|
issuance:
|
||||||
|
- MCP agent authenticates to Metacrypt with MCIAS service credentials
|
||||||
|
- Metacrypt policy allows cert issuance for
|
||||||
|
`*.svc.mcp.metacircular.net`
|
||||||
|
- One cert per hostname per service — no wildcard certs
|
||||||
|
|
||||||
#### 7. Metacrypt: Automated Cert Issuance Policy
|
#### 8. MCNS: Record Management API — DONE
|
||||||
|
|
||||||
**Gap**: no policy exists for automated cert issuance. The MCP agent
|
MCNS provides full CRUD for DNS records via REST and gRPC:
|
||||||
doesn't have a Metacrypt identity or permissions.
|
- REST: POST/GET/PUT/DELETE on `/v1/zones/{zone}/records`
|
||||||
|
- gRPC: RecordService with ListRecords, CreateRecord, GetRecord,
|
||||||
**Work**:
|
UpdateRecord, DeleteRecord RPCs
|
||||||
- MCP agent gets an MCIAS service account.
|
- SQLite-backed with transactional writes, CNAME exclusivity enforcement,
|
||||||
- Metacrypt policy allows this account to issue certs scoped to
|
and automatic SOA serial bumping on mutations
|
||||||
`*.svc.mcp.metacircular.net` (and explicitly listed public
|
- Authorization: admin can manage any record, mcp-agent system account
|
||||||
hostnames).
|
can manage any record name, other system accounts scoped to own name
|
||||||
- No wildcard certs — one cert per hostname per service.
|
- MCP agent uses the REST API to register/deregister records on
|
||||||
|
deploy/stop
|
||||||
**Depends on**: MCIAS service account provisioning (exists today, just
|
|
||||||
needs the account created).
|
|
||||||
|
|
||||||
#### 8. MCNS: Record Management API
|
|
||||||
|
|
||||||
**Gap**: MCNS v1.0.0 has REST + gRPC APIs and SQLite storage, but
|
|
||||||
records are currently seeded from migrations (static). The API supports
|
|
||||||
CRUD operations but MCP does not yet call it for dynamic registration.
|
|
||||||
|
|
||||||
**Work**:
|
|
||||||
- MCP agent calls MCNS API to create/update/delete records on
|
|
||||||
deploy/stop.
|
|
||||||
- MCIAS auth scoping to allow MCP agent to manage
|
|
||||||
`*.svc.mcp.metacircular.net` records.
|
|
||||||
|
|
||||||
**Depends on**: MCNS API exists. Remaining work is MCP integration
|
|
||||||
and auth scoping.
|
|
||||||
|
|
||||||
#### 9. Application $PORT Convention — DONE
|
#### 9. Application $PORT Convention — DONE
|
||||||
|
|
||||||
mcdsl v1.1.0 adds `$PORT` and `$PORT_GRPC` env var support:
|
mcdsl v1.2.0 added `$PORT` and `$PORT_GRPC` env var support:
|
||||||
- `config.Load` checks `$PORT` → overrides `Server.ListenAddr`
|
- `config.Load` checks `$PORT` → overrides `Server.ListenAddr`
|
||||||
- `config.Load` checks `$PORT_GRPC` → overrides `Server.GRPCAddr`
|
- `config.Load` checks `$PORT_GRPC` → overrides `Server.GRPCAddr`
|
||||||
- Takes precedence over TOML and generic env overrides
|
- Takes precedence over TOML and generic env overrides
|
||||||
(`$MCR_SERVER_LISTEN_ADDR`) — agent-assigned ports are authoritative
|
(`$MCR_SERVER_LISTEN_ADDR`) — agent-assigned ports are authoritative
|
||||||
- Handles both `config.Base` embedding (MCR, MCNS, MCAT) and direct
|
- Handles both `config.Base` embedding (MCR, MCNS, MCAT) and direct
|
||||||
`ServerConfig` embedding (Metacrypt) via struct tree walking
|
`ServerConfig` embedding (Metacrypt) via struct tree walking
|
||||||
- MCR, Metacrypt, MCNS upgraded to mcdsl v1.1.0
|
- All consuming services on mcdsl v1.4.0
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -311,32 +299,92 @@ Phase A — Independent groundwork: ✓ COMPLETE
|
|||||||
Phase B — MCP route registration: ✓ COMPLETE
|
Phase B — MCP route registration: ✓ COMPLETE
|
||||||
#3 Agent registers routes with mc-proxy ✓ DONE
|
#3 Agent registers routes with mc-proxy ✓ DONE
|
||||||
|
|
||||||
Phase C — Automated TLS:
|
Phase C — Automated TLS: ✓ COMPLETE
|
||||||
#7 Metacrypt cert issuance policy
|
#7 Metacrypt cert issuance policy ✓ DONE
|
||||||
#4 Agent provisions certs
|
#4 Agent provisions certs ✓ DONE
|
||||||
(depends on #7)
|
(depends on #7)
|
||||||
|
|
||||||
Phase D — DNS:
|
Phase D — DNS: ✓ COMPLETE
|
||||||
#8 MCNS record management API
|
#8 MCNS record management API ✓ DONE
|
||||||
#6 Agent registers DNS
|
#6 Agent registers DNS ✓ DONE
|
||||||
(depends on #8)
|
(depends on #8)
|
||||||
|
|
||||||
|
Phase E — Multi-node agent management:
|
||||||
|
#10 Agent binary at /srv/mcp/mcp-agent on all nodes
|
||||||
|
#11 mcp agent upgrade (SSH-based cross-compiled push)
|
||||||
|
#12 Node provisioning tooling (Debian + NixOS)
|
||||||
|
(depends on #10)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Phases A and B are complete.** Services can be deployed with
|
**Phases A, B, C, and D are complete.** Services can be deployed with
|
||||||
agent-assigned ports, `$PORT` env vars, and automatic mc-proxy route
|
agent-assigned ports, `$PORT` env vars, automatic mc-proxy route
|
||||||
registration. No more manual port picking, mcproxyctl, or TOML editing.
|
registration, automated TLS cert provisioning from Metacrypt CA, and
|
||||||
|
automatic DNS registration in MCNS. No more manual port picking,
|
||||||
The remaining manual steps are TLS cert provisioning (Phase C) and
|
mcproxyctl, TOML editing, cert generation, or DNS zone editing.
|
||||||
DNS registration (Phase D).
|
|
||||||
|
|
||||||
### Immediate Next Steps
|
### Immediate Next Steps
|
||||||
|
|
||||||
1. **Phase C: Automated TLS** — Metacrypt cert issuance policy for MCP
|
1. **Phase E: Multi-node agent management** — see below. See
|
||||||
agent, then agent provisions certs automatically during deploy.
|
`docs/phase-e-plan.md` for the detailed Phase E design and
|
||||||
2. **Phase D: DNS** — MCNS record management API integration, then
|
`docs/architecture-v2.md` for the v2 master/agent architecture.
|
||||||
agent registers DNS records during deploy.
|
2. **mcdoc** — deployed (v0.1.0 on rift). Serves rendered markdown
|
||||||
3. **mcdoc implementation** — fully designed, no platform evolution
|
from Gitea via mc-proxy.
|
||||||
dependency. Deployable now with the new route system.
|
|
||||||
|
#### 10. Agent Binary Location Convention
|
||||||
|
|
||||||
|
**Gap**: The agent binary is currently NixOS-managed on rift (lives in
|
||||||
|
`/nix/store/`, systemd `ExecStart` points there). This doesn't work for
|
||||||
|
Debian nodes and requires a full `nixos-rebuild` for every MCP release.
|
||||||
|
|
||||||
|
**Work**:
|
||||||
|
- Standardize agent binary at `/srv/mcp/mcp-agent` on all nodes.
|
||||||
|
- NixOS config: change `ExecStart` from nix store path to
|
||||||
|
`/srv/mcp/mcp-agent`. NixOS still owns user, systemd unit, podman,
|
||||||
|
directories — just not the binary version.
|
||||||
|
- Debian nodes: same layout, provisioned by setup script.
|
||||||
|
|
||||||
|
#### 11. Agent Upgrade via SSH Push
|
||||||
|
|
||||||
|
**Gap**: Updating the agent requires manual, OS-specific steps. On
|
||||||
|
NixOS: update flake lock, commit, push, rebuild. On Debian: build, scp,
|
||||||
|
restart. With multiple nodes and architectures (amd64 + arm64), this
|
||||||
|
doesn't scale.
|
||||||
|
|
||||||
|
**Work**:
|
||||||
|
- `mcp agent upgrade [node]` CLI command.
|
||||||
|
- Cross-compiles agent for each target arch (`GOARCH` from node config).
|
||||||
|
- Uses `golang.org/x/crypto/ssh` to push the binary and restart the
|
||||||
|
service. No external tool dependencies.
|
||||||
|
- Node config gains `ssh` (hostname) and `arch` (GOARCH) fields.
|
||||||
|
- Upgrades all nodes by default to prevent version skew. New RPCs cause
|
||||||
|
`Unimplemented` errors if agent and CLI are out of sync.
|
||||||
|
|
||||||
|
**Depends on**: #10 (binary location convention).
|
||||||
|
|
||||||
|
#### 12. Node Provisioning Tooling
|
||||||
|
|
||||||
|
**Gap**: Setting up a new node requires manual steps: create user,
|
||||||
|
create directories, install podman, write config, create systemd unit.
|
||||||
|
Different for NixOS vs Debian.
|
||||||
|
|
||||||
|
**Work**:
|
||||||
|
- Go-based provisioning tool (part of MCP CLI) or standalone script.
|
||||||
|
- `mcp node provision <name>` SSHs to the node and runs setup:
|
||||||
|
create `mcp` user with podman access, create `/srv/mcp/`, write
|
||||||
|
systemd unit, install initial binary, start service.
|
||||||
|
- For NixOS, provisioning remains in the NixOS config (declarative).
|
||||||
|
The provisioning tool targets Debian/generic Linux.
|
||||||
|
|
||||||
|
**Depends on**: #10 (binary location convention), #11 (SSH infra).
|
||||||
|
|
||||||
|
**Current fleet**:
|
||||||
|
|
||||||
|
| Node | OS | Arch | Status |
|
||||||
|
|------|----|------|--------|
|
||||||
|
| rift | NixOS | amd64 | Operational, single MCP agent |
|
||||||
|
| hyperborea | Debian (RPi) | arm64 | Online, agent provisioning planned |
|
||||||
|
| svc | Debian | amd64 | Runs MCIAS, needs agent for public edge services |
|
||||||
|
| orion | NixOS | amd64 | Provisioned, offline for maintenance |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -26,7 +26,8 @@ lives in [docs/metacircular.md](docs/metacircular.md).
|
|||||||
| **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented |
|
| **MCNS** | Networking — authoritative DNS for internal platform zones, upstream forwarding. | Implemented |
|
||||||
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented |
|
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Implemented |
|
||||||
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | Implemented |
|
| **MCDoc** | Documentation server — renders markdown from Gitea, serves public docs. | Implemented |
|
||||||
| **MCDeploy** | Deployment CLI — single-binary tool for build, push, deploy, cert renewal, and status. Tactical bridge tool while MCP capabilities mature. | Active dev |
|
| **MCQ** | Document review queue — push docs for review, MCP server for Claude Code integration. | Implemented |
|
||||||
|
| **MCDeploy** | Deployment CLI — tactical bridge tool, now deprecated and archived. Superseded by MCP. | Deprecated |
|
||||||
|
|
||||||
Shared library: **MCDSL** — standard library for all services (auth, db,
|
Shared library: **MCDSL** — standard library for all services (auth, db,
|
||||||
config, TLS server, CSRF, snapshots).
|
config, TLS server, CSRF, snapshots).
|
||||||
@@ -103,8 +104,9 @@ metacircular/
|
|||||||
├── mcns/ DNS server
|
├── mcns/ DNS server
|
||||||
├── mcat/ Login policy tester
|
├── mcat/ Login policy tester
|
||||||
├── mcdsl/ Standard library (shared packages)
|
├── mcdsl/ Standard library (shared packages)
|
||||||
├── mcdeploy/ Deployment CLI tool
|
├── mcdeploy/ Deployment CLI (deprecated, archived)
|
||||||
├── mcdoc/ Documentation server
|
├── mcdoc/ Documentation server
|
||||||
|
├── mcq/ Document review queue
|
||||||
├── ca/ PKI infrastructure (dev/test, not source code)
|
├── ca/ PKI infrastructure (dev/test, not source code)
|
||||||
└── docs/ Platform-wide documentation
|
└── docs/ Platform-wide documentation
|
||||||
```
|
```
|
||||||
|
|||||||
134
STATUS.md
134
STATUS.md
@@ -1,39 +1,45 @@
|
|||||||
# Metacircular Platform Status
|
# Metacircular Platform Status
|
||||||
|
|
||||||
Last updated: 2026-03-27
|
Last updated: 2026-04-02
|
||||||
|
|
||||||
## Platform Overview
|
## Platform Overview
|
||||||
|
|
||||||
One node operational (**rift**), running core infrastructure services as
|
Two nodes operational (**rift** + **svc**), with **orion** provisioned but
|
||||||
containers fronted by MC-Proxy. MCIAS runs separately (not on rift).
|
offline for maintenance. Core infrastructure services run as containers on
|
||||||
Bootstrap phases 0–4 complete (MCIAS, Metacrypt, MC-Proxy, MCR all
|
rift, fronted by MC-Proxy. Svc operates as an MCP edge node managing
|
||||||
operational). MCP is deployed and managing all platform containers. MCNS is
|
mc-proxy routing only (no containers); MCIAS runs on svc separately as a
|
||||||
deployed on rift, serving authoritative DNS.
|
systemd service. Bootstrap phases 0–4 complete (MCIAS, Metacrypt, MC-Proxy,
|
||||||
|
MCR all operational). MCP is deployed and managing all platform containers
|
||||||
|
on rift, with multi-node capability (svc as edge node). MCNS is deployed on
|
||||||
|
rift, serving authoritative DNS. Platform evolution Phases A–D complete
|
||||||
|
(automated port assignment, route registration, TLS cert provisioning, and
|
||||||
|
DNS registration). Phase E (multi-node expansion) is in planning, with v2
|
||||||
|
architecture in development.
|
||||||
|
|
||||||
## Service Status
|
## Service Status
|
||||||
|
|
||||||
| Service | Version | SDLC Phase | Deployed | Node |
|
| Service | Version | SDLC Phase | Deployed | Node |
|
||||||
|---------|---------|------------|----------|------|
|
|---------|---------|------------|----------|------|
|
||||||
| MCIAS | v1.8.0 | Maintenance | Yes | (separate) |
|
| MCIAS | v1.10.5 | Maintenance | Yes | svc (systemd) |
|
||||||
| Metacrypt | v1.1.0 | Production | Yes | rift |
|
| Metacrypt | v1.4.1 | Production | Yes | rift |
|
||||||
| MC-Proxy | v1.2.1 | Maintenance | Yes | rift |
|
| MC-Proxy | v1.2.2 | Maintenance | Yes | rift |
|
||||||
| MCR | v1.2.0 | Production | Yes | rift |
|
| MCR | v1.3.2 | Production | Yes | rift |
|
||||||
| MCAT | v1.1.0 | Complete | Unknown | — |
|
| MCAT | v1.2.0 | Production | Yes | rift |
|
||||||
| MCDSL | v1.2.0 | Stable | N/A (library) | — |
|
| MCDSL | v1.7.0 | Stable | N/A (library) | — |
|
||||||
| MCNS | v1.1.0 | Production | Yes | rift |
|
| MCNS | v1.2.0 | Production | Yes | rift |
|
||||||
| MCDoc | v0.1.0 | Production | Yes | rift |
|
| MCDoc | v0.1.0 | Production | Yes | rift |
|
||||||
| MCP | v0.4.0 | Production | Yes | rift |
|
| MCQ | v0.4.2 | Production | Yes | rift |
|
||||||
| MCDeploy | v0.2.0 | Active dev | N/A (CLI tool) | — |
|
| MCP | v0.9.0 | Production | Yes | rift |
|
||||||
|
|
||||||
## Service Details
|
## Service Details
|
||||||
|
|
||||||
### MCIAS — Identity and Access Service
|
### MCIAS — Identity and Access Service
|
||||||
|
|
||||||
- **Version:** v1.8.0 (client library: clients/go/v0.2.0)
|
- **Version:** v1.10.5 (client library: clients/go/v0.2.0)
|
||||||
- **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active
|
- **Phase:** Maintenance. Phases 0-14 complete. Feature-complete with active
|
||||||
refinement.
|
refinement.
|
||||||
- **Deployment:** Running in production. All other services authenticate
|
- **Deployment:** Running in production on svc as a systemd service. All
|
||||||
against it.
|
other services authenticate against it.
|
||||||
- **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login
|
- **Recent work:** WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login
|
||||||
policies, Nix flake for CLI tools.
|
policies, Nix flake for CLI tools.
|
||||||
- **Artifacts:** systemd units (service + backup timer), install script,
|
- **Artifacts:** systemd units (service + backup timer), install script,
|
||||||
@@ -41,7 +47,7 @@ deployed on rift, serving authoritative DNS.
|
|||||||
|
|
||||||
### Metacrypt — Cryptographic Service Engine
|
### Metacrypt — Cryptographic Service Engine
|
||||||
|
|
||||||
- **Version:** v1.1.0.
|
- **Version:** v1.4.1 (API v1.3.1, Web v1.4.1).
|
||||||
- **Phase:** Production. All four engine types implemented (CA, SSH CA, transit,
|
- **Phase:** Production. All four engine types implemented (CA, SSH CA, transit,
|
||||||
user-to-user). Active work on integration test coverage.
|
user-to-user). Active work on integration test coverage.
|
||||||
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on
|
- **Deployment:** Running on rift as a container, fronted by MC-Proxy on
|
||||||
@@ -53,10 +59,12 @@ deployed on rift, serving authoritative DNS.
|
|||||||
|
|
||||||
### MC-Proxy — TLS Proxy and Router
|
### MC-Proxy — TLS Proxy and Router
|
||||||
|
|
||||||
- **Version:** v1.2.1.
|
- **Version:** v1.2.2.
|
||||||
- **Phase:** Maintenance. Stable and actively routing traffic on rift.
|
- **Phase:** Maintenance. Stable and actively routing traffic on rift and svc.
|
||||||
- **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports
|
- **Deployment:** Running on rift. Fronts Metacrypt, MCR, and sgard on ports
|
||||||
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091.
|
443, 8443, and 9443. Prometheus metrics on 127.0.0.1:9091. Routes persisted
|
||||||
|
in SQLite and managed via gRPC API. Svc runs its own mc-proxy on :443 with
|
||||||
|
public-facing routes.
|
||||||
- **Recent work:** Route persistence (SQLite), idempotent AddRoute (upsert),
|
- **Recent work:** Route persistence (SQLite), idempotent AddRoute (upsert),
|
||||||
golangci-lint v2 compliance, module path migration to mc/ org.
|
golangci-lint v2 compliance, module path migration to mc/ org.
|
||||||
- **Artifacts:** systemd units (service + backup timer), Docker Compose
|
- **Artifacts:** systemd units (service + backup timer), Docker Compose
|
||||||
@@ -64,7 +72,8 @@ deployed on rift, serving authoritative DNS.
|
|||||||
|
|
||||||
### MCR — Container Registry
|
### MCR — Container Registry
|
||||||
|
|
||||||
- **Version:** v1.2.0. All implementation phases complete.
|
- **Version:** v1.3.2 (API v1.2.1, Web v1.3.2). All implementation phases
|
||||||
|
complete.
|
||||||
- **Phase:** Production. Deployed on rift, serving container images.
|
- **Phase:** Production. Deployed on rift, serving container images.
|
||||||
- **Deployment:** Running on rift as two containers (mcr API + mcr-web),
|
- **Deployment:** Running on rift as two containers (mcr API + mcr-web),
|
||||||
fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and
|
fronted by MC-Proxy on ports 443 (web, L7), 8443 (API, L4), and
|
||||||
@@ -77,29 +86,30 @@ deployed on rift, serving authoritative DNS.
|
|||||||
|
|
||||||
### MCAT — Login Policy Tester
|
### MCAT — Login Policy Tester
|
||||||
|
|
||||||
- **Version:** v1.1.0.
|
- **Version:** v1.2.0.
|
||||||
- **Phase:** Complete. Diagnostic tool, not core infrastructure.
|
- **Phase:** Production. Deployed on rift as a container managed by MCP.
|
||||||
- **Deployment:** Available for ad-hoc use. Lightweight tool for testing
|
- **Deployment:** Running on rift. Lightweight tool for testing MCIAS login
|
||||||
MCIAS login policy rules.
|
policy rules.
|
||||||
- **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web.
|
- **Recent work:** Migrated to mcdsl for auth, config, CSRF, and web.
|
||||||
- **Artifacts:** systemd unit, install script, example config.
|
- **Artifacts:** systemd unit, install script, example config.
|
||||||
|
|
||||||
### MCDSL — Standard Library
|
### MCDSL — Standard Library
|
||||||
|
|
||||||
- **Version:** v1.2.0.
|
- **Version:** v1.7.0.
|
||||||
- **Phase:** Stable. All 9 packages implemented and tested. Being adopted
|
- **Phase:** Stable. All 9 packages implemented and tested. Being adopted
|
||||||
across the platform.
|
across the platform.
|
||||||
- **Deployment:** N/A (Go library, imported by other services).
|
- **Deployment:** N/A (Go library, imported by other services).
|
||||||
- **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health,
|
- **Packages:** auth, db, config, httpserver, grpcserver, csrf, web, health,
|
||||||
archive.
|
archive.
|
||||||
- **Adoption:** All services except mcias on v1.2.0. mcias pending.
|
- **Adoption:** All services except mcias on v1.7.0. mcias pending.
|
||||||
|
|
||||||
### MCNS — Networking Service
|
### MCNS — Networking Service
|
||||||
|
|
||||||
- **Version:** v1.1.0.
|
- **Version:** v1.2.0.
|
||||||
- **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor.
|
- **Phase:** Production. Custom Go DNS server replacing CoreDNS precursor.
|
||||||
- **Deployment:** Running on rift as a container managed by MCP. Serves two
|
- **Deployment:** Running on rift as a container managed by MCP. Serves two
|
||||||
authoritative zones plus upstream forwarding.
|
authoritative zones plus upstream forwarding. REST + gRPC APIs with MCIAS
|
||||||
|
auth and name-scoped system account authorization.
|
||||||
- **Recent work:** v1.0.0 implementation (custom Go DNS server), engineering
|
- **Recent work:** v1.0.0 implementation (custom Go DNS server), engineering
|
||||||
review, deployed to rift replacing CoreDNS.
|
review, deployed to rift replacing CoreDNS.
|
||||||
- **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition,
|
- **Artifacts:** Dockerfile, Docker Compose (rift), MCP service definition,
|
||||||
@@ -115,53 +125,65 @@ deployed on rift, serving authoritative DNS.
|
|||||||
rendering with syntax highlighting, webhook-driven refresh.
|
rendering with syntax highlighting, webhook-driven refresh.
|
||||||
- **Artifacts:** Dockerfile, MCP service definition.
|
- **Artifacts:** Dockerfile, MCP service definition.
|
||||||
|
|
||||||
|
### MCQ — Document Review Queue
|
||||||
|
|
||||||
|
- **Version:** v0.4.2.
|
||||||
|
- **Phase:** Production. Document review queue with MCP server for Claude
|
||||||
|
integration.
|
||||||
|
- **Deployment:** Running on rift as a container managed by MCP.
|
||||||
|
- **Recent work:** Claude MCP server integration, document review workflow.
|
||||||
|
- **Artifacts:** Dockerfile, MCP service definition.
|
||||||
|
|
||||||
### MCP — Control Plane
|
### MCP — Control Plane
|
||||||
|
|
||||||
- **Version:** v0.4.0.
|
- **Version:** v0.9.0 (agent on rift: v0.8.3-dirty, agent on svc: v0.9.0).
|
||||||
- **Phase:** Production. Phases 0-4 complete. Phase C (automated TLS cert
|
- **Phase:** Production. Phases A–D complete. Multi-node capable with svc
|
||||||
provisioning) implemented. Deployed to rift, managing all platform containers.
|
operating as an edge node. V2 architecture in development, Phase E planning
|
||||||
|
underway.
|
||||||
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
|
- **Deployment:** Running on rift. Agent as systemd service under `mcp` user
|
||||||
with rootless podman. Manages metacrypt, mc-proxy, mcr, and mcns containers.
|
with rootless podman. Manages metacrypt, mc-proxy, mcr, mcns, mcdoc, mcat,
|
||||||
|
mcq, and non-platform containers. Svc runs an MCP agent for edge mc-proxy
|
||||||
|
route management.
|
||||||
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
|
- **Architecture:** Two components — `mcp` CLI (thin client on vade) and
|
||||||
`mcp-agent` (per-node daemon with SQLite registry, podman management,
|
`mcp-agent` (per-node daemon with SQLite registry, podman management,
|
||||||
monitoring with drift/flap detection, route registration with mc-proxy during
|
monitoring with drift/flap detection, route registration with mc-proxy,
|
||||||
deploy/stop, automated TLS cert provisioning for L7 routes via Metacrypt CA).
|
automated TLS cert provisioning for L7 routes via Metacrypt CA, automated
|
||||||
gRPC-only (no REST).
|
DNS registration in MCNS). gRPC-only (no REST). 15 RPCs, 17+ CLI commands.
|
||||||
- **Recent work:** Full v1 implementation (12 RPCs, 15 CLI commands),
|
- **Recent work:** Phase C (automated TLS cert provisioning), Phase D
|
||||||
deployment to rift, container migration from kyle→mcp user, service
|
(automated DNS registration via MCNS), undeploy command, logs command,
|
||||||
definition authoring. Phase C automated TLS cert provisioning for L7 routes,
|
edit command, auto-login to MCR, system account auth model, module path
|
||||||
mc-proxy route registration during deploy, mc-proxy dependency updated to
|
migration, multi-node edge support (svc).
|
||||||
v1.2.0, module path migration.
|
|
||||||
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
|
- **Artifacts:** systemd service (NixOS), TLS cert from Metacrypt, service
|
||||||
definition files, design docs.
|
definition files, design docs.
|
||||||
|
|
||||||
### MCDeploy — Deployment CLI
|
|
||||||
|
|
||||||
- **Version:** v0.2.0.
|
|
||||||
- **Phase:** Active development. Tactical bridge tool for deploying services
|
|
||||||
while MCP is being built.
|
|
||||||
- **Deployment:** N/A (local CLI tool, not a server).
|
|
||||||
- **Recent work:** Initial implementation, Nix flake.
|
|
||||||
- **Description:** Single-binary CLI that shells out to podman/ssh/scp/git
|
|
||||||
for build, push, deploy, cert renewal, and status. TOML-configured.
|
|
||||||
|
|
||||||
## Node Inventory
|
## Node Inventory
|
||||||
|
|
||||||
| Node | Address (LAN) | Address (Tailscale) | Role |
|
| Node | Address (LAN) | Address (Tailscale) | Role |
|
||||||
|------|---------------|---------------------|------|
|
|------|---------------|---------------------|------|
|
||||||
| rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services |
|
| rift | 192.168.88.181 | 100.95.252.120 | Infrastructure services |
|
||||||
|
| svc | — | 100.106.232.4 | Edge — public mc-proxy, MCIAS (systemd) |
|
||||||
|
| orion | 192.168.88.214 | — | Worker (provisioned, offline for maintenance) |
|
||||||
|
|
||||||
## Rift Port Map
|
## Rift Port Map
|
||||||
|
|
||||||
|
Note: Services deployed via MCP receive dynamically assigned host ports
|
||||||
|
(10000–60000). The ports below are for infrastructure services with static
|
||||||
|
assignments or well-known ports.
|
||||||
|
|
||||||
| Port | Protocol | Services |
|
| Port | Protocol | Services |
|
||||||
|------|----------|----------|
|
|------|----------|----------|
|
||||||
| 53 | DNS (LAN + Tailscale) | mcns |
|
| 53 | DNS (LAN + Tailscale) | mcns |
|
||||||
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web |
|
| 443 | L7 (TLS termination) | metacrypt-web, mcr-web, mcdoc, mcat, kls |
|
||||||
| 8080 | HTTP (all interfaces) | exod |
|
| 8080 | HTTP (all interfaces) | exod |
|
||||||
| 8443 | L4 (SNI passthrough) | metacrypt API, mcr API |
|
| 8443 | L4 (SNI passthrough) | metacrypt API, mcr API |
|
||||||
| 9090 | HTTP (all interfaces) | exod |
|
| 9090 | HTTP (all interfaces) | exod |
|
||||||
| 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard |
|
| 9443 | L4 (SNI passthrough) | metacrypt gRPC, mcr gRPC, sgard |
|
||||||
| 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics |
|
| 9091 | HTTP (loopback) | MC-Proxy Prometheus metrics |
|
||||||
|
|
||||||
Non-platform services also running on rift: **exod** (ports 8080/9090),
|
Svc runs its own mc-proxy on :443 (L7) with public-facing routes:
|
||||||
**sgardd** (port 19473, fronted by MC-Proxy on 9443).
|
kls.metacircular.net, mcq.metacircular.net, metacrypt.metacircular.net,
|
||||||
|
docs.metacircular.net, git.metacircular.net, git.wntrmute.dev.
|
||||||
|
|
||||||
|
Non-platform services also running on rift: **exod** (v0.1.0, ports
|
||||||
|
8080/9090), **sgardd** (v3.2.0, port 19473, fronted by MC-Proxy on 9443),
|
||||||
|
**kls** (v0.2.0).
|
||||||
|
|||||||
158
SVC_WORK.md
Normal file
158
SVC_WORK.md
Normal file
@@ -0,0 +1,158 @@
|
|||||||
|
# svc.metacircular.net — Phase 1 Work Log
|
||||||
|
|
||||||
|
Date: 2026-04-02
|
||||||
|
Purpose: Deploy mcp-agent to svc (edge node) for MCP v2 Phase 1.
|
||||||
|
|
||||||
|
## Changes Made
|
||||||
|
|
||||||
|
### 1. Created `mcp` system user
|
||||||
|
```
|
||||||
|
useradd --system --home-dir /srv/mcp --create-home --shell /usr/sbin/nologin mcp
|
||||||
|
usermod -aG mc-proxy mcp
|
||||||
|
```
|
||||||
|
- UID 992, GID 991
|
||||||
|
- Member of `mc-proxy` group for socket access
|
||||||
|
|
||||||
|
### 2. Created `/srv/mcp/` directory structure
|
||||||
|
```
|
||||||
|
/srv/mcp/
|
||||||
|
├── mcp-agent # binary (v0.8.3-1-gfa8ba6f, linux/amd64)
|
||||||
|
├── mcp-agent.toml # agent config
|
||||||
|
├── mcp.db # SQLite registry (created on first run)
|
||||||
|
└── certs/
|
||||||
|
├── cert.pem # TLS cert (SAN: IP:100.106.232.4, DNS:svc.svc.mcp.metacircular.net)
|
||||||
|
├── key.pem # TLS private key
|
||||||
|
└── ca.pem # Metacircular CA cert
|
||||||
|
```
|
||||||
|
- Owned by `mcp:mcp`, key file mode 0600
|
||||||
|
|
||||||
|
### 3. TLS certificate
|
||||||
|
- Issued from the Metacircular CA (`ca/ca.pem` + `ca/ca.key`)
|
||||||
|
- Subject: `CN=mcp-agent-svc`
|
||||||
|
- SANs: `IP:100.106.232.4`, `DNS:svc.svc.mcp.metacircular.net`
|
||||||
|
- Validity: 365 days
|
||||||
|
- Stored at `/srv/mcp/certs/{cert,key,ca}.pem`
|
||||||
|
|
||||||
|
### 4. Agent configuration
|
||||||
|
- File: `/srv/mcp/mcp-agent.toml`
|
||||||
|
- gRPC listen: `100.106.232.4:9555` (port 9444 in use by MCNS)
|
||||||
|
- MCIAS: `https://mcias.metacircular.net:8443`
|
||||||
|
- mc-proxy socket: `/srv/mc-proxy/mc-proxy.sock`
|
||||||
|
- Node name: `svc`
|
||||||
|
- Runtime: `podman` (not used on edge, but required by config)
|
||||||
|
|
||||||
|
### 5. systemd unit
|
||||||
|
- File: `/etc/systemd/system/mcp-agent.service`
|
||||||
|
- Runs as `mcp:mcp`
|
||||||
|
- Security hardened (NoNewPrivileges, ProtectSystem=strict, etc.)
|
||||||
|
- ReadWritePaths: `/srv/mcp`, `/srv/mc-proxy/mc-proxy.sock`
|
||||||
|
- Enabled and started
|
||||||
|
|
||||||
|
### 6. mc-proxy directory permissions
|
||||||
|
- Changed `/srv/mc-proxy/` from `drwx------` to `drwxr-x---` (group traversal)
|
||||||
|
- Changed `/srv/mc-proxy/mc-proxy.sock` from `srw-------` to `srw-rw----` (group read/write)
|
||||||
|
- Required for `mcp` user (in `mc-proxy` group) to access the socket
|
||||||
|
|
||||||
|
### 7. MCP CLI config update (on rift)
|
||||||
|
- Added svc node to `~/.config/mcp/mcp.toml`:
|
||||||
|
```toml
|
||||||
|
[[nodes]]
|
||||||
|
name = "svc"
|
||||||
|
address = "100.106.232.4:9555"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
```
|
||||||
|
$ mcp node list
|
||||||
|
NAME ADDRESS VERSION
|
||||||
|
rift 100.95.252.120:9444 v0.8.3-dirty
|
||||||
|
svc 100.106.232.4:9555 v0.8.3-1-gfa8ba6f
|
||||||
|
|
||||||
|
$ mcp route list -n svc
|
||||||
|
NODE: svc
|
||||||
|
mc-proxy v1.2.1-2-g82fce41-dirty
|
||||||
|
:443 routes=6
|
||||||
|
l7 git.wntrmute.dev → 127.0.0.1:3000
|
||||||
|
l7 kls.metacircular.net → 100.95.252.120:58080
|
||||||
|
l7 mcq.metacircular.net → 100.95.252.120:48080
|
||||||
|
l7 metacrypt.metacircular.net → 100.95.252.120:18080 (re-encrypt)
|
||||||
|
l7 docs.metacircular.net → 100.95.252.120:38080
|
||||||
|
l7 git.metacircular.net → 127.0.0.1:3000
|
||||||
|
```
|
||||||
|
|
||||||
|
## Agent Cert Reissue (2026-04-02)
|
||||||
|
|
||||||
|
Both agent certs reissued with comprehensive SANs:
|
||||||
|
|
||||||
|
**Rift agent** (`/srv/mcp/certs/cert.pem`):
|
||||||
|
- DNS: `rift.scylla-hammerhead.ts.net`, `mcp-agent.svc.mcp.metacircular.net`
|
||||||
|
- IP: `100.95.252.120`, `192.168.88.181`
|
||||||
|
|
||||||
|
**Svc agent** (`/srv/mcp/certs/cert.pem`):
|
||||||
|
- DNS: `svc.scylla-hammerhead.ts.net`, `svc.svc.mcp.metacircular.net`
|
||||||
|
- IP: `100.106.232.4`
|
||||||
|
|
||||||
|
Both agents upgraded to v0.10.0 (Phase 2 edge routing RPCs + v2 proto fields).
|
||||||
|
|
||||||
|
## MCP Master Deployment (2026-04-02)
|
||||||
|
|
||||||
|
**Binary**: `/srv/mcp-master/mcp-master` (v0.10.0) on rift
|
||||||
|
**Config**: `/srv/mcp-master/mcp-master.toml`
|
||||||
|
**Database**: `/srv/mcp-master/master.db`
|
||||||
|
**Certs**: `/srv/mcp-master/certs/{cert,key,ca}.pem`
|
||||||
|
- SAN: `rift.scylla-hammerhead.ts.net`, `mcp-master.svc.mcp.metacircular.net`, IP `100.95.252.120`
|
||||||
|
**Service token**: `/srv/mcp-master/mcias-token` (MCIAS identity: `mcp-master`, expires 2027-04-03)
|
||||||
|
**Listen**: `100.95.252.120:9555`
|
||||||
|
**Bootstrap nodes**: rift (master), svc (edge)
|
||||||
|
|
||||||
|
**Status**: Running via `doas` (ad-hoc). NixOS read-only /etc prevents
|
||||||
|
direct systemd unit creation — needs NixOS config update for persistent
|
||||||
|
service.
|
||||||
|
|
||||||
|
**Tested**:
|
||||||
|
- `mcp deploy mcq` → master places on rift, forwards to agent ✓
|
||||||
|
- `mcp undeploy mcq` → master forwards to agent, cleans up placement ✓
|
||||||
|
- `mcp ps` → fleet-wide status through agents ✓
|
||||||
|
- `mcp node list` → both nodes visible with versions ✓
|
||||||
|
|
||||||
|
## CLI Config Changes (vade)
|
||||||
|
|
||||||
|
Updated `~/.config/mcp/mcp.toml`:
|
||||||
|
- Added `[master]` section: `address = "rift.scylla-hammerhead.ts.net:9555"`
|
||||||
|
- All node addresses switched to Tailscale DNS names
|
||||||
|
- Added CA cert path
|
||||||
|
|
||||||
|
## Known Limitations
|
||||||
|
- ~~mc-proxy socket permissions will reset on restart~~ **FIXED**: mc-proxy
|
||||||
|
now creates the socket with 0660 (was 0600). Committed to mc-proxy master.
|
||||||
|
- Master runs ad-hoc via `doas` on rift. Needs NixOS systemd config for
|
||||||
|
persistent service (rift has read-only /etc).
|
||||||
|
- DNS registration not configured on master (MCNS config omitted for now).
|
||||||
|
- Edge routing not yet tested end-to-end through master (svc cert provisioning
|
||||||
|
not configured).
|
||||||
|
- The TLS cert was issued from the local CA directly, not via Metacrypt API.
|
||||||
|
Should be re-issued via Metacrypt once the agent has cert provisioning.
|
||||||
|
- Container runtime is set to `podman` but podman is not installed on svc
|
||||||
|
(Docker is). Edge agents don't run containers so this is benign.
|
||||||
|
- Metacrypt and MCNS integrations not configured (not needed for edge role).
|
||||||
|
|
||||||
|
## Edge Routing E2E Test (2026-04-02)
|
||||||
|
|
||||||
|
Full edge routing flow tested successfully through the master:
|
||||||
|
|
||||||
|
1. Added `public = true` route for `mcq.metacircular.net` to mcq service def
|
||||||
|
2. `mcp deploy mcq` → master placed on rift, deployed, registered DNS, set up edge route on svc
|
||||||
|
3. Svc agent provisioned TLS cert from Metacrypt (expires 2026-06-28)
|
||||||
|
4. mc-proxy route created: `mcq.metacircular.net → 100.95.252.120:8443` (re-encrypt)
|
||||||
|
5. Edge route persisted in both master DB and svc agent registry
|
||||||
|
|
||||||
|
**Fix required**: `RouteDef` in servicedef.go was missing `Public` field —
|
||||||
|
TOML `public = true` was silently dropped. Fixed in v0.10.2 along with
|
||||||
|
`Tier` field and relaxed `Node` validation.
|
||||||
|
|
||||||
|
## DNS Registration Working (2026-04-02)
|
||||||
|
|
||||||
|
Master registers Tailnet IPs in MCNS (not LAN IPs). Fix in v0.10.1
|
||||||
|
resolves Tailscale DNS names to IPs before passing to MCNS.
|
||||||
|
|
||||||
|
`mcq.svc.mcp.metacircular.net → 100.95.252.120` (Tailnet IP) ✓
|
||||||
19
TOOLING.md
19
TOOLING.md
@@ -8,12 +8,11 @@ Go binaries built with `CGO_ENABLED=0` and installed via Nix flakes.
|
|||||||
| Tool | Project | Purpose | Install target |
|
| Tool | Project | Purpose | Install target |
|
||||||
|------|---------|---------|---------------|
|
|------|---------|---------|---------------|
|
||||||
| `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion |
|
| `mcp` | mcp | Control plane CLI — deploy, status, lifecycle, file transfer | vade, orion |
|
||||||
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift (systemd) |
|
| `mcp-agent` | mcp | Control plane agent — per-node container management daemon | rift, svc (systemd) |
|
||||||
| `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift |
|
| `mciasctl` | mcias | MCIAS admin CLI — accounts, tokens, policies | vade, orion, rift |
|
||||||
| `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift |
|
| `mciasgrpcctl` | mcias | MCIAS gRPC debug CLI | vade, orion, rift |
|
||||||
| `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift |
|
| `mcproxyctl` | mc-proxy | MC-Proxy admin CLI — routes, firewall, status | vade, orion, rift |
|
||||||
| `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift |
|
| `mcrctl` | mcr | MCR admin CLI — repositories, policies, audit | vade, orion, rift |
|
||||||
| `mcdeploy` | mcdeploy | Deployment CLI — build, push, deploy (bridge tool) | vade |
|
|
||||||
|
|
||||||
### Server-only binaries (not installed as tools)
|
### Server-only binaries (not installed as tools)
|
||||||
|
|
||||||
@@ -29,6 +28,8 @@ These run inside containers and are not installed on operator workstations:
|
|||||||
| `mc-proxy` | mc-proxy | TLS proxy server |
|
| `mc-proxy` | mc-proxy | TLS proxy server |
|
||||||
| `mcns` | mcns | DNS server |
|
| `mcns` | mcns | DNS server |
|
||||||
| `mcat` | mcat | Login policy tester web app |
|
| `mcat` | mcat | Login policy tester web app |
|
||||||
|
| `mcdoc` | mcdoc | Documentation server |
|
||||||
|
| `mcq` | mcq | Document review queue |
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@@ -46,15 +47,18 @@ via `mcpkg.nix` in the NixOS configuration. Adding a tool:
|
|||||||
- Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps).
|
- Build: `pkgs.buildGoModule` with `vendorHash = null` (vendored deps).
|
||||||
- ldflags: `-s -w -X main.version=${version}`.
|
- ldflags: `-s -w -X main.version=${version}`.
|
||||||
- `subPackages`: list only the client binaries, not servers.
|
- `subPackages`: list only the client binaries, not servers.
|
||||||
- `system`: `x86_64-linux` (all machines are x86_64).
|
- `system`: `x86_64-linux` for rift, svc, and orion; `aarch64-linux`
|
||||||
|
for hyperborea. Flakes that target the full fleet should support both.
|
||||||
|
|
||||||
### MCP agent
|
### MCP agent
|
||||||
|
|
||||||
The `mcp-agent` is a special case: it runs as a systemd service on
|
The `mcp-agent` is a special case: it runs as a systemd service on
|
||||||
managed nodes (not as a container, since it manages containers). Its
|
managed nodes (not as a container, since it manages containers). Its
|
||||||
flake exposes both `mcp` (client) and `mcp-agent` (server). The NixOS
|
flake exposes both `mcp` (client) and `mcp-agent` (server). Phase E is
|
||||||
config in `configs/mcp.nix` references the Nix package path for the
|
moving the agent binary to `/srv/mcp/mcp-agent` on all nodes — NixOS
|
||||||
agent binary.
|
`ExecStart` will point there instead of a nix store path, and Debian
|
||||||
|
nodes use the same layout. svc already follows this convention. See
|
||||||
|
`docs/phase-e-plan.md` for details.
|
||||||
|
|
||||||
## Flake status
|
## Flake status
|
||||||
|
|
||||||
@@ -63,9 +67,10 @@ agent binary.
|
|||||||
| mcias | Yes | mciasctl, mciasgrpcctl | Yes | |
|
| mcias | Yes | mciasctl, mciasgrpcctl | Yes | |
|
||||||
| mc-proxy | Yes | mcproxyctl | Yes | |
|
| mc-proxy | Yes | mcproxyctl | Yes | |
|
||||||
| mcr | Yes | mcrctl | Yes | |
|
| mcr | Yes | mcrctl | Yes | |
|
||||||
| mcdeploy | Yes | mcdeploy | Yes | |
|
|
||||||
| mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit |
|
| mcp | Yes | mcp, mcp-agent | Yes | Agent also used by mcp.nix systemd unit |
|
||||||
| mcns | No | — | No | Server-only, no client tool yet |
|
| mcns | No | — | No | Server-only, no client tool yet |
|
||||||
| metacrypt | No | — | No | Server-only, no client tool yet |
|
| metacrypt | No | — | No | Server-only, no client tool yet |
|
||||||
| mcat | No | — | No | Server-only, no client tool yet |
|
| mcat | No | — | No | Server-only, no client tool yet |
|
||||||
|
| mcdoc | No | — | No | Server-only, deployed as container |
|
||||||
|
| mcq | No | — | No | Server-only, document review queue |
|
||||||
| mcdsl | No | — | No | Library, no binaries |
|
| mcdsl | No | — | No | Library, no binaries |
|
||||||
|
|||||||
1657
docs/architecture-v2.md
Normal file
1657
docs/architecture-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
344
docs/disaster-recovery.md
Normal file
344
docs/disaster-recovery.md
Normal file
@@ -0,0 +1,344 @@
|
|||||||
|
# Disaster Recovery: Bootstrap from Zero
|
||||||
|
|
||||||
|
This document covers recovering the Metacircular platform when all
|
||||||
|
services on rift are down and no containers are running. It assumes:
|
||||||
|
|
||||||
|
- The machine boots and NixOS is functional
|
||||||
|
- The mcp-agent systemd service starts automatically
|
||||||
|
- Tailscale is configured and connects on boot
|
||||||
|
- Service data directories (`/srv/<service>/`) are intact on disk
|
||||||
|
- Container images are cached in podman's overlay storage
|
||||||
|
|
||||||
|
If images are NOT cached (fresh machine, disk wipe), see the
|
||||||
|
"Cold Start" section at the end.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
Before starting recovery, verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Machine is up
|
||||||
|
hostname # should print "rift"
|
||||||
|
|
||||||
|
# 2. Tailscale is connected
|
||||||
|
tailscale status --self
|
||||||
|
# Should show the Tailnet IP (100.95.252.120)
|
||||||
|
|
||||||
|
# 3. The mcp user exists
|
||||||
|
id mcp
|
||||||
|
# Should show uid=850(mcp) gid=850(mcp)
|
||||||
|
|
||||||
|
# 4. The agent is running
|
||||||
|
systemctl status mcp-agent
|
||||||
|
# Should be active
|
||||||
|
|
||||||
|
# 5. Images are cached
|
||||||
|
su -s /bin/sh mcp -c "XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp podman images" | wc -l
|
||||||
|
# Should be > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
If Tailscale is not running: `doas systemctl start tailscaled && doas tailscale up`
|
||||||
|
|
||||||
|
If the agent is not running: check `/srv/mcp/mcp-agent` exists and
|
||||||
|
`/srv/mcp/mcp-agent.toml` is correct, then `doas systemctl restart mcp-agent`.
|
||||||
|
|
||||||
|
## Recovery Order
|
||||||
|
|
||||||
|
Services must be started in dependency order. Each stage must be
|
||||||
|
healthy before the next starts.
|
||||||
|
|
||||||
|
```
|
||||||
|
Stage 1 (Foundation): MCNS → DNS works
|
||||||
|
Stage 2 (Core): mc-proxy, MCR, Metacrypt → routing + images + certs
|
||||||
|
Stage 3 (Management): mcp-master → orchestration
|
||||||
|
Stage 4 (Services): mcq, mcdoc, mcat, kls, sgard, exo → applications
|
||||||
|
```
|
||||||
|
|
||||||
|
## Manual Recovery Commands
|
||||||
|
|
||||||
|
All commands run as the mcp user. Use this shell prefix:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set up the environment
|
||||||
|
export PODMAN_CMD='doas sh -c "cd /srv/mcp && XDG_RUNTIME_DIR=/run/user/850 HOME=/srv/mcp su -s /bin/sh mcp -c"'
|
||||||
|
# Or SSH as mcp directly (if SSH login is enabled):
|
||||||
|
ssh mcp@rift
|
||||||
|
```
|
||||||
|
|
||||||
|
For brevity, commands below show the `podman run` portion only. Prefix
|
||||||
|
with the environment setup above.
|
||||||
|
|
||||||
|
### Stage 1: MCNS (DNS)
|
||||||
|
|
||||||
|
MCNS must start first. Without it, no hostname resolution works.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman run -d --name mcns --restart unless-stopped \
|
||||||
|
-p 192.168.88.181:53:53/tcp \
|
||||||
|
-p 192.168.88.181:53:53/udp \
|
||||||
|
-p 100.95.252.120:53:53/tcp \
|
||||||
|
-p 100.95.252.120:53:53/udp \
|
||||||
|
-p 127.0.0.1:38443:8443 \
|
||||||
|
-v /srv/mcns:/srv/mcns \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
|
||||||
|
server --config /srv/mcns/mcns.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify:**
|
||||||
|
```bash
|
||||||
|
dig @192.168.88.181 google.com +short
|
||||||
|
# Should return an IP address
|
||||||
|
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
|
||||||
|
# Should return a Tailnet IP
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** MCNS binds to specific IPs, not `0.0.0.0`, because
|
||||||
|
systemd-resolved holds port 53 on localhost. The explicit bindings
|
||||||
|
avoid the conflict.
|
||||||
|
|
||||||
|
### Stage 2: Core Infrastructure
|
||||||
|
|
||||||
|
#### mc-proxy (TLS routing)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman run -d --name mc-proxy --restart unless-stopped \
|
||||||
|
--network host \
|
||||||
|
-v /srv/mc-proxy:/srv/mc-proxy \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mc-proxy:v1.2.2 \
|
||||||
|
server --config /srv/mc-proxy/mc-proxy.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify:** `curl -sk https://localhost:443/ 2>&1 | head -1`
|
||||||
|
(should get a response, even if 404)
|
||||||
|
|
||||||
|
#### MCR (Container Registry)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# API server
|
||||||
|
podman run -d --name mcr-api --restart unless-stopped \
|
||||||
|
-v /srv/mcr:/srv/mcr \
|
||||||
|
-p 127.0.0.1:28443:8443 \
|
||||||
|
-p 127.0.0.1:29443:9443 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcr:v1.2.1 \
|
||||||
|
server --config /srv/mcr/mcr.toml
|
||||||
|
|
||||||
|
# Web UI
|
||||||
|
podman run -d --name mcr-web --restart unless-stopped \
|
||||||
|
--user 0:0 \
|
||||||
|
-v /srv/mcr:/srv/mcr \
|
||||||
|
-p 127.0.0.1:28080:8080 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcr-web:v1.3.2 \
|
||||||
|
server --config /srv/mcr/mcr.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
**If MCR fails with "chmod" or "readonly database":**
|
||||||
|
```bash
|
||||||
|
podman stop mcr-api
|
||||||
|
rm -f /srv/mcr/mcr.db /srv/mcr/mcr.db-wal /srv/mcr/mcr.db-shm
|
||||||
|
podman start mcr-api
|
||||||
|
```
|
||||||
|
This recreates the database empty. Image blobs in `/srv/mcr/layers/`
|
||||||
|
are preserved but tag metadata is lost. Re-push images to rebuild the
|
||||||
|
registry.
|
||||||
|
|
||||||
|
#### Metacrypt (PKI / Secrets)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# API server
|
||||||
|
podman run -d --name metacrypt-api --restart unless-stopped \
|
||||||
|
-v /srv/metacrypt:/srv/metacrypt \
|
||||||
|
-p 127.0.0.1:18443:8443 \
|
||||||
|
-p 127.0.0.1:19443:9443 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/metacrypt:v1.3.1 \
|
||||||
|
server --config /srv/metacrypt/metacrypt.toml
|
||||||
|
|
||||||
|
# Web UI
|
||||||
|
podman run -d --name metacrypt-web --restart unless-stopped \
|
||||||
|
-v /srv/metacrypt:/srv/metacrypt \
|
||||||
|
-p 127.0.0.1:18080:8080 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/metacrypt-web:v1.4.1 \
|
||||||
|
--config /srv/metacrypt/metacrypt.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
**If Metacrypt fails with "chmod" or "readonly database":**
|
||||||
|
Same fix as MCR — delete the database files. **Warning:** this loses
|
||||||
|
all encrypted secrets, issued certs tracking, and CA state. The CA
|
||||||
|
key itself is in the sealed vault (password-protected), not in SQLite.
|
||||||
|
|
||||||
|
### Stage 3: MCP Master
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman run -d --name mcp-master --restart unless-stopped \
|
||||||
|
--network host \
|
||||||
|
-v /srv/mcp-master:/srv/mcp-master \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcp-master:v0.10.3 \
|
||||||
|
server --config /srv/mcp-master/mcp-master.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify:**
|
||||||
|
```bash
|
||||||
|
# From vade (operator workstation):
|
||||||
|
mcp node list
|
||||||
|
# Should show rift, svc, orion
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stage 4: Application Services
|
||||||
|
|
||||||
|
Once the master is running, deploy applications through MCP:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mcp deploy mcq --direct
|
||||||
|
mcp deploy mcdoc --direct
|
||||||
|
mcp deploy mcat --direct
|
||||||
|
mcp deploy kls --direct
|
||||||
|
```
|
||||||
|
|
||||||
|
Or start them manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# MCQ
|
||||||
|
podman run -d --name mcq --restart unless-stopped \
|
||||||
|
-v /srv/mcq:/srv/mcq \
|
||||||
|
-p 127.0.0.1:48080:8080 -p 100.95.252.120:48080:8080 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcq:v0.4.2 \
|
||||||
|
server --config /srv/mcq/mcq.toml
|
||||||
|
|
||||||
|
# MCDoc
|
||||||
|
podman run -d --name mcdoc --restart unless-stopped \
|
||||||
|
-v /srv/mcdoc:/srv/mcdoc \
|
||||||
|
-p 127.0.0.1:38080:8080 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcdoc:v0.1.0 \
|
||||||
|
server --config /srv/mcdoc/mcdoc.toml
|
||||||
|
|
||||||
|
# MCAT
|
||||||
|
podman run -d --name mcat --restart unless-stopped \
|
||||||
|
-v /srv/mcat:/srv/mcat \
|
||||||
|
-p 127.0.0.1:48116:8443 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcat:v1.2.0 \
|
||||||
|
server --config /srv/mcat/mcat.toml
|
||||||
|
|
||||||
|
# KLS
|
||||||
|
podman run -d --name kls --restart unless-stopped \
|
||||||
|
-v /srv/kls:/srv/kls \
|
||||||
|
-p 127.0.0.1:58080:8080 -p 100.95.252.120:58080:8080 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/kls:v0.2.0 \
|
||||||
|
-f /srv/kls/kls.conf
|
||||||
|
|
||||||
|
# Sgard
|
||||||
|
podman run -d --name sgardd --restart unless-stopped \
|
||||||
|
-v /srv/sgard:/srv/sgard \
|
||||||
|
-p 127.0.0.1:19473:9473 \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/sgardd:v3.2.0 \
|
||||||
|
--repo /srv/sgard --authorized-keys /srv/sgard/authorized_keys \
|
||||||
|
--tls-cert /srv/sgard/certs/sgard.pem --tls-key /srv/sgard/certs/sgard.key
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification Checklist
|
||||||
|
|
||||||
|
After all services are running:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fleet status
|
||||||
|
mcp ps
|
||||||
|
# All services should show "running"
|
||||||
|
|
||||||
|
# DNS
|
||||||
|
dig @192.168.88.181 google.com +short
|
||||||
|
dig @192.168.88.181 mcq.svc.mcp.metacircular.net +short
|
||||||
|
|
||||||
|
# MCIAS (runs on svc, should be unaffected by rift outage)
|
||||||
|
curl -sk https://mcias.metacircular.net:8443/v1/health
|
||||||
|
|
||||||
|
# MCR
|
||||||
|
curl -sk https://mcr.svc.mcp.metacircular.net:8443/v2/
|
||||||
|
|
||||||
|
# Metacrypt
|
||||||
|
curl -sk https://metacrypt.svc.mcp.metacircular.net:8443/v1/health
|
||||||
|
|
||||||
|
# Public routes via svc
|
||||||
|
curl -sk https://mcq.metacircular.net/
|
||||||
|
curl -sk https://docs.metacircular.net/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Errors
|
||||||
|
|
||||||
|
### "chmod: operation not permitted"
|
||||||
|
|
||||||
|
modernc.org/sqlite calls `fchmod()` on database files. This is denied
|
||||||
|
inside rootless podman user namespaces. Fix:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete the database and let the service recreate it
|
||||||
|
podman stop <container>
|
||||||
|
rm -f /srv/<service>/<service>.db*
|
||||||
|
podman start <container>
|
||||||
|
```
|
||||||
|
|
||||||
|
The `fchmod` error will still appear in logs as a warning but is
|
||||||
|
non-fatal for newly created databases.
|
||||||
|
|
||||||
|
### "address already in use" on port 53
|
||||||
|
|
||||||
|
systemd-resolved holds port 53 on localhost. MCNS must bind to
|
||||||
|
specific IPs, not `0.0.0.0:53`. Use explicit port bindings:
|
||||||
|
`-p 192.168.88.181:53:53 -p 100.95.252.120:53:53`
|
||||||
|
|
||||||
|
### "connection refused" to MCR
|
||||||
|
|
||||||
|
MCR is down. Images are cached locally — you can start services that
|
||||||
|
use cached images without MCR. MCR itself starts from its cached
|
||||||
|
image.
|
||||||
|
|
||||||
|
### Agent shows "error" for all nodes
|
||||||
|
|
||||||
|
Check:
|
||||||
|
1. Tailscale is running on both the CLI machine and the target node
|
||||||
|
2. The agent is listening: `ss -tlnp | grep 9444`
|
||||||
|
3. The CLI config has the correct addresses
|
||||||
|
4. TLS certs have the right SANs for the Tailnet IP
|
||||||
|
|
||||||
|
### "podman: executable file not found"
|
||||||
|
|
||||||
|
This warning appears for svc (which uses Docker, not podman). It's
|
||||||
|
benign — svc is an edge node that doesn't run containers.
|
||||||
|
|
||||||
|
## Cold Start (No Cached Images)
|
||||||
|
|
||||||
|
If the machine was wiped and no images are cached:
|
||||||
|
|
||||||
|
1. **MCIAS** runs on svc (Docker container), not rift. It should be
|
||||||
|
unaffected by a rift failure. Verify: `ssh svc.metacircular.net
|
||||||
|
"docker ps | grep mcias"`.
|
||||||
|
|
||||||
|
2. **Pre-stage images** by pulling from a backup or building locally:
|
||||||
|
```bash
|
||||||
|
# On vade (operator workstation), build and push to a temp location
|
||||||
|
cd ~/src/metacircular/mcns && make docker
|
||||||
|
podman save mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 | \
|
||||||
|
ssh rift "podman load"
|
||||||
|
```
|
||||||
|
Repeat for each service.
|
||||||
|
|
||||||
|
3. Alternatively, if another node has MCR access, push images there
|
||||||
|
first, then pull from the running MCR instance.
|
||||||
|
|
||||||
|
## Service Reference
|
||||||
|
|
||||||
|
Quick reference for all services, their images, and critical flags:
|
||||||
|
|
||||||
|
| Service | Image | Network | Key Ports | Config Path |
|
||||||
|
|---------|-------|---------|-----------|-------------|
|
||||||
|
| mcns | mcns:v1.2.0 | bridge | 53/tcp+udp, 38443→8443 | /srv/mcns/mcns.toml |
|
||||||
|
| mc-proxy | mc-proxy:v1.2.2 | host | 443, 8443, 9443 | /srv/mc-proxy/mc-proxy.toml |
|
||||||
|
| mcr (api) | mcr:v1.2.1 | bridge | 28443→8443, 29443→9443 | /srv/mcr/mcr.toml |
|
||||||
|
| mcr (web) | mcr-web:v1.3.2 | bridge | 28080→8080 | /srv/mcr/mcr.toml |
|
||||||
|
| metacrypt (api) | metacrypt:v1.3.1 | bridge | 18443→8443, 19443→9443 | /srv/metacrypt/metacrypt.toml |
|
||||||
|
| metacrypt (web) | metacrypt-web:v1.4.1 | bridge | 18080→8080 | /srv/metacrypt/metacrypt.toml |
|
||||||
|
| mcp-master | mcp-master:v0.10.3 | host | 9555 | /srv/mcp-master/mcp-master.toml |
|
||||||
|
| mcq | mcq:v0.4.2 | bridge | 48080→8080 | /srv/mcq/mcq.toml |
|
||||||
|
| mcdoc | mcdoc:v0.1.0 | bridge | 38080→8080 | /srv/mcdoc/mcdoc.toml |
|
||||||
|
| mcat | mcat:v1.2.0 | bridge | 48116→8443 | /srv/mcat/mcat.toml |
|
||||||
|
| kls | kls:v0.2.0 | bridge | 58080→8080 | /srv/kls/kls.conf |
|
||||||
|
| sgard | sgardd:v3.2.0 | bridge | 19473→9473 | (flags, see above) |
|
||||||
|
|
||||||
|
All images are prefixed with `mcr.svc.mcp.metacircular.net:8443/`.
|
||||||
@@ -48,11 +48,11 @@ the spec disagree, one of them has a bug.
|
|||||||
|
|
||||||
## High-Level Overview
|
## High-Level Overview
|
||||||
|
|
||||||
Metacircular infrastructure is built from six core components, plus a shared
|
Metacircular infrastructure is built from six core components and a
|
||||||
standard library (**MCDSL**) that provides the common patterns all services
|
documentation server, plus a shared standard library (**MCDSL**) that provides
|
||||||
depend on (auth integration, database setup, config loading, HTTP/gRPC server
|
the common patterns all services depend on (auth integration, database setup,
|
||||||
bootstrapping, CSRF, web session management, health checks, snapshots, and
|
config loading, HTTP/gRPC server bootstrapping, CSRF, web session management,
|
||||||
service directory archiving):
|
health checks, snapshots, and service directory archiving):
|
||||||
|
|
||||||
- **MCIAS** — Identity and access. The root of trust for all other services.
|
- **MCIAS** — Identity and access. The root of trust for all other services.
|
||||||
Handles authentication, token issuance, role management, and login policy
|
Handles authentication, token issuance, role management, and login policy
|
||||||
@@ -75,6 +75,10 @@ service directory archiving):
|
|||||||
accepts outside connections, and routes them to the correct service — either
|
accepts outside connections, and routes them to the correct service — either
|
||||||
as raw TCP passthrough or via TLS-terminating HTTP/2 reverse proxy.
|
as raw TCP passthrough or via TLS-terminating HTTP/2 reverse proxy.
|
||||||
|
|
||||||
|
- **MCDoc** — Documentation server. Fetches markdown from Gitea repositories,
|
||||||
|
renders HTML with syntax highlighting, serves a navigable documentation site.
|
||||||
|
Public-facing, no MCIAS authentication required.
|
||||||
|
|
||||||
These components form a dependency graph rooted at MCIAS:
|
These components form a dependency graph rooted at MCIAS:
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -179,14 +183,19 @@ delegates authentication to it; no service maintains its own user database.
|
|||||||
Services validate tokens by calling back to MCIAS (cached 30s by SHA-256 of
|
Services validate tokens by calling back to MCIAS (cached 30s by SHA-256 of
|
||||||
the token).
|
the token).
|
||||||
|
|
||||||
- **Role-based access.** Three roles — `admin` (full access, policy bypass),
|
- **Role-based access.** Three roles — `admin` (MCIAS account management,
|
||||||
`user` (policy-governed), `guest` (service-dependent restrictions). Admin
|
policy changes, zone mutations — reserved for human operators), `user`
|
||||||
detection comes solely from the MCIAS `admin` role; services never promote
|
(policy-governed), `guest` (service-dependent restrictions, rejected by MCP
|
||||||
users locally.
|
agent). Admin detection comes solely from the MCIAS `admin` role; services
|
||||||
|
never promote users locally. Routine operations (deploy, push, DNS updates)
|
||||||
|
do not require admin.
|
||||||
|
|
||||||
- **Account types.** Human accounts (interactive users) and system accounts
|
- **Account types.** Human accounts (interactive users) and system accounts
|
||||||
(service-to-service). Both authenticate the same way; system accounts enable
|
(service-to-service). Both produce standard JWTs validated the same way.
|
||||||
automated workflows.
|
System accounts carry no roles — their authorization is handled by each
|
||||||
|
service's policy engine (Metacrypt policies, MCNS name-scoped access, MCR
|
||||||
|
default policies). System account tokens are long-lived (365-day default)
|
||||||
|
and do not require passwords for issuance.
|
||||||
|
|
||||||
- **Login policy.** Priority-based ACL rules that control who can log into
|
- **Login policy.** Priority-based ACL rules that control who can log into
|
||||||
which services. Rules can target roles, account types, service names, and
|
which services. Rules can target roles, account types, service names, and
|
||||||
@@ -204,7 +213,7 @@ MCIAS evaluates login policy against the service context, verifies credentials,
|
|||||||
and returns a bearer token. The MCIAS Go client library
|
and returns a bearer token. The MCIAS Go client library
|
||||||
(`git.wntrmute.dev/mc/mcias/clients/go`) handles this flow.
|
(`git.wntrmute.dev/mc/mcias/clients/go`) handles this flow.
|
||||||
|
|
||||||
**Status:** Implemented. v1.7.0. Feature-complete with active refinement
|
**Status:** Implemented. v1.9.0. Feature-complete with active refinement
|
||||||
(WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login policies).
|
(WebAuthn/FIDO2 passkeys, TOTP 2FA, service-context login policies).
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -255,7 +264,7 @@ core.
|
|||||||
operations on which engine mounts. Priority-based evaluation, default deny,
|
operations on which engine mounts. Priority-based evaluation, default deny,
|
||||||
admin bypass. See Metacrypt's `POLICY.md` for the full model.
|
admin bypass. See Metacrypt's `POLICY.md` for the full model.
|
||||||
|
|
||||||
**Status:** Implemented. All four engine types complete — CA (with ACME
|
**Status:** Implemented. v1.3.1. All four engine types complete — CA (with ACME
|
||||||
support), SSH CA, transit encryption, and user-to-user encryption.
|
support), SSH CA, transit encryption, and user-to-user encryption.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -274,7 +283,9 @@ serves the container images that MCP deploys across the platform.
|
|||||||
- **Authenticated access.** No anonymous access. MCR uses the OCI token
|
- **Authenticated access.** No anonymous access. MCR uses the OCI token
|
||||||
authentication flow: clients hit `/v2/`, receive a 401 with a token
|
authentication flow: clients hit `/v2/`, receive a 401 with a token
|
||||||
endpoint, authenticate via MCIAS, and use the returned JWT for subsequent
|
endpoint, authenticate via MCIAS, and use the returned JWT for subsequent
|
||||||
requests.
|
requests. The token endpoint accepts both username/password (standard
|
||||||
|
login) and pre-existing MCIAS JWTs as passwords (personal-access-token
|
||||||
|
pattern), enabling non-interactive push/pull for system accounts and CI.
|
||||||
|
|
||||||
- **Policy-controlled push/pull.** Fine-grained ACL rules govern who can push
|
- **Policy-controlled push/pull.** Fine-grained ACL rules govern who can push
|
||||||
to or pull from which repositories. Integrated with MCIAS roles.
|
to or pull from which repositories. Integrated with MCIAS roles.
|
||||||
@@ -286,7 +297,7 @@ serves the container images that MCP deploys across the platform.
|
|||||||
is scheduled, MCP tells the node's agent which image to pull and where to get
|
is scheduled, MCP tells the node's agent which image to pull and where to get
|
||||||
it. MCR sits behind an MC-Proxy instance for TLS routing.
|
it. MCR sits behind an MC-Proxy instance for TLS routing.
|
||||||
|
|
||||||
**Status:** Implemented. Phase 13 (deployment artifacts) complete.
|
**Status:** Implemented. v1.2.1. All implementation phases complete.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -333,7 +344,9 @@ two instances — an edge proxy on a public VPS and an origin proxy on the
|
|||||||
private network, connected over the overlay with PROXY protocol preserving
|
private network, connected over the overlay with PROXY protocol preserving
|
||||||
client IPs across the hop.
|
client IPs across the hop.
|
||||||
|
|
||||||
**Status:** Implemented.
|
**Status:** Implemented. v1.2.1. Route state persisted in SQLite with
|
||||||
|
write-through semantics. gRPC admin API with idempotent AddRoute for runtime
|
||||||
|
route management.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -365,9 +378,13 @@ into DNS records.
|
|||||||
using internal DNS names automatically resolve to the right place without
|
using internal DNS names automatically resolve to the right place without
|
||||||
config changes.
|
config changes.
|
||||||
|
|
||||||
- **Record management API.** Authenticated via MCIAS. MCP is the primary
|
- **Record management API.** Authenticated via MCIAS with name-scoped
|
||||||
consumer for dynamic updates. Operators can also manage records directly
|
authorization. Admin can manage all records and zones. The `mcp-agent`
|
||||||
for static entries (node addresses, aliases).
|
system account can create and delete any record. Other system accounts
|
||||||
|
can only manage records matching their own name (e.g., system account
|
||||||
|
`mcq` can manage `mcq.svc.mcp.metacircular.net` but not other records).
|
||||||
|
Human users have read-only access to records. Zone mutations (create,
|
||||||
|
update, delete zones) remain admin-only.
|
||||||
|
|
||||||
**How it fits in:** MCNS answers "what is the address of X?" MCP answers "where
|
**How it fits in:** MCNS answers "what is the address of X?" MCP answers "where
|
||||||
is service α running?" and pushes the answer to MCNS. This separation means
|
is service α running?" and pushes the answer to MCNS. This separation means
|
||||||
@@ -375,10 +392,11 @@ services can use stable DNS names in their configs (e.g.,
|
|||||||
`mcias.svc.mcp.metacircular.net` in `[mcias] server_url`) that survive
|
`mcias.svc.mcp.metacircular.net` in `[mcias] server_url`) that survive
|
||||||
migration without config changes.
|
migration without config changes.
|
||||||
|
|
||||||
**Status:** Implemented. v1.0.0. Custom Go DNS server deployed on rift,
|
**Status:** Implemented. v1.1.1. Custom Go DNS server deployed on rift,
|
||||||
serving two authoritative zones (`svc.mcp.metacircular.net` and
|
serving two authoritative zones (`svc.mcp.metacircular.net` and
|
||||||
`mcp.metacircular.net`) plus upstream forwarding. REST + gRPC APIs with
|
`mcp.metacircular.net`) plus upstream forwarding. REST + gRPC APIs with
|
||||||
MCIAS auth. Records stored in SQLite.
|
MCIAS auth and name-scoped system account authorization. Records stored
|
||||||
|
in SQLite.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -403,6 +421,10 @@ each managed node.
|
|||||||
the initial config, pulls the image from MCR, starts the container, and
|
the initial config, pulls the image from MCR, starts the container, and
|
||||||
pushes a DNS update to MCNS (`α.svc.mcp.metacircular.net` → node address).
|
pushes a DNS update to MCNS (`α.svc.mcp.metacircular.net` → node address).
|
||||||
|
|
||||||
|
- **Undeploy.** Full teardown of a service. Stops the container, removes
|
||||||
|
MC-Proxy routes, deletes DNS records from MCNS, and cleans up the service
|
||||||
|
registry entry. The inverse of deploy.
|
||||||
|
|
||||||
- **Migrate.** Move a service from one node to another. MCP snapshots the
|
- **Migrate.** Move a service from one node to another. MCP snapshots the
|
||||||
service's `/srv/<service>/` directory on the source node (as a tar.zst
|
service's `/srv/<service>/` directory on the source node (as a tar.zst
|
||||||
image), transfers it to the destination, extracts it, starts the service,
|
image), transfers it to the destination, extracts it, starts the service,
|
||||||
@@ -429,9 +451,17 @@ each managed node.
|
|||||||
- **Master/agent architecture.** MCP Master runs on the operator's machine.
|
- **Master/agent architecture.** MCP Master runs on the operator's machine.
|
||||||
Agents run on every managed node, receiving C2 (command and control) from
|
Agents run on every managed node, receiving C2 (command and control) from
|
||||||
Master, reporting node status, and managing local workloads. The C2 channel
|
Master, reporting node status, and managing local workloads. The C2 channel
|
||||||
is authenticated via MCIAS. The master does not need to be always-on —
|
is authenticated via MCIAS — any authenticated non-guest user or system
|
||||||
agents keep running their workloads independently; the master is needed only
|
account is accepted (admin role is not required for deploy operations).
|
||||||
to issue new commands.
|
The master does not need to be always-on — agents keep running their
|
||||||
|
workloads independently; the master is needed only to issue new commands.
|
||||||
|
|
||||||
|
- **System account automation.** The agent uses an `mcp-agent` system account
|
||||||
|
for all service-to-service communication: TLS cert provisioning (Metacrypt),
|
||||||
|
DNS record management (MCNS), and container image pulls (MCR). Each service
|
||||||
|
authorizes the agent through its own policy engine. Per-service system
|
||||||
|
accounts (e.g., `mcq`) can be created for scoped self-management — a service
|
||||||
|
account can only manage its own DNS records, not other services'.
|
||||||
|
|
||||||
- **Node management.** Track which nodes are in the platform, their health,
|
- **Node management.** Track which nodes are in the platform, their health,
|
||||||
available resources, and running workloads.
|
available resources, and running workloads.
|
||||||
@@ -452,11 +482,15 @@ services it depends on.
|
|||||||
can deploy them. The systemd unit files exist as a fallback and for bootstrap —
|
can deploy them. The systemd unit files exist as a fallback and for bootstrap —
|
||||||
the long-term deployment model is MCP-managed containers.
|
the long-term deployment model is MCP-managed containers.
|
||||||
|
|
||||||
**Status:** Implemented. v0.1.0. Deployed on rift managing all platform
|
**Status:** Implemented. v0.7.6. Deployed on rift managing all platform
|
||||||
containers. Two components — `mcp` CLI (operator workstation) and
|
containers. Route declarations with automatic port allocation (`$PORT` /
|
||||||
|
`$PORT_<NAME>` env vars passed to containers). MC-Proxy route registration
|
||||||
|
during deploy and stop. Automated TLS cert provisioning for L7 routes via
|
||||||
|
Metacrypt CA (Phase C). Automated DNS registration in MCNS during deploy
|
||||||
|
and stop (Phase D). Two components — `mcp` CLI (operator workstation) and
|
||||||
`mcp-agent` (per-node daemon with SQLite registry, rootless Podman,
|
`mcp-agent` (per-node daemon with SQLite registry, rootless Podman,
|
||||||
monitoring with drift/flap detection). gRPC-only (no REST). 12 RPCs,
|
monitoring with drift/flap detection). gRPC-only (no REST). 15 RPCs,
|
||||||
15 CLI commands.
|
17+ CLI commands.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -663,20 +697,22 @@ renew certificates programmatically.
|
|||||||
|
|
||||||
### How Services Get Certificates Today
|
### How Services Get Certificates Today
|
||||||
|
|
||||||
Currently, certificates are provisioned through Metacrypt's **REST API or web
|
For services deployed via MCP with L7 routes, certificates are provisioned
|
||||||
UI** and placed into each service's `/srv/<service>/certs/` directory. This is
|
automatically during deploy — MCP uses the Metacrypt ACME client library to
|
||||||
a manual process — the operator issues a certificate, downloads it, and
|
obtain certs and transfers them to the node. For other services and during
|
||||||
deploys the files. The ACME client library exists but is not yet integrated
|
bootstrap, certificates are provisioned through Metacrypt's **REST API or web
|
||||||
into any service.
|
UI** and placed into each service's `/srv/<service>/certs/` directory manually.
|
||||||
|
|
||||||
### How It Will Work With MCP
|
### How MCP Automates Certificates
|
||||||
|
|
||||||
MCP is the natural place to automate certificate provisioning:
|
MCP automates certificate provisioning for deploy workflows, with renewal and
|
||||||
|
migration automation planned:
|
||||||
|
|
||||||
- **Initial deploy.** When MCP deploys a new service, it can provision a
|
- **Initial deploy.** When MCP deploys a new service, it provisions a
|
||||||
certificate from Metacrypt (via the ACME client library or the REST API),
|
certificate from Metacrypt (via the ACME client library), transfers the cert
|
||||||
transfer the cert and key to the node as part of the config push to
|
and key to the node as part of the config push to `/srv/<service>/certs/`,
|
||||||
`/srv/<service>/certs/`, and start the service with valid TLS material.
|
and starts the service with valid TLS material. For L7 routes, MCP also
|
||||||
|
provisions a TLS certificate for MC-Proxy's termination endpoint.
|
||||||
|
|
||||||
- **Renewal.** MCP knows what services are running and when their certificates
|
- **Renewal.** MCP knows what services are running and when their certificates
|
||||||
expire. It can renew certificates before expiry by re-running the ACME flow
|
expire. It can renew certificates before expiry by re-running the ACME flow
|
||||||
@@ -689,10 +725,8 @@ MCP is the natural place to automate certificate provisioning:
|
|||||||
for the new name.
|
for the new name.
|
||||||
|
|
||||||
- **MC-Proxy L7 routes.** MC-Proxy's L7 mode requires certificate/key pairs
|
- **MC-Proxy L7 routes.** MC-Proxy's L7 mode requires certificate/key pairs
|
||||||
for TLS termination. MCP (or the operator) can provision these from
|
for TLS termination. MCP provisions these from Metacrypt during deploy and
|
||||||
Metacrypt and push them to MC-Proxy's cert directory. MC-Proxy's
|
pushes them to the node alongside the route registration.
|
||||||
architecture doc lists ACME integration and Metacrypt key storage as future
|
|
||||||
work.
|
|
||||||
|
|
||||||
### Trust Distribution
|
### Trust Distribution
|
||||||
|
|
||||||
@@ -793,8 +827,13 @@ Operator workstation (vade)
|
|||||||
│
|
│
|
||||||
├── Scheduling: select Node C (best fit)
|
├── Scheduling: select Node C (best fit)
|
||||||
│
|
│
|
||||||
├── Provision TLS certificate from Metacrypt
|
├── Port assignment: allocate a free host port for each
|
||||||
│ (ACME flow or REST API)
|
│ declared route (passed as $PORT / $PORT_<NAME> env vars)
|
||||||
|
│
|
||||||
|
├── Provision TLS certificate from Metacrypt CA
|
||||||
|
│ (ACME client library) for the service
|
||||||
|
│ — for L7 routes, also provision a cert for MC-Proxy
|
||||||
|
│ TLS termination
|
||||||
│
|
│
|
||||||
├── C2 to Node C agent:
|
├── C2 to Node C agent:
|
||||||
│ 1. Create /srv/α/ directory structure
|
│ 1. Create /srv/α/ directory structure
|
||||||
@@ -802,15 +841,15 @@ Operator workstation (vade)
|
|||||||
│ 3. Transfer TLS cert+key → /srv/α/certs/
|
│ 3. Transfer TLS cert+key → /srv/α/certs/
|
||||||
│ 4. Transfer root CA cert → /srv/α/certs/ca.pem
|
│ 4. Transfer root CA cert → /srv/α/certs/ca.pem
|
||||||
│ 5. Pull image from MCR
|
│ 5. Pull image from MCR
|
||||||
│ 6. Start container
|
│ 6. Start container with $PORT / $PORT_<NAME> env vars
|
||||||
|
│
|
||||||
|
├── Register routes with MC-Proxy
|
||||||
|
│ (gRPC AddRoute for each declared route)
|
||||||
│
|
│
|
||||||
├── Update service registry: α → Node C
|
├── Update service registry: α → Node C
|
||||||
│
|
│
|
||||||
├── Push DNS update to MCNS:
|
└── Push DNS update to MCNS:
|
||||||
│ α.svc.mcp.metacircular.net → Node C address
|
α.svc.mcp.metacircular.net → Node C address
|
||||||
│
|
|
||||||
└── (Optionally) update MC-Proxy route table
|
|
||||||
if α needs external ingress
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Migration
|
### 4. Migration
|
||||||
|
|||||||
@@ -9,8 +9,8 @@ the platform's internals.
|
|||||||
## Platform Overview
|
## Platform Overview
|
||||||
|
|
||||||
Metacircular is a multi-service infrastructure platform. Services are
|
Metacircular is a multi-service infrastructure platform. Services are
|
||||||
Go binaries running as containers on Linux nodes, managed by these core
|
Go binaries running as containers across a fleet of Linux nodes,
|
||||||
components:
|
managed by these core components:
|
||||||
|
|
||||||
| Component | Role |
|
| Component | Role |
|
||||||
|-----------|------|
|
|-----------|------|
|
||||||
@@ -21,8 +21,27 @@ components:
|
|||||||
| **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` |
|
| **MCNS** (DNS) | Authoritative DNS for `*.svc.mcp.metacircular.net` |
|
||||||
|
|
||||||
The operator workflow is: **build image → push to MCR → write service
|
The operator workflow is: **build image → push to MCR → write service
|
||||||
definition → deploy via MCP**. MCP handles port assignment, route
|
definition → deploy via MCP**. MCP handles port assignment, TLS cert
|
||||||
registration, and container lifecycle.
|
provisioning, route registration, DNS registration, and container
|
||||||
|
lifecycle.
|
||||||
|
|
||||||
|
### Fleet Topology
|
||||||
|
|
||||||
|
The platform runs across multiple nodes connected via Tailnet:
|
||||||
|
|
||||||
|
| Node | Role | OS | Arch | Purpose |
|
||||||
|
|------|------|----|------|---------|
|
||||||
|
| **rift** | Compute + core infra | NixOS | amd64 | Runs most services (Metacrypt, MCR, MCNS, etc.) |
|
||||||
|
| **svc** | Edge | Debian | amd64 | Public-facing mc-proxy, routes traffic over Tailnet to compute nodes |
|
||||||
|
| **orion** | Compute | NixOS | amd64 | Provisioned, currently offline |
|
||||||
|
|
||||||
|
**Node roles:**
|
||||||
|
|
||||||
|
- **Compute nodes** (rift, orion, future RPis) run the full container
|
||||||
|
lifecycle via rootless Podman.
|
||||||
|
- **Edge nodes** (svc) run mc-proxy for public traffic routing only.
|
||||||
|
The MCP agent on edge nodes manages mc-proxy routes but does not run
|
||||||
|
application containers.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -34,7 +53,7 @@ registration, and container lifecycle.
|
|||||||
| Container engine | Docker or Podman (for building images) |
|
| Container engine | Docker or Podman (for building images) |
|
||||||
| `mcp` CLI | Installed on the operator workstation |
|
| `mcp` CLI | Installed on the operator workstation |
|
||||||
| MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` |
|
| MCR access | Credentials to push images to `mcr.svc.mcp.metacircular.net:8443` |
|
||||||
| MCP agent | Running on the target node (currently `rift`) |
|
| MCP agent | Running on the target node (`rift` for services, `svc` for edge routing) |
|
||||||
| MCIAS account | For `mcp` CLI authentication to the agent |
|
| MCIAS account | For `mcp` CLI authentication to the agent |
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -123,18 +142,38 @@ Service definitions are TOML files that tell MCP what to deploy. They
|
|||||||
live at `~/.config/mcp/services/<service>.toml` on the operator
|
live at `~/.config/mcp/services/<service>.toml` on the operator
|
||||||
workstation.
|
workstation.
|
||||||
|
|
||||||
### Minimal Example (Single Component)
|
### Minimal Example (Single Component, L7)
|
||||||
|
|
||||||
|
```toml
|
||||||
|
name = "myservice"
|
||||||
|
node = "rift"
|
||||||
|
|
||||||
|
[build.images]
|
||||||
|
myservice = "Dockerfile"
|
||||||
|
|
||||||
|
[[components]]
|
||||||
|
name = "web"
|
||||||
|
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||||
|
|
||||||
|
[[components.routes]]
|
||||||
|
port = 8443
|
||||||
|
mode = "l7"
|
||||||
|
```
|
||||||
|
|
||||||
|
### API Service Example (L4, Multiple Routes)
|
||||||
|
|
||||||
```toml
|
```toml
|
||||||
name = "myservice"
|
name = "myservice"
|
||||||
node = "rift"
|
node = "rift"
|
||||||
version = "v1.0.0"
|
|
||||||
|
|
||||||
[build.images]
|
[build.images]
|
||||||
myservice = "Dockerfile"
|
myservice = "Dockerfile"
|
||||||
|
|
||||||
[[components]]
|
[[components]]
|
||||||
name = "api"
|
name = "api"
|
||||||
|
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||||
|
volumes = ["/srv/myservice:/srv/myservice"]
|
||||||
|
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||||
|
|
||||||
[[components.routes]]
|
[[components.routes]]
|
||||||
name = "rest"
|
name = "rest"
|
||||||
@@ -152,7 +191,6 @@ mode = "l4"
|
|||||||
```toml
|
```toml
|
||||||
name = "myservice"
|
name = "myservice"
|
||||||
node = "rift"
|
node = "rift"
|
||||||
version = "v1.0.0"
|
|
||||||
|
|
||||||
[build.images]
|
[build.images]
|
||||||
myservice = "Dockerfile.api"
|
myservice = "Dockerfile.api"
|
||||||
@@ -160,6 +198,7 @@ myservice-web = "Dockerfile.web"
|
|||||||
|
|
||||||
[[components]]
|
[[components]]
|
||||||
name = "api"
|
name = "api"
|
||||||
|
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||||
volumes = ["/srv/myservice:/srv/myservice"]
|
volumes = ["/srv/myservice:/srv/myservice"]
|
||||||
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||||
|
|
||||||
@@ -175,6 +214,7 @@ mode = "l4"
|
|||||||
|
|
||||||
[[components]]
|
[[components]]
|
||||||
name = "web"
|
name = "web"
|
||||||
|
image = "mcr.svc.mcp.metacircular.net:8443/myservice-web:v1.0.0"
|
||||||
volumes = ["/srv/myservice:/srv/myservice"]
|
volumes = ["/srv/myservice:/srv/myservice"]
|
||||||
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
cmd = ["server", "--config", "/srv/myservice/myservice.toml"]
|
||||||
|
|
||||||
@@ -183,21 +223,16 @@ port = 443
|
|||||||
mode = "l7"
|
mode = "l7"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Convention-Derived Defaults
|
### Conventions
|
||||||
|
|
||||||
Most fields are optional — MCP derives them from conventions:
|
A few fields are derived by the agent at deploy time:
|
||||||
|
|
||||||
| Field | Default | Override when... |
|
| Field | Default | Override when... |
|
||||||
|-------|---------|------------------|
|
|-------|---------|------------------|
|
||||||
| Image name | `<service>` (api), `<service>-<component>` (others) | Image name differs from convention |
|
| Source path | `<service>` relative to workspace root | Directory name differs from service name (use `path`) |
|
||||||
| Image registry | `mcr.svc.mcp.metacircular.net:8443` (from global MCP config) | Never — always use MCR |
|
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname (use route `hostname`) |
|
||||||
| Version | Service-level `version` field | A component needs a different version |
|
|
||||||
| Volumes | `/srv/<service>:/srv/<service>` | Additional mounts are needed |
|
All other fields must be explicit in the service definition.
|
||||||
| Network | `mcpnet` | Service needs host networking or a different network |
|
|
||||||
| User | `0:0` | Never change this for standard services |
|
|
||||||
| Restart | `unless-stopped` | Service should not auto-restart |
|
|
||||||
| Source path | `<service>` relative to workspace root | Directory name differs from service name |
|
|
||||||
| Hostname | `<service>.svc.mcp.metacircular.net` | Service needs a public hostname |
|
|
||||||
|
|
||||||
### Service Definition Reference
|
### Service Definition Reference
|
||||||
|
|
||||||
@@ -207,7 +242,6 @@ Most fields are optional — MCP derives them from conventions:
|
|||||||
|-------|----------|---------|
|
|-------|----------|---------|
|
||||||
| `name` | Yes | Service name (matches project name) |
|
| `name` | Yes | Service name (matches project name) |
|
||||||
| `node` | Yes | Target node to deploy to |
|
| `node` | Yes | Target node to deploy to |
|
||||||
| `version` | Yes | Image version tag (semver, e.g. `v1.0.0`) |
|
|
||||||
| `active` | No | Whether MCP keeps this running (default: `true`) |
|
| `active` | No | Whether MCP keeps this running (default: `true`) |
|
||||||
| `path` | No | Source directory relative to workspace (default: `name`) |
|
| `path` | No | Source directory relative to workspace (default: `name`) |
|
||||||
|
|
||||||
@@ -215,20 +249,20 @@ Most fields are optional — MCP derives them from conventions:
|
|||||||
|
|
||||||
| Field | Purpose |
|
| Field | Purpose |
|
||||||
|-------|---------|
|
|-------|---------|
|
||||||
| `build.images.<name>` | Maps image name to Dockerfile path |
|
| `build.images.<name>` | Maps build image name to Dockerfile path. The `<name>` must match the repository name in a component's `image` field (the part after the last `/`, before the `:` tag). |
|
||||||
|
|
||||||
**Component fields:**
|
**Component fields:**
|
||||||
|
|
||||||
| Field | Purpose |
|
| Field | Required | Purpose |
|
||||||
|-------|---------|
|
|-------|----------|---------|
|
||||||
| `name` | Component name (e.g. `api`, `web`) |
|
| `name` | Yes | Component name (e.g. `api`, `web`) |
|
||||||
| `image` | Full image reference override |
|
| `image` | Yes | Full image reference (e.g. `mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0`) |
|
||||||
| `version` | Version override for this component |
|
| `volumes` | No | Volume mounts (list of `host:container` strings) |
|
||||||
| `volumes` | Volume mounts (list of `host:container` strings) |
|
| `cmd` | No | Command override (list of strings) |
|
||||||
| `cmd` | Command override (list of strings) |
|
| `env` | No | Extra environment variables (list of `KEY=VALUE` strings) |
|
||||||
| `network` | Container network override |
|
| `network` | No | Container network (default: none) |
|
||||||
| `user` | Container user override |
|
| `user` | No | Container user (e.g. `0:0`) |
|
||||||
| `restart` | Restart policy override |
|
| `restart` | No | Restart policy (e.g. `unless-stopped`) |
|
||||||
|
|
||||||
**Route fields (under `[[components.routes]]`):**
|
**Route fields (under `[[components.routes]]`):**
|
||||||
|
|
||||||
@@ -248,9 +282,11 @@ Most fields are optional — MCP derives them from conventions:
|
|||||||
|
|
||||||
### Version Pinning
|
### Version Pinning
|
||||||
|
|
||||||
Service definitions **must** pin an explicit semver tag (e.g. `v1.1.0`).
|
Component `image` fields **must** pin an explicit semver tag (e.g.
|
||||||
Never use `:latest`. This ensures deployments are reproducible and
|
`mcr.svc.mcp.metacircular.net:8443/myservice:v1.1.0`). Never use
|
||||||
`mcp status` shows the actual running version.
|
`:latest`. This ensures deployments are reproducible and `mcp status`
|
||||||
|
shows the actual running version. The version is extracted from the
|
||||||
|
image tag.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -303,14 +339,20 @@ recreates the containers.
|
|||||||
### What Happens During Deploy
|
### What Happens During Deploy
|
||||||
|
|
||||||
1. Agent assigns a free host port (10000–60000) for each declared route.
|
1. Agent assigns a free host port (10000–60000) for each declared route.
|
||||||
2. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
|
2. For L7 routes, agent provisions a TLS certificate from Metacrypt CA
|
||||||
|
(via `POST /v1/engine/request`). Certs are written to
|
||||||
|
`/srv/mc-proxy/certs/<service>.pem` and `.key`. Existing valid certs
|
||||||
|
(more than 30 days from expiry) are reused.
|
||||||
|
3. Agent starts containers with `$PORT` / `$PORT_<NAME>` environment
|
||||||
variables set to the assigned ports.
|
variables set to the assigned ports.
|
||||||
3. Agent registers routes with mc-proxy (hostname → `127.0.0.1:<port>`,
|
4. Agent registers routes with mc-proxy via gRPC (hostname →
|
||||||
mode, TLS cert paths).
|
`<node-address>:<port>`, mode, TLS cert paths).
|
||||||
4. Agent records the full state in its SQLite registry.
|
5. Agent registers DNS entries in MCNS for
|
||||||
|
`<service>.svc.mcp.metacircular.net`.
|
||||||
|
6. Agent records the full state in its SQLite registry.
|
||||||
|
|
||||||
On stop (`mcp stop <service>`), the agent reverses the process: removes
|
On stop (`mcp stop <service>`), the agent reverses the process: removes
|
||||||
mc-proxy routes, then stops containers.
|
DNS entries, removes mc-proxy routes, then stops containers.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -368,7 +410,14 @@ tags = []
|
|||||||
level = "info"
|
level = "info"
|
||||||
```
|
```
|
||||||
|
|
||||||
For services with a web UI, add:
|
For services with SSO-enabled web UIs, add:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[sso]
|
||||||
|
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||||
|
```
|
||||||
|
|
||||||
|
For services with a separate web UI binary, add:
|
||||||
|
|
||||||
```toml
|
```toml
|
||||||
[web]
|
[web]
|
||||||
@@ -385,12 +434,17 @@ addresses** — they will be overridden at deploy time.
|
|||||||
|
|
||||||
| Env var | When set |
|
| Env var | When set |
|
||||||
|---------|----------|
|
|---------|----------|
|
||||||
| `$PORT` | Component has a single route |
|
| `$PORT` | Component has a single unnamed route |
|
||||||
| `$PORT_<NAME>` | Component has multiple named routes |
|
| `$PORT_<NAME>` | Component has named routes |
|
||||||
|
|
||||||
Route names are uppercased: `name = "rest"` → `$PORT_REST`,
|
Route names are uppercased: `name = "rest"` → `$PORT_REST`,
|
||||||
`name = "grpc"` → `$PORT_GRPC`.
|
`name = "grpc"` → `$PORT_GRPC`.
|
||||||
|
|
||||||
|
**Container listen address:** Services must bind to `0.0.0.0:$PORT`
|
||||||
|
(or `:$PORT`), not `localhost:$PORT`. Podman port-forwards go through
|
||||||
|
the container's network namespace — binding to `localhost` inside the
|
||||||
|
container makes the port unreachable from outside.
|
||||||
|
|
||||||
Services built with **mcdsl v1.1.0+** handle this automatically —
|
Services built with **mcdsl v1.1.0+** handle this automatically —
|
||||||
`config.Load` checks `$PORT` → overrides `Server.ListenAddr`, and
|
`config.Load` checks `$PORT` → overrides `Server.ListenAddr`, and
|
||||||
`$PORT_GRPC` → overrides `Server.GRPCAddr`. These take precedence over
|
`$PORT_GRPC` → overrides `Server.GRPCAddr`. These take precedence over
|
||||||
@@ -411,18 +465,72 @@ these.
|
|||||||
## 6. Authentication (MCIAS Integration)
|
## 6. Authentication (MCIAS Integration)
|
||||||
|
|
||||||
Every service delegates authentication to MCIAS. No service maintains
|
Every service delegates authentication to MCIAS. No service maintains
|
||||||
its own user database.
|
its own user database. Services support two login modes: **SSO
|
||||||
|
redirect** (recommended for web UIs) and **direct credentials**
|
||||||
|
(fallback / API clients).
|
||||||
|
|
||||||
### Auth Flow
|
### SSO Login (Web UIs)
|
||||||
|
|
||||||
|
SSO is the preferred login method for web UIs. The flow is an OAuth
|
||||||
|
2.0-style authorization code exchange:
|
||||||
|
|
||||||
|
1. User visits the service and is redirected to `/login`.
|
||||||
|
2. Login page shows a "Sign in with MCIAS" button.
|
||||||
|
3. Click redirects to MCIAS (`/sso/authorize`), which authenticates the
|
||||||
|
user.
|
||||||
|
4. MCIAS redirects back to the service's `/sso/callback` with an
|
||||||
|
authorization code.
|
||||||
|
5. The service exchanges the code for a JWT via a server-to-server call
|
||||||
|
to MCIAS `POST /v1/sso/token`.
|
||||||
|
6. The JWT is stored in a session cookie.
|
||||||
|
|
||||||
|
SSO is enabled by adding an `[sso]` section to the service config and
|
||||||
|
registering the service as an SSO client in MCIAS.
|
||||||
|
|
||||||
|
**Service config:**
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[sso]
|
||||||
|
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||||
|
```
|
||||||
|
|
||||||
|
**MCIAS config** (add to the `[[sso_clients]]` list):
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[[sso_clients]]
|
||||||
|
client_id = "<service>"
|
||||||
|
redirect_uri = "https://<service>.svc.mcp.metacircular.net/sso/callback"
|
||||||
|
service_name = "<service>"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `redirect_uri` must match exactly between the service config and
|
||||||
|
the MCIAS client registration.
|
||||||
|
|
||||||
|
When `[sso].redirect_uri` is empty or absent, the service falls back to
|
||||||
|
the direct credentials form.
|
||||||
|
|
||||||
|
**Implementation:** Services use `mcdsl/sso` (v1.7.0+) which handles
|
||||||
|
state management, CSRF-safe cookies, and the code exchange. The web
|
||||||
|
server registers three routes:
|
||||||
|
|
||||||
|
| Route | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `GET /login` | Renders landing page with "Sign in with MCIAS" button |
|
||||||
|
| `GET /sso/redirect` | Sets state cookies, redirects to MCIAS |
|
||||||
|
| `GET /sso/callback` | Validates state, exchanges code for JWT, sets session |
|
||||||
|
|
||||||
|
### Direct Credentials (API / Fallback)
|
||||||
|
|
||||||
1. Client sends credentials to the service's `POST /v1/auth/login`.
|
1. Client sends credentials to the service's `POST /v1/auth/login`.
|
||||||
2. Service forwards them to MCIAS via the client library
|
2. Service forwards them to MCIAS via `mcdsl/auth.Authenticator.Login()`.
|
||||||
(`git.wntrmute.dev/mc/mcias/clients/go`).
|
|
||||||
3. MCIAS validates and returns a bearer token.
|
3. MCIAS validates and returns a bearer token.
|
||||||
4. Subsequent requests include `Authorization: Bearer <token>`.
|
4. Subsequent requests include `Authorization: Bearer <token>`.
|
||||||
5. Service validates tokens via MCIAS `ValidateToken()`, cached for 30s
|
5. Service validates tokens via `ValidateToken()`, cached for 30s
|
||||||
(keyed by SHA-256 of the token).
|
(keyed by SHA-256 of the token).
|
||||||
|
|
||||||
|
Web UIs use this mode when SSO is not configured, presenting a
|
||||||
|
username/password/TOTP form instead of the SSO button.
|
||||||
|
|
||||||
### Roles
|
### Roles
|
||||||
|
|
||||||
| Role | Access |
|
| Role | Access |
|
||||||
@@ -475,11 +583,14 @@ co-located on the same node).
|
|||||||
| `mcp build <service>` | Build and push images to MCR |
|
| `mcp build <service>` | Build and push images to MCR |
|
||||||
| `mcp sync` | Push all service definitions to agents; auto-build missing images |
|
| `mcp sync` | Push all service definitions to agents; auto-build missing images |
|
||||||
| `mcp deploy <service>` | Pull image, (re)create containers, register routes |
|
| `mcp deploy <service>` | Pull image, (re)create containers, register routes |
|
||||||
|
| `mcp undeploy <service>` | Full teardown: remove routes, DNS, certs, and containers |
|
||||||
| `mcp stop <service>` | Remove routes, stop containers |
|
| `mcp stop <service>` | Remove routes, stop containers |
|
||||||
| `mcp start <service>` | Start previously stopped containers |
|
| `mcp start <service>` | Start previously stopped containers |
|
||||||
| `mcp restart <service>` | Restart containers in place |
|
| `mcp restart <service>` | Restart containers in place |
|
||||||
| `mcp ps` | List all managed containers and status |
|
| `mcp ps` | List all managed containers and status |
|
||||||
| `mcp status [service]` | Detailed status for a specific service |
|
| `mcp status [service]` | Detailed status for a specific service |
|
||||||
|
| `mcp logs <service>` | Stream container logs |
|
||||||
|
| `mcp edit <service>` | Edit service definition |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -504,13 +615,14 @@ git push origin v1.0.0
|
|||||||
cat > ~/.config/mcp/services/myservice.toml << 'EOF'
|
cat > ~/.config/mcp/services/myservice.toml << 'EOF'
|
||||||
name = "myservice"
|
name = "myservice"
|
||||||
node = "rift"
|
node = "rift"
|
||||||
version = "v1.0.0"
|
|
||||||
|
|
||||||
[build.images]
|
[build.images]
|
||||||
myservice = "Dockerfile.api"
|
myservice = "Dockerfile.api"
|
||||||
|
|
||||||
[[components]]
|
[[components]]
|
||||||
name = "api"
|
name = "api"
|
||||||
|
image = "mcr.svc.mcp.metacircular.net:8443/myservice:v1.0.0"
|
||||||
|
volumes = ["/srv/myservice:/srv/myservice"]
|
||||||
|
|
||||||
[[components.routes]]
|
[[components.routes]]
|
||||||
name = "rest"
|
name = "rest"
|
||||||
@@ -584,15 +696,85 @@ Services follow a standard directory structure:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 10. Agent Management
|
||||||
|
|
||||||
|
MCP manages a fleet of nodes with heterogeneous operating systems and
|
||||||
|
architectures. The agent binary lives at `/srv/mcp/mcp-agent` on every
|
||||||
|
node — this is a mutable path that MCP controls, regardless of whether
|
||||||
|
the node runs NixOS or Debian.
|
||||||
|
|
||||||
|
### Node Configuration
|
||||||
|
|
||||||
|
Each node in `~/.config/mcp/mcp.toml` includes SSH and architecture
|
||||||
|
info for agent management:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[[nodes]]
|
||||||
|
name = "rift"
|
||||||
|
address = "100.95.252.120:9444"
|
||||||
|
ssh = "rift"
|
||||||
|
arch = "amd64"
|
||||||
|
|
||||||
|
[[nodes]]
|
||||||
|
name = "hyperborea"
|
||||||
|
address = "100.x.x.x:9444"
|
||||||
|
ssh = "hyperborea"
|
||||||
|
arch = "arm64"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Upgrading Agents
|
||||||
|
|
||||||
|
After tagging a new MCP release:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Upgrade all nodes (recommended — prevents version skew)
|
||||||
|
mcp agent upgrade
|
||||||
|
|
||||||
|
# Upgrade a single node
|
||||||
|
mcp agent upgrade rift
|
||||||
|
|
||||||
|
# Check versions across the fleet
|
||||||
|
mcp agent status
|
||||||
|
```
|
||||||
|
|
||||||
|
`mcp agent upgrade` cross-compiles the agent binary for each target
|
||||||
|
architecture, SSHs to each node, atomically replaces the binary, and
|
||||||
|
restarts the systemd service. All nodes should be upgraded together
|
||||||
|
because new CLI versions often depend on new agent RPCs.
|
||||||
|
|
||||||
|
### Provisioning New Nodes
|
||||||
|
|
||||||
|
One-time setup for a new Debian node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Provision the node (creates user, dirs, systemd unit, installs binary)
|
||||||
|
mcp node provision <name>
|
||||||
|
|
||||||
|
# 2. Register the node
|
||||||
|
mcp node add <name> <address>
|
||||||
|
|
||||||
|
# 3. Deploy services
|
||||||
|
mcp deploy <service>
|
||||||
|
```
|
||||||
|
|
||||||
|
For NixOS nodes, provisioning is handled by the NixOS configuration.
|
||||||
|
The NixOS config creates the `mcp` user, systemd unit, and directories.
|
||||||
|
The `ExecStart` path points to `/srv/mcp/mcp-agent` so that `mcp agent
|
||||||
|
upgrade` works the same as on Debian nodes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Appendix: Currently Deployed Services
|
## Appendix: Currently Deployed Services
|
||||||
|
|
||||||
For reference, these services are operational on the platform:
|
For reference, these services are operational on the platform:
|
||||||
|
|
||||||
| Service | Version | Node | Purpose |
|
| Service | Version | Node | Purpose |
|
||||||
|---------|---------|------|---------|
|
|---------|---------|------|---------|
|
||||||
| MCIAS | v1.8.0 | (separate) | Identity and access |
|
| MCIAS | v1.9.0 | (separate) | Identity and access |
|
||||||
| Metacrypt | v1.1.0 | rift | Cryptographic service, PKI/CA |
|
| Metacrypt | v1.4.1 | rift | Cryptographic service, PKI/CA |
|
||||||
| MC-Proxy | v1.1.0 | rift | TLS proxy and router |
|
| MC-Proxy | v1.2.1 | rift, svc | TLS proxy and router (svc handles public edge) |
|
||||||
| MCR | v1.2.0 | rift | Container registry |
|
| MCR | v1.2.1 | rift | Container registry |
|
||||||
| MCNS | v1.1.0 | rift | Authoritative DNS |
|
| MCNS | v1.1.1 | rift | Authoritative DNS |
|
||||||
| MCP | v0.3.0 | rift | Control plane agent |
|
| MCDoc | v0.1.0 | rift | Documentation server |
|
||||||
|
| MCQ | v0.4.0 | rift | Document review queue |
|
||||||
|
| MCP | v0.7.6 | rift, svc | Control plane agent |
|
||||||
|
|||||||
127
docs/phase-e-plan.md
Normal file
127
docs/phase-e-plan.md
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
# Phase E: Multi-Node Orchestration
|
||||||
|
|
||||||
|
Phase D (automated DNS registration) is complete. Phase E extends MCP from
|
||||||
|
a single-node agent on rift to a multi-node fleet with a central master
|
||||||
|
process.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Deploy and manage services across multiple nodes from a single control
|
||||||
|
plane. The operator runs `mcp deploy` and the system places the workload on
|
||||||
|
the right node, provisions certs, registers DNS, and configures routing --
|
||||||
|
same as today on rift, but across the fleet.
|
||||||
|
|
||||||
|
## Fleet Topology
|
||||||
|
|
||||||
|
| Node | OS | Arch | Role |
|
||||||
|
|------|----|------|------|
|
||||||
|
| rift | NixOS | amd64 | Master + worker -- runs mcp-master, core infra, and application services |
|
||||||
|
| orion | NixOS | amd64 | Worker |
|
||||||
|
| hyperborea | Debian | arm64 | Worker (Raspberry Pi) |
|
||||||
|
| svc | Debian | amd64 | Edge -- mc-proxy for public traffic, no containers |
|
||||||
|
|
||||||
|
Tailnet is the interconnect between all nodes. Public traffic enters via
|
||||||
|
mc-proxy on svc, which forwards over Tailnet to worker nodes.
|
||||||
|
|
||||||
|
## Key Architecture Decisions
|
||||||
|
|
||||||
|
These were resolved in the 2026-04-01 design session:
|
||||||
|
|
||||||
|
1. **Rift is the master node.** No separate straylight machine. Core infra
|
||||||
|
stays on rift, which gains mcp-master alongside its existing agent.
|
||||||
|
|
||||||
|
2. **Master-mediated coordination.** Agents never talk to each other. All
|
||||||
|
cross-node operations go through the master. Agents only dial the master
|
||||||
|
(for registration and heartbeats) and respond to master RPCs.
|
||||||
|
|
||||||
|
3. **Agent self-registration.** Agents register with the master on startup
|
||||||
|
(name, role, address, arch). The master maintains the live node registry.
|
||||||
|
No static `[[nodes]]` config required except for bootstrap.
|
||||||
|
|
||||||
|
4. **Heartbeats with fallback probe.** Agents push heartbeats every 30s
|
||||||
|
(with resource data). If the master misses 3 heartbeats (90s), it
|
||||||
|
actively probes the agent. Failed probe marks the node unhealthy.
|
||||||
|
|
||||||
|
5. **Tier-based placement.** `tier = "core"` runs on the master node.
|
||||||
|
`tier = "worker"` (default) is auto-placed on a worker with capacity.
|
||||||
|
Explicit `node = "orion"` overrides tier for pinned services.
|
||||||
|
|
||||||
|
6. **Two separate certs for public services.** Internal cert
|
||||||
|
(`svc.mcp.metacircular.net`) issued by worker agent. Public cert
|
||||||
|
(`metacircular.net`) issued by edge agent. Internal names never
|
||||||
|
appear on edge certs.
|
||||||
|
|
||||||
|
7. **`public = true` on routes.** Public routes declare intent with a
|
||||||
|
boolean flag. The master assigns the route to an edge node (currently
|
||||||
|
always svc). No explicit `edge` field in service definitions.
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### Master (`mcp-master`)
|
||||||
|
|
||||||
|
Long-lived orchestrator on rift. Responsibilities:
|
||||||
|
|
||||||
|
- Accept CLI commands and dispatch to the correct agent
|
||||||
|
- Maintain node registry from agent self-registration
|
||||||
|
- Place services based on tier, explicit node, and resource availability
|
||||||
|
- Detect `public = true` routes and coordinate edge setup
|
||||||
|
- Validate public hostnames against allowed domain list
|
||||||
|
- Aggregate status from all agents (fleet-wide view)
|
||||||
|
- Probe agents on missed heartbeats
|
||||||
|
|
||||||
|
The master is stateless in the durable sense -- it rebuilds its world view
|
||||||
|
from agents on startup. If the master goes down, running services continue
|
||||||
|
unaffected; only new deploys and rescheduling stop.
|
||||||
|
|
||||||
|
### Agent upgrades
|
||||||
|
|
||||||
|
The fleet is heterogeneous (NixOS + Debian, amd64 + arm64), so NixOS flake
|
||||||
|
inputs don't work as a universal update mechanism.
|
||||||
|
|
||||||
|
**Design:** MCP owns the binary at `/srv/mcp/mcp-agent` on all nodes.
|
||||||
|
|
||||||
|
- `mcp agent upgrade [node]` -- CLI cross-compiles for the target's
|
||||||
|
GOARCH, SCPs the binary, restarts via SSH
|
||||||
|
- Node config gains `ssh` (user@host) and `arch` (amd64/arm64) fields
|
||||||
|
- rift's NixOS `ExecStart` changes from nix store path to
|
||||||
|
`/srv/mcp/mcp-agent`
|
||||||
|
- All nodes: binary at `/srv/mcp/mcp-agent`, systemd unit
|
||||||
|
`mcp-agent.service`
|
||||||
|
|
||||||
|
### Edge agents
|
||||||
|
|
||||||
|
svc runs an agent but does NOT run containers. Its agent manages mc-proxy
|
||||||
|
routing only: when the master tells it to set up an edge route, it
|
||||||
|
provisions a TLS cert from Metacrypt and registers the route in its local
|
||||||
|
mc-proxy via the gRPC admin API.
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
### Phase 1: Agent on svc
|
||||||
|
Deploy mcp-agent to svc. Verify with `mcp node list`.
|
||||||
|
|
||||||
|
### Phase 2: Edge routing RPCs
|
||||||
|
Implement SetupEdgeRoute/RemoveEdgeRoute/ListEdgeRoutes on the agent.
|
||||||
|
Test by calling directly from CLI.
|
||||||
|
|
||||||
|
### Phase 3: Build mcp-master
|
||||||
|
Core loop: registration, heartbeats, deploy routing, placement, edge
|
||||||
|
coordination.
|
||||||
|
|
||||||
|
### Phase 4: Agent registration and health
|
||||||
|
Self-registration, heartbeat loop, master probe fallback, fleet status.
|
||||||
|
|
||||||
|
### Phase 5: Cut over
|
||||||
|
Point CLI at master, add tier fields to service defs, deploy agents to
|
||||||
|
orion and hyperborea.
|
||||||
|
|
||||||
|
## What Phase E Does NOT Include
|
||||||
|
|
||||||
|
These remain future work:
|
||||||
|
|
||||||
|
- Auto-reconciliation (agent auto-restarting drifted containers)
|
||||||
|
- Live migration (snapshot streaming between nodes)
|
||||||
|
- Web UI for fleet management
|
||||||
|
- Observability / log aggregation
|
||||||
|
- Object store
|
||||||
|
- Multiple edge nodes / master HA
|
||||||
@@ -1018,6 +1018,13 @@ Write these before writing code. They are the blueprint, not the afterthought.
|
|||||||
- **Never log secrets.** Keys, passwords, tokens, and plaintext must never
|
- **Never log secrets.** Keys, passwords, tokens, and plaintext must never
|
||||||
appear in log output.
|
appear in log output.
|
||||||
|
|
||||||
|
### CLI Security
|
||||||
|
|
||||||
|
- **Never echo passwords.** Interactive password prompts must suppress
|
||||||
|
terminal echo. Use `mcdsl/terminal.ReadPassword` — it wraps
|
||||||
|
`golang.org/x/term.ReadPassword` with proper prompt and newline handling.
|
||||||
|
Never read passwords with `bufio.Scanner` or `fmt.Scanln`.
|
||||||
|
|
||||||
### Web Security
|
### Web Security
|
||||||
|
|
||||||
- CSRF tokens on all mutating requests.
|
- CSRF tokens on all mutating requests.
|
||||||
|
|||||||
627
log/2026-03-28-building-mcq.md
Normal file
627
log/2026-03-28-building-mcq.md
Normal file
@@ -0,0 +1,627 @@
|
|||||||
|
# Building MCQ: A Reading Queue in a Session
|
||||||
|
|
||||||
|
*A document reading queue goes from design discussion to production
|
||||||
|
deployment — including the deployment's sharp edges.*
|
||||||
|
|
||||||
|
*Written by Claude (Opus 4.6), reflecting on a collaborative session with
|
||||||
|
Kyle. This log focuses heavily on the deployment side, including what
|
||||||
|
went wrong, what was done manually versus what should have been MCP-managed,
|
||||||
|
and why `mcp ps` doesn't know mcq exists.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Idea
|
||||||
|
|
||||||
|
Kyle was out and about, away from his tailnet, and wanted to review
|
||||||
|
platform documentation on his phone. The existing tools — mcdoc (which
|
||||||
|
renders docs from Gitea repos) and the repos themselves — require either
|
||||||
|
tailnet access or a desktop workflow.
|
||||||
|
|
||||||
|
The concept: a **document queue**. Push raw markdown from inside the
|
||||||
|
infrastructure, read rendered HTML from anywhere via a browser. Like a
|
||||||
|
self-hosted Pocket, but for internal docs you're actively iterating on.
|
||||||
|
|
||||||
|
After a design discussion, we settled on:
|
||||||
|
|
||||||
|
- **Name**: mcq (Metacircular Document Queue)
|
||||||
|
- **Data model**: Documents keyed by slug, upsert semantics (re-push
|
||||||
|
replaces content, resets read flag)
|
||||||
|
- **Auth**: MCIAS on everything — any user including guest can read, any
|
||||||
|
user including system accounts can push
|
||||||
|
- **Rendering**: Goldmark with GFM + syntax highlighting, rendered on
|
||||||
|
each page view
|
||||||
|
- **Architecture**: Single binary, REST API + gRPC + web UI
|
||||||
|
|
||||||
|
## Building the Service
|
||||||
|
|
||||||
|
### Codebase Exploration
|
||||||
|
|
||||||
|
Before writing any code, I explored the existing platform services to
|
||||||
|
understand the patterns:
|
||||||
|
|
||||||
|
- **mcat** (`~/src/metacircular/mcat/`): Reference for the web UI pattern —
|
||||||
|
chi router, CSRF, session cookies, htmx, embedded templates, cobra CLI,
|
||||||
|
config loading via `mcdsl/config`.
|
||||||
|
- **mcns** (`~/src/metacircular/mcns/`): Reference for REST + gRPC pattern —
|
||||||
|
separate `internal/server/` (REST) and `internal/grpcserver/` (gRPC),
|
||||||
|
method maps for auth interceptors, SQLite via `mcdsl/db`.
|
||||||
|
- **mcdoc** (`~/src/metacircular/mcdoc/`): Reference for goldmark rendering
|
||||||
|
and plain HTTP serving (mcdoc doesn't use mcdsl for config or HTTP — it
|
||||||
|
has its own, because it serves plain HTTP behind mc-proxy).
|
||||||
|
- **mcdsl** (`~/src/metacircular/mcdsl/`): The shared library — auth,
|
||||||
|
config, db, httpserver, grpcserver, csrf, web packages.
|
||||||
|
|
||||||
|
### Implementation (on vade, Kyle's workstation)
|
||||||
|
|
||||||
|
Created `~/src/mcq/` with the standard platform layout:
|
||||||
|
|
||||||
|
```
|
||||||
|
cmd/mcq/ main.go, server.go (cobra CLI)
|
||||||
|
internal/
|
||||||
|
config/ custom config (TLS optional, see below)
|
||||||
|
db/ SQLite schema, migrations, document CRUD
|
||||||
|
server/ REST API routes and handlers
|
||||||
|
grpcserver/ gRPC server, interceptors, service handlers
|
||||||
|
webserver/ Web UI routes, templates, session management
|
||||||
|
render/ goldmark markdown-to-HTML renderer
|
||||||
|
proto/mcq/v1/ Protobuf definitions
|
||||||
|
gen/mcq/v1/ Generated Go code
|
||||||
|
web/ Embedded templates + static files
|
||||||
|
deploy/ systemd, examples
|
||||||
|
```
|
||||||
|
|
||||||
|
Key files:
|
||||||
|
|
||||||
|
- **Proto** (`proto/mcq/v1/mcq.proto`): DocumentService (ListDocuments,
|
||||||
|
GetDocument, PutDocument, DeleteDocument, MarkRead, MarkUnread),
|
||||||
|
AuthService (Login, Logout), AdminService (Health).
|
||||||
|
- **DB** (`internal/db/documents.go`): Single `documents` table with slug
|
||||||
|
as unique key. PutDocument uses `INSERT ... ON CONFLICT(slug) DO UPDATE`.
|
||||||
|
- **REST** (`internal/server/routes.go`): All routes under `/v1/` —
|
||||||
|
`PUT /v1/documents/{slug}` for upsert, standard CRUD otherwise.
|
||||||
|
- **Web UI** (`internal/webserver/server.go`): Login page, document list
|
||||||
|
at `/`, rendered markdown reader at `/d/{slug}`.
|
||||||
|
- **gRPC** (`internal/grpcserver/`): Mirrors REST exactly. Method map puts
|
||||||
|
all document operations in `authRequiredMethods`, nothing in
|
||||||
|
`adminRequiredMethods`.
|
||||||
|
|
||||||
|
Proto generation ran on vade:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/src/mcq
|
||||||
|
protoc --go_out=. --go_opt=module=git.wntrmute.dev/mc/mcq \
|
||||||
|
--go-grpc_out=. --go-grpc_opt=module=git.wntrmute.dev/mc/mcq \
|
||||||
|
proto/mcq/v1/*.proto
|
||||||
|
```
|
||||||
|
|
||||||
|
### The .gitignore Bug
|
||||||
|
|
||||||
|
First `git add -A` missed `cmd/mcq/`, `proto/mcq/`, and `gen/mcq/`. The
|
||||||
|
`.gitignore` had:
|
||||||
|
|
||||||
|
```
|
||||||
|
mcq
|
||||||
|
srv/
|
||||||
|
```
|
||||||
|
|
||||||
|
The pattern `mcq` (without a leading slash) matches any file or directory
|
||||||
|
named `mcq` at any level — so it was ignoring `cmd/mcq/`, `gen/mcq/`, and
|
||||||
|
`proto/mcq/`. Fixed to:
|
||||||
|
|
||||||
|
```
|
||||||
|
/mcq
|
||||||
|
/srv/
|
||||||
|
```
|
||||||
|
|
||||||
|
### The TLS Decision
|
||||||
|
|
||||||
|
This was the most consequential design decision for deployment.
|
||||||
|
|
||||||
|
The standard platform pattern (mcdsl's `httpserver`) enforces TLS 1.3
|
||||||
|
minimum. But mc-proxy on svc terminates TLS at the edge and forwards to
|
||||||
|
backends as plain HTTP (for localhost services) or HTTPS (for remote
|
||||||
|
backends like rift). Gitea on svc runs plain HTTP on port 3000 behind
|
||||||
|
mc-proxy. mcdoc on rift runs plain HTTP on port 38080 behind mc-proxy.
|
||||||
|
|
||||||
|
mcdsl's `config.Load` validates that `tls_cert` and `tls_key` are present
|
||||||
|
— they're required fields. So I couldn't use `config.Base` with empty TLS
|
||||||
|
fields.
|
||||||
|
|
||||||
|
**Solution**: Created `internal/config/config.go` — mcq's own config
|
||||||
|
package, modeled after mcdoc's. Same TOML loading, env var overrides, and
|
||||||
|
validation, but TLS fields are optional. When empty, the server uses
|
||||||
|
`http.ListenAndServe()` instead of `httpserver.ListenAndServeTLS()`.
|
||||||
|
|
||||||
|
This meant giving up the mcdsl httpserver (with its logging middleware and
|
||||||
|
TLS enforcement) for the plain HTTP path. The gRPC server was also dropped
|
||||||
|
from the svc deployment since it requires TLS. The REST API and web UI
|
||||||
|
are sufficient for the use case.
|
||||||
|
|
||||||
|
### Build and Test (on vade)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/src/mcq
|
||||||
|
go mod tidy
|
||||||
|
go build ./... # clean
|
||||||
|
go vet ./... # clean
|
||||||
|
go test ./... # 6 tests pass (all in internal/db)
|
||||||
|
|
||||||
|
# Production binary
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
|
||||||
|
go build -trimpath -ldflags="-s -w -X main.version=v0.1.0" \
|
||||||
|
-o mcq ./cmd/mcq
|
||||||
|
|
||||||
|
# Result: 21MB static binary
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
### Why mcq is NOT in `mcp ps`
|
||||||
|
|
||||||
|
**This is the most important thing in this log.**
|
||||||
|
|
||||||
|
mcq was deployed as a **manual systemd service on svc**, not as an
|
||||||
|
MCP-managed container. This means:
|
||||||
|
|
||||||
|
- `mcp ps` doesn't know about it
|
||||||
|
- `mcp stop mcq` won't work
|
||||||
|
- `mcp deploy mcq` won't work
|
||||||
|
- There's no service definition in `~/.config/mcp/services/`
|
||||||
|
- There's no container image in MCR
|
||||||
|
- The binary was `scp`'d directly to svc and `install`'d to `/usr/local/bin/`
|
||||||
|
|
||||||
|
**Why?** Three reasons:
|
||||||
|
|
||||||
|
1. **svc has no MCP agent.** The MCP agent (`mcp-agent`) only runs on rift.
|
||||||
|
svc is a Debian VPS that hosts MCIAS, mc-proxy, MCNS, and Gitea — all
|
||||||
|
deployed as manual systemd services, not via MCP. Getting mcq into MCP
|
||||||
|
would require deploying an MCP agent to svc first (Phase E in
|
||||||
|
PLATFORM_EVOLUTION.md, items #10-#12).
|
||||||
|
|
||||||
|
2. **mcq runs as a native binary, not a container.** MCP manages containers
|
||||||
|
(podman). mcq on svc is a bare binary under systemd, like MCIAS and
|
||||||
|
mc-proxy on svc. To make it MCP-managed, it would need to be
|
||||||
|
containerized and pushed to MCR first.
|
||||||
|
|
||||||
|
3. **The deployment followed the existing svc pattern.** Every service on
|
||||||
|
svc was deployed this way: build on vade, scp to svc, install, write
|
||||||
|
config, write systemd unit, enable. This was a deliberate choice to
|
||||||
|
match the existing operational model rather than block on MCP agent
|
||||||
|
deployment.
|
||||||
|
|
||||||
|
### What MCP-managed deployment would look like
|
||||||
|
|
||||||
|
Once svc has an MCP agent, mcq could be managed like services on rift:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
# ~/.config/mcp/services/mcq.toml
|
||||||
|
name = "mcq"
|
||||||
|
node = "svc"
|
||||||
|
version = "v0.1.0"
|
||||||
|
|
||||||
|
[[components]]
|
||||||
|
name = "api"
|
||||||
|
|
||||||
|
[[components.routes]]
|
||||||
|
port = 8090
|
||||||
|
mode = "l7"
|
||||||
|
hostname = "mcq.metacircular.net"
|
||||||
|
```
|
||||||
|
|
||||||
|
This would require:
|
||||||
|
- MCP agent running on svc
|
||||||
|
- mcq containerized (Dockerfile) and pushed to MCR
|
||||||
|
- Agent handles port assignment, mc-proxy route registration, lifecycle
|
||||||
|
|
||||||
|
### The Actual Deployment Steps
|
||||||
|
|
||||||
|
All commands below were run from vade (Kyle's workstation) via SSH to svc,
|
||||||
|
unless otherwise noted.
|
||||||
|
|
||||||
|
#### 1. Push repo to Gitea (from vade)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/src/mcq
|
||||||
|
git remote add origin git@git.wntrmute.dev:mc/mcq.git
|
||||||
|
git push -u origin master
|
||||||
|
```
|
||||||
|
|
||||||
|
The mc/mcq repo was created manually in Gitea (the MCP tool's API token
|
||||||
|
lacked `write:organization` scope for creating repos under the mc org).
|
||||||
|
|
||||||
|
#### 2. Copy binary to svc (from vade)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp ~/src/mcq/mcq kyle@svc:/tmp/mcq
|
||||||
|
```
|
||||||
|
|
||||||
|
SSH to svc uses Tailscale hostname resolution — `svc` resolves to
|
||||||
|
`100.106.232.4` via tailscale. No SSH config entry was needed. Had to
|
||||||
|
accept the host key on first connection:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh -o StrictHostKeyChecking=accept-new kyle@svc
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Create user and install binary (on svc, as root via sudo)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo useradd --system --no-create-home --shell /usr/sbin/nologin mcq
|
||||||
|
sudo mkdir -p /srv/mcq
|
||||||
|
sudo chown mcq:mcq /srv/mcq
|
||||||
|
sudo chmod 0700 /srv/mcq
|
||||||
|
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
|
||||||
|
```
|
||||||
|
|
||||||
|
Verified: `/usr/local/bin/mcq --version` → `mcq version v0.1.0`
|
||||||
|
|
||||||
|
#### 4. Write config (on svc)
|
||||||
|
|
||||||
|
Created `/srv/mcq/mcq.toml`:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[server]
|
||||||
|
listen_addr = "127.0.0.1:8090"
|
||||||
|
|
||||||
|
[database]
|
||||||
|
path = "/srv/mcq/mcq.db"
|
||||||
|
|
||||||
|
[mcias]
|
||||||
|
server_url = "https://mcias.metacircular.net:8443"
|
||||||
|
ca_cert = "/srv/mcq/ca.pem"
|
||||||
|
service_name = "mcq"
|
||||||
|
tags = []
|
||||||
|
|
||||||
|
[log]
|
||||||
|
level = "info"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important detail**: The first attempt used `server_url = "https://127.0.0.1:8443"`
|
||||||
|
which failed because MCIAS's TLS cert has SANs for `mcias.wntrmute.dev`
|
||||||
|
and `mcias.metacircular.net` but **not** `127.0.0.1` or `localhost`. Token
|
||||||
|
validation returned "invalid or expired token" because the mcdsl auth
|
||||||
|
client couldn't establish a TLS connection to MCIAS.
|
||||||
|
|
||||||
|
Fixed by copying the pattern from MCNS on svc:
|
||||||
|
- `server_url = "https://mcias.metacircular.net:8443"` (uses the hostname
|
||||||
|
that matches the cert's SAN)
|
||||||
|
- `ca_cert = "/srv/mcq/ca.pem"` (the WNTRMUTE root CA cert, copied from
|
||||||
|
`/srv/mcns/certs/ca.pem`)
|
||||||
|
|
||||||
|
The hostname `mcias.metacircular.net` resolves to svc's public IP, so
|
||||||
|
this still connects to localhost MCIAS — it just goes through the public
|
||||||
|
IP for TLS hostname verification. (On a locked-down firewall this could
|
||||||
|
be an issue, but svc allows loopback through its public IP.)
|
||||||
|
|
||||||
|
#### 5. Create systemd unit (on svc)
|
||||||
|
|
||||||
|
Created `/etc/systemd/system/mcq.service`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=MCQ Document Queue
|
||||||
|
After=network-online.target mcias.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
ExecStart=/usr/local/bin/mcq server --config /srv/mcq/mcq.toml
|
||||||
|
WorkingDirectory=/srv/mcq
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=5
|
||||||
|
User=mcq
|
||||||
|
Group=mcq
|
||||||
|
|
||||||
|
NoNewPrivileges=yes
|
||||||
|
ProtectSystem=strict
|
||||||
|
ProtectHome=yes
|
||||||
|
ReadWritePaths=/srv/mcq
|
||||||
|
PrivateTmp=yes
|
||||||
|
ProtectKernelTunables=yes
|
||||||
|
ProtectControlGroups=yes
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable --now mcq
|
||||||
|
```
|
||||||
|
|
||||||
|
Verified running: PID 3765144, memory 7.8MB, started cleanly.
|
||||||
|
|
||||||
|
#### 6. Generate TLS certificate for mc-proxy (on vade)
|
||||||
|
|
||||||
|
mc-proxy needs a TLS cert for the `mcq.metacircular.net` hostname (it
|
||||||
|
terminates TLS at the edge). Generated using the local WNTRMUTE root CA:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /tmp
|
||||||
|
openssl ecparam -name prime256v1 -genkey -noout -out mcq.key
|
||||||
|
|
||||||
|
openssl req -new -key mcq.key -out mcq.csr \
|
||||||
|
-subj "/CN=mcq.metacircular.net/O=Metacircular Dynamics" \
|
||||||
|
-addext "subjectAltName=DNS:mcq.metacircular.net"
|
||||||
|
|
||||||
|
openssl x509 -req -in mcq.csr \
|
||||||
|
-CA ~/src/metacircular/ca/ca.pem \
|
||||||
|
-CAkey ~/src/metacircular/ca/ca.key \
|
||||||
|
-CAcreateserial -out mcq.pem -days 365 -sha256 \
|
||||||
|
-extfile <(echo "subjectAltName=DNS:mcq.metacircular.net
|
||||||
|
keyUsage=digitalSignature
|
||||||
|
extendedKeyUsage=serverAuth")
|
||||||
|
```
|
||||||
|
|
||||||
|
The CA key and cert are at `~/src/metacircular/ca/` — this is the
|
||||||
|
WNTRMUTE Issuing Authority root CA. Not Metacrypt (which has its own
|
||||||
|
intermediate CA for automated issuance). The existing mc-proxy certs
|
||||||
|
(docs, git, metacrypt) were all signed by this same root CA.
|
||||||
|
|
||||||
|
Copied to svc:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp /tmp/mcq.pem /tmp/mcq.key kyle@svc:/tmp/
|
||||||
|
```
|
||||||
|
|
||||||
|
Installed on svc:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo cp /tmp/mcq.pem /srv/mc-proxy/certs/mcq.metacircular.net.pem
|
||||||
|
sudo cp /tmp/mcq.key /srv/mc-proxy/certs/mcq.metacircular.net.key
|
||||||
|
sudo chown mc-proxy:mc-proxy /srv/mc-proxy/certs/mcq.metacircular.net.*
|
||||||
|
sudo chmod 0600 /srv/mc-proxy/certs/mcq.metacircular.net.key
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 7. Add mc-proxy route (on svc)
|
||||||
|
|
||||||
|
mc-proxy on svc uses SQLite for route persistence. The TOML config only
|
||||||
|
seeds the database on first run (`store.IsEmpty()` check). After that,
|
||||||
|
routes are loaded from SQLite. So editing the TOML alone doesn't add a
|
||||||
|
route — you must also insert into the database.
|
||||||
|
|
||||||
|
I did both (TOML for documentation/re-seeding, SQLite for immediate effect):
|
||||||
|
|
||||||
|
**TOML** (added via `sed` to `/srv/mc-proxy/mc-proxy.toml`):
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[[listeners.routes]]
|
||||||
|
hostname = "mcq.metacircular.net"
|
||||||
|
backend = "127.0.0.1:8090"
|
||||||
|
mode = "l7"
|
||||||
|
tls_cert = "/srv/mc-proxy/certs/mcq.metacircular.net.pem"
|
||||||
|
tls_key = "/srv/mc-proxy/certs/mcq.metacircular.net.key"
|
||||||
|
backend_tls = false
|
||||||
|
```
|
||||||
|
|
||||||
|
**SQLite** (direct insert):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo sqlite3 /srv/mc-proxy/mc-proxy.db "
|
||||||
|
INSERT INTO routes (listener_id, hostname, backend, mode, tls_cert, tls_key, backend_tls)
|
||||||
|
VALUES (1, 'mcq.metacircular.net', '127.0.0.1:8090', 'l7',
|
||||||
|
'/srv/mc-proxy/certs/mcq.metacircular.net.pem',
|
||||||
|
'/srv/mc-proxy/certs/mcq.metacircular.net.key', 0);
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `listener_id = 1` is the `:443` listener (only listener on svc's
|
||||||
|
mc-proxy).
|
||||||
|
|
||||||
|
**Note on `backend_tls = false`**: mcq serves plain HTTP on localhost.
|
||||||
|
mc-proxy terminates TLS for the client and forwards as plain HTTP to
|
||||||
|
`127.0.0.1:8090`. This is the same pattern as Gitea (`127.0.0.1:3000`)
|
||||||
|
and mcdoc (`100.95.252.120:38080`). Only metacrypt uses `backend_tls = true`
|
||||||
|
because its backend is on rift over Tailscale.
|
||||||
|
|
||||||
|
#### 8. Restart mc-proxy (on svc)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl restart mc-proxy
|
||||||
|
```
|
||||||
|
|
||||||
|
This was messy. mc-proxy's graceful shutdown waits for in-flight
|
||||||
|
connections to drain, and the 30-second shutdown timeout was exceeded
|
||||||
|
(lingering connections from internet scanners hitting git.metacircular.net).
|
||||||
|
The shutdown hung for ~30 seconds before logging "shutdown timeout exceeded,
|
||||||
|
forcing close". systemd then moved to `deactivating (stop-sigterm)` state.
|
||||||
|
|
||||||
|
Had to force it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl kill mc-proxy
|
||||||
|
sleep 2
|
||||||
|
sudo systemctl start mc-proxy
|
||||||
|
```
|
||||||
|
|
||||||
|
After restart: `routes=5` (was 4 before mcq). Confirmed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sk https://mcq.metacircular.net/v1/health
|
||||||
|
# {"status":"ok"}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 9. Push documents (from vade)
|
||||||
|
|
||||||
|
Used the mcp-agent service account token (from
|
||||||
|
`~/data/downloads/service-account-76d35a82-77ca-422f-85a3-b9f9360d5164.token`)
|
||||||
|
to authenticate API calls. This is a long-lived JWT issued by MCIAS with
|
||||||
|
`admin` role, `exp` in 2027.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
TOKEN=$(cat ~/data/downloads/service-account-*.token)
|
||||||
|
|
||||||
|
# Push MCP Architecture
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
body = open('mcp/ARCHITECTURE.md').read()
|
||||||
|
print(json.dumps({'title': 'MCP Architecture', 'body': body}))
|
||||||
|
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/mcp-architecture \
|
||||||
|
-H "Authorization: Bearer $TOKEN" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d @-
|
||||||
|
|
||||||
|
# Push Platform Evolution
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
body = open('PLATFORM_EVOLUTION.md').read()
|
||||||
|
print(json.dumps({'title': 'Platform Evolution', 'body': body}))
|
||||||
|
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/platform-evolution \
|
||||||
|
-H "Authorization: Bearer $TOKEN" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d @-
|
||||||
|
|
||||||
|
# Push Packaging doc
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
body = open('docs/packaging-and-deployment.md').read()
|
||||||
|
print(json.dumps({'title': 'Packaging and Deployment', 'body': body}))
|
||||||
|
" | curl -sk -X PUT https://mcq.metacircular.net/v1/documents/packaging-and-deployment \
|
||||||
|
-H "Authorization: Bearer $TOKEN" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d @-
|
||||||
|
```
|
||||||
|
|
||||||
|
Used `python3` for JSON encoding because `jq` isn't installed on vade
|
||||||
|
(NixOS — would need to add it to the system config or use `nix-shell`).
|
||||||
|
|
||||||
|
All three documents pushed successfully. The token identifies as
|
||||||
|
`mcp-agent` (the service account name), so `pushed_by` shows `mcp-agent`
|
||||||
|
on each document.
|
||||||
|
|
||||||
|
### Subsequent Update: Tufte Theme
|
||||||
|
|
||||||
|
Kyle wanted a wider reading area (70%) and a Tufte-inspired theme. Updated
|
||||||
|
`web/static/style.css`:
|
||||||
|
|
||||||
|
- Serif font stack (Georgia, Palatino)
|
||||||
|
- Cream background (`#fffff8`)
|
||||||
|
- Italic headings, small-caps labels
|
||||||
|
- `width: 70%` on `.page-container` (was `max-width: 720px`)
|
||||||
|
- Minimal chrome — document list uses ruled lines instead of cards,
|
||||||
|
tables use bottom-borders only
|
||||||
|
- Mobile fallback: full width below 768px
|
||||||
|
|
||||||
|
Rebuilt, deployed same way:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On vade
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
|
||||||
|
go build -trimpath -ldflags="-s -w -X main.version=v0.1.1" \
|
||||||
|
-o mcq ./cmd/mcq
|
||||||
|
scp mcq kyle@svc:/tmp/mcq
|
||||||
|
|
||||||
|
# On svc
|
||||||
|
sudo install -m 0755 /tmp/mcq /usr/local/bin/mcq
|
||||||
|
sudo systemctl restart mcq
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## State After This Session
|
||||||
|
|
||||||
|
### What's running on svc
|
||||||
|
|
||||||
|
| Service | Port | Managed by | Notes |
|
||||||
|
|---------|------|------------|-------|
|
||||||
|
| MCIAS | :8443/:9443 | systemd | Identity/auth, been here longest |
|
||||||
|
| mc-proxy | :443 | systemd | L7 TLS termination, 5 routes |
|
||||||
|
| MCNS | :53/:8444/:9444 | systemd | Authoritative DNS |
|
||||||
|
| Gitea | :3000 | systemd | Git hosting |
|
||||||
|
| **mcq** | **:8090** | **systemd** | **NEW: document queue** |
|
||||||
|
|
||||||
|
None of these are MCP-managed. svc has no MCP agent.
|
||||||
|
|
||||||
|
### mc-proxy routes on svc
|
||||||
|
|
||||||
|
| Hostname | Backend | Mode | TLS Backend |
|
||||||
|
|----------|---------|------|-------------|
|
||||||
|
| metacrypt.metacircular.net | 100.95.252.120:18080 | L7 | yes (rift) |
|
||||||
|
| git.metacircular.net | 127.0.0.1:3000 | L7 | no |
|
||||||
|
| git.wntrmute.dev | 127.0.0.1:3000 | L7 | no |
|
||||||
|
| docs.metacircular.net | 100.95.252.120:38080 | L7 | no |
|
||||||
|
| **mcq.metacircular.net** | **127.0.0.1:8090** | **L7** | **no** |
|
||||||
|
|
||||||
|
### DNS
|
||||||
|
|
||||||
|
`mcq.metacircular.net` is a CNAME to `svc.metacircular.net` (set up by
|
||||||
|
Kyle at the DNS registrar before this session). mc-proxy's SNI-based
|
||||||
|
routing handles the rest.
|
||||||
|
|
||||||
|
### Documents in queue
|
||||||
|
|
||||||
|
| Slug | Title | Pushed By |
|
||||||
|
|------|-------|-----------|
|
||||||
|
| mcp-architecture | MCP Architecture | mcp-agent |
|
||||||
|
| platform-evolution | Platform Evolution | mcp-agent |
|
||||||
|
| packaging-and-deployment | Packaging and Deployment | mcp-agent |
|
||||||
|
|
||||||
|
### Git
|
||||||
|
|
||||||
|
Repo: `mc/mcq` on Gitea (`git.wntrmute.dev:mc/mcq.git`)
|
||||||
|
|
||||||
|
Commits:
|
||||||
|
1. `bc16279` — Initial implementation
|
||||||
|
2. `648e9dc` — Support plain HTTP mode for mc-proxy L7 deployment
|
||||||
|
3. `a5b90b6` — Switch to Tufte-inspired reading theme
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Would Be Different with MCP
|
||||||
|
|
||||||
|
If svc had an MCP agent and mcq were containerized:
|
||||||
|
|
||||||
|
1. **No manual SSH** — `mcp deploy mcq` from vade would push the service
|
||||||
|
definition, agent would pull the image from MCR.
|
||||||
|
2. **No manual port picking** — agent assigns a free port from 10000-60000.
|
||||||
|
3. **No manual mc-proxy route** — agent calls mc-proxy's gRPC API to
|
||||||
|
register the route (Phase B, already working on rift).
|
||||||
|
4. **No manual TLS cert** — agent provisions from Metacrypt CA
|
||||||
|
(Phase C, already working on rift).
|
||||||
|
5. **No manual systemd unit** — agent manages the container lifecycle.
|
||||||
|
6. **`mcp ps` would show mcq** — because the agent tracks it in its
|
||||||
|
registry.
|
||||||
|
7. **`mcp stop mcq` / `mcp restart mcq` would work** — standard lifecycle.
|
||||||
|
|
||||||
|
The gap is: svc has no agent. That's Phase E work (items #10-#12 in
|
||||||
|
PLATFORM_EVOLUTION.md). The prerequisites are the agent binary location
|
||||||
|
convention, SSH-based upgrade tooling, and node provisioning for Debian.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rough Edges and Lessons
|
||||||
|
|
||||||
|
1. **MCIAS cert hostname**: Every new service on svc will hit this. The
|
||||||
|
MCIAS cert doesn't include localhost as a SAN. Services must use
|
||||||
|
`server_url = "https://mcias.metacircular.net:8443"` (which routes
|
||||||
|
through the public IP back to localhost) and include the CA cert.
|
||||||
|
Could fix by reissuing the MCIAS cert with a localhost SAN.
|
||||||
|
|
||||||
|
2. **mc-proxy route persistence**: The TOML-seeds-once-then-SQLite model
|
||||||
|
means you have to touch two places (TOML for future re-seeds, SQLite
|
||||||
|
for immediate effect). On rift this is handled by the agent's gRPC
|
||||||
|
calls. On svc without an agent, it's manual database surgery.
|
||||||
|
|
||||||
|
3. **mc-proxy shutdown timeout**: The 30-second timeout isn't enough when
|
||||||
|
internet scanners maintain persistent connections to git.metacircular.net.
|
||||||
|
Had to force-kill on restart. Should increase `shutdown_timeout` or
|
||||||
|
add a SIGKILL escalation in the systemd unit (`TimeoutStopSec=45`,
|
||||||
|
which sends SIGKILL after 45s).
|
||||||
|
|
||||||
|
4. **No jq on vade**: NixOS doesn't have jq in the default system config.
|
||||||
|
Used python3 as a workaround for JSON encoding. Minor friction.
|
||||||
|
|
||||||
|
5. **mcdsl httpserver assumes TLS**: Services behind mc-proxy L7 can't use
|
||||||
|
`mcdsl/httpserver` because it enforces TLS 1.3. mcdoc solved this with
|
||||||
|
its own config/server. mcq now does the same. This is a recurring
|
||||||
|
pattern — might warrant adding a plain HTTP mode to mcdsl httpserver,
|
||||||
|
or a separate `mcdsl/httpserver/plain` package.
|
||||||
|
|
||||||
|
6. **Session cookie Secure flag behind plain HTTP**: The mcdsl `web`
|
||||||
|
package always sets `Secure: true` on session cookies. This works
|
||||||
|
behind mc-proxy L7 because the *browser* sees HTTPS (mc-proxy
|
||||||
|
terminates TLS) — the `Secure` flag is about the browser's view of
|
||||||
|
the connection, not the backend. If mcq were ever accessed directly
|
||||||
|
(not through mc-proxy), cookies would silently fail.
|
||||||
231
log/2026-04-03-uid-incident.md
Normal file
231
log/2026-04-03-uid-incident.md
Normal file
@@ -0,0 +1,231 @@
|
|||||||
|
# Incident Report: UID Change Cascading Failure
|
||||||
|
|
||||||
|
**Date**: 2026-04-03
|
||||||
|
**Duration**: ~2 hours (08:45–10:45 PDT)
|
||||||
|
**Severity**: Full platform outage on rift (all containers lost)
|
||||||
|
**Root cause**: Changing the `mcp` system user UID from 995 to 850
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
### Background
|
||||||
|
|
||||||
|
Orion was being provisioned as a new worker node. Its NixOS config
|
||||||
|
imports `mcp.nix` which pins the mcp user to UID 995. On orion, UID 995
|
||||||
|
was already assigned to the `sshd` user, causing a UID collision:
|
||||||
|
|
||||||
|
```
|
||||||
|
uid=995(sshd) gid=988(mcp) groups=988(mcp),62(systemd-journal),992(sshd)
|
||||||
|
```
|
||||||
|
|
||||||
|
Both `sshd` and `mcp` had UID 995 on orion. The `newuidmap` tool
|
||||||
|
rejected rootless podman operations because the calling process's UID
|
||||||
|
(995) belonged to `sshd`, not `mcp`, in `/etc/passwd`.
|
||||||
|
|
||||||
|
### The UID Change
|
||||||
|
|
||||||
|
To resolve the collision, `mcp.nix` was updated to pin UID 850 (in
|
||||||
|
the 800–899 range, empty on all nodes). Both rift and orion were
|
||||||
|
rebuilt with `nixos-rebuild switch`.
|
||||||
|
|
||||||
|
**Problem 1: NixOS doesn't change UIDs for existing users.** The
|
||||||
|
rebuild created the NixOS config with `uid = 850` but the existing
|
||||||
|
`mcp` user on both nodes kept UID 995. Manual `usermod -u 850 mcp`
|
||||||
|
was required on each node.
|
||||||
|
|
||||||
|
**Problem 2: Rootless podman caches the UID everywhere.**
|
||||||
|
- Podman's SQLite database (`db.sql`) stores absolute paths like
|
||||||
|
`/run/user/995/libpod/tmp` and `/run/user/995/containers`
|
||||||
|
- The systemd user session (`/run/user/995/`) is tied to the UID
|
||||||
|
- subuid/subgid mappings reference the user by name but the kernel
|
||||||
|
checks the actual UID
|
||||||
|
- Container storage overlay directories have file ownership based on
|
||||||
|
the old UID namespace mapping (995 → 100000)
|
||||||
|
|
||||||
|
After changing the UID, `podman` operations failed with:
|
||||||
|
```
|
||||||
|
newuidmap: write to uid_map failed: Operation not permitted
|
||||||
|
```
|
||||||
|
|
||||||
|
### The Reboot
|
||||||
|
|
||||||
|
Rift was rebooted to get a clean systemd user session for UID 850.
|
||||||
|
The reboot succeeded, but **all containers were gone**:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ podman ps -a
|
||||||
|
(empty)
|
||||||
|
```
|
||||||
|
|
||||||
|
Podman's database was recreated fresh on boot because the old database
|
||||||
|
referenced paths under `/run/user/995/` which no longer existed. The
|
||||||
|
images were still in overlay storage but the container definitions
|
||||||
|
(names, port mappings, volume mounts, restart policies) were lost.
|
||||||
|
|
||||||
|
### DNS Collapse
|
||||||
|
|
||||||
|
MCNS (the authoritative DNS server for `.svc.mcp.metacircular.net`)
|
||||||
|
ran as a container on rift. When all containers were lost, DNS
|
||||||
|
resolution broke:
|
||||||
|
|
||||||
|
- `mcq.svc.mcp.metacircular.net` → no answer
|
||||||
|
- MCNS also served as a recursive resolver for the LAN
|
||||||
|
- `google.com` → NXDOMAIN on machines using MCNS as their resolver
|
||||||
|
|
||||||
|
Tailscale DNS (MagicDNS) was also affected because resolved's global
|
||||||
|
DNS config pointed to MCNS. Tailscale itself remained functional
|
||||||
|
(its coordination servers are external), but hostname resolution via
|
||||||
|
Tailscale DNS names failed.
|
||||||
|
|
||||||
|
The operator turned off Tailscale on vade (the workstation) because
|
||||||
|
Tailscale's MagicDNS was routing ALL DNS queries through the broken
|
||||||
|
MCNS resolver — external services including Claude Code and Gitea
|
||||||
|
were unreachable. Disabling Tailscale was the only way to restore
|
||||||
|
external DNS resolution. However, this also broke connectivity to
|
||||||
|
rift since the MCP agent binds to the Tailnet IP only
|
||||||
|
(`100.95.252.120:9444`).
|
||||||
|
|
||||||
|
### Recovery
|
||||||
|
|
||||||
|
**Step 1**: Turn Tailscale back on (on both rift and vade). Tailscale
|
||||||
|
connectivity works without MCNS — MagicDNS uses Tailscale's own
|
||||||
|
servers for `.ts.net` names.
|
||||||
|
|
||||||
|
**Step 2**: Start MCNS manually via `podman run`. The image was cached
|
||||||
|
in overlay storage. MCNS needed explicit port bindings (not `--network
|
||||||
|
host`) because systemd-resolved holds port 53 on localhost:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman run -d --name mcns --restart unless-stopped \
|
||||||
|
-p 192.168.88.181:53:53/tcp -p 192.168.88.181:53:53/udp \
|
||||||
|
-p 100.95.252.120:53:53/tcp -p 100.95.252.120:53:53/udp \
|
||||||
|
-p 127.0.0.1:38443:8443 \
|
||||||
|
-v /srv/mcns:/srv/mcns \
|
||||||
|
mcr.svc.mcp.metacircular.net:8443/mcns:v1.2.0 \
|
||||||
|
server --config /srv/mcns/mcns.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
DNS resolution restored within seconds.
|
||||||
|
|
||||||
|
**Step 3**: Start remaining services manually via `podman run`. Images
|
||||||
|
were all cached. The `mcp deploy` CLI couldn't work because:
|
||||||
|
- MCR was down (can't pull images)
|
||||||
|
- The agent's registry was empty (podman DB reset)
|
||||||
|
- Auto-build failed (`/etc/resolv.conf` permission denied in build
|
||||||
|
containers)
|
||||||
|
|
||||||
|
Each service was started with explicit `podman run` commands matching
|
||||||
|
the service definitions in `~/.config/mcp/services/*.toml`.
|
||||||
|
|
||||||
|
**Step 4**: Fix file ownership for rootless podman. Files in `/srv/*`
|
||||||
|
were owned by UID 850 (the mcp user on the host). Inside containers,
|
||||||
|
UID 0 (root) maps to host UID 850 via subuid. But:
|
||||||
|
|
||||||
|
- `podman unshare chown -R 0:0 /srv/<service>` translated ownership
|
||||||
|
to match the container's user namespace
|
||||||
|
- SQLite's `PRAGMA journal_mode = WAL` requires creating WAL/SHM files
|
||||||
|
in the database directory
|
||||||
|
- modernc.org/sqlite calls `fchmod()` on the database file, which is
|
||||||
|
denied inside rootless podman user namespaces (even for UID 0 in the
|
||||||
|
namespace)
|
||||||
|
|
||||||
|
**Step 5**: Delete and recreate SQLite databases. The `fchmod` denial
|
||||||
|
was fatal for MCR and Metacrypt. The fix:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop the container
|
||||||
|
podman stop metacrypt-api
|
||||||
|
# Delete the database (WAL and SHM too)
|
||||||
|
rm -f /srv/metacrypt/metacrypt.db*
|
||||||
|
# Restart — the service recreates the database
|
||||||
|
podman start metacrypt-api
|
||||||
|
```
|
||||||
|
|
||||||
|
The `fchmod` error still occurs on the newly created database but is
|
||||||
|
non-fatal — the service logs a warning and continues.
|
||||||
|
|
||||||
|
**Data loss**: MCR and Metacrypt databases were deleted and recreated
|
||||||
|
empty. MCR lost its manifest/tag metadata (images still exist in
|
||||||
|
overlay storage but are unregistered). Metacrypt lost its CA state
|
||||||
|
(encrypted keys, issued certs tracking). Other services (mcq, mcdoc,
|
||||||
|
etc.) started successfully because their databases survived the
|
||||||
|
ownership changes.
|
||||||
|
|
||||||
|
## Root Causes
|
||||||
|
|
||||||
|
1. **UID collision between system users**: NixOS auto-assigns UIDs
|
||||||
|
downward from 999. Pinning UID 995 for mcp collided with sshd on
|
||||||
|
orion.
|
||||||
|
|
||||||
|
2. **Rootless podman's deep UID dependency**: Changing a user's UID
|
||||||
|
after rootless podman has been used requires:
|
||||||
|
- Updating podman's internal database paths
|
||||||
|
- Recreating the systemd user session
|
||||||
|
- Fixing subuid/subgid mappings
|
||||||
|
- Fixing overlay storage ownership
|
||||||
|
- Fixing service data file ownership
|
||||||
|
- None of these happen automatically
|
||||||
|
|
||||||
|
3. **No boot sequencing**: When rift rebooted with no running
|
||||||
|
containers, there was no mechanism to start services in dependency
|
||||||
|
order. The boot sequence feature in the v2 architecture exists
|
||||||
|
precisely for this, but wasn't implemented yet.
|
||||||
|
|
||||||
|
4. **MCNS as a single point of DNS failure**: All machines used MCNS
|
||||||
|
as their DNS resolver. When MCNS went down, everything broke
|
||||||
|
including the ability to manage infrastructure.
|
||||||
|
|
||||||
|
5. **modernc.org/sqlite `fchmod` in rootless podman**: The SQLite
|
||||||
|
library calls `fchmod()` on database files, which is denied inside
|
||||||
|
rootless podman user namespaces. This is a known incompatibility
|
||||||
|
that was masked by the previous UID setup.
|
||||||
|
|
||||||
|
## Lessons Learned
|
||||||
|
|
||||||
|
1. **Never change a rootless podman user's UID.** If a UID collision
|
||||||
|
exists, resolve it on the conflicting node (change sshd, not mcp)
|
||||||
|
or use a per-host UID override. Changing the UID after podman has
|
||||||
|
been used is destructive.
|
||||||
|
|
||||||
|
2. **DNS must not be a single point of failure.** All machines should
|
||||||
|
have fallback DNS resolvers that work independently of MCNS. The
|
||||||
|
NixOS config should list public resolvers (1.1.1.1, 8.8.8.8) as
|
||||||
|
fallbacks, not just MCNS.
|
||||||
|
|
||||||
|
3. **Boot sequencing is critical.** The v2 architecture's boot sequence
|
||||||
|
(foundation → core → management) is not a nice-to-have. Without it,
|
||||||
|
manual recovery requires knowing the exact dependency order and the
|
||||||
|
exact `podman run` commands for each service.
|
||||||
|
|
||||||
|
4. **The MCP agent should be able to recover containers from its
|
||||||
|
registry.** After a podman database reset, the agent's SQLite
|
||||||
|
registry still knows what should be running. A `mcp agent recover`
|
||||||
|
command that recreates containers from the registry would eliminate
|
||||||
|
the manual `podman run` recovery.
|
||||||
|
|
||||||
|
5. **Service definitions must include all runtime parameters.** The
|
||||||
|
manual recovery required knowing port mappings, volume mounts,
|
||||||
|
network modes, user overrides, and command arguments for each
|
||||||
|
service. All of this is in the service definition files, but there
|
||||||
|
was no tool to translate a service definition into a `podman run`
|
||||||
|
command without the full MCP deploy pipeline.
|
||||||
|
|
||||||
|
6. **Tailscale MagicDNS amplifies DNS failures.** When MCNS is down
|
||||||
|
and MagicDNS routes through it, ALL DNS breaks — not just internal
|
||||||
|
names. Disabling Tailscale restores external DNS but loses Tailnet
|
||||||
|
connectivity. The fix is fallback resolvers that bypass MCNS, not
|
||||||
|
disabling Tailscale.
|
||||||
|
|
||||||
|
## Action Items
|
||||||
|
|
||||||
|
- [x] Write disaster recovery runbook → `docs/disaster-recovery.md`
|
||||||
|
- [x] Add fallback DNS resolvers to NixOS config → all nodes now have
|
||||||
|
1.1.1.1 and 8.8.8.8 as fallbacks after MCNS
|
||||||
|
- [x] Implement `mcp agent recover` command → MCP v0.10.5. Recreates
|
||||||
|
containers from the agent registry when podman DB is lost.
|
||||||
|
- [x] Implement boot sequencing in the agent → MCP v0.10.6.
|
||||||
|
[[boot.sequence]] config with per-stage health checks.
|
||||||
|
- [x] Fix modernc.org/sqlite `fchmod` → was our own `os.Chmod` in
|
||||||
|
`mcdsl/db/db.go`, not sqlite. Made best-effort in mcdsl v1.8.0.
|
||||||
|
- [x] Add multi-address support to node config → MCP v0.10.4.
|
||||||
|
Fallback addresses tried in order when primary fails.
|
||||||
|
- [x] Stabilize mcp UID → pinned at 850 with NEVER CHANGE comment
|
||||||
Reference in New Issue
Block a user