Initial import.

This commit is contained in:
2026-03-25 22:25:44 -07:00
commit 168ceb2c07
6 changed files with 57369 additions and 0 deletions

12
.gitignore vendored Normal file
View File

@@ -0,0 +1,12 @@
# infrastructure / secrets
/ca
# project directories: these are separate git repos
/mcat
/mcias
/mc-proxy
/mcr
/metacrypt
/mcdsl
/mcns

76
CLAUDE.md Normal file
View File

@@ -0,0 +1,76 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
Metacircular is a multi-service personal infrastructure platform. This root repository is a workspace container — each subdirectory is a separate Git repo (gitignored here). The authoritative platform-wide standards live in `engineering-standards.md`.
## Project Map
| Directory | Purpose | Language |
|-----------|---------|----------|
| `mcias/` | Identity and Access Service — central SSO/IAM, all other services delegate auth here | Go |
| `metacrypt/` | Cryptographic service engine — encrypted secrets, PKI/CA, SSH CA, transit encryption | Go |
| `mc-proxy/` | TLS proxy and router — L4 passthrough or L7 terminating, PROXY protocol, firewall | Go |
| `mcr/` | OCI container registry — integrated with MCIAS for auth and policy-based push/pull | Go |
| `mcat/` | MCIAS login policy tester — lightweight web app to test and audit login policies | Go |
| `mcdsl/` | Standard library — shared packages for auth, db, config, TLS servers, CSRF, snapshots | Go |
| `ca/` | PKI infrastructure and secrets for dev/test (not source code, gitignored) | — |
Each subproject has its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`. When working in a subproject, read its own CLAUDE.md first.
## Service Dependencies
MCIAS is the root dependency — every other service authenticates through it. No service maintains its own user database. The dependency graph:
```
mcias (standalone — no MCIAS dependency)
├── metacrypt (uses MCIAS for auth)
├── mc-proxy (uses MCIAS for admin auth)
├── mcr (uses MCIAS for auth + policy)
└── mcat (tests MCIAS login policies)
```
## Standard Build Commands (all subprojects)
```bash
make all # vet → lint → test → build (the CI pipeline)
make build # go build ./...
make test # go test ./...
make vet # go vet ./...
make lint # golangci-lint run ./...
make proto # regenerate gRPC code from .proto files
make proto-lint # buf lint + buf breaking
make devserver # build and run locally against srv/ config
make docker # build container image
make clean # remove binaries
```
Run a single test: `go test ./internal/auth/ -run TestTokenValidation`
## Critical Rules
1. **REST/gRPC sync**: Every REST endpoint must have a corresponding gRPC RPC, updated in the same change.
2. **gRPC interceptor maps**: New RPCs must be added to `authRequiredMethods`, `adminRequiredMethods`, and/or `sealRequiredMethods`. Forgetting this is a security defect.
3. **No CGo in production**: All builds use `CGO_ENABLED=0`. Use `modernc.org/sqlite`, not `mattn/go-sqlite3`.
4. **No test frameworks**: Use stdlib `testing` only. Real SQLite in `t.TempDir()`, no mocks for databases.
5. **Default deny**: Unauthenticated and unauthorized requests are always rejected. Admin detection comes solely from the MCIAS `admin` role.
6. **Proto versioning**: Start at v1. Only create v2 for breaking changes. Non-breaking additions go in-place.
## Architecture Patterns
- **Seal/Unseal**: Metacrypt starts sealed and requires a password to unlock (Vault-like pattern). Key hierarchy: Password → Argon2id → KWK → MEK → per-engine DEKs.
- **Web UI separation**: Web UIs run as separate binaries communicating with the API server via gRPC. No direct DB access from the web tier.
- **Config**: TOML with env var overrides (`SERVICENAME_*`). All runtime data in `/srv/<service>/`.
- **Policy engines**: Priority-based ACL rules, default deny, admin bypass. See metacrypt's implementation as reference.
- **Auth flow**: Client → service `/v1/auth/login` → MCIAS client library → MCIAS validates → bearer token returned. Token validation cached 30s keyed by SHA-256 of token.
## Tech Stack
- Go 1.25+, chi router, cobra CLI, go-toml/v2
- SQLite via modernc.org/sqlite (pure Go), WAL mode, foreign keys on
- gRPC + protobuf, buf for linting
- htmx + Go html/template for web UIs
- golangci-lint v2 with errcheck, gosec, staticcheck, revive
- TLS 1.3 minimum, AES-256-GCM, Argon2id, Ed25519

104
README.md Normal file
View File

@@ -0,0 +1,104 @@
# Metacircular Dynamics
Metacircular Dynamics is a self-hosted personal infrastructure platform. The
name comes from the tradition of metacircular evaluators in Lisp — a system
defined in terms of itself — by way of SICP and Common Lisp projects that
preceded this work. The infrastructure is metacircular in the same sense: the
platform manages, secures, and hosts its own services.
Every component is self-hosted, every dependency is controlled, and the entire
stack is operable by one person. No cloud providers, no third-party auth, no
external databases. The platform is designed for a small number of machines — a
personal homelab or a handful of VPSes — not for hyperscale.
All services are written in Go and follow shared
[engineering standards](engineering-standards.md). Full platform documentation
lives in [docs/metacircular.md](docs/metacircular.md).
## Components
| Component | Purpose | Status |
|-----------|---------|--------|
| **MCIAS** | Identity and access — the root of trust. SSO, token issuance, role management, login policy. Every other service delegates auth here. | Implemented |
| **Metacrypt** | Cryptographic services — PKI/CA, transit encryption, encrypted secret storage behind a seal/unseal barrier. Issues TLS certificates for the platform. | Implemented |
| **MCR** | Container registry — OCI-compliant image storage with MCIAS auth and policy-controlled push/pull. | Implemented |
| **MC-Proxy** | Node ingress — TLS proxy and router. L4 passthrough or L7 terminating (per-route), PROXY protocol, firewall with rate limiting and GeoIP. | Implemented |
| **MCNS** | Networking — DNS and address management for the platform. | Planned |
| **MCP** | Control plane — operator-driven deployment, service registry, data transfer, master/agent container lifecycle. | Planned |
Shared library: **MCDSL** — standard library for all services (auth, db,
config, TLS server, CSRF, snapshots).
Supporting tool: **MCAT** — lightweight web app for testing MCIAS login
policies.
## Architecture
```
MCIAS (standalone — the root of trust)
├── Metacrypt (auth via MCIAS; provides certs to all services)
├── MCR (auth via MCIAS; stores images pulled by MCP)
├── MCNS (auth via MCIAS; provides DNS for the platform)
├── MCP (auth via MCIAS; orchestrates everything; owns service registry)
└── MC-Proxy (pre-auth; routes traffic to services behind it)
```
Each machine is an **MC Node**. On every node, **MC-Proxy** accepts outside
connections and routes by TLS SNI — either relaying raw TCP (L4) or
terminating TLS and reverse proxying HTTP/2 (L7), per-route. **MCP Agent** on
each node receives commands from **MCP Master** (which runs on the operator's
workstation) and manages containers via the local runtime. Core infrastructure
(MCIAS, Metacrypt, MCR) runs on nodes like any other workload.
```
┌──────────────────┐ ┌──────────────┐
│ Core Infra │ │ MCP Master │
│ (e.g. MCIAS) │ │ │
└────────┬─────────┘ └──────┬───────┘
│ │ C2
Outside ┌─────────────▼─────────────────────▼──────────┐
Client ────▶│ MC Node │
│ ┌───────────┐ │
│ │ MC-Proxy │──┬──────┬──────┐ │
│ └───────────┘ │ │ │ │
│ ┌───▼┐ ┌──▼─┐ ┌─▼──┐ ┌─────┐ │
│ │ α │ │ β │ │ γ │ │ MCP │ │
│ └────┘ └────┘ └────┘ │Slave│ │
│ └──┬──┘ │
│ ┌────▼───┐│
│ │Container│
│ │Runtime │
│ └────────┘│
└──────────────────────────────────────────────┘
```
## Design Principles
- **Sovereignty** — self-hosted end to end; no SaaS dependencies
- **Simplicity** — SQLite over Postgres, stdlib testing, pure Go, htmx, single binaries
- **Consistency** — every service follows identical patterns (layout, config, auth, deployment)
- **Security as structure** — default deny, TLS 1.3 minimum, interceptor-map auth, encrypted-at-rest secrets
- **Design before code** — ARCHITECTURE.md is the spec, written before implementation
## Tech Stack
Go 1.25+, SQLite (modernc.org/sqlite), chi router, gRPC + protobuf, htmx +
Go html/template, golangci-lint v2, Ed25519/Argon2id/AES-256-GCM, TLS 1.3,
container-first deployment (Docker + systemd).
## Repository Structure
This root repository is a workspace container. Each subdirectory is a separate
Git repo with its own `CLAUDE.md`, `ARCHITECTURE.md`, `Makefile`, and `go.mod`:
```
metacircular/
├── mcias/ Identity and Access Service
├── metacrypt/ Cryptographic service engine
├── mcr/ Container registry
├── mc-proxy/ TLS proxy and router
├── mcat/ Login policy tester
├── mcdsl/ Standard library (shared packages)
├── ca/ PKI infrastructure (dev/test, not source code)
└── docs/ Platform-wide documentation
```

927
docs/metacircular.md Normal file
View File

@@ -0,0 +1,927 @@
# Metacircular Infrastructure
## Background
Metacircular Dynamics is a personal infrastructure platform. The name comes
from the tradition of metacircular evaluators in Lisp — a system defined in
terms of itself — by way of SICP and Common Lisp projects that preceded this
work. The infrastructure is metacircular in the same sense: the platform
manages, secures, and hosts its own services.
The goal is sovereign infrastructure. Every component is self-hosted, every
dependency is controlled, and the entire stack is operable by one person. There
are no cloud provider dependencies, no third-party auth providers, no external
databases. When a Metacircular node boots, it connects to Metacircular services
for identity, certificates, container images, and workload scheduling.
All services are written in Go and follow a shared set of engineering standards
(see `engineering-standards.md`). The platform is designed for a small number of
machines — a personal homelab or a handful of VPSes — not for hyperscale.
## Philosophy
**Sovereignty.** You own the whole stack. Identity, certificates, secrets,
container images, DNS, networking — all self-hosted. No SaaS dependency means
no vendor lock-in, no surprise deprecations, and no trust delegation to third
parties.
**Simplicity over sophistication.** SQLite over Postgres. Stdlib `testing` over
test frameworks. Pure Go over CGo. htmx over React. Single-binary deployments
over microservice orchestrators. The right tool is the simplest one that solves
the problem without creating a new one.
**Consistency as leverage.** Every service follows identical patterns: the same
directory layout, the same Makefile targets, the same config format, the same
auth integration, the same deployment model. Knowledge of one service transfers
instantly to all others. A new service can be stood up by copying the skeleton.
**Security as structure.** Security is not a feature bolted on after the fact.
Default deny is the starting posture. TLS 1.3 is the minimum, not a goal.
Interceptor maps make "forgot to add auth" a visible, reviewable omission
rather than a silent runtime failure. Secrets are encrypted at rest behind a
seal/unseal barrier. Every service delegates identity to a single root of
trust.
**Design before code.** The architecture document is written before
implementation begins. It is the spec, not the afterthought. When the code and
the spec disagree, one of them has a bug.
## High-Level Overview
Metacircular infrastructure is built from six core components, plus a shared
standard library (**MCDSL**) that provides the common patterns all services
depend on (auth integration, database setup, config loading, TLS server
bootstrapping, CSRF, snapshots):
- **MCIAS** — Identity and access. The root of trust for all other services.
Handles authentication, token issuance, role management, and login policy
enforcement. Every other component delegates auth here.
- **Metacrypt** — Cryptographic services. PKI/CA, SSH CA, transit encryption,
and encrypted secret storage behind a Vault-inspired seal/unseal barrier.
Issues the TLS certificates that every other service depends on.
- **MCR** — Container registry. OCI-compliant image storage. MCP directs nodes
to pull images from MCR. Policy-controlled push/pull integrated with MCIAS.
- **MCNS** — Networking. DNS and address management for the platform.
- **MCP** — Control plane. The orchestrator. A master/agent architecture that
manages workload scheduling, container lifecycle, service registry, data
transfer, and node state across the platform.
- **MC-Proxy** — Node ingress. A TLS proxy and router that sits on every node,
accepts outside connections, and routes them to the correct service — either
as raw TCP passthrough or via TLS-terminating HTTP/2 reverse proxy.
These components form a dependency graph rooted at MCIAS:
```
MCIAS (standalone — the root of trust)
├── Metacrypt (uses MCIAS for auth; provides certs to all services)
├── MCR (uses MCIAS for auth; stores images pulled by MCP)
├── MCNS (uses MCIAS for auth; provides DNS for the platform)
├── MCP (uses MCIAS for auth; orchestrates everything; owns service registry)
└── MC-Proxy (pre-auth; routes traffic to services behind it)
```
### The Node Model
The unit of deployment is the **MC Node** — a machine (physical or virtual)
that participates in the Metacircular platform.
```
┌──────────────────┐ ┌──────────────┐
│ System / Core │ │ MCP │
│ Infrastructure │ │ Master │
│ (e.g. MCIAS) │ │ │
└────────┬─────────┘ └──────┬───────┘
│ │ C2
│ │
Outside ┌─────────────▼─────────────────────▼──────────┐
Client ────▶│ MC Node │
│ │
│ ┌───────────┐ │
│ │ MC-Proxy │──┬──────┬──────┐ │
│ └───────────┘ │ │ │ │
│ ┌───▼┐ ┌──▼─┐ ┌─▼──┐ ┌─────┐ │
│ │ α │ │ β │ │ γ │ │ MCP │ │
│ └────┘ └────┘ └────┘ │Slave│ │
│ └──┬──┘ │
│ ┌────▼───┐│
│ │Docker/ ││
│ │etc. ││
│ └────────┘│
└──────────────────────────────────────────────┘
```
Outside clients connect to **MC-Proxy**, which inspects the TLS SNI hostname
and routes to the correct service (α, β, γ) — either as a raw TCP relay or
via TLS-terminating HTTP/2 reverse proxy, per-route. The **MCP Agent** on each
node receives C2 commands from the **MCP Master** (running on the operator's
workstation) and manages local container lifecycle via the container runtime.
Core infrastructure services (MCIAS, Metacrypt, MCR) run on nodes like any
other workload.
### The Network Model
Metacircular nodes are connected via an **encrypted overlay network** — a
self-managed WireGuard mesh, Tailscale, or similar. No component has a hard
dependency on a specific overlay implementation; the platform requires only
that nodes can reach each other over encrypted links.
```
Public Internet
┌─────────▼──────────┐
│ Edge MC-Proxy │ VPS (public IP)
│ :443 │
└─────────┬──────────┘
│ PROXY protocol v2
┌─────────▼──────────────────────────────────┐
│ Encrypted Overlay (e.g. WireGuard) │
│ │
┌───────────┴──┐ ┌──────────┐ ┌──────────┐ ┌──────┴─────┐
│ Origin │ │ Node B │ │ Node C │ │ Operator │
│ MC-Proxy │ │ (MCP │ │ │ │ Workstation│
│ + services │ │ agent) │ │ (MCP │ │ (MCP │
│ (MCP agent) │ │ │ │ agent) │ │ Master) │
└──────────────┘ └──────────┘ └──────────┘ └────────────┘
```
**External traffic** flows from the internet through an edge MC-Proxy (on a
public VPS), which forwards via PROXY protocol over the overlay to an origin
MC-Proxy on the private network. The overlay preserves the real client IP
across the hop.
**Internal traffic** (MCP C2, inter-service communication, MCNS DNS) flows
directly over the overlay. MCP's C2 channel is gRPC over whatever link exists
between master and agent — the overlay provides the transport.
The overlay network itself is a candidate for future Metacircular management
(a self-hosted WireGuard mesh manager), consistent with the sovereignty
principle of minimizing third-party dependencies.
---
## System Catalog
### MCIAS — Metacircular Identity and Access Service
MCIAS is the root of trust for the entire platform. Every other service
delegates authentication to it; no service maintains its own user database.
**What it provides:**
- **Authentication.** Username/password with optional TOTP and FIDO2/WebAuthn.
Credentials are verified by MCIAS and a signed JWT bearer token is returned.
Services validate tokens by calling back to MCIAS (cached 30s by SHA-256 of
the token).
- **Role-based access.** Three roles — `admin` (full access, policy bypass),
`user` (policy-governed), `guest` (service-dependent restrictions). Admin
detection comes solely from the MCIAS `admin` role; services never promote
users locally.
- **Account types.** Human accounts (interactive users) and system accounts
(service-to-service). Both authenticate the same way; system accounts enable
automated workflows.
- **Login policy.** Priority-based ACL rules that control who can log into
which services. Rules can target roles, account types, service names, and
tags. This allows operators to restrict access per-service (e.g., deny
`guest` from services tagged `env:restricted`) without changing the
services themselves.
- **Token lifecycle.** Issuance, validation, renewal, and revocation.
Ed25519-signed JWTs. Short expiry with renewal support.
**How other services integrate:** Every service includes an `[mcias]` config
section with the MCIAS server URL, a `service_name`, and optional `tags`. At
login time, the service forwards credentials to MCIAS along with this context.
MCIAS evaluates login policy against the service context, verifies credentials,
and returns a bearer token. The MCIAS Go client library
(`git.wntrmute.dev/kyle/mcias/clients/go`) handles this flow.
**Status:** Implemented. v1.0.0 complete.
---
### Metacrypt — Cryptographic Service Engine
Metacrypt provides cryptographic resources to the platform through a modular
engine architecture, backed by an encrypted storage barrier inspired by
HashiCorp Vault.
**What it provides:**
- **PKI / Certificate Authority.** X.509 certificate issuance. Root and
intermediate CAs, certificate signing, CRL management, ACME protocol
support. This is how every service in the platform gets its TLS
certificates.
- **SSH CA.** (Planned.) SSH certificate signing for host and user
certificates, replacing static SSH key management.
- **Transit encryption.** (Planned.) Encrypt and decrypt data without exposing
keys to the caller. Envelope encryption for services that need to protect
data at rest without managing their own key material.
- **User-to-user encryption.** (Planned.) End-to-end encryption between users,
with key management handled by Metacrypt.
**Seal/unseal model:** Metacrypt starts sealed. An operator provides a password
which derives (via Argon2id) a key-wrapping key, which decrypts the master
encryption key (MEK), which in turn unwraps per-engine data encryption keys
(DEKs). Each engine mount gets its own DEK, limiting blast radius — compromise
of one engine's key does not expose another's data.
```
Password → Argon2id → KWK → [decrypt] → MEK → [unwrap] → per-engine DEKs
```
**Engine architecture:** Engines are pluggable providers that register with a
central registry. Each engine mount has a type, a name, its own DEK, and its
own configuration. The engine interface handles initialization, seal/unseal
lifecycle, and request routing. New engine types plug in without modifying the
core.
**Policy:** Fine-grained ACL rules control which users can perform which
operations on which engine mounts. Priority-based evaluation, default deny,
admin bypass. See Metacrypt's `POLICY.md` for the full model.
**Status:** Implemented. CA engine complete with ACME support. SSH CA, transit,
and user-to-user engines planned.
---
### MCR — Metacircular Container Registry
MCR is an OCI Distribution Spec-compliant container registry. It stores and
serves the container images that MCP deploys across the platform.
**What it provides:**
- **OCI-compliant image storage.** Pull, push, tag, and delete container
images. Content-addressed by SHA-256 digest. Manifests and tags in SQLite,
blobs on the filesystem.
- **Authenticated access.** No anonymous access. MCR uses the OCI token
authentication flow: clients hit `/v2/`, receive a 401 with a token
endpoint, authenticate via MCIAS, and use the returned JWT for subsequent
requests.
- **Policy-controlled push/pull.** Fine-grained ACL rules govern who can push
to or pull from which repositories. Integrated with MCIAS roles.
- **Garbage collection.** Unreferenced blobs are cleaned up via the admin CLI
(`mcrctl`).
**How it fits in:** MCP directs nodes to pull images from MCR. When a workload
is scheduled, MCP tells the node's agent which image to pull and where to get
it. MCR sits behind an MC-Proxy instance for TLS routing.
**Status:** Implemented. Phase 12 (web UI) complete.
---
### MC-Proxy — TLS Proxy and Router
MC-Proxy is the ingress layer for every MC Node. It accepts TLS connections,
extracts the SNI hostname, and routes to the correct backend. Each route is
independently configured as either **L4 passthrough** (raw TCP relay, no TLS
termination) or **L7 terminating** (terminates TLS, reverse proxies HTTP/2 and
HTTP/1.1 including gRPC). Both modes coexist on the same listener.
**What it provides:**
- **SNI-based routing.** A route table maps hostnames to backend addresses.
Exact match, case-insensitive. Multiple listeners can bind different ports,
each with its own route table, all sharing the same global firewall.
- **Dual-mode proxying.** L4 routes relay raw TCP — backends see the original
TLS handshake, MC-Proxy adds nothing. L7 routes terminate TLS at the proxy
and reverse proxy HTTP/2 to backends (plaintext h2c or re-encrypted TLS),
with header injection (`X-Forwarded-For`, `X-Real-IP`), gRPC streaming
support, and trailer forwarding.
- **Global firewall.** Every connection is evaluated before routing: per-IP
rate limiting, IP/CIDR blocks, and GeoIP country blocks (MaxMind GeoLite2).
Blocked connections get a TCP RST — no error messages, no TLS alerts.
- **PROXY protocol.** Listeners can accept v1/v2 headers from upstream proxies
to learn the real client IP. Routes can send v2 headers to downstream
backends. This enables multi-hop deployments — a public edge MC-Proxy on a
VPS forwarding over the encrypted overlay to a private origin MC-Proxy —
while preserving the real client IP for firewall evaluation and logging.
- **Runtime management.** Routes and firewall rules can be updated at runtime
via a gRPC admin API on a Unix domain socket (filesystem permissions for
access control, no network exposure). State is persisted to SQLite with
write-through semantics.
**How it fits in:** MC-Proxy is pre-auth infrastructure. It sits in front of
everything on a node. Outside clients connect to MC-Proxy on well-known ports
(443, 8443, etc.) and MC-Proxy routes to the correct backend based on the
hostname the client is trying to reach. A typical production deployment uses
two instances — an edge proxy on a public VPS and an origin proxy on the
private network, connected over the overlay with PROXY protocol preserving
client IPs across the hop.
**Status:** Implemented.
---
### MCNS — Metacircular Networking Service
MCNS provides DNS for the platform. It manages two internal zones and serves
as the name resolution layer for the Metacircular network. Service discovery
(which services run where) is owned by MCP; MCNS translates those assignments
into DNS records.
**What it will provide:**
- **Internal DNS.** MCNS is authoritative for the internal zones of the
Metacircular network. Three zones serve different purposes:
| Zone | Example | Purpose |
|------|---------|---------|
| `*.metacircular.net` | `metacrypt.metacircular.net` | External, public-facing. Managed outside MCNS (existing DNS). Points to edge MC-Proxy. |
| `*.mcp.metacircular.net` | `vade.mcp.metacircular.net` | Node addresses. Maps node names to their network addresses (e.g. Tailscale IPs). |
| `*.svc.mcp.metacircular.net` | `metacrypt.svc.mcp.metacircular.net` | Internal service addresses. Maps service names to the node and port where they currently run. |
The `*.mcp.metacircular.net` and `*.svc.mcp.metacircular.net` zones are
managed by MCNS. The external `*.metacircular.net` zone is managed separately
(existing DNS infrastructure) and is mostly static.
- **MCP integration.** MCP pushes DNS record updates to MCNS after deploy and
migrate operations. When MCP starts service α on node X, it calls the MCNS
API to set `α.svc.mcp.metacircular.net` to X's address. Services and clients
using internal DNS names automatically resolve to the right place without
config changes.
- **Record management API.** Authenticated via MCIAS. MCP is the primary
consumer for dynamic updates. Operators can also manage records directly
for static entries (node addresses, aliases).
**How it fits in:** MCNS answers "what is the address of X?" MCP answers "where
is service α running?" and pushes the answer to MCNS. This separation means
services can use stable DNS names in their configs (e.g.,
`mcias.svc.mcp.metacircular.net` in `[mcias] server_url`) that survive
migration without config changes.
**Status:** Not yet implemented.
---
### MCP — Metacircular Control Plane
MCP is the orchestrator. It manages what runs where across the platform. The
deployment model is operator-driven: the user says "deploy service α" and MCP
handles the rest. MCP Master runs on the operator's workstation; agents run on
each managed node.
**What it will provide:**
- **Service registry.** MCP is the source of truth for what is running where.
It tracks every service, which node it's on, and its current state. Other
components that need to find a service (including MC-Proxy for route table
updates) query MCP's registry.
- **Deploy.** The operator says "deploy α". MCP checks if α is already running
somewhere. If it is, MCP pulls the new container image on that node and
restarts the service in place. If it isn't running, MCP selects a node
(the operator can pin to a specific node but shouldn't have to), transfers
the initial config, pulls the image from MCR, starts the container, and
pushes a DNS update to MCNS (`α.svc.mcp.metacircular.net` → node address).
- **Migrate.** Move a service from one node to another. MCP snapshots the
service's `/srv/<service>/` directory on the source node (as a tar.zst
image), transfers it to the destination, extracts it, starts the service,
stops it on the source, and updates MCNS so DNS points to the new location.
The `/srv/<service>/` convention makes this uniform across all services.
- **Data transfer.** The C2 channel supports file-level operations between
master and agents: copy or fetch individual files (push a config, pull a
log), and transfer tar.zst archives for bulk snapshot/restore of service
data directories. This is the foundation for both migration and backup.
- **Service snapshots.** To snapshot `/srv/<service>/`, the agent runs
`VACUUM INTO` to create a consistent database copy, then builds a tar.zst
that includes the full directory but **excludes** live database files
(`*.db`, `*.db-wal`, `*.db-shm`) and the `backups/` directory. The
temporary VACUUM INTO copy is injected into the archive as `<service>.db`.
The result is a clean, minimal archive that extracts directly into a
working service directory on the destination.
- **Container lifecycle.** Start, stop, restart, and update containers on
nodes. MCP Master issues commands; agents on each node execute them against
the local container runtime (Docker, etc.).
- **Master/agent architecture.** MCP Master runs on the operator's machine.
Agents run on every managed node, receiving C2 (command and control) from
Master, reporting node status, and managing local workloads. The C2 channel
is authenticated via MCIAS. The master does not need to be always-on —
agents keep running their workloads independently; the master is needed only
to issue new commands.
- **Node management.** Track which nodes are in the platform, their health,
available resources, and running workloads.
- **Scheduling.** When placing a new service, MCP selects a node based on
available resources and any operator-specified constraints. The operator can
override with an explicit node, but the default is MCP's choice.
**How it fits in:** MCP is the piece that ties everything together. MCIAS
provides identity, Metacrypt provides certificates, MCR provides images, MCNS
provides DNS, MC-Proxy provides ingress — MCP orchestrates all of it, owns the
map of what is running where, and pushes updates to MCNS so DNS stays current. It is the system that makes the
infrastructure metacircular: the control plane deploys and manages the very
services it depends on.
**Container-first design:** All Metacircular services are built as containers
(multi-stage Docker builds, Alpine runtime, non-root) specifically so that MCP
can deploy them. The systemd unit files exist as a fallback and for bootstrap —
the long-term deployment model is MCP-managed containers.
**Status:** Not yet implemented.
---
### MCAT — MCIAS Login Policy Tester
MCAT is a lightweight diagnostic tool, not a core infrastructure component. It
presents a web login form, forwards credentials to MCIAS with a configurable
`service_name` and `tags`, and shows whether the login was accepted or denied
by policy. This lets operators verify that login policy rules behave as
expected without touching the target service.
**Status:** Implemented.
---
## Bootstrap Sequence
Bringing up a Metacircular platform from scratch requires careful ordering
because of the circular dependencies — the infrastructure manages itself, but
must exist before it can do so. The key challenge is that nearly every service
needs TLS certificates (from Metacrypt) and authentication (from MCIAS), but
those services themselves need to be running first.
During bootstrap, all services run as **systemd units** on a single bootstrap
node. MCP takes over lifecycle management as the final step.
### Prerequisites
Before any service starts, the operator needs:
- **The bootstrap node** — a machine (VPS, homelab server, etc.) with the
overlay network configured and reachable.
- **Seed PKI** — MCIAS and Metacrypt need TLS certs to start, but Metacrypt
isn't running yet to issue them. The root CA is generated manually using
`github.com/kisom/cert` and stored in the `ca/` directory in the workspace.
Initial service certificates are issued from this root. The root CA is then
imported into Metacrypt once it's running, so Metacrypt becomes the
authoritative CA for the platform going forward.
- **TOML config files** — each service needs its config in `/srv/<service>/`.
During bootstrap these are written manually. Later, MCP handles config
distribution.
### Startup Order
```
Phase 0: Seed PKI
Operator creates or obtains initial TLS certificates for MCIAS
and Metacrypt. Places them in /srv/mcias/certs/ and
/srv/metacrypt/certs/.
Phase 1: Identity
┌──────────────────────────────────────────────────────┐
│ MCIAS starts (systemd) │
│ - No dependencies on other Metacircular services │
│ - Uses seed TLS certificates │
│ - Operator creates initial admin account │
│ - Operator creates system accounts for other services│
└──────────────────────────────────────────────────────┘
Phase 2: Cryptographic Services
┌──────────────────────────────────────────────────────┐
│ Metacrypt starts (systemd) │
│ - Authenticates against MCIAS │
│ - Uses seed TLS certificates initially │
│ - Operator initializes and unseals │
│ - Operator creates CA engine, imports root CA from │
│ ca/, creates issuers │
│ - Can now issue certificates for all other services │
│ - Reissue MCIAS and Metacrypt certs from own CA │
│ (replace seed certs with Metacrypt-issued certs) │
└──────────────────────────────────────────────────────┘
Phase 3: Ingress
┌──────────────────────────────────────────────────────┐
│ MC-Proxy starts (systemd) │
│ - Static route table from TOML config │
│ - Routes external traffic to MCIAS, Metacrypt │
│ - No MCIAS auth (pre-auth infrastructure) │
│ - TLS certs for L7 routes from Metacrypt │
└──────────────────────────────────────────────────────┘
Phase 4: Container Registry
┌──────────────────────────────────────────────────────┐
│ MCR starts (systemd) │
│ - Authenticates against MCIAS │
│ - TLS certificates from Metacrypt │
│ - Operator pushes container images for all services │
│ (including MCIAS, Metacrypt, MC-Proxy themselves) │
└──────────────────────────────────────────────────────┘
Phase 5: DNS
┌──────────────────────────────────────────────────────┐
│ MCNS starts (systemd) │
│ - Authenticates against MCIAS │
│ - Operator configures initial DNS records │
│ (node addresses, service names) │
└──────────────────────────────────────────────────────┘
Phase 6: Control Plane
┌──────────────────────────────────────────────────────┐
│ MCP Agent starts on bootstrap node (systemd) │
│ MCP Master starts on operator workstation │
│ - Authenticates against MCIAS │
│ - Master registers the bootstrap node │
│ - Master imports running services into its registry │
│ - From here, MCP owns the service map │
│ - Services can be redeployed as MCP-managed │
│ containers (replacing the systemd units) │
└──────────────────────────────────────────────────────┘
```
### The Seed Certificate Problem
The circular dependency between MCIAS, Metacrypt, and TLS is resolved by
bootstrapping with a **manually generated root CA**:
1. The operator generates a root CA using `github.com/kisom/cert`. This root
and initial service certificates live in the `ca/` directory.
2. MCIAS and Metacrypt start with certificates issued from this external root.
3. Metacrypt comes up. The operator imports the root CA into Metacrypt's CA
engine, making Metacrypt the authoritative issuer under the same root.
4. Metacrypt can now issue and renew certificates for all services. The `ca/`
directory remains as the offline backup of the root material.
This is a one-time process. The root CA is generated once, imported once, and
from that point forward Metacrypt is the sole CA. MCP handles certificate
provisioning for all services.
### Adding a New Node
Once the platform is bootstrapped, adding a node is straightforward:
1. Provision the machine and connect it to the overlay network.
2. Install the MCP agent binary.
3. Configure the agent with the MCP Master address and MCIAS credentials
(system account for the node).
4. Start the agent. It authenticates with MCIAS, connects to Master, and
reports as available.
5. The operator deploys workloads to it via MCP. MCP handles image pulls,
config transfer, certificate provisioning, and DNS updates.
### Disaster Recovery
If the bootstrap node is lost, recovery follows the same sequence as initial
bootstrap — but with data restored from backups:
1. Start MCIAS on a new node, restore its database from the most recent
`VACUUM INTO` snapshot.
2. Start Metacrypt, restore its database. Unseal with the original password.
The entire key hierarchy and all issued certificates are recovered.
3. Bring up the remaining services in order, restoring their databases.
4. Start MCP, which rebuilds its registry from the running services.
5. Update DNS (MCNS or external) to point to the new node.
Every service's `snapshot` CLI command and daily backup timer exist specifically
to make this recovery possible. The `/srv/<service>/` convention means each
service's entire state is a single directory to back up and restore.
---
## Certificate Lifecycle
Every service in the platform requires TLS certificates, and Metacrypt is the
CA that issues them. This section describes how certificates flow from
Metacrypt to services, how they are renewed, and how the pieces fit together.
### PKI Structure
Metacrypt implements a **two-tier PKI**:
```
Root CA (self-signed, generated at engine initialization)
├── Issuer "infra" (intermediate CA for infrastructure services)
├── Issuer "services" (intermediate CA for application services)
└── Issuer "clients" (intermediate CA for client certificates)
```
The root CA signs intermediate CAs ("issuers"), which in turn sign leaf
certificates. Each issuer is scoped to a purpose. The root CA certificate is
the trust anchor — services and clients need it (or the relevant issuer chain)
to verify certificates presented by other services.
### ACME Protocol
Metacrypt implements an **ACME server** (RFC 8555) with External Account
Binding (EAB). This is the same protocol used by Let's Encrypt, meaning any
standard ACME client can obtain certificates from Metacrypt.
The ACME flow:
1. Client authenticates with MCIAS and requests EAB credentials from Metacrypt.
2. Client registers an ACME account using the EAB credentials.
3. Client places a certificate order (one or more domain names).
4. Metacrypt creates authorization challenges (HTTP-01 and DNS-01 supported).
5. Client fulfills the challenge (places a file for HTTP-01, or a DNS TXT
record for DNS-01).
6. Metacrypt validates the challenge and issues the certificate.
7. Client downloads the certificate chain and private key.
A **Go client library** (`metacrypt/clients/go`) wraps this entire flow:
MCIAS login, EAB fetch, account registration, challenge fulfillment, and
certificate download. Services that integrate this library can obtain and
renew certificates programmatically.
### How Services Get Certificates Today
Currently, certificates are provisioned through Metacrypt's **REST API or web
UI** and placed into each service's `/srv/<service>/certs/` directory. This is
a manual process — the operator issues a certificate, downloads it, and
deploys the files. The ACME client library exists but is not yet integrated
into any service.
### How It Will Work With MCP
MCP is the natural place to automate certificate provisioning:
- **Initial deploy.** When MCP deploys a new service, it can provision a
certificate from Metacrypt (via the ACME client library or the REST API),
transfer the cert and key to the node as part of the config push to
`/srv/<service>/certs/`, and start the service with valid TLS material.
- **Renewal.** MCP knows what services are running and when their certificates
expire. It can renew certificates before expiry by re-running the ACME flow
(or calling Metacrypt's `renew` operation) and pushing updated files to the
node. The service restarts with the new certificate.
- **Migration.** When MCP migrates a service, the certificate in
`/srv/<service>/certs/` moves with the tar.zst snapshot. If the service's
hostname changes (new node, new DNS name), MCP provisions a new certificate
for the new name.
- **MC-Proxy L7 routes.** MC-Proxy's L7 mode requires certificate/key pairs
for TLS termination. MCP (or the operator) can provision these from
Metacrypt and push them to MC-Proxy's cert directory. MC-Proxy's
architecture doc lists ACME integration and Metacrypt key storage as future
work.
### Trust Distribution
Every service and client that validates TLS certificates needs the root CA
certificate (or the relevant issuer chain). Metacrypt serves these publicly
without authentication:
- `GET /v1/pki/{mount}/ca` — root CA certificate (PEM)
- `GET /v1/pki/{mount}/ca/chain` — full chain: issuer + root (PEM)
- `GET /v1/pki/{mount}/issuer/{name}` — specific issuer certificate (PEM)
During bootstrap, the root CA cert is distributed manually (or via the `ca/`
directory in the workspace). Once MCP is running, it can distribute the CA
cert as part of service deployment. Services reference the CA cert path in
their `[mcias]` config section (`ca_cert`) to verify connections to MCIAS and
other services.
---
## End-to-End Deploy Workflow
This traces a deployment from code change to running service, showing how every
component participates. The example deploys a new version of service α that is
already running on Node B.
### 1. Build and Push
The operator builds a new container image and pushes it to MCR:
```
Operator workstation (vade)
$ docker build -t mcr.metacircular.net/α:v1.2.0 .
$ docker push mcr.metacircular.net/α:v1.2.0
MC-Proxy (edge) ──overlay──→ MC-Proxy (origin) ──→ MCR
Authenticates
via MCIAS
Policy check:
can this user
push to α?
Image stored
(blobs + manifest)
```
The `docker push` goes through MC-Proxy (SNI routing to MCR), authenticates
via the OCI token flow (which delegates to MCIAS), and is checked against
MCR's push policy. The image is stored content-addressed in MCR.
### 2. Deploy
The operator tells MCP to deploy:
```
Operator workstation (vade)
$ mcp deploy α # or: mcp deploy α --image v1.2.0
MCP Master
├── Registry lookup: α is running on Node B
├── C2 (gRPC over overlay) to Node B agent:
│ "pull mcr.metacircular.net/α:v1.2.0 and restart"
MCP Agent (Node B)
├── Pull image from MCR
│ (authenticates via MCIAS, same OCI flow)
├── Stop running container
├── Start new container from updated image
│ - Mounts /srv/α/ (config, database, certs all persist)
│ - Service starts, authenticates to MCIAS, resumes operation
└── Report status back to Master
```
Since α is already running on Node B, this is an in-place update. The
`/srv/α/` directory is untouched — config, database, and certificates persist
across the container restart.
### 3. First-Time Deploy
If α has never been deployed, MCP does more work:
```
Operator workstation (vade)
$ mcp deploy α --config α.toml
MCP Master
├── Registry lookup: α is not running anywhere
├── Scheduling: select Node C (best fit)
├── Provision TLS certificate from Metacrypt
│ (ACME flow or REST API)
├── C2 to Node C agent:
│ 1. Create /srv/α/ directory structure
│ 2. Transfer config file (α.toml → /srv/α/α.toml)
│ 3. Transfer TLS cert+key → /srv/α/certs/
│ 4. Transfer root CA cert → /srv/α/certs/ca.pem
│ 5. Pull image from MCR
│ 6. Start container
├── Update service registry: α → Node C
├── Push DNS update to MCNS:
α.svc.mcp.metacircular.net → Node C address
└── (Optionally) update MC-Proxy route table
if α needs external ingress
```
### 4. Migration
Moving α from Node B to Node C:
```
Operator workstation (vade)
$ mcp migrate α --to node-c # or let MCP choose the destination
MCP Master
├── C2 to Node B agent:
│ 1. Stop α container
│ 2. Snapshot /srv/α/ → tar.zst archive
│ 3. Transfer tar.zst to Master (or directly to Node C)
├── C2 to Node C agent:
│ 1. Receive tar.zst archive
│ 2. Extract to /srv/α/
│ 3. Pull container image from MCR (if not cached)
│ 4. Start container
│ 5. Report status
├── Update service registry: α → Node C
├── Push DNS update to MCNS:
α.svc.mcp.metacircular.net → Node C address
└── (If α had external ingress) update MC-Proxy route
or rely on DNS change
```
### What Each Component Does
| Step | MCIAS | Metacrypt | MCR | MC-Proxy | MCP | MCNS |
|------|-------|-----------|-----|----------|-----|------|
| Build/push image | Authenticates push | — | Stores image, enforces push policy | Routes traffic to MCR | — | — |
| Deploy (update) | Authenticates pull, authenticates service on start | — | Serves image to agent | Routes traffic to service | Coordinates: registry lookup, C2 to agent | — |
| Deploy (new) | Authenticates pull, authenticates service on start | Issues TLS certificate | Serves image to agent | Routes traffic to service (if external) | Coordinates: scheduling, cert provisioning, config transfer, DNS update | Updates DNS records |
| Migrate | Authenticates service on new node | Issues new cert (if hostname changes) | Serves image (if not cached) | Routes traffic to new location | Coordinates: snapshot, transfer, DNS update | Updates DNS records |
| Steady state | Validates tokens for every authenticated request | Serves CA certs publicly, renews certs | Serves image pulls | Routes all external traffic | Tracks service health, holds registry | Serves DNS queries |
---
## Future Ideas
Components and capabilities that may be worth building but have no immediate
timeline. Listed here to capture the thinking; none are committed.
### Observability — Log Collection and Health Monitoring
Every service already produces structured logs (`log/slog`) and exposes health
checks (gRPC `Health.Check` or REST status endpoints). What's missing is
aggregation — today, debugging a cross-service issue means SSH'ing into each
node and reading local logs.
A collector could:
- Gather structured logs from services on each node and forward them to a
central store.
- Periodically health-check local services and report status.
- Feed health data into MCP so it can make informed decisions (restart
unhealthy services, avoid scheduling on degraded nodes, alert the operator).
This might be a standalone service or an MCP agent capability, depending on
weight. If it's just "tail logs and hit health endpoints," it fits in the
agent. If it grows to include indexing, querying, retention policies, and
alerting rules, it's its own service.
### Object Store
The platform has structured storage (SQLite), blob storage scoped to container
images (MCR), and encrypted key-value storage (Metacrypt's barrier). It does
not have general-purpose object/blob storage.
Potential uses:
- **Centralized backups.** Service snapshots currently live on each node in
`/srv/<service>/backups/`. A central object store gives MCP somewhere to push
tar.zst snapshots for offsite retention.
- **Artifact storage.** Build outputs, large files, anything that doesn't fit
in a database row.
- **Data sharing between services.** Files that need to move between services
outside the MCP C2 channel.
Prior art: [Nebula](https://metacircular.net/pages/nebula.html), a
content-addressable data store with capability-based security (SHA-256
addressed blobs, UUID entries for versioning, proxy references for revocable
access). Prototyped in multiple languages. The capability model is interesting
but may be more sophistication than the platform needs — a simpler
authenticated blob store with MCIAS integration might suffice.
### Overlay Network Management
The platform currently relies on an external overlay network (WireGuard,
Tailscale, or similar) for node-to-node connectivity. A self-hosted WireGuard
mesh manager would bring the overlay under Metacircular's control:
- Automate key exchange and peer configuration when MCP adds a node.
- Manage IP allocation within the mesh (potentially absorbing part of MCNS's
scope).
- Remove the dependency on Tailscale's coordination servers.
This is a natural extension of the sovereignty principle but is low priority
while the mesh is small enough to manage by hand.
### Hypervisor / Isolation
A deeper exploration of environment isolation, message-passing between
services, and access mediation at a level below containers. Prior art:
[hypervisor concept](https://metacircular.net/pages/hypervisor.html). The
current platform achieves these goals through containers + MCIAS + policy
engines. A hypervisor layer would push isolation down to the OS level —
interesting for security but significant in scope. More relevant if the
platform ever moves beyond containers to VM-based workloads.
### Prior Art: SYSGOV
[SYSGOV](https://metacircular.net/pages/lisp-dcos.html) was an earlier
exploration of system management in Lisp, with SYSPLAN (desired state
enforcement) and SYSMON (service management). Many of its research questions —
C2 communication, service discovery, secure config distribution, failure
handling — are directly addressed by MCP's design. MCP is the spiritual
successor, reimplemented in Go with the benefit of the Metacircular platform
underneath it.

55399
docs/notebook.pdf Normal file

File diff suppressed because it is too large Load Diff

851
engineering-standards.md Normal file
View File

@@ -0,0 +1,851 @@
# Metacircular Dynamics — Engineering Standards
Source: https://metacircular.net/roam/20260314210051-metacircular_dynamics.html
This document describes the standard repository layout, tooling, and software
development lifecycle (SDLC) for services built at Metacircular Dynamics. It
incorporates the platform-wide project guidelines and codifies the conventions
established in Metacrypt as the baseline for all services.
## Platform Rules
These four rules apply to every Metacircular service:
1. **Data Storage**: All service data goes in `/srv/<service>/` to enable
straightforward migration across systems.
2. **Deployment Architecture**: Services require systemd unit files but
prioritize container-first design to support deployment via the
Metacircular Control Plane (MCP).
3. **Identity Management**: Services must integrate with MCIAS (Metacircular
Identity and Access Service) for user management and access control. Three
role levels: `admin` (full administrative access), `user` (full
non-administrative access), `guest` (service-dependent restrictions).
4. **API Design**: Services expose both gRPC and REST interfaces, kept in
sync. Web UIs are built with htmx.
## Table of Contents
0. [Platform Rules](#platform-rules)
1. [Repository Layout](#repository-layout)
2. [Language & Toolchain](#language--toolchain)
3. [Build System](#build-system)
4. [API Design](#api-design)
5. [Authentication & Authorization](#authentication--authorization)
6. [Database Conventions](#database-conventions)
7. [Configuration](#configuration)
8. [Web UI](#web-ui)
9. [Testing](#testing)
10. [Linting & Static Analysis](#linting--static-analysis)
11. [Deployment](#deployment)
12. [Documentation](#documentation)
13. [Security](#security)
14. [Development Workflow](#development-workflow)
---
## Repository Layout
Every service follows a consistent directory structure. Adjust the
service-specific directories (e.g. `engines/` in Metacrypt) as appropriate,
but the top-level skeleton is fixed.
```
.
├── cmd/
│ ├── <service>/ CLI entry point (server, subcommands)
│ └── <service>-web/ Web UI entry point (if separate binary)
├── internal/
│ ├── auth/ MCIAS integration (token validation, caching)
│ ├── config/ TOML configuration loading & validation
│ ├── db/ Database setup, schema migrations
│ ├── server/ REST API server, routes, middleware
│ ├── grpcserver/ gRPC server, interceptors, service handlers
│ ├── webserver/ Web UI server, template routes, HTMX handlers
│ └── <domain>/ Service-specific packages
├── proto/<service>/
│ └── v<N>/ Current proto definitions (start at v1;
│ increment only on breaking changes)
├── gen/<service>/
│ └── v<N>/ Generated Go gRPC/protobuf code
├── web/
│ ├── embed.go //go:embed directive for templates and static
│ ├── templates/ Go HTML templates
│ └── static/ CSS, JS (htmx)
├── deploy/
│ ├── docker/ Docker Compose configuration
│ ├── examples/ Example config files
│ ├── scripts/ Install, backup, migration scripts
│ └── systemd/ systemd unit files and timers
├── docs/ Internal engineering documentation
├── Dockerfile.api API server container (if split binary)
├── Dockerfile.web Web UI container (if split binary)
├── Makefile
├── buf.yaml Protobuf linting & breaking-change config
├── .golangci.yaml Linter configuration
├── .gitignore
├── CLAUDE.md AI-assisted development instructions
├── ARCHITECTURE.md Full system specification
└── <service>.toml.example Example configuration
```
### Key Principles
- **`cmd/`** contains only CLI wiring (cobra commands, flag parsing). No
business logic.
- **`internal/`** contains all service logic. Nothing in `internal/` is
importable by other modules — this is enforced by Go's module system.
- **`proto/`** is the source of truth for gRPC definitions. Generated code
lives in `gen/`, never edited by hand. Versions start at `v1`; a new
version directory is only created when a breaking change is required — not
as a naming convention or initial setup step.
- **`deploy/`** contains everything needed to run the service in production.
A new engineer should be able to deploy from this directory alone.
- **`web/`** is embedded into the binary via `//go:embed`. No external file
dependencies at runtime.
### What Does Not Belong in the Repository
- Runtime data (databases, certificates, logs) — these live in `/srv/<service>`
- Real configuration files with secrets — only examples are committed
- IDE configuration (`.idea/`, `.vscode/`) — per-developer, not shared
- Vendored dependencies — Go module proxy handles this
---
## Language & Toolchain
| Tool | Version | Purpose |
|------|---------|---------|
| Go | 1.25+ | Primary language |
| protoc + protoc-gen-go | Latest | Protobuf/gRPC code generation |
| buf | Latest | Proto linting and breaking-change detection |
| golangci-lint | v2 | Static analysis and linting |
| Docker | Latest | Container builds |
### Go Conventions
- **Pure-Go dependencies** where possible. Avoid CGo — it complicates
cross-compilation and container builds. Use `modernc.org/sqlite` instead
of `mattn/go-sqlite3`.
- **`CGO_ENABLED=0`** for all production builds. Statically linked binaries
deploy cleanly to Alpine containers.
- **Stripped binaries**: Build with `-trimpath -ldflags="-s -w"` to remove
debug symbols and reduce image size.
- **Version injection**: Pass `git describe --tags --always --dirty` via
`-X main.version=...` at build time. Every binary must report its version.
### Module Path
Services hosted on `git.wntrmute.dev` use:
```
git.wntrmute.dev/kyle/<service>
```
---
## Build System
Every repository has a Makefile with these standard targets:
```makefile
.PHONY: build test vet lint proto-lint clean docker all
LDFLAGS := -trimpath -ldflags="-s -w -X main.version=$(shell git describe --tags --always --dirty)"
<service>:
go build $(LDFLAGS) -o <service> ./cmd/<service>
build:
go build ./...
test:
go test ./...
vet:
go vet ./...
lint:
golangci-lint run ./...
proto:
protoc --go_out=. --go_opt=module=<module> \
--go-grpc_out=. --go-grpc_opt=module=<module> \
proto/<service>/v2/*.proto
proto-lint:
buf lint
buf breaking --against '.git#branch=master,subdir=proto'
clean:
rm -f <service>
docker:
docker build -t <service> -f Dockerfile.api .
all: vet lint test <service>
```
### Target Semantics
| Target | When to Run | CI Gate? |
|--------|-------------|----------|
| `vet` | Every change | Yes |
| `lint` | Every change | Yes |
| `test` | Every change | Yes |
| `proto-lint` | Any proto change | Yes |
| `proto` | After editing `.proto` files | No (manual) |
| `all` | Pre-push verification | Yes |
The `all` target is the CI pipeline: `vet → lint → test → build`. If any
step fails, the pipeline stops.
---
## API Design
Services expose two synchronized API surfaces:
### gRPC (Primary)
- Proto definitions live in `proto/<service>/v<N>/`, where N starts at 1.
- **Versioning policy**: proto packages are versioned to protect existing
clients from breaking changes. A new version directory (`v2/`, `v3/`, …)
is only introduced when a breaking change is unavoidable. Non-breaking
additions (new fields, new RPCs) are made in-place to the current version.
- Use strongly-typed, per-operation RPCs. Avoid generic "execute" patterns.
- Use `google.protobuf.Timestamp` for all time fields (not RFC 3339 strings).
- Run `buf lint` and `buf breaking` against master before merging proto
changes.
### REST (Secondary)
- JSON over HTTPS. Routes live in `internal/server/routes.go`.
- Use `chi` for routing (lightweight, stdlib-compatible).
- Standard error format: `{"error": "description"}`.
- Standard HTTP status codes: `401` (unauthenticated), `403` (unauthorized),
`412` (precondition failed), `503` (service unavailable).
### API Sync Rule
**Every REST endpoint must have a corresponding gRPC RPC, and vice versa.**
When adding, removing, or changing an endpoint in either surface, the other
must be updated in the same change. This is enforced in code review.
### gRPC Interceptors
Access control is enforced via interceptor maps, not per-handler checks:
| Map | Effect |
|-----|--------|
| `sealRequiredMethods` | Returns `UNAVAILABLE` if the service is sealed/locked |
| `authRequiredMethods` | Validates MCIAS bearer token, populates caller info |
| `adminRequiredMethods` | Requires admin role on the caller |
Adding a new RPC means adding it to the correct interceptor maps. Forgetting
this is a security defect.
---
## Authentication & Authorization
### Authentication
All services delegate authentication to **MCIAS** (Metacircular Identity and
Access Service). No service maintains its own user database.
- Client sends credentials to the service's `/v1/auth/login` endpoint.
- The service forwards them to MCIAS via the client library
(`git.wntrmute.dev/kyle/mcias/clients/go`).
- On success, MCIAS returns a bearer token. The service returns it to the
client and optionally sets it as a cookie for the web UI.
- Subsequent requests include the token via `Authorization: Bearer <token>`
header or cookie.
- Token validation calls MCIAS `ValidateToken()`. Results should be cached
(keyed by SHA-256 of the token) with a short TTL (30 seconds or less).
### Authorization
Three role levels:
| Role | Meaning |
|------|---------|
| `admin` | Full access to everything. Policy bypass. |
| `user` | Access governed by policy rules. Default deny. |
| `guest` | Service-dependent restrictions. Default deny. |
Admin detection is based solely on the MCIAS `admin` role. The service never
promotes users locally.
Services that need fine-grained access control should implement a policy
engine (priority-based ACL rules stored in encrypted storage, default deny,
admin bypass). See Metacrypt's implementation as the reference.
---
## Database Conventions
### SQLite
SQLite is the default database for Metacircular services. It is simple to
operate, requires no external processes, and backs up cleanly with
`VACUUM INTO`.
Connection settings (applied at open time):
```go
PRAGMA journal_mode = WAL;
PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;
```
File permissions: `0600`. Created by the service on first run.
### Migrations
- Migrations are Go functions registered in `internal/db/` and run
sequentially at startup.
- Each migration is idempotent — `CREATE TABLE IF NOT EXISTS`,
`ALTER TABLE ... ADD COLUMN IF NOT EXISTS`.
- Applied migrations are tracked in a `schema_migrations` table.
- Never modify a migration that has been deployed. Add a new one.
### Backup
Every service must provide a `snapshot` CLI command that creates a consistent
backup using `VACUUM INTO`. Automated backups run via a systemd timer
(daily, with retention pruning).
---
## Configuration
### Format
TOML. Parsed with `go-toml/v2`. Environment variable overrides via
`SERVICENAME_*` (e.g. `METACRYPT_SERVER_LISTEN_ADDR`).
### Standard Sections
```toml
[server]
listen_addr = ":8443" # HTTPS API
grpc_addr = ":9443" # gRPC (optional; disabled if unset)
tls_cert = "/srv/<service>/certs/cert.pem"
tls_key = "/srv/<service>/certs/key.pem"
[web]
listen_addr = "127.0.0.1:8080" # Web UI (optional; disabled if unset)
vault_grpc = "127.0.0.1:9443" # gRPC address of the API server
vault_ca_cert = "" # CA cert for verifying API server TLS
[database]
path = "/srv/<service>/<service>.db"
[mcias]
server_url = "https://mcias.metacircular.net:8443"
ca_cert = "" # Custom CA for MCIAS TLS
service_name = "<service>" # This service's identity, as registered in MCIAS
tags = [] # Tags sent with every login request (e.g. ["env:restricted"])
# MCIAS evaluates auth:login policy against these tags,
# enabling per-service login restrictions via policy rules.
[log]
level = "info" # debug, info, warn, error
```
#### Service context and login policy
`service_name` and `tags` in `[mcias]` are sent with every `POST /v1/auth/login`
request. MCIAS evaluates the `auth:login` action with the resource set to
`{service_name, tags}`. This allows operators to write deny rules that restrict
which roles or account types can log into specific services.
Example: deny `guest` and `viewer` human accounts from any service tagged
`env:restricted`:
```json
{
"effect": "deny",
"roles": ["guest", "viewer"],
"account_types": ["human"],
"actions": ["auth:login"],
"required_tags": ["env:restricted"]
}
```
A service can also be targeted by name instead of (or in addition to) tags:
```json
{
"effect": "deny",
"roles": ["guest"],
"actions": ["auth:login"],
"service_names": ["meta-money-printer"]
}
```
MCIAS enforces the policy after credentials are verified; a policy-denied
login returns HTTP 403 (not 401) so the client can distinguish a bad password
from a service access restriction.
### Validation
Required fields are validated at startup. The service refuses to start if
any are missing. Do not silently default required values.
### Data Directory
All runtime data lives in `/srv/<service>/`:
```
/srv/<service>/
├── <service>.toml Configuration
├── <service>.db SQLite database
├── certs/ TLS certificates
└── backups/ Database snapshots
```
This convention enables straightforward service migration between hosts:
copy `/srv/<service>/` and the binary.
---
## Web UI
### Technology
- **Go `html/template`** for server-side rendering. No JavaScript frameworks.
- **htmx** for dynamic interactions (form submission, partial page updates)
without full page reloads.
- Templates and static files are embedded in the binary via `//go:embed`.
### Structure
- `web/templates/layout.html` — shared HTML skeleton, navigation, CSS/JS
includes. All page templates extend this.
- Page templates: one `.html` file per page/feature.
- `web/static/` — CSS, htmx. Keep this minimal.
### Architecture
The web UI runs as a separate binary (`<service>-web`) that communicates
with the API server via its gRPC interface. This separation means:
- The web UI has no direct database access.
- The API server enforces all authorization.
- The web UI can be deployed independently or omitted entirely.
### Security
- CSRF protection via signed double-submit cookies on all mutating requests
(POST/PUT/PATCH/DELETE).
- Session cookie: `HttpOnly`, `Secure`, `SameSite=Strict`.
- All user input is escaped by `html/template` (the default).
---
## Testing
### Philosophy
Tests are written using the Go standard library `testing` package. No test
frameworks (testify, gomega, etc.) — the standard library is sufficient and
keeps dependencies minimal.
### Patterns
```go
func TestFeatureName(t *testing.T) {
// Setup: use t.TempDir() for isolated file system state.
dir := t.TempDir()
database, err := db.Open(filepath.Join(dir, "test.db"))
if err != nil {
t.Fatalf("open db: %v", err)
}
defer func() { _ = database.Close() }()
db.Migrate(database)
// Exercise the code under test.
// ...
// Assert with t.Fatal (not t.Error) for precondition failures.
if !bytes.Equal(got, want) {
t.Fatalf("got %q, want %q", got, want)
}
}
```
### Guidelines
- **Use `t.TempDir()`** for all file-system state. Never write to fixed
paths. Cleanup is automatic.
- **Use `errors.Is`** for error assertions, not string comparison.
- **No mocks for databases.** Tests use real SQLite databases created in
temp directories. This catches migration bugs that mocks would hide.
- **Test files** live alongside the code they test: `barrier.go` and
`barrier_test.go` in the same package.
- **Test helpers** call `t.Helper()` so failures report the caller's line.
### What to Test
| Layer | Test Strategy |
|-------|---------------|
| Crypto primitives | Roundtrip encryption/decryption, wrong-key rejection, edge cases |
| Storage (barrier, DB) | CRUD operations, sealed-state rejection, concurrent access |
| API handlers | Request/response correctness, auth enforcement, error codes |
| Policy engine | Rule matching, priority ordering, default deny, admin bypass |
| CLI commands | Flag parsing, output format (lightweight) |
---
## Linting & Static Analysis
### Configuration
Every repository includes a `.golangci.yaml` with this philosophy:
**fail loudly for security and correctness; everything else is a warning.**
### Required Linters
| Linter | Category | Purpose |
|--------|----------|---------|
| `errcheck` | Correctness | Unhandled errors are silent failures |
| `govet` | Correctness | Printf mismatches, unreachable code, suspicious constructs |
| `ineffassign` | Correctness | Dead writes hide logic bugs |
| `unused` | Correctness | Unused variables and functions |
| `errorlint` | Error handling | Proper `errors.Is`/`errors.As` usage |
| `gosec` | Security | Hardcoded secrets, weak RNG, insecure crypto, SQL injection |
| `staticcheck` | Security | Deprecated APIs, mutex misuse, deep analysis |
| `revive` | Style | Go naming conventions, error return ordering |
| `gofmt` | Formatting | Standard Go formatting |
| `goimports` | Formatting | Import grouping and ordering |
### Settings
- `errcheck`: `check-type-assertions: true` (catch `x.(*T)` without ok check).
- `govet`: all analyzers enabled except `shadow` (too noisy for idiomatic Go).
- `gosec`: severity and confidence set to `medium`. Exclude `G104` (overlaps
with errcheck).
- `max-issues-per-linter: 0` — report everything. No caps.
- Test files: allow `G101` (hardcoded credentials) for test fixtures.
---
## Deployment
### Container-First
Services are designed for container deployment but must also run as native
systemd services. Both paths are first-class.
### Docker
Multi-stage builds:
1. **Builder**: `golang:1.23-alpine`. Compile with `CGO_ENABLED=0`, strip
symbols.
2. **Runtime**: `alpine:3.21`. Non-root user (`<service>`), minimal attack
surface.
If the service has separate API and web binaries, use separate Dockerfiles
(`Dockerfile.api`, `Dockerfile.web`) and a `docker-compose.yml` that wires
them together with a shared data volume.
### systemd
Every service ships with:
| File | Purpose |
|------|---------|
| `<service>.service` | Main service unit (API server) |
| `<service>-web.service` | Web UI unit (if applicable) |
| `<service>-backup.service` | Oneshot backup unit |
| `<service>-backup.timer` | Daily backup timer (02:00 UTC, 5-minute jitter) |
#### Security Hardening
All service units must include these security directives:
```ini
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
ReadWritePaths=/srv/<service>
```
The web UI unit should use `ReadOnlyPaths=/srv/<service>` instead of
`ReadWritePaths` — it has no reason to write to the data directory.
### Install Script
`deploy/scripts/install.sh` handles:
1. Create system user/group (idempotent).
2. Install binary to `/usr/local/bin/`.
3. Create `/srv/<service>/` directory structure.
4. Install example config if none exists.
5. Install systemd units and reload the daemon.
### TLS
- **Minimum TLS version: 1.3.** No exceptions, no fallback cipher suites.
Go's TLS 1.3 implementation manages cipher selection automatically.
- **Timeouts**: read 30s, write 30s, idle 120s.
- Certificate and key paths are required configuration — the service refuses
to start without them.
### Graceful Shutdown
Services handle `SIGINT` and `SIGTERM`, shutting down cleanly:
1. Stop accepting new connections.
2. Drain in-flight requests (with a timeout).
3. Clean up resources (close databases, zeroize secrets if applicable).
4. Exit.
---
## Documentation
### Required Files
| File | Purpose | Audience |
|------|---------|----------|
| `README.md` | Project overview, quick-start, and contributor guide | Everyone |
| `CLAUDE.md` | AI-assisted development context | Claude Code |
| `ARCHITECTURE.md` | Full system specification | Engineers |
| `RUNBOOK.md` | Operational procedures and incident response | Operators |
| `deploy/examples/<service>.toml` | Example configuration | Operators |
### Suggested Files
These are not required for every project but should be created where applicable:
| File | When to Include | Purpose |
|------|-----------------|---------|
| `AUDIT.md` | Services handling cryptography, secrets, PII, or auth | Security audit findings with issue tracking and resolution status |
| `POLICY.md` | Services with fine-grained access control | Policy engine documentation: rule structure, evaluation algorithm, resource paths, action classification, common patterns |
### README.md
The README is the front door. A new engineer or user should be able to
understand what the service does and get it running from this file alone.
It should contain:
- Project name and one-paragraph description.
- Quick-start instructions (build, configure, run).
- Link to `ARCHITECTURE.md` for full technical details.
- Link to `RUNBOOK.md` for operational procedures.
- License and contribution notes (if applicable).
Keep it concise. The README is not the spec — that's `ARCHITECTURE.md`.
### CLAUDE.md
This file provides context for AI-assisted development. It should contain:
- Project overview (one paragraph).
- Build, test, and lint commands.
- High-level architecture summary.
- Project structure with directory descriptions.
- Ignored directories (runtime data, generated code).
- Critical rules (e.g. API sync requirements).
Keep it concise. AI tools read this on every interaction.
### ARCHITECTURE.md
This is the canonical specification for the service. It should cover:
1. System overview with a layered architecture diagram.
2. Cryptographic design (if applicable): algorithms, key hierarchy.
3. State machines and lifecycle (if applicable).
4. Storage design.
5. Authentication and authorization model.
6. API surface (REST and gRPC, with tables of every endpoint).
7. Web interface routes.
8. Database schema (every table, every column).
9. Configuration reference.
10. Deployment guide.
11. Security model: threat mitigations table and security invariants.
12. Future work.
This document is the source of truth. When the code and the spec disagree,
one of them has a bug.
### RUNBOOK.md
The runbook is written for operators, not developers. It covers what to do
when things go wrong and how to perform routine maintenance. It should
contain:
1. **Service overview** — what the service does, in one paragraph.
2. **Health checks** — how to verify the service is healthy (endpoints,
CLI commands, expected responses).
3. **Common operations** — start, stop, restart, seal/unseal, backup,
restore, log inspection.
4. **Alerting** — what alerts exist, what they mean, and how to respond.
5. **Incident procedures** — step-by-step playbooks for known failure
modes (database corruption, certificate expiry, MCIAS outage, disk
full, etc.).
6. **Escalation** — when and how to escalate beyond the runbook.
Write runbook entries as numbered steps, not prose. An operator at 3 AM
should be able to follow them without thinking.
### AUDIT.md (Suggested)
For services that handle cryptography, secrets, PII, or authentication,
maintain a security audit log. Each finding gets a numbered entry with:
- Description of the issue.
- Severity (critical, high, medium, low).
- Resolution status: open, resolved (with summary), or accepted (with
rationale for accepting the risk).
The priority summary table at the bottom provides a scannable overview.
Resolved and accepted items are struck through but retained for history.
See Metacrypt's `AUDIT.md` for the reference format.
### POLICY.md (Suggested)
For services with a policy engine or fine-grained access control, document
the policy model separately from the architecture spec. It should cover:
- Rule structure (fields, types, semantics).
- Evaluation algorithm (match logic, priority, default effect).
- Resource path conventions and glob patterns.
- Action classification.
- API endpoints for policy CRUD.
- Common policy patterns with examples.
- Role summary (what each MCIAS role gets by default).
This document is aimed at administrators who need to write policy rules,
not engineers who need to understand the implementation.
### Engine/Feature Design Documents
For services with a modular architecture, each module gets its own design
document (e.g. `engines/sshca.md`). These are detailed implementation plans
that include:
- Overview and core concepts.
- Data model and storage layout.
- Lifecycle (initialization, teardown).
- Operations table with auth requirements.
- API definitions (gRPC and REST).
- Implementation steps (file-by-file).
- Security considerations.
- References to existing code patterns to follow.
Write these before writing code. They are the blueprint, not the afterthought.
---
## Security
### General Principles
- **Default deny.** Unauthenticated requests are rejected. Unauthorized
requests are rejected. If in doubt, deny.
- **Fail closed.** If the service cannot verify authorization, it denies the
request. If the database is unavailable, the service is unavailable.
- **Least privilege.** Service processes run as non-root. systemd units
restrict filesystem access, syscalls, and capabilities.
- **No local user databases.** Authentication is always delegated to MCIAS.
### Cryptographic Standards
| Purpose | Algorithm | Notes |
|---------|-----------|-------|
| Symmetric encryption | AES-256-GCM | 12-byte random nonce per operation |
| Symmetric alternative | XChaCha20-Poly1305 | For contexts needing nonce misuse resistance |
| Key derivation | Argon2id | Memory-hard; tune params to hardware |
| Asymmetric signing | Ed25519, ECDSA (P-256, P-384) | Prefer Ed25519 |
| CSPRNG | `crypto/rand` | All keys, nonces, salts, tokens |
| Constant-time comparison | `crypto/subtle` | All secret comparisons |
- **Never use RSA for new designs.** Ed25519 and ECDSA are faster, produce
smaller keys, and have simpler security models.
- **Zeroize secrets** from memory when they are no longer needed. Overwrite
byte slices with zeros, nil out pointers.
- **Never log secrets.** Keys, passwords, tokens, and plaintext must never
appear in log output.
### Web Security
- CSRF tokens on all mutating requests.
- `SameSite=Strict` on all cookies.
- `html/template` for automatic escaping.
- Validate all input at system boundaries.
---
## Development Workflow
### Local Development
```bash
# Build and run both servers locally:
make devserver
# Or build everything and run the full pipeline:
make all
```
The `devserver` target builds both binaries and runs them against a local
config in `srv/`. The `srv/` directory is gitignored — it holds your local
database, certificates, and configuration.
### Pre-Push Checklist
Before pushing a branch:
```bash
make all # vet → lint → test → build
make proto-lint # if proto files changed
```
### Proto Changes
1. Edit `.proto` files in `proto/<service>/v2/`.
2. Run `make proto` to regenerate Go code.
3. Run `make proto-lint` to check for linting violations and breaking changes.
4. Update REST routes to match the new/changed RPCs.
5. Update gRPC interceptor maps for any new RPCs.
6. Update `ARCHITECTURE.md` API tables.
### Adding a New Feature
1. **Design first.** Write or update the relevant design document. For a new
engine or major subsystem, create a new doc in `docs/` or `engines/`.
2. **Implement.** Follow existing patterns — the design doc should reference
specific files and line numbers.
3. **Test.** Write tests alongside the implementation.
4. **Update docs.** Update `ARCHITECTURE.md`, `CLAUDE.md`, and route tables.
5. **Verify.** Run `make all`.
### CLI Commands
Every service uses cobra for CLI commands. Standard subcommands:
| Command | Purpose |
|---------|---------|
| `server` | Start the service |
| `init` | First-time setup (if applicable) |
| `status` | Query a running instance's health |
| `snapshot` | Create a database backup |
Add service-specific subcommands as needed (e.g. `migrate-aad`, `unseal`).
Each command lives in its own file in `cmd/<service>/`.