* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy) Single canonical "how OpenOva works" doc per founder's lean-doc strategy. 2926 source lines → 1110 consolidated lines, no semantic loss. Sections: §1 High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint) §2 Repo layout §3 Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...) §4 Naming conventions (dimensions, patterns, labels, DOMAINS-CANON) §5 Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces) §6 Per-host-cluster infrastructure §7 Application Blueprints §8 Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh) §9 Bootstrap-kit slot ordering (full 48-slot canonical list) §10 EPIC-level design overview (EPIC-0 through EPIC-6) §11 Per-chart DESIGN.md inventory §12 OAM influence §13 Read further Stale literal fixes: - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances) - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055) - failover-controller marked REPLACED by bp-continuum New PR refs wired into §3: - PR #665 SPIRE deferral - PR #2071 bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region) - PR #2087 bp-cnpg-pair pre-merge guard - PR #2093 bp-cnpg-pair pre-merge guard New stack components added to §3: - bp-cnpg-pair (synchronous remote_apply ReplicaCluster across ClusterMesh) - bp-continuum (lease-based failover orchestrator) - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11) Source docs (to be deleted by orchestrator in final PR): - docs/PLATFORM-TECH-STACK.md - docs/NAMING-CONVENTION.md - docs/EPICS-1-6-unified-design.md - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md * docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy) * docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy) * docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy) Part 1 — Runbook consolidation: - NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops, Blueprint authoring, chart conventions, demo walk, failover, troubleshooting) - Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK / RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface - Documents dual-annotation requirement for charts with enabled.default: false (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1 dead-reserve incident as the live evidence - All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console) - All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works - Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093 Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md): - Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit) - Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed, awaiting fresh-prov walk" (per 5-pillar DoD) - Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053) - Adds 3 new CRDs verified in products/catalyst/chart/crds/: CNPGPair, PDM, Sandbox - Sandbox controller chain CODE-COMPLETE (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632) - SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061) - New §6 CI / supply-chain guards table: hollow-chart (#2087), smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle, subchart 4-step, Flux version-pin replay - New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧 - Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20), Pillar 3 (per above), Pillar 4 (Sandbox chain) Part 3 — GLOSSARY.md folded as single source of truth for banned terms: - Header dated 2026-05-20, notes "single source of truth for banned terms" and "no separate BANNED-TERMS.md" - Existing 11 banned-terms rows rewritten with italicized qualifiers - NEW Forbidden test domains subsection: openova.io (mothership-only), omantel.openova.io (hallucinated), Nova Cloud (predecessor brand), eventforge.io (hallucinated), admin.<fqdn> (dead BSS URL) - SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665 with TBD-V29 (#2055) re-introduction roadmap - Cross-links updated: IMPLEMENTATION-STATUS → STATUS, SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion). No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11. This is the orchestrator commit on top of the four cherry-picked consolidation commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It: 1. Deletes 15 legacy source docs (now folded into the 7 canonical): PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design, BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG, 5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD, PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING, DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING. 2. Moves transient + historical docs into proper subdirs: - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state) - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery, 2026-05-20-trust-audit,2026-05-20-walk-runbook}.md - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md 3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision) + docs/adr/README.md index. 4. Updates CLAUDE.md reading-order + repo-structure block to match the lean strategy and current core/ tree (controllers/, marketplace/, etc.). 5. Sweeps all .md files + .github/workflows + scripts to repoint old doc paths to the new canonical homes. ADR cross-references kept intact (ADRs are immutable historical artifacts). Operator-side cron scripts that still write to the old paths (/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and openova-private/bin/trust-audit.sh) need a one-line path update — flagged in the PR body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its repo-root sentinel; the file no longer exists after the lean-doc consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to match the new canonical filename. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
9.7 KiB
Secret Rotation
The canonical list of credentials Catalyst-Zero handles, where each one lives, and how to rotate it.
Per PRINCIPLES.md #10 (credential
hygiene): passwords, tokens, API keys, client secrets, kubeconfig
contents, TLS private keys, and .env values are all credentials and
treated identically. No credential is committed to git, ever. The
catalyst-api Pod's runtime env is the single source of truth for every
secret it consumes; persisted deployment records redact every one of them
via internal/store.Redact.
This document is the operator runbook for rotating each of those credentials on the schedule below — and the rollback path if a rotation breaks something live.
Rotation Schedule
| Credential | Where it lives | Rotation cadence | Rollback window |
|---|---|---|---|
GHCR pull token (catalyst-ghcr-pull-token) |
K8s Secret in catalyst ns, key token |
Yearly | 24h via 1Password version history |
| Hetzner Cloud API token (per Sovereign) | Wizard input → catalyst-api memory only | Per Sovereign apply | n/a — single-use, never persisted |
Dynadot API key + secret (dynadot-api-credentials) |
K8s Secret in openova-system ns, keys api-key + api-secret |
Yearly (or on personnel change) | 24h via 1Password version history |
Sovereign Admin SSO client secret (Keycloak catalyst-admin realm) |
Per-Sovereign K8s Secret in keycloak ns |
Yearly | 1h — Keycloak supports two active client secrets during rollover |
| SOPS / SealedSecrets cluster key (per Sovereign) | K8s Secret in kube-system ns |
Per Sovereign, never rotated post-bootstrap | n/a — re-key requires migrating every existing SealedSecret |
The rest of this document is the per-credential procedure.
GHCR pull token (catalyst-ghcr-pull-token)
What it is. A long-lived GitHub Personal Access Token (PAT) or
fine-grained token with the packages:read scope on the openova-io
organisation. The token authenticates the GHCR pulls Flux performs on
every freshly-provisioned Sovereign — every HelmRepository CR in
clusters/<sovereign-fqdn>/bootstrap-kit/ references the
flux-system/ghcr-pull Secret, and that Secret's content comes from this
token.
Why this token has its own runbook. The bootstrap-kit pulls the bp-*
OCI artifacts from ghcr.io/openova-io/, which is a private registry
path. Without the token, the source-controller logs:
failed to get authentication secret 'flux-system/ghcr-pull':
secrets "ghcr-pull" not found
…and Phase 1 stalls at bp-cilium. The fix that landed this runbook
(fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly) makes the cloud-init template write
the Secret BEFORE kubectl apply -f flux-bootstrap.yaml, but the token
itself is never in the template — OpenTofu interpolates it at apply time
from var.ghcr_pull_token, sourced from the catalyst-api Pod's env var
CATALYST_GHCR_PULL_TOKEN.
Where the token must NEVER be: git (any branch, any repo), the
bootstrap-kit YAMLs, the catalyst-api Pod logs, the Hetzner project
metadata, Slack/email/issue bodies. The provisioner stamps it onto the
Request struct in memory, writes tofu.auto.tfvars.json (mode 0600), and
that file is wiped when the per-deployment workdir is cleared. The
json:"-" tag on Request.GHCRPullToken keeps it out of the persisted
deployment records (see internal/store.Redact).
Generation
Generate a fine-grained PAT (preferred over classic PATs):
- https://github.com/settings/personal-access-tokens/new
- Resource owner: openova-io
- Repository access: Public Repositories (read-only) — this is sufficient because GHCR packages inherit the openova-io org's GHCR visibility settings; the token does not need repo-level access.
- Permissions:
- Account → Packages → Read (the only scope this token uses)
- Expiration: 365 days (next rotation date — write it on the 1Password item).
- Generate. Copy the token to 1Password immediately (the page shows it once); never paste it into a terminal or a chat window.
Storage
1Password vault: OpenOva — Production
Item title: Catalyst — GHCR pull token (catalyst-ghcr-pull-token)
Tags: catalyst, ghcr, rotation:yearly
Notes field on the 1Password item must record:
- Generation date.
- Expiration date.
- Username paired with this token at the registry:
openova-bot(the literal string the cloud-init template uses; GitHub validates the token, not the username, but this string lands in audit-trail JSON). - Operator who generated it.
Apply (the one-liner)
Replace <GHCR_PULL_TOKEN> with the token retrieved from 1Password —
never paste a real token into git, an issue, a commit message, or a
terminal session that will be transcribed.
kubectl create secret generic catalyst-ghcr-pull-token \
--namespace=catalyst \
--from-literal=token='<GHCR_PULL_TOKEN>' \
--dry-run=client -o yaml | \
kubectl apply -f -
The --dry-run=client … | kubectl apply -f - form is idempotent: a fresh
install creates the Secret; a rotation overwrites the existing one
in-place. The catalyst-api Deployment must be rolled to pick up the new
value:
kubectl -n catalyst rollout restart deployment/catalyst-api
kubectl -n catalyst rollout status deployment/catalyst-api
(secretKeyRef-mounted env vars are NOT auto-refreshed by the Pod —
only volume mounts are. The catalyst-api chart mounts the token as
env.valueFrom.secretKeyRef, so a rollout is required.)
Verify
# The Secret exists with the expected key.
kubectl -n catalyst get secret catalyst-ghcr-pull-token \
-o jsonpath='{.data.token}' | base64 -d | wc -c
# (Output: a non-zero byte count. NEVER append `; echo` — that prints
# the token to your terminal.)
# The catalyst-api Pod read it cleanly at startup.
kubectl -n catalyst logs deploy/catalyst-api | grep -i 'ghcr' || \
echo "no ghcr-related warning — provisioner picked up the token"
# A fresh /api/v1/deployments POST validates without the
# 'CATALYST_GHCR_PULL_TOKEN missing' error (expected for managed-pool
# domain mode).
Rollback
If the new token does not authenticate (typo, wrong scope, expired):
- Open 1Password's item version history; copy the previous token.
- Re-run the
kubectl create secret … --dry-run=client | kubectl applyone-liner with the previous token. kubectl -n catalyst rollout restart deployment/catalyst-api.- File a follow-up issue to investigate why the new token failed.
The previous token remains valid until the next yearly rotation — GitHub does not invalidate replaced fine-grained tokens automatically. Revoke the broken token in the GitHub UI as a hygiene step once rollback succeeds.
Hetzner Cloud API token (per Sovereign)
Captured by the wizard's StepProvider, lives in catalyst-api memory only
for the duration of one deployment. NEVER persisted (the
Request.HetznerToken field is json:"-"; internal/store.Redact
overwrites it with <redacted> for any record that ends up on disk).
Rotation: per-Sovereign apply. Each tofu apply accepts a fresh token;
once tofu apply returns, catalyst-api drops the value out of memory
(the Pod restart on next image roll loses the in-memory copy regardless).
If a Hetzner token is suspected of leaking: revoke at https://console.hetzner.cloud/projects → Security → API tokens. The next wizard run will accept a fresh one.
Dynadot API key + secret (dynadot-api-credentials)
K8s Secret in openova-system namespace, keys: api-key, api-secret,
domain (legacy single-domain), domains (comma-separated list,
preferred).
Yearly rotation via the Dynadot account UI:
- https://www.dynadot.com → My Account → API Settings → Regenerate.
- Copy both halves to the 1Password item Dynadot — OpenOva pool domains API credentials.
- Apply:
kubectl create secret generic dynadot-api-credentials \
--namespace=openova-system \
--from-literal=api-key='<DYNADOT_API_KEY>' \
--from-literal=api-secret='<DYNADOT_API_SECRET>' \
--from-literal=domains='omani.works' \
--dry-run=client -o yaml | \
kubectl apply -f -
kubectl -n catalyst rollout restart deployment/catalyst-api
kubectl -n openova-system rollout restart deployment/pool-domain-manager
The domains value is the comma-separated allowlist of pool domains
this account manages. Adding a third pool domain (e.g. acme.io) is a
secret update, not a code change — see
PRINCIPLES.md #4.
Cross-cutting rules
- NEVER print a credential to a terminal. All retrievals pipe to a
file (
> /path && chmod 600) or directly intokubectl create secret --from-literal. Session transcripts are durable. - NEVER commit a credential. Use this runbook's
kubectl create secret … | kubectl applyone-liner; the value never touches a file the working tree tracks. - NEVER skip the rollout restart.
secretKeyRefenv vars are read at Pod start. A Secret update with no rollout is a silent half-rotation: existing Pods serve the old value, new Pods (post next evict) serve the new one. The catalyst-api is single-replica with strategyRecreate, so this is one step. - Log only metadata, never the value.
kubectl describe secretshowsdata: token: <not shown>— that is intentional. Reading the value via-o jsonpathand piping to a file is the sanctioned confirmation path; piping tocat/echois not.
If you accidentally expose a credential — printed to a terminal that will be transcribed, committed it to a branch, posted it to an issue — rotate immediately following this runbook. Do not try to "quietly fix it" by editing history; assume the leaked value is captured.