feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094 )

* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy)

Single canonical "how OpenOva works" doc per founder's lean-doc strategy.
2926 source lines → 1110 consolidated lines, no semantic loss.

Sections:
 §1  High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint)
 §2  Repo layout
 §3  Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...)
 §4  Naming conventions (dimensions, patterns, labels, DOMAINS-CANON)
 §5  Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces)
 §6  Per-host-cluster infrastructure
 §7  Application Blueprints
 §8  Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh)
 §9  Bootstrap-kit slot ordering (full 48-slot canonical list)
 §10 EPIC-level design overview (EPIC-0 through EPIC-6)
 §11 Per-chart DESIGN.md inventory
 §12 OAM influence
 §13 Read further

Stale literal fixes:
 - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances)
 - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055)
 - failover-controller marked REPLACED by bp-continuum

New PR refs wired into §3:
 - PR #665   SPIRE deferral
 - PR #2071  bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region)
 - PR #2087  bp-cnpg-pair pre-merge guard
 - PR #2093  bp-cnpg-pair pre-merge guard

New stack components added to §3:
 - bp-cnpg-pair  (synchronous remote_apply ReplicaCluster across ClusterMesh)
 - bp-continuum  (lease-based failover orchestrator)
 - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11)

Source docs (to be deleted by orchestrator in final PR):
 - docs/PLATFORM-TECH-STACK.md
 - docs/NAMING-CONVENTION.md
 - docs/EPICS-1-6-unified-design.md
 - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md

* docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy)

* docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy)

* docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy)

Part 1 — Runbook consolidation:
- NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops,
  Blueprint authoring, chart conventions, demo walk, failover, troubleshooting)
- Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK /
  RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface
- Documents dual-annotation requirement for charts with enabled.default: false
  (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1
  dead-reserve incident as the live evidence
- All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console)
- All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works
- Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093

Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md):
- Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit)
- Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed,
  awaiting fresh-prov walk" (per 5-pillar DoD)
- Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053)
- Adds 3 new CRDs verified in products/catalyst/chart/crds/:
  CNPGPair, PDM, Sandbox
- Sandbox controller chain CODE-COMPLETE
  (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632)
- SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061)
- New §6 CI / supply-chain guards table: hollow-chart (#2087),
  smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle,
  subchart 4-step, Flux version-pin replay
- New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧
- Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20),
  Pillar 3 (per above), Pillar 4 (Sandbox chain)

Part 3 — GLOSSARY.md folded as single source of truth for banned terms:
- Header dated 2026-05-20, notes "single source of truth for banned terms"
  and "no separate BANNED-TERMS.md"
- Existing 11 banned-terms rows rewritten with italicized qualifiers
- NEW Forbidden test domains subsection:
  openova.io (mothership-only), omantel.openova.io (hallucinated),
  Nova Cloud (predecessor brand), eventforge.io (hallucinated),
  admin.<fqdn> (dead BSS URL)
- SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665
  with TBD-V29 (#2055) re-introduction roadmap
- Cross-links updated: IMPLEMENTATION-STATUS → STATUS,
  SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md

CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion).
No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs

Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11.

This is the orchestrator commit on top of the four cherry-picked consolidation
commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It:

1. Deletes 15 legacy source docs (now folded into the 7 canonical):
   PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design,
   BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG,
   5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD,
   PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING,
   DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING.

2. Moves transient + historical docs into proper subdirs:
   - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state)
   - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery,
     2026-05-20-trust-audit,2026-05-20-walk-runbook}.md
   - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md

3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision)
   + docs/adr/README.md index.

4. Updates CLAUDE.md reading-order + repo-structure block to match the
   lean strategy and current core/ tree (controllers/, marketplace/, etc.).

5. Sweeps all .md files + .github/workflows + scripts to repoint old doc
   paths to the new canonical homes. ADR cross-references kept intact
   (ADRs are immutable historical artifacts).

Operator-side cron scripts that still write to the old paths
(/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and
openova-private/bin/trust-audit.sh) need a one-line path update —
flagged in the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md

The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its
repo-root sentinel; the file no longer exists after the lean-doc
consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to
match the new canonical filename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 14:40:01 +04:00

9.7 KiB

Raw Blame History

Secret Rotation

The canonical list of credentials Catalyst-Zero handles, where each one lives, and how to rotate it.

Per PRINCIPLES.md #10 (credential hygiene): passwords, tokens, API keys, client secrets, kubeconfig contents, TLS private keys, and .env values are all credentials and treated identically. No credential is committed to git, ever. The catalyst-api Pod's runtime env is the single source of truth for every secret it consumes; persisted deployment records redact every one of them via internal/store.Redact.

This document is the operator runbook for rotating each of those credentials on the schedule below — and the rollback path if a rotation breaks something live.

Rotation Schedule

Credential	Where it lives	Rotation cadence	Rollback window
GHCR pull token (`catalyst-ghcr-pull-token`)	K8s Secret in `catalyst` ns, key `token`	Yearly	24h via 1Password version history
Hetzner Cloud API token (per Sovereign)	Wizard input → catalyst-api memory only	Per Sovereign apply	n/a — single-use, never persisted
Dynadot API key + secret (`dynadot-api-credentials`)	K8s Secret in `openova-system` ns, keys `api-key` + `api-secret`	Yearly (or on personnel change)	24h via 1Password version history
Sovereign Admin SSO client secret (Keycloak `catalyst-admin` realm)	Per-Sovereign K8s Secret in `keycloak` ns	Yearly	1h — Keycloak supports two active client secrets during rollover
SOPS / SealedSecrets cluster key (per Sovereign)	K8s Secret in `kube-system` ns	Per Sovereign, never rotated post-bootstrap	n/a — re-key requires migrating every existing SealedSecret

The rest of this document is the per-credential procedure.

GHCR pull token (`catalyst-ghcr-pull-token`)

What it is. A long-lived GitHub Personal Access Token (PAT) or fine-grained token with the packages:read scope on the openova-io organisation. The token authenticates the GHCR pulls Flux performs on every freshly-provisioned Sovereign — every HelmRepository CR in clusters/<sovereign-fqdn>/bootstrap-kit/ references the flux-system/ghcr-pull Secret, and that Secret's content comes from this token.

Why this token has its own runbook. The bootstrap-kit pulls the bp-* OCI artifacts from ghcr.io/openova-io/, which is a private registry path. Without the token, the source-controller logs:

failed to get authentication secret 'flux-system/ghcr-pull':
  secrets "ghcr-pull" not found

…and Phase 1 stalls at bp-cilium. The fix that landed this runbook (fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly) makes the cloud-init template write the Secret BEFORE kubectl apply -f flux-bootstrap.yaml, but the token itself is never in the template — OpenTofu interpolates it at apply time from var.ghcr_pull_token, sourced from the catalyst-api Pod's env var CATALYST_GHCR_PULL_TOKEN.

Where the token must NEVER be: git (any branch, any repo), the bootstrap-kit YAMLs, the catalyst-api Pod logs, the Hetzner project metadata, Slack/email/issue bodies. The provisioner stamps it onto the Request struct in memory, writes tofu.auto.tfvars.json (mode 0600), and that file is wiped when the per-deployment workdir is cleared. The json:"-" tag on Request.GHCRPullToken keeps it out of the persisted deployment records (see internal/store.Redact).

Generation

Generate a fine-grained PAT (preferred over classic PATs):

https://github.com/settings/personal-access-tokens/new
Resource owner: openova-io
Repository access: Public Repositories (read-only) — this is sufficient because GHCR packages inherit the openova-io org's GHCR visibility settings; the token does not need repo-level access.
Permissions:
- Account → Packages → Read (the only scope this token uses)
Expiration: 365 days (next rotation date — write it on the 1Password item).
Generate. Copy the token to 1Password immediately (the page shows it once); never paste it into a terminal or a chat window.

Storage

1Password vault: OpenOva — Production Item title: Catalyst — GHCR pull token (catalyst-ghcr-pull-token) Tags: catalyst, ghcr, rotation:yearly

Notes field on the 1Password item must record:

Generation date.
Expiration date.
Username paired with this token at the registry: openova-bot (the literal string the cloud-init template uses; GitHub validates the token, not the username, but this string lands in audit-trail JSON).
Operator who generated it.

Apply (the one-liner)

Replace <GHCR_PULL_TOKEN> with the token retrieved from 1Password — never paste a real token into git, an issue, a commit message, or a terminal session that will be transcribed.

kubectl create secret generic catalyst-ghcr-pull-token \
  --namespace=catalyst \
  --from-literal=token='<GHCR_PULL_TOKEN>' \
  --dry-run=client -o yaml | \
  kubectl apply -f -

The --dry-run=client … | kubectl apply -f - form is idempotent: a fresh install creates the Secret; a rotation overwrites the existing one in-place. The catalyst-api Deployment must be rolled to pick up the new value:

kubectl -n catalyst rollout restart deployment/catalyst-api
kubectl -n catalyst rollout status  deployment/catalyst-api

(secretKeyRef-mounted env vars are NOT auto-refreshed by the Pod — only volume mounts are. The catalyst-api chart mounts the token as env.valueFrom.secretKeyRef, so a rollout is required.)

Verify

# The Secret exists with the expected key.
kubectl -n catalyst get secret catalyst-ghcr-pull-token \
  -o jsonpath='{.data.token}' | base64 -d | wc -c
# (Output: a non-zero byte count. NEVER append `; echo` — that prints
# the token to your terminal.)

# The catalyst-api Pod read it cleanly at startup.
kubectl -n catalyst logs deploy/catalyst-api | grep -i 'ghcr' || \
  echo "no ghcr-related warning — provisioner picked up the token"

# A fresh /api/v1/deployments POST validates without the
# 'CATALYST_GHCR_PULL_TOKEN missing' error (expected for managed-pool
# domain mode).

Rollback

If the new token does not authenticate (typo, wrong scope, expired):

Open 1Password's item version history; copy the previous token.
Re-run the kubectl create secret … --dry-run=client | kubectl apply one-liner with the previous token.
kubectl -n catalyst rollout restart deployment/catalyst-api.
File a follow-up issue to investigate why the new token failed.

The previous token remains valid until the next yearly rotation — GitHub does not invalidate replaced fine-grained tokens automatically. Revoke the broken token in the GitHub UI as a hygiene step once rollback succeeds.

Hetzner Cloud API token (per Sovereign)

Captured by the wizard's StepProvider, lives in catalyst-api memory only for the duration of one deployment. NEVER persisted (the Request.HetznerToken field is json:"-"; internal/store.Redact overwrites it with <redacted> for any record that ends up on disk).

Rotation: per-Sovereign apply. Each tofu apply accepts a fresh token; once tofu apply returns, catalyst-api drops the value out of memory (the Pod restart on next image roll loses the in-memory copy regardless).

If a Hetzner token is suspected of leaking: revoke at https://console.hetzner.cloud/projects → Security → API tokens. The next wizard run will accept a fresh one.

Dynadot API key + secret (`dynadot-api-credentials`)

K8s Secret in openova-system namespace, keys: api-key, api-secret, domain (legacy single-domain), domains (comma-separated list, preferred).

Yearly rotation via the Dynadot account UI:

https://www.dynadot.com → My Account → API Settings → Regenerate.
Copy both halves to the 1Password item Dynadot — OpenOva pool domains API credentials.
Apply:

kubectl create secret generic dynadot-api-credentials \
  --namespace=openova-system \
  --from-literal=api-key='<DYNADOT_API_KEY>' \
  --from-literal=api-secret='<DYNADOT_API_SECRET>' \
  --from-literal=domains='omani.works' \
  --dry-run=client -o yaml | \
  kubectl apply -f -

kubectl -n catalyst         rollout restart deployment/catalyst-api
kubectl -n openova-system   rollout restart deployment/pool-domain-manager

The domains value is the comma-separated allowlist of pool domains this account manages. Adding a third pool domain (e.g. acme.io) is a secret update, not a code change — see PRINCIPLES.md #4.

Cross-cutting rules

NEVER print a credential to a terminal. All retrievals pipe to a file (> /path && chmod 600) or directly into kubectl create secret --from-literal. Session transcripts are durable.
NEVER commit a credential. Use this runbook's kubectl create secret … | kubectl apply one-liner; the value never touches a file the working tree tracks.
NEVER skip the rollout restart. secretKeyRef env vars are read at Pod start. A Secret update with no rollout is a silent half-rotation: existing Pods serve the old value, new Pods (post next evict) serve the new one. The catalyst-api is single-replica with strategy Recreate, so this is one step.
Log only metadata, never the value. kubectl describe secret shows data: token: <not shown> — that is intentional. Reading the value via -o jsonpath and piping to a file is the sanctioned confirmation path; piping to cat/echo is not.

If you accidentally expose a credential — printed to a terminal that will be transcribed, committed it to a branch, posted it to an issue — rotate immediately following this runbook. Do not try to "quietly fix it" by editing history; assume the leaked value is captured.

9.7 KiB Raw Blame History