feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094 )

* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy)

Single canonical "how OpenOva works" doc per founder's lean-doc strategy.
2926 source lines → 1110 consolidated lines, no semantic loss.

Sections:
 §1  High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint)
 §2  Repo layout
 §3  Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...)
 §4  Naming conventions (dimensions, patterns, labels, DOMAINS-CANON)
 §5  Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces)
 §6  Per-host-cluster infrastructure
 §7  Application Blueprints
 §8  Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh)
 §9  Bootstrap-kit slot ordering (full 48-slot canonical list)
 §10 EPIC-level design overview (EPIC-0 through EPIC-6)
 §11 Per-chart DESIGN.md inventory
 §12 OAM influence
 §13 Read further

Stale literal fixes:
 - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances)
 - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055)
 - failover-controller marked REPLACED by bp-continuum

New PR refs wired into §3:
 - PR #665   SPIRE deferral
 - PR #2071  bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region)
 - PR #2087  bp-cnpg-pair pre-merge guard
 - PR #2093  bp-cnpg-pair pre-merge guard

New stack components added to §3:
 - bp-cnpg-pair  (synchronous remote_apply ReplicaCluster across ClusterMesh)
 - bp-continuum  (lease-based failover orchestrator)
 - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11)

Source docs (to be deleted by orchestrator in final PR):
 - docs/PLATFORM-TECH-STACK.md
 - docs/NAMING-CONVENTION.md
 - docs/EPICS-1-6-unified-design.md
 - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md

* docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy)

* docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy)

* docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy)

Part 1 — Runbook consolidation:
- NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops,
  Blueprint authoring, chart conventions, demo walk, failover, troubleshooting)
- Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK /
  RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface
- Documents dual-annotation requirement for charts with enabled.default: false
  (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1
  dead-reserve incident as the live evidence
- All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console)
- All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works
- Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093

Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md):
- Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit)
- Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed,
  awaiting fresh-prov walk" (per 5-pillar DoD)
- Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053)
- Adds 3 new CRDs verified in products/catalyst/chart/crds/:
  CNPGPair, PDM, Sandbox
- Sandbox controller chain CODE-COMPLETE
  (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632)
- SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061)
- New §6 CI / supply-chain guards table: hollow-chart (#2087),
  smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle,
  subchart 4-step, Flux version-pin replay
- New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧
- Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20),
  Pillar 3 (per above), Pillar 4 (Sandbox chain)

Part 3 — GLOSSARY.md folded as single source of truth for banned terms:
- Header dated 2026-05-20, notes "single source of truth for banned terms"
  and "no separate BANNED-TERMS.md"
- Existing 11 banned-terms rows rewritten with italicized qualifiers
- NEW Forbidden test domains subsection:
  openova.io (mothership-only), omantel.openova.io (hallucinated),
  Nova Cloud (predecessor brand), eventforge.io (hallucinated),
  admin.<fqdn> (dead BSS URL)
- SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665
  with TBD-V29 (#2055) re-introduction roadmap
- Cross-links updated: IMPLEMENTATION-STATUS → STATUS,
  SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md

CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion).
No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs

Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11.

This is the orchestrator commit on top of the four cherry-picked consolidation
commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It:

1. Deletes 15 legacy source docs (now folded into the 7 canonical):
   PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design,
   BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG,
   5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD,
   PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING,
   DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING.

2. Moves transient + historical docs into proper subdirs:
   - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state)
   - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery,
     2026-05-20-trust-audit,2026-05-20-walk-runbook}.md
   - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md

3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision)
   + docs/adr/README.md index.

4. Updates CLAUDE.md reading-order + repo-structure block to match the
   lean strategy and current core/ tree (controllers/, marketplace/, etc.).

5. Sweeps all .md files + .github/workflows + scripts to repoint old doc
   paths to the new canonical homes. ADR cross-references kept intact
   (ADRs are immutable historical artifacts).

Operator-side cron scripts that still write to the old paths
(/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and
openova-private/bin/trust-audit.sh) need a one-line path update —
flagged in the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md

The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its
repo-root sentinel; the file no longer exists after the lean-doc
consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to
match the new canonical filename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 14:40:01 +04:00

52 KiB

Raw Permalink Blame History

Definition of Done

What this is: the canonical end-user Definition of Done for OpenOva — the deterministic 2-phase test, the 5 inseparable pillars, all DoD gates (D0–D35), the domains canon (which domains, when, why), and the operator / customer / tenant persona journeys.

Authority: 📐 PERMANENT canon. Reviewed PRs only. The generic cross-project DoD principle (operator-walks-a-fresh-prov, no theater) lives in user-global ~/.claude/CLAUDE.md §2. This file is the OpenOva-specific elaboration.

Pointers: see PRINCIPLES.md for engineering rules, ARCHITECTURE.md for system shape, RUNBOOKS.md for operator how-tos.

Status: Authoritative. Updated: 2026-05-20. Supersedes the legacy split 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS — consolidated here per the lean-doc strategy (PR #2076 / PR #2084). SPIRE-issued SVIDs for Sandbox MCP auth are deferred per PR #2056; the Phase 2 mechanism therefore currently relies on the interim sandbox-pty-server stdio attachment — see §1 Pillar 4 below.

§1 — The 5 inseparable pillars

Every dispatch in this repo must answer:

Which of the 5 pillars does this work move forward, and which deterministic step (Phase 0 / 1 / 2) does it advance?

If the answer is "none," the work is wrong — pick differently.

The 5 pillars are inseparable — none alone is a viable platform. Pillar work is strictly primary; operator-console polish, cosmetic-guard re-enables, treemap drill-down quality, jobs region filter, admin sidebar nav are tertiary operator-debugger surfaces and must never displace pillar work.

#	Pillar	What "shipped" looks like
1	Marketplace + voucher onboarding	Anonymous visitor reaches the operator-branded marketplace → picks the canonical Postgres-backed bundle → completes signup (email + 6-digit PIN magic-link) → Organization CR created.
2	Multi-region BCP topology choice at signup	Wizard exposes region / topology choice during signup; customer picks N regions; system provisions across all N in one pass. Not a Day-2 upgrade.
3	Two independent CNPG clusters with ReplicaCluster sync + region-kill failover	One CNPG cluster per region; synchronous `ReplicaCluster` replication over Cilium ClusterMesh on the DMZ WireGuard-over-public-IPs data plane; region-kill test passes with zero transactions lost.
4	Sandbox + auto-mounted MCP plugin with full org knowledge	Sandbox launches the chosen agent CLI; `openova-sandbox-mcp` auto-mounts at session start with every org resource (apps, vClusters, conn-strings via OpenBao, Gitea repos, IAM, region health). User pastes zero credentials. Agent answers prompts with full org context and mutates resources via MCP tool calls.
5	Sovereign independence post-cutover	After `bp-self-sovereign-cutover` runs, zero egress to `harbor.openova.io`, `ghcr.io/openova-io`, or `github.com/openova-io` — proved by a 10-minute deny-egress NetworkPolicy hold (Principle #11).

Pillar 4 — `openova-sandbox-mcp` auto-mount mechanism

When a Sandbox session attaches, the sandbox-pty-server (per products/sandbox/pty-server/) writes the chosen agent's mcp.json config to every canonical agent-config path (claude-code, qwen-code, opencode, aider, cline) and starts the MCP server as a stdio subprocess of the agent process. The server exposes 49 handlers grouped under namespaces such as sandbox.db.*, sandbox.auth.*, marketplace.app.*, sandbox.git.*, sandbox.iam.*, etc. (full registry in products/sandbox/mcp-server/internal/tools/registry.go).

Authentication is currently the stdio child-process trust boundary (the agent process is the tenant's session and the MCP server inherits its identity). SPIFFE / SPIRE-issued SVIDs as the long-term auth substrate are deferred per PR #2056; when SVIDs land, the agent's caller-identity will become the tenant Organization's workload identity, never a long-lived API key, and the agent will never see credentials.

Per Principle #1 (the waterfall is the contract) and Principle #2 (never compromise from quality), Pillar 4 is not "ship a stub MCP server now and wire real tools later." A Sandbox session that boots without the full 49 tools is Pillar 4 unshipped, regardless of how good the chrome looks.

Pillar 5 — `bp-self-sovereign-cutover` and the 8-tether pivot

A franchised Sovereign emerging from Phase 1 is operationally tethered to the OpenOva mothership in eight places (full map in adr/0002-post-handover-sovereignty-cutover.md §2.1 and in ARCHITECTURE.md §11.1):

#	Tether	Phase
1	Flux `GitRepository.url = github.com/openova-io/openova`	P0
2	containerd `registries.yaml` rewrites every upstream registry → `https://harbor.openova.io`	P0
3	OCI `HelmRepository` urls = `oci://ghcr.io/openova-io`	P0
4	`catalyst-api` env fallback to `https://github.com/openova-io/openova`	P0
5	`flux-system/ghcr-pull` Secret seeded for private GHCR pulls	P0
6	Crossplane provider packages from `xpkg.upbound.io`	P1
7	Catalyst-authored images = `ghcr.io/openova-io/openova/*`	P0
8	OS package mirrors during cloud-init (`apt`, `get.k3s.io`)	P2 (cold-start only)

bp-self-sovereign-cutover installs dormant at bootstrap-kit slot 06a during Phase 1 and is triggered post-handover by the operator's "Achieve True Sovereignty" CTA. Eight sequential Jobs pivot the tethers in dependency order; the final step is a 10-minute deny-egress NetworkPolicy hold against github.com, ghcr.io, and harbor.openova.io. The only condition under which cutoverComplete=true is set is that the cluster reconciles green during this hold. No cutover claim without the egress-block proof. Full choreography in §7 below.

§2 — The deterministic test (Phase 0 / 1 / 2)

The test is deterministic — one fresh prov, one run, all phases pass in order. No retries, no "works if you wait longer."

Phase 0 — Operator issues voucher via BSS

Voucher operations live in the operator console's BSS menu (Business Support System), NOT in any admin.<sovereign-fqdn> subdomain. The legacy admin.* references in older docs / agents are outdated.

Step	Action	URL / Outcome
0a	Operator logs in to the Sovereign Console	`https://console.t<NN>.omani.works`
0b	Navigate to the BSS menu	Sidebar → BSS (NOT `admin.<fqdn>/...`)
0c	Issue voucher	Voucher artifact created + delivered to recipient via Sovereign outbound SMTP

Phase 1 — Customer redeems voucher (Postgres-backed app onboarding)

Step	Action	URL / Outcome
1a	Customer receives voucher email	Canonical URL pattern per `core/services/notification/templates/templates.go`: `https://marketplace.t<NN>.omani.works/redeem/?code=<CODE>` (slash before `?` is mandatory)
1b	Customer redeems → checkout → picks the Postgres-backed bundle	Org provisions across the 2 chosen regions with 2 independent CNPG clusters (ReplicaCluster sync over ClusterMesh on the WG-public-IP DMZ data plane)
1c	Org URL after signup	`https://console.<orgslug>.omani.homes` (default pool TLD; pool also has `omani.rest` and `omani.trade` per `core/services/parent-domain/sovereign_parent_domains.go`)

Phase 2 — Customer launches Sandbox; agent provisions an additional app via MCP

This is the most important test. It exercises Pillar 4 end-to-end and proves that an agent acting on behalf of the tenant can mutate the Organization's resources entirely through the auto-mounted MCP plugin — without the user typing any credential.

Step	Action	Outcome
2a	Tenant logs in at `console.<orgslug>.omani.homes`	Dashboard renders
2b	Opens Sandbox	Sandbox session launches with agent set to `qwen-code` (NOT claude-code — `qwen-code` routes through newapi → Sovereign-hosted Qwen, zero Anthropic cost leak)
2c	`openova-sandbox-mcp` auto-mounts at session start	49 MCP tools available with zero user-typed config (full handler set per `products/sandbox/mcp-server/internal/tools/registry.go`)
2d	Customer prompts qwen-code to provision an additional application in their Organization	Agent uses MCP tools (`sandbox.db.provision`, `sandbox.auth.provisionRealm`, `marketplace.app.install`, etc.) — new app CNPG cluster + namespace + HelmRelease + Gitea repo materialise
2e	New app reachable	At `<newapp>.<orgslug>.omani.homes`

Orthogonal — D31 region-kill BCP failover

Run in parallel with Phase 0 / 1 / 2 to exercise Pillar 3. See §3 (D31) for the gate definition and §6 below for the full counter-test continuity procedure.

Mapping each pillar to the deterministic steps

Pillar	Steps it covers
Pillar 1 — Marketplace + signup	Phase 0 (all), Phase 1 step 1a (voucher email), Phase 1 step 1b (redeem + checkout), Phase 1 step 1c (post-signup landing)
Pillar 2 — Multi-region BCP at signup	Phase 1 step 1b (wizard region-selection step)
Pillar 3 — 2 CNPG clusters + region-kill failover	Phase 1 step 1b (provisioning the 2 clusters), orthogonal D31 (the kill test)
Pillar 4 — Sandbox + auto-mounted MCP	Phase 2 steps 2a–2e
Pillar 5 — Sovereign independence	Implicit in all of the above; verified separately by the `bp-self-sovereign-cutover` 10-minute deny-egress hold (see §7 + Principle #11)

What "shipped" means

A pillar is shipped when an operator (or a read-only Playwright verification agent — never a verification agent that ships fixes) walks a fresh prov through the pillar-relevant steps and produces:

A screenshot (.playwright-mcp/t<NN>-<surface>-<YYYY-MM-DD>.png)
A non-empty wire-level capture (log line, curl output, kubectl output, or HAR file)
A working downstream artifact (the new app reachable, the failover counter intact, the egress-block proof recorded)

One PR landing does not ship a pillar. One walk-with-screenshot does. Every PR against a surface flips that surface back to 🔴 UNVERIFIED in TRUST.md until re-walked.

§3 — DoD gates (D0–D35)

This is the convergence contract. Every wipe → create → test cycle must validate every gate below before claiming a Sovereign is converged. Founder ruling 2026-05-15: silent compromise from these gates is a quality violation.

Architecture invariants (never compromise)

ID	Rule
A1	3 regions minimum. If a provider has capacity / zone constraints, swap regions — never silently drop to 2.
A2	Inter-region link = DMZ WireGuard over PUBLIC IPs. No `hcloud_network` cross-region, no VPC peering, no Huawei VPC — provider-agnostic, always over the DMZ WG endpoint.
A3	Cilium ClusterMesh apiserver Service = LoadBalancer (public IP through DMZ WG). The word `NodePort` must never appear in clustermesh-apiserver Service spec on any Sovereign.
A4	vCluster topology: primary region = MGMT+DMZ; each secondary region = DMZ+RTZ. Cross-vCluster intra-region traffic stays inside host k3s via Cilium.
A5	Zero public exposure of K8s control-plane endpoints. `kubectl get svc -A` on a converged Sovereign returns no NodePort for clustermesh-apiserver, kube-apiserver, or etcd.
A6	Provider-mix is the canonical case. Assume 1 region Hetzner, 1 AWS, 1 Huawei. Code must work for that even when the active test prov is all-Hetzner.

Gates D0–D35

Every gate must pass on a SINGLE fresh provision in one continuous run. No partial credit.

D0 sits ABOVE D1. Without successful handover + auto-redirect, the operator never sees that provisioning succeeded — every other gate becomes invisible from their perspective. The zero-touch contract is end-to-end OPERATOR experience, not just backend convergence.

#	Gate	Verifier
D0	Successful handover + auto-redirect to Sovereign Console. Once `deployment.status=ready` AND `deployment.handoverFiredAt != null`, the mothership UI auto-routes operator's browser to `deployment.handoverURL` (`/auth/handover?token=<jwt>` on the Sovereign Console). Synthetic `Apps` / `Handover` per-region stage rows MUST be marked Succeeded (or not-applicable), never stuck Pending after handover fires. No operator action required — they should land on the Sovereign Console without copying / typing the FQDN.	Playwright MCP
D1	`dig console.<fqdn> @1.1.1.1` returns primary LB IP (auto-written by catalyst-api after Phase-0)	dig
D2	`curl https://console.<fqdn>/` → 200, cert publicly trusted (verify=0)	curl
D3	PIN-login: enter email → receive PIN via IMAP → enter PIN → dashboard	Playwright MCP
D4	Keycloak SSO: PIN-login bounces through Keycloak once, lands on `/dashboard` with session cookie	Playwright MCP
D5	`/cloud` view: renders all 3 regions, no stuck spinners	Playwright MCP
D6	`/jobs` view: 0 pending, 0 running — every job in terminal state	Playwright MCP
D7	Mothership flow Jobs ≡ child Sovereign Jobs (same IDs, same statuses)	Playwright MCP diff
D8	`kubectl --context <child> get hr -A` shows all 135 HRs Ready=True across all clusters	kubectl
D9	clustermesh-apiserver Pod Ready in every region, no restarts, no x509 errors	kubectl
D10	`cilium clustermesh status` shows OK for every peer cluster	cilium CLI
D11	Inter-region pod-to-pod packet test passes, hubble-flow shows WireGuard traversal	kubectl + hubble
D12	`kubectl get svc -A \| grep clustermesh` shows only `LoadBalancer` (no NodePort)	kubectl
D13	Canvas flow page: sibling-deps edges render, no orphan bubbles, no phantom pillars	Playwright MCP
D14	Operator re-login after browser refresh works without re-PIN within session TTL	Playwright MCP
D15	`/cloud?view=graph` canvas accurate per kind. `vCluster N/N` non-zero (6/6 on a converged 3-region prov — 1 mgmt + 3 dmz + 2 rtz). `LoadBalancer N/N` non-zero (clustermesh-apiserver LBs + ingress LBs counted). `Cluster N/N` matches actual 3 regions. No kind chip shows `0/0` for a resource that actually exists in the cluster.	Playwright MCP
D16	`/dashboard` Layer-1 / Layer-2 grouping renders multi-region. Selecting Layer-1=Cluster on `/dashboard` MUST emit 3 cluster-grouped bubbles (one per region), not a single Sovereign. Layer-2=Namespace MUST emit namespace bubbles WITHIN each cluster bubble. The hierarchy `Cloud → Region → Cluster → vCluster → Namespace → Application` must collapse correctly per the operator's Layer-1 / Layer-2 selection.	Playwright MCP
D17	Application detail route `/app/<name>` shows the application, not "deployment id malformed". Clicking any application card in `/apps` MUST navigate to a route where the URL segment is the application name (e.g. `bp-cnpg`) and the renderer treats it as an application reference, NOT as a deployment id. The notifications drawer MUST NOT contain "Deployment id in the URL is malformed" entries for valid app-name segments.	Playwright MCP
D18	Sovereign-side catalyst-api can self-monitor Phase-1 install state. The chroot Sovereign catalyst-api MUST be able to fetch its own cluster's kubeconfig (or use in-cluster service account) to observe HelmRelease state. The notifications drawer MUST NOT contain "Per-component install monitoring is unavailable for this deployment — the Catalyst API couldn't fetch the new cluster's kubeconfig" entries. Operator should never need to drop to `kubectl get helmrelease` to know per-app install state.	Playwright MCP
D19	Apps + Cloud counter consistency. Apps page Deployments tab count MUST equal Catalog "INSTALLED" count MUST equal `kubectl get hr -A` Ready count. Cloud canvas kind chips MUST NOT show `N/0` for resources that exist (vCluster, LoadBalancer, Bucket, Volume, PVC). PVC count in graph view MUST equal PVC count in list view. App card hrefs MUST NOT have doubled prefix (`/app/bp-bp-*`).	Playwright MCP
D20	Jobs page surfaces all-region jobs with region filter. Jobs view MUST show per-region prefixes (`nbg1-1:`, `sin-2:`) for every app on a multi-region Sovereign, plus an App-filter that lets the operator narrow to a single region. Any unexplained `N/M` counter MUST resolve to an actionable filter or be removed.	Playwright MCP
D21	Operator pre-populated as owner-tier on /users post-handover. Sovereign Console /users MUST list the operator who completed PIN-login as `tier=owner` with their email. /users MUST NOT render empty on a freshly-converged Sovereign.	Playwright MCP
D22	Settings page shows real values. /settings MUST render real values for Region, Capacity, ControlPlaneSize, Created (timestamp), DeploymentID, Pool subdomain — NOT `—` placeholders or "API PENDING" badges. Operator MUST be able to see what their Sovereign actually is.	Playwright MCP
D23	Sovereign-side /wizard route does not collide with post-handover landing. After PIN-login + handover, operator's browser MUST land on `/dashboard` (or the canonical post-handover surface), NEVER on `/wizard` (which is the mothership new-prov flow).	Playwright MCP
D24	Mothership-only views absent from Sovereign Console. The Sovereign Console MUST NOT expose: `/app/dashboard` (mothership fleet view), `/app/settings` (mothership settings), `+ New deployment` button. The Sovereign Console is for ONE Sovereign; the mothership fleet view is a different UI.	Playwright MCP
D25	All operator-facing service hostnames reachable + correctly wired. `keycloak.<fqdn>` / `openbao.<fqdn>` / `openova-flow.<fqdn>` / `prometheus.<fqdn>` / `mimir.<fqdn>` / `loki.<fqdn>` / `tempo.<fqdn>` / `argo.<fqdn>` / `workspaces.<fqdn>` MUST return non-zero HTTP. `harbor.<fqdn>` / `registry.<fqdn>` / `guacamole.<fqdn>` / `marketplace.<fqdn>` MUST return their app page, not 404. No service config may carry a dev hostname (e.g. `gitea.catalyst.local`) in production HTML.	curl + Playwright MCP
D26	CSP allows fonts or self-hosts woff2. Operator MUST NOT see system-font fallback on Sovereign Console pages. Either `fonts.googleapis.com` is allowed by CSP, or fonts are self-hosted (no external dependency).	Playwright MCP
D27	Marketplace enabled on the Sovereign. `MARKETPLACE_ENABLED=true` flows from provision body → bp-catalyst-platform → Sovereign Console: a `/marketplace` route returns 200 with a non-empty catalog page (apps + voucher admin) — NOT 404, NOT a "marketplace disabled" stub. `kubectl get hr -A` shows `bp-marketplace` (or whichever HR backs the marketplace) Ready=True on the chroot. The mothership provision wizard MUST default `marketplace.enabled=true` (zero-touch — operator never toggles a flag).	Playwright MCP + kubectl
D28	Voucher issuance from owner-tier UI. Owner (the operator who PIN-logged-in per D21) opens Sovereign Console marketplace admin → issues a voucher for tenant onboarding. Voucher artifact MUST persist (CR + DB row), MUST be emailed to the chosen recipient via the Sovereign's outbound SMTP, AND MUST be visible in the admin's voucher list. Issuance must be one-click (no kubectl, no API call). Test recipient: `hatice.yildiz@openova.io` (canonical operator-test address) or any other Sovereign-side mailbox the operator controls.	Playwright MCP
D29	Voucher-based organization (tenant) provisioning is zero-touch. Recipient opens the voucher email → clicks redeem link → PIN-login as the test recipient (e.g. `hatice.yildiz@openova.io`) → lands on an organization-creation wizard → completes the form → a new `Organization` / Tenant CR is created → tenant namespace + RBAC + bootstrap apps converge → recipient is auto-redirected to their tenant home page. NO operator intervention beyond the voucher email.	Playwright MCP
D30	Free-subdomain selection from operator-curated pool. Organization wizard step MUST present a subdomain picker populated from the configured pool: `omani.homes`, `omani.rest`, `omani.trade` (singular — see §4 below), and any others the operator has provisioned. Tenant chooses a free subdomain (e.g. `acme.omani.homes`) → cert provisions → tenant landing page resolves on the chosen FQDN with publicly-trusted TLS. The pool MUST come from a Sovereign-side CR / config (not hardcoded).	Playwright MCP + dig + curl
D31	Tenant application with CNPG active-hot-standby replication. Inside the new tenant, user picks a CNPG-backed app from the marketplace (e.g. Ghost or WordPress) → selects "active hot-standby" → app installs with a CNPG `Cluster` that replicates across the Sovereign's regions (primary + at least one replica). `kubectl get cluster.postgresql.cnpg.io -A` in the tenant context shows `instances` distributed across regions (region label / topology spread). Failover test: cordoning the primary region brings the replica to primary, app remains reachable on its FQDN within the documented RTO (≤ 30 s). Full counter-test continuity procedure in §6.	Playwright MCP + kubectl + curl
D32	Sandbox CRD installable on the Sovereign. `kubectl get crd sandboxes.sandbox.openova.io` returns the CRD; the controller Pod (`sandbox-controller` in `catalyst-system`) is Ready and processes a no-op `Sandbox` CR within 30 s (status transitions `Pending → Reconciling → Ready`). `helm template` of the Sovereign chart with sandbox enabled emits the controller Deployment + RBAC + Service. The Sandbox plane is part of every Sovereign by default — operator does not opt in.	kubectl + helm template
D33	Sandbox agent catalogue picker functional. Sovereign Console `/console/sandbox` lists at minimum the six agents specified in `products/sandbox/docs/architecture.md` — Claude Code, Cursor (cloud), Qwen Code, Aider, OpenCode, plus the Sovereign-native shell. Picking an agent opens a session host page; the BYOS settings page lets the operator paste an Anthropic OAuth client_id (per `products/sandbox/docs/claude-code-byos.md`). A picked session establishes a WebSocket to the pty-server and renders xterm.js with a live PTY prompt.	Playwright MCP
D34	newapi Sovereign-side LLM gateway routes to a backend model. `https://newapi.<fqdn>/v1/chat/completions` accepts an HS256 org-scoped JWT (issued by `core/services/auth`), authenticates the request, and proxies to a configured backend. The reference backend for this gate is Bank Dhofar Qwen. A round-trip `curl` with a valid JWT returns a non-empty `choices[0].message.content` within 30 s. No Anthropic / OpenAI cloud calls leave the Sovereign by default — BYOS is opt-in per-user.	curl + kubectl
D35	NATS broker round-trips `catalyst.tenant.created` + `catalyst.order.placed` end-to-end. SME tenant + billing dispatchers PUBLISH to NATS JetStream (subjects `catalyst.tenant.created`, `catalyst.tenant.updated`, `catalyst.order.placed`, `catalyst.invoice.paid` observed via `nats sub 'catalyst.>'`). Organization controller + Sandbox controller CONSUME (consume legs ship per #1862; round-trip wire test per the contract added in `56e04ac8a`). Round-trip test: issue a voucher → redeem it → measure latency from billing-service publish to Org controller reconcile-start ≤ 2 s. Convergence is NOT declared until both legs are wired — polling-the-API workaround does not satisfy this gate.	NATS CLI + kubectl logs

DoD grows. Every iteration of test-writer / test-executor finds more operator-visible bugs. Append the gate, ship the fix, re-validate. The list is the convergence contract; do not declare convergence until every appended gate passes on a single fresh prov.

Trigger phrases that mean STOP — about to compromise

"for now let's just do 2 regions"
"the matrix expects X — let me synth X into the response"
"Hetzner private net spans zones, let me use that for cross-region"
"ClusterMesh on NodePort is fine for testing"
"ash / sin is different zone, let me just stay in eu-central"
"private NIC link between regions is faster"

If any of these appear in your reasoning → STOP, re-read this file, fix the root cause.

Cycle protocol

Before any tofu apply or POST /api/v1/deployments:

Read this file (or the auto-memory mirror at ~/.claude/projects/-home-openova-repos-openova/memory/sovereign_multiregion_dod.md).
Log the D0–D35 list to the loop output.
Refuse to mark convergence until each D0–D35 has been individually checked.

§4 — Domains canon

This section is the single source of truth for FQDN patterns used in Catalyst test provs and tenant Organizations. Every test, walk, agent dispatch, and provisioning request must use the patterns below.

Test-Sovereign FQDNs

Layer	Pattern	Notes
Sovereign (test)	`t<NN>.omani.works`	`<NN>` increments with every fresh prov (t39, t40, …).
Sovereign (test fallback)	`t<NN>.omantel.biz`	Use when `omani.works` hits a Let's Encrypt rate limit. Swap weekly.
Sovereign Console	`console.t<NN>.omani.works`	Operator-facing console UI.
Marketplace	`marketplace.t<NN>.omani.works`	Customer-facing storefront for the operator-curated catalog.
Operator services	`keycloak.t<NN>.omani.works`, `openbao.t<NN>.omani.works`, `openova-flow.t<NN>.omani.works`, `prometheus.t<NN>.omani.works`, `mimir.t<NN>.omani.works`, `loki.t<NN>.omani.works`, `tempo.t<NN>.omani.works`, `argo.t<NN>.omani.works`, `workspaces.t<NN>.omani.works`, `harbor.t<NN>.omani.works`, `registry.t<NN>.omani.works`, `guacamole.t<NN>.omani.works`	Per §3 D25.

Voucher operations live in the operator console's BSS menu, NOT in any admin.<sovereign-fqdn> subdomain. The legacy admin.* references in older docs and agents are outdated.

Let's Encrypt rate-limit fallback policy

omani.works is the default test TLD. When Let's Encrypt rate-limits issuance on that TLD (typically after many wipe → create cycles in a week), swap to t<NN>.omantel.biz for the affected week. Both TLDs are operator-owned and both have the Catalyst NS records pre-provisioned. Never improvise a third test TLD — adding one to the canon must go through a PR against this section.

Tenant-Organization FQDNs

Tenant Organizations receive a free subdomain from an operator-curated pool allocated at signup. The pool population is defined in core/services/parent-domain/sovereign_parent_domains.go (the canonical Go source — pool TLDs are not hardcoded in tests).

Pattern	Example	Notes
`<orgslug>.omani.homes`	`acme.omani.homes`	Default — first NS-ready entry in registration order per `core/services/sme/sme_tenant.go:514-521`.
`<orgslug>.omani.rest`	`acme.omani.rest`	Pool alternate.
`<orgslug>.omani.trade`	`acme.omani.trade`	Pool alternate. Note: singular `trade`, not `trades` — earlier docs that said `omani.trades` are wrong; do not reintroduce the plural.

The tenant console URL pattern is console.<orgslug>.<pool-tld> — e.g. console.acme.omani.homes. Additional tenant-installed apps are reachable at <newapp>.<orgslug>.<pool-tld> — e.g. notes.acme.omani.homes.

Voucher redeem URL

The canonical voucher-email link pattern (per core/services/notification/templates/templates.go):

https://marketplace.t<NN>.omani.works/redeem/?code=<CODE>

The slash before ? is mandatory — both URL ends are part of the Phase 1 step 1a contract and must be byte-for-byte stable.

Forbidden in tests

The following strings must never appear in test code, test data, operator-walk runbooks, fresh-prov provisioning bodies, or any artifact that exercises the 5-Pillar deterministic path:

openova.io — and any subdomain (console.openova.io, marketplace.openova.io, etc.)
omantel.openova.io — legacy operator-sample FQDN, dead
eventforge.io — never an OpenOva domain; never the canonical app name
Nova Cloud — never the operator brand for the test stack

openova.io is reserved for the OpenOva marketing site (the openova-private/website/ repo) and the mothership control plane during Phase 0 + Phase 1 cold-start. After bp-self-sovereign-cutover runs, every reference to openova.io from a franchised Sovereign is a Principle #11 violation.

When to switch + why

Situation	Switch to	Why
Fresh prov on a clean LE quota	`t<NN>.omani.works`	Default; cheapest, most stable.
LE rate-limit on `omani.works`	`t<NN>.omantel.biz`	Same Catalyst NS records, different LE quota.
Tenant Organization signup, default	`<orgslug>.omani.homes`	First NS-ready entry; quietest pool TLD.
`omani.homes` pool exhausted in a Sovereign	`<orgslug>.omani.rest` then `<orgslug>.omani.trade`	Round-robin within the operator's pool config.
Any test or walk artifact	NEVER `openova.io` / `omantel.openova.io` / `eventforge.io` / `Nova Cloud`	Reserved for mothership / marketing surfaces only; appearance in a tenant test = Principle #11 violation.

Domain hygiene checks

docs/trust-audit-*.md and PR review hunt for:

openova.io leaks in test data — any *_test.go / *.spec.ts / *.feature literal containing openova.io is a leak.
Hardcoded operator FQDNs — any code path that pins the operator domain to a literal instead of reading it from a runtime parameter (SOVEREIGN_FQDN, --sovereign-fqdn, etc.). See PRINCIPLES.md §4 (never hardcode).
Tenant-Org URL pattern drift — any path that emits <orgslug>.openova.io or <orgslug>.<sovereign-fqdn> instead of <orgslug>.<pool-tld>. The pool TLD is the source of truth.
admin.<sovereign-fqdn> references — voucher and billing operations live in the BSS menu inside the operator console; an admin.* subdomain means a stale reference.
Plural omani.trades — must be singular omani.trade. Any new occurrence is a regression.

When in doubt, defer to GLOSSARY.md and the Go source files named above.

§5 — Persona journeys

How different people use Catalyst. Defer to GLOSSARY.md for terminology. The journeys described below use Catalyst surfaces (console / Git / API) that are partially design-stage — see STATUS.md.

5.1 Personas

#	Persona	Where they live	Tools they use
P1	OpenOva Engineer	github.com/openova-io	Catalyst codebase, Blueprint repos
P2	`sovereign-admin`	Catalyst admin UI + Sovereign Gitea	Browser UI, Git, kubectl (debug)
P3	Support Agent (within a Sovereign's operations team)	Catalyst admin UI in support mode	Browser UI
P4	`org-admin`	Org-scoped Catalyst console	Browser UI, occasional Git
P5	SME End User (e.g. Ahmed, pharmacy owner on Omantel)	Marketplace + the App they installed	Browser only
P6	SME Power User (e.g. Ahmed's tech-savvy nephew)	Console with Developer mode toggled on	Browser, occasionally Git
P7	Corporate DevOps / SRE (e.g. Layla at Bank Dhofar)	Git + console in advanced view	Browser, Git, kubectl-on-own-vcluster, IDE
P8	Corporate App Developer (e.g. Omar at Bank Dhofar)	Console + Git for own service repos	Browser, Git, IDE
P9	Security / Compliance Officer (e.g. Khalid, CISO)	Audit dashboards + EnvironmentPolicy editor	Browser
P10	Billing Admin	Billing console	Browser

5.2 Surfaces

The three first-class surfaces (full list and rationale in ARCHITECTURE.md §7):

UI — Catalyst console. Form / Advanced / IaC editor depths.
Git — direct push or PR to the Application's Gitea repo (one repo per App; branches develop / staging / main map to dev / stg / prod), or to private Blueprint repos (shared-blueprints per-Org or catalog-sovereign Sovereign-wide).
API — REST + GraphQL, for portal integrations.

Plus one debug-only surface:

kubectl — inside one's own vcluster. Read-mostly, never used to mutate Catalyst-managed resources.

There is no fourth surface. Terraform, Pulumi, "catalystctl install" are not part of this model.

5.3 Personas × Journeys matrix

Cells show which surface(s) the persona uses for that journey. Bold = primary. Italic = secondary. Empty = not applicable.

	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10
J1 Build & publish Blueprint to public catalog	Git + CI
J2 Provision a Sovereign		UI+Git
J3 Onboard an Organization		UI	UI
J4 Create an Environment		UI		UI	auto on signup	UI	UI		view audit
J5 Install Application from catalog				UI	UI form	UI	UI + Git	UI + Git	view audit	view cost
J6 Configure Application				UI	UI form	UI	UI + Git	UI + Git	view audit
J7 Author private Blueprint		Git+CI				UI + Git	Git + CI	Git + CI	review + sign
J8 Author Crossplane Composition (advanced)	Git + CI	Git+CI					Git + CI		review
J9 Promote between Environments				UI			UI + Git PR	UI + Git PR	UI approve
J10 Observe runtime / debug		UI dashboards	UI dashboards	UI dashboards	App's own UI	UI	UI + kubectl	UI + kubectl	UI audit
J11 Rotate credentials		UI + auto		UI + auto	auto	UI	UI + auto		UI + policy
J12 Audit / compliance review		UI	UI	UI			UI (own changes)	UI (own changes)	UI export to SIEM
J13 Billing & quotas		UI quotas	UI read	UI invoices	UI plan					UI
J14 Off-board / migrate		UI export		UI	UI cancel		UI export		audit	UI final invoice

The operator's primary mutation surface is the BSS menu in the Sovereign Console sidebar — Business Support System for voucher issuance, billing, plan/quota administration, and tenant lifecycle operations. The BSS menu replaces the dead admin.<sovereign-fqdn> subdomain pattern.

Step	Surface	Action
O0	Sovereign Console at `console.t<NN>.omani.works`	PIN-login as owner-tier (D21 must pass — operator pre-populated).
O1	Sidebar → BSS → Vouchers	Issue voucher to test recipient (e.g. `hatice.yildiz@openova.io`). Voucher CR + DB row materialise (D28).
O2	Sovereign outbound SMTP	Recipient receives voucher email with canonical URL `https://marketplace.t<NN>.omani.works/redeem/?code=<CODE>` (slash mandatory).
O3	Sidebar → BSS → Tenants (post-redeem)	New Organization appears in the operator's tenant list with the chosen pool subdomain (D30).
O4	Sidebar → Settings	Region / Capacity / ControlPlaneSize / Created / DeploymentID / Pool subdomain populated with real values (D22).

What the operator never touches: admin.<fqdn>, kubectl, raw NATS, raw SQL.

5.5 Customer journey — voucher → checkout → Org (SME, Ahmed)

Cast. Ahmed owns 4 small pharmacies in Muscat. No IT staff. He has a laptop and a credit card. (Sovereign for this example is the Omantel-run Sovereign for SMEs.)

Day 1 — 14:00
─────────────
1. Ahmed receives the voucher email from his Omantel sales rep.
   Link points at his Sovereign's marketplace, e.g.
   https://marketplace.<sovereign-fqdn>/redeem/?code=<CODE>
2. Clicks the link → PIN-login (email + 6-digit PIN magic-link).
3. Picks the canonical Postgres-backed bundle from the operator-curated catalog
   (e.g. bp-bundle-pharmacy: ERPNext + WooCommerce + Stalwart-mail + Postgres + Redis).
4. Org wizard: picks subdomain `muscatpharmacy.<pool-tld>` from the picker
   (D30 — pool comes from operator-curated Sovereign config), confirms
   business details + 2-region BCP topology (Pillar 2).
5. Catalyst auto-creates: Organization "muscatpharmacy", Environment
   "muscatpharmacy-prod", vcluster "muscatpharmacy" on the chosen primary region.
   2 independent CNPG clusters provisioned, ReplicaCluster sync over
   ClusterMesh (Pillar 3). Environment-controller spins up the vcluster
   in ~60 seconds.
6. Provisioning service creates 5 Application Gitea repos under
   gitea.<location-code>.<sovereign-fqdn>/muscatpharmacy/ (one repo per App:
   erpnext, woocommerce, pharmacy-mail, shared-postgres, shared-redis), each
   with develop/staging/main branches and initial manifests.
   Webhook → projector → Flux in the muscatpharmacy vcluster picks up the
   N new GitRepository sources and reconciles.
7. ~3 minutes later: Ahmed sees green checkmarks on his dashboard.
   Each App card has an "Open" button.
   Click ERPNext → SSO via the Sovereign's Keycloak realm for muscatpharmacy → he's in.
─────────────
Day 1 — 14:08 — Ahmed is selling.

What he never saw: Git, kubectl, vcluster, Flux, Blueprint, YAML, JetStream. His mental model: "I have a Sovereign account. I bought a bundle. It works."

This is the deterministic Phase 2 walkthrough from §2 expressed as a journey narrative. Same tenant Organization (Ahmed's muscatpharmacy, or any tenant created in §5.5).

Step	Surface	Action
T0	`console.<orgslug>.omani.homes`	Tenant PIN-logs-in. Dashboard renders (Pillar 1 + Pillar 2 already shipped).
T1	Tenant Console → Sandbox	Sandbox session launches with agent set to `qwen-code` by default (routes via newapi to Sovereign-hosted Qwen — zero Anthropic cost leak).
T2	Sandbox session	`openova-sandbox-mcp` auto-mounts: 49 MCP tools available (`sandbox.db.`, `sandbox.auth.`, `marketplace.app.`, `sandbox.git.`, `sandbox.iam.*`). User pastes zero credentials.
T3	qwen-code prompt	Tenant prompts: "install a notes app backed by Postgres in my Org, public on `notes.<orgslug>.omani.homes`."
T4	Agent action	qwen-code calls `marketplace.app.install` + `sandbox.db.provision` + `sandbox.auth.provisionRealm` via the MCP plugin. New app's CNPG cluster + namespace + HelmRelease + Gitea repo materialise.
T5	New app	Reachable at `https://notes.<orgslug>.omani.homes` with publicly-trusted TLS.

The tenant never typed a kubeconfig, never opened Git, never copied a DB connection string. Pillar 4 shipped end-to-end.

5.7 Corporate journey — Layla at Bank Dhofar (running its own Sovereign)

Cast. Layla is an SRE on Bank Dhofar's 12-person Cloud Platform team. They run their own Sovereign on Hetzner. Their internal Organizations are core-banking, digital-channels, analytics, corporate-it. Their default tooling is Git + IDE.

09:00  Coffee. Opens VS Code. Branch: bp-bd-payment-rail
─────────────────────────────────────────────────────────────────────────
       She's authoring a private Blueprint for a payment-rail microservice
       with Postgres + Redis dependencies.

09:15  Pushes to gitea.<location-code>.bankdhofar.local/digital-channels/shared-blueprints/
       bp-bd-payment-rail. CI in Bank Dhofar's GitHub Actions runner pool
       (running inside the Sovereign) builds the image, signs the Blueprint
       with cosign, publishes to the local OCI registry. blueprint-controller
       picks it up — visible as a private card in the digital-channels Org.

10:00  Switches to her Application repo:
       gitea.<location-code>.bankdhofar.local/digital-channels/payment-rail
       Checks out branch `develop` (the dev environment branch).
       Edits values.yaml (config tweak).
       Catalyst console (Plan view) shows the diff: what will change,
       dependency impact, drift, cost delta. Like `terraform plan`, but
       served by the API on the Git diff.

10:15  Happy. Commits to develop. Webhook → projector → Flux in the
       digital-channels vcluster (watching the develop branch on this
       Application repo) reconciles in 30s. Audit log captures her as
       committer at the App-repo level.

11:00  Need to debug the staging deployment of the same App.
       Browser: console → digital-channels-stg → payment-rail card
       → Logs tab. Then Topology tab to see across regions.
       Or, drops into kubectl scoped to her vcluster:
         $ kubectl --context=hz-fsn-rtz-prod-digital-channels logs -n payment-rail
       Direct kubectl, scoped strictly to her Org's vcluster (vcluster name
       per NAMING §1.5 is the Org name, not the Sovereign name — Layla's Org
       is `digital-channels`). Bank Dhofar's sovereign-admin grants this via
       a JIT elevation flow.

14:00  Promotion stg → uat. From the payment-rail Application card,
       clicks "Promote staging → uat". Catalyst opens a Gitea PR
       within the same payment-rail repo: source branch `staging`,
       target branch (a feature branch tracking uat config). The
       EnvironmentPolicy CR for digital-channels-uat (in
       system/catalyst-config/policies/) requires team-platform approver
       and an RE score ≥ 80%. Reviewer approves via Gitea web UI (or
       via the Catalyst console's PR view — same backend). Auto-merge.
       Flux in the uat-bound vcluster reconciles.

15:00  New Environment needed for a fraud lab. From the console:
       "New Environment in analytics" → fills name "fraud-lab-dev" →
       picks "small" topology (1 region, single bb=rtz). Environment-controller
       creates the vcluster and bootstraps Flux pointing at the develop
       branch of every Application repo in the analytics Org. No new repos
       are created (Application repos exist already, indexed by branch).
       Ready in 60s. Layla now has a new sandbox.

16:00  Business asks for the bank's existing Backstage portal to show
       Catalyst-managed services. Layla integrates: Backstage queries
       Catalyst REST API at https://api.<location-code>.bankdhofar.local/v1/applications,
       authenticated via workload identity (Backstage runs inside the
       Sovereign). Backstage's service catalog now includes Catalyst
       Applications alongside other systems. No code change in Catalyst —
       the API was already there.

What Layla DOES use: UI (for promotion approvals, observability, EnvironmentPolicy editing), Git (for Blueprint authoring in shared-blueprints and per-Application config in each App's repo with develop / staging / main branches), kubectl (for debugging her own vcluster), and the API (for integrating Backstage). She never writes Crossplane code unless she's contributing a new Composition upstream as a Blueprint — and even then it's via a Gitea PR.

What Layla doesn't use: Terraform, Pulumi, a "catalystctl" CLI, or any other tool that bypasses Git.

5.8 Application card (the user's primary handle)

The card is the user's view of an Application in their Environment. Anatomy below; full UX in the console docs.

┌────────────────────────────────────────────────────────────────┐
│  🌐  marketing-site                                       ⋮   │  ← name + menu
│      bp-wordpress @ 1.3.0                                      │  ← Blueprint + version
├────────────────────────────────────────────────────────────────┤
│  ●  Running         🔗 acme.com  ↗                             │  ← status + endpoint
│                                                                 │
│  📍 eu-central                          5 / 5 pods              │  ← placement + health
│  💾 postgres → shared-postgres (own card)                       │  ← key dependency (linked)
│                                                                 │
│  Last deploy: 2h ago by Layla     ⏵ View history                │  ← provenance
│                                                                 │
│  [ Open app ↗ ]    [ Settings ]    [ Logs ]    [ Topology ]    │  ← primary actions
└────────────────────────────────────────────────────────────────┘

States via the status badge:

State	Meaning
● Running (green)	All replicas healthy, traffic flowing
◐ Installing (blue)	Flux reconciling, progress shown inline
◑ Updating (blue)	Config or version change rolling out
◒ Degraded (amber)	Partial — `3/5 pods, 2 unhealthy`
◓ Failed (red)	Install or update failed, "View error" button
○ Paused (grey)	Manually paused, scale-to-zero
◔ Pending approval (purple)	Promotion PR open, awaiting reviewers

Clicking the card opens the detail page with tabs: Overview, Settings, Topology, Secrets, Observability, History, Manifests.

The Topology tab is where Placement edits happen — single-region → active-active, region picker, failover policy. The Manifests tab is the Monaco IaC editor.

5.9 Catalog vs Applications-in-use view

The Marketplace renders Blueprint cards (something to install) — visually distinct from Application cards (something running). The Blueprint detail page is the "where is this Blueprint running in my Org" view — a query, not a chain object. The Environment view groups Application cards by status, with backing services (Postgres, Redis, etc.) in their own section.

5.10 Default UI mode by Sovereign type

Setting	SME-style default	Corporate default (Bank Dhofar)
Console default depth	Form view	Advanced view + IaC editor toggle on
Developer mode (Blueprint Studio)	Hidden, off	Visible by role
Multi-Environment promotion features	Hidden when only 1 Env	Visible always
EnvironmentPolicy editor	Hidden by default	Visible by role
`kubectl` access for users	Off	On for `org-developer` and above
Git access for users	Off (sovereign-admin can flip per-Org)	On
Marketplace features (search, bundles, ratings)	All on	All on but de-emphasized
Specter / AIOps Blueprint included by default	Optional	Recommended (Cortex + Specter on top)

Each Sovereign sets its defaults at provisioning time; users within can override via per-user preferences within the role permissions allowed.

§6 — Multi-region BCP test (D31 detail)

D31 is the region-kill BCP failover gate — the verifier for Pillar 3. Run in parallel with Phase 0 / 1 / 2 on the same fresh prov.

Preconditions

Tenant Organization exists (created via §2 Phase 1).
Tenant has installed a CNPG-backed app via the marketplace with active hot-standby selected. Two independent CNPG Cluster CRs exist — one in each chosen region — with ReplicaCluster sync over Cilium ClusterMesh on the DMZ WireGuard data plane (A2 + A3).
App reachable on its FQDN with publicly-trusted TLS.

Counter-test continuity procedure

The continuity check is a monotonic counter that increments through the region kill. Any gap, replay, or skipped value = failed gate.

Step	Action	Pass criterion
1	Start the counter writer: a client process that `INSERT … RETURNING id` every 100 ms against the app's primary CNPG endpoint, recording each returned `id` and timestamp locally.	Counter increments monotonically — no holes pre-failover.
2	Kill the primary region. Two valid kill modes: (a) instance destroy via the cloud-provider API; (b) `NetworkPolicy` isolation that drops all traffic in/out of the primary region's namespaces.	Primary region becomes unreachable within ≤ 5 s.
3	`failover-controller` (per Continuum CR) flips traffic to the replica region. Replica CNPG `ReplicaCluster` promotes to primary. Cilium ClusterMesh keeps inter-region pod-to-pod alive across the surviving regions.	Failover RTO ≤ 30 s end-to-end (kill → counter writer reconnects to the new primary).
4	Counter writer reconnects via the app's FQDN (DNS / LB now points at the surviving region).	Writer resumes within the 30 s window; no transaction lost — the next written `id` is `last_id + 1`, never less, never with a gap.
5	(Optional, ledger-grade) After failover, walk the durable WAL and confirm every `id` from before the kill is present in the new primary's data plane.	Zero `id` gaps in the replica-promoted-to-primary's data.

Hard requirements

The kill MUST be a real region kill, not a Pod restart or a Deployment scale-to-zero. A single-region prov cannot satisfy D31 — see PR #1599 shape in PRINCIPLES.md ("multi-region claim on single-region prov").
The failover must be triggered by failover-controller reading the Continuum CR — never by a human flipping DNS by hand.
The counter writer's local log + the post-failover database state are the only acceptable evidence. Operator-walk screenshots alone do not satisfy D31.

§7 — Pillar 5 sovereignty cutover

The Phase 1 → cutover transition is the proof that a franchised Sovereign can operate independently of the OpenOva mothership.

Choreography

bp-self-sovereign-cutover installs dormant at bootstrap-kit slot 06a during Phase 1. It is present but inert — no Jobs running, no tether pivots active.
Post-handover, the operator clicks the "Achieve True Sovereignty" CTA in the Sovereign Console.
The Blueprint runs eight sequential Jobs, each pivoting one of the tethers listed in §1 Pillar 5. Tethers are pivoted in dependency order so the cluster never loses its ability to pull what it needs at each step (e.g. Harbor proxy-cache is warmed before containerd registries.yaml flips).
The final step is a 10-minute deny-egress NetworkPolicy hold against github.com, ghcr.io, and harbor.openova.io. During the hold:
- Flux must continue to reconcile (sources are now local Gitea + Harbor).
- All HelmReleases must remain Ready=True.
- No image-pull errors, no Git fetch errors, no upstream registry hits.
cutoverComplete=true is set only if the cluster reconciles green during the full 10-minute hold. Any hiccup = the cutover failed; rollback to pre-cutover state, fix the root cause, re-run.

Verification (the only acceptable evidence)

Egress-block proof: a Hubble / NetworkPolicy log showing zero allowed flows to github.com, ghcr.io, harbor.openova.io for the duration of the 10-minute hold.
Reconcile-green proof: kubectl get hr -A -o jsonpath=… showing every HelmRelease Ready=True at minute 0 and minute 10 of the hold.
Operator-walk screenshot: the Sovereign Console's "Sovereignty" tab showing cutoverComplete=true with the timestamp of the hold.

No cutover claim without all three. See adr/0002-post-handover-sovereignty-cutover.md for the full architecture, alternatives considered, and per-step contract.

Customer-sync — how each Sovereign keeps the catalog after cutover

Each franchised Sovereign's Gitea mirrors the public catalog from this repo (openova-io/openova):

GitHub (openova-io/openova)              Per-Sovereign Gitea (mirrored)
─────────────────────────────              ───────────────────────────────
platform/cilium/        ────sync────>    gitea.<location-code>.<sovereign-domain>/catalog/bp-cilium/
products/cortex/        ────sync────>    gitea.<location-code>.<sovereign-domain>/catalog/bp-cortex/
...

Sovereigns pull on their own schedule (default daily). Air-gapped Sovereigns mirror via offline media. After bp-self-sovereign-cutover completes, the Sovereign's Flux reconciles exclusively from its local Gitea + Harbor — never back to github.com/openova-io or ghcr.io/openova-io (Principle #11).

52 KiB Raw Permalink Blame History Unescape Escape