* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy) Single canonical "how OpenOva works" doc per founder's lean-doc strategy. 2926 source lines → 1110 consolidated lines, no semantic loss. Sections: §1 High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint) §2 Repo layout §3 Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...) §4 Naming conventions (dimensions, patterns, labels, DOMAINS-CANON) §5 Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces) §6 Per-host-cluster infrastructure §7 Application Blueprints §8 Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh) §9 Bootstrap-kit slot ordering (full 48-slot canonical list) §10 EPIC-level design overview (EPIC-0 through EPIC-6) §11 Per-chart DESIGN.md inventory §12 OAM influence §13 Read further Stale literal fixes: - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances) - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055) - failover-controller marked REPLACED by bp-continuum New PR refs wired into §3: - PR #665 SPIRE deferral - PR #2071 bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region) - PR #2087 bp-cnpg-pair pre-merge guard - PR #2093 bp-cnpg-pair pre-merge guard New stack components added to §3: - bp-cnpg-pair (synchronous remote_apply ReplicaCluster across ClusterMesh) - bp-continuum (lease-based failover orchestrator) - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11) Source docs (to be deleted by orchestrator in final PR): - docs/PLATFORM-TECH-STACK.md - docs/NAMING-CONVENTION.md - docs/EPICS-1-6-unified-design.md - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md * docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy) * docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy) * docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy) Part 1 — Runbook consolidation: - NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops, Blueprint authoring, chart conventions, demo walk, failover, troubleshooting) - Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK / RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface - Documents dual-annotation requirement for charts with enabled.default: false (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1 dead-reserve incident as the live evidence - All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console) - All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works - Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093 Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md): - Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit) - Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed, awaiting fresh-prov walk" (per 5-pillar DoD) - Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053) - Adds 3 new CRDs verified in products/catalyst/chart/crds/: CNPGPair, PDM, Sandbox - Sandbox controller chain CODE-COMPLETE (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632) - SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061) - New §6 CI / supply-chain guards table: hollow-chart (#2087), smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle, subchart 4-step, Flux version-pin replay - New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧 - Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20), Pillar 3 (per above), Pillar 4 (Sandbox chain) Part 3 — GLOSSARY.md folded as single source of truth for banned terms: - Header dated 2026-05-20, notes "single source of truth for banned terms" and "no separate BANNED-TERMS.md" - Existing 11 banned-terms rows rewritten with italicized qualifiers - NEW Forbidden test domains subsection: openova.io (mothership-only), omantel.openova.io (hallucinated), Nova Cloud (predecessor brand), eventforge.io (hallucinated), admin.<fqdn> (dead BSS URL) - SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665 with TBD-V29 (#2055) re-introduction roadmap - Cross-links updated: IMPLEMENTATION-STATUS → STATUS, SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion). No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11. This is the orchestrator commit on top of the four cherry-picked consolidation commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It: 1. Deletes 15 legacy source docs (now folded into the 7 canonical): PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design, BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG, 5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD, PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING, DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING. 2. Moves transient + historical docs into proper subdirs: - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state) - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery, 2026-05-20-trust-audit,2026-05-20-walk-runbook}.md - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md 3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision) + docs/adr/README.md index. 4. Updates CLAUDE.md reading-order + repo-structure block to match the lean strategy and current core/ tree (controllers/, marketplace/, etc.). 5. Sweeps all .md files + .github/workflows + scripts to repoint old doc paths to the new canonical homes. ADR cross-references kept intact (ADRs are immutable historical artifacts). Operator-side cron scripts that still write to the old paths (/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and openova-private/bin/trust-audit.sh) need a one-line path update — flagged in the PR body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its repo-root sentinel; the file no longer exists after the lean-doc consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to match the new canonical filename. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
52 KiB
Definition of Done
What this is: the canonical end-user Definition of Done for OpenOva — the deterministic 2-phase test, the 5 inseparable pillars, all DoD gates (D0–D35), the domains canon (which domains, when, why), and the operator / customer / tenant persona journeys.
Authority: 📐 PERMANENT canon. Reviewed PRs only. The generic cross-project DoD principle (operator-walks-a-fresh-prov, no theater) lives in user-global
~/.claude/CLAUDE.md§2. This file is the OpenOva-specific elaboration.Pointers: see
PRINCIPLES.mdfor engineering rules,ARCHITECTURE.mdfor system shape,RUNBOOKS.mdfor operator how-tos.Status: Authoritative. Updated: 2026-05-20. Supersedes the legacy split 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS — consolidated here per the lean-doc strategy (PR #2076 / PR #2084). SPIRE-issued SVIDs for Sandbox MCP auth are deferred per PR #2056; the Phase 2 mechanism therefore currently relies on the interim sandbox-pty-server stdio attachment — see §1 Pillar 4 below.
§1 — The 5 inseparable pillars
Every dispatch in this repo must answer:
Which of the 5 pillars does this work move forward, and which deterministic step (Phase 0 / 1 / 2) does it advance?
If the answer is "none," the work is wrong — pick differently.
The 5 pillars are inseparable — none alone is a viable platform. Pillar work is strictly primary; operator-console polish, cosmetic-guard re-enables, treemap drill-down quality, jobs region filter, admin sidebar nav are tertiary operator-debugger surfaces and must never displace pillar work.
| # | Pillar | What "shipped" looks like |
|---|---|---|
| 1 | Marketplace + voucher onboarding | Anonymous visitor reaches the operator-branded marketplace → picks the canonical Postgres-backed bundle → completes signup (email + 6-digit PIN magic-link) → Organization CR created. |
| 2 | Multi-region BCP topology choice at signup | Wizard exposes region / topology choice during signup; customer picks N regions; system provisions across all N in one pass. Not a Day-2 upgrade. |
| 3 | Two independent CNPG clusters with ReplicaCluster sync + region-kill failover | One CNPG cluster per region; synchronous ReplicaCluster replication over Cilium ClusterMesh on the DMZ WireGuard-over-public-IPs data plane; region-kill test passes with zero transactions lost. |
| 4 | Sandbox + auto-mounted MCP plugin with full org knowledge | Sandbox launches the chosen agent CLI; openova-sandbox-mcp auto-mounts at session start with every org resource (apps, vClusters, conn-strings via OpenBao, Gitea repos, IAM, region health). User pastes zero credentials. Agent answers prompts with full org context and mutates resources via MCP tool calls. |
| 5 | Sovereign independence post-cutover | After bp-self-sovereign-cutover runs, zero egress to harbor.openova.io, ghcr.io/openova-io, or github.com/openova-io — proved by a 10-minute deny-egress NetworkPolicy hold (Principle #11). |
Pillar 4 — openova-sandbox-mcp auto-mount mechanism
When a Sandbox session attaches, the sandbox-pty-server (per
products/sandbox/pty-server/) writes the chosen agent's mcp.json config to
every canonical agent-config path (claude-code, qwen-code, opencode, aider,
cline) and starts the MCP server as a stdio subprocess of the agent process.
The server exposes 49 handlers grouped under namespaces such as
sandbox.db.*, sandbox.auth.*, marketplace.app.*, sandbox.git.*,
sandbox.iam.*, etc. (full registry in
products/sandbox/mcp-server/internal/tools/registry.go).
Authentication is currently the stdio child-process trust boundary (the agent process is the tenant's session and the MCP server inherits its identity). SPIFFE / SPIRE-issued SVIDs as the long-term auth substrate are deferred per PR #2056; when SVIDs land, the agent's caller-identity will become the tenant Organization's workload identity, never a long-lived API key, and the agent will never see credentials.
Per Principle #1 (the waterfall is the contract) and Principle #2 (never compromise from quality), Pillar 4 is not "ship a stub MCP server now and wire real tools later." A Sandbox session that boots without the full 49 tools is Pillar 4 unshipped, regardless of how good the chrome looks.
Pillar 5 — bp-self-sovereign-cutover and the 8-tether pivot
A franchised Sovereign emerging from Phase 1 is operationally tethered to the
OpenOva mothership in eight places (full map in
adr/0002-post-handover-sovereignty-cutover.md
§2.1 and in ARCHITECTURE.md §11.1):
| # | Tether | Phase |
|---|---|---|
| 1 | Flux GitRepository.url = github.com/openova-io/openova |
P0 |
| 2 | containerd registries.yaml rewrites every upstream registry → https://harbor.openova.io |
P0 |
| 3 | OCI HelmRepository urls = oci://ghcr.io/openova-io |
P0 |
| 4 | catalyst-api env fallback to https://github.com/openova-io/openova |
P0 |
| 5 | flux-system/ghcr-pull Secret seeded for private GHCR pulls |
P0 |
| 6 | Crossplane provider packages from xpkg.upbound.io |
P1 |
| 7 | Catalyst-authored images = ghcr.io/openova-io/openova/* |
P0 |
| 8 | OS package mirrors during cloud-init (apt, get.k3s.io) |
P2 (cold-start only) |
bp-self-sovereign-cutover installs dormant at bootstrap-kit slot 06a
during Phase 1 and is triggered post-handover by the operator's "Achieve True
Sovereignty" CTA. Eight sequential Jobs pivot the tethers in dependency order;
the final step is a 10-minute deny-egress NetworkPolicy hold against
github.com, ghcr.io, and harbor.openova.io. The only condition under
which cutoverComplete=true is set is that the cluster reconciles green
during this hold. No cutover claim without the egress-block proof. Full
choreography in §7 below.
§2 — The deterministic test (Phase 0 / 1 / 2)
The test is deterministic — one fresh prov, one run, all phases pass in order. No retries, no "works if you wait longer."
Phase 0 — Operator issues voucher via BSS
Voucher operations live in the operator console's BSS menu (Business
Support System), NOT in any admin.<sovereign-fqdn> subdomain. The legacy
admin.* references in older docs / agents are outdated.
| Step | Action | URL / Outcome |
|---|---|---|
| 0a | Operator logs in to the Sovereign Console | https://console.t<NN>.omani.works |
| 0b | Navigate to the BSS menu | Sidebar → BSS (NOT admin.<fqdn>/...) |
| 0c | Issue voucher | Voucher artifact created + delivered to recipient via Sovereign outbound SMTP |
Phase 1 — Customer redeems voucher (Postgres-backed app onboarding)
| Step | Action | URL / Outcome |
|---|---|---|
| 1a | Customer receives voucher email | Canonical URL pattern per core/services/notification/templates/templates.go: https://marketplace.t<NN>.omani.works/redeem/?code=<CODE> (slash before ? is mandatory) |
| 1b | Customer redeems → checkout → picks the Postgres-backed bundle | Org provisions across the 2 chosen regions with 2 independent CNPG clusters (ReplicaCluster sync over ClusterMesh on the WG-public-IP DMZ data plane) |
| 1c | Org URL after signup | https://console.<orgslug>.omani.homes (default pool TLD; pool also has omani.rest and omani.trade per core/services/parent-domain/sovereign_parent_domains.go) |
Phase 2 — Customer launches Sandbox; agent provisions an additional app via MCP
This is the most important test. It exercises Pillar 4 end-to-end and proves that an agent acting on behalf of the tenant can mutate the Organization's resources entirely through the auto-mounted MCP plugin — without the user typing any credential.
| Step | Action | Outcome |
|---|---|---|
| 2a | Tenant logs in at console.<orgslug>.omani.homes |
Dashboard renders |
| 2b | Opens Sandbox | Sandbox session launches with agent set to qwen-code (NOT claude-code — qwen-code routes through newapi → Sovereign-hosted Qwen, zero Anthropic cost leak) |
| 2c | openova-sandbox-mcp auto-mounts at session start |
49 MCP tools available with zero user-typed config (full handler set per products/sandbox/mcp-server/internal/tools/registry.go) |
| 2d | Customer prompts qwen-code to provision an additional application in their Organization | Agent uses MCP tools (sandbox.db.provision, sandbox.auth.provisionRealm, marketplace.app.install, etc.) — new app CNPG cluster + namespace + HelmRelease + Gitea repo materialise |
| 2e | New app reachable | At <newapp>.<orgslug>.omani.homes |
Orthogonal — D31 region-kill BCP failover
Run in parallel with Phase 0 / 1 / 2 to exercise Pillar 3. See §3 (D31) for the gate definition and §6 below for the full counter-test continuity procedure.
Mapping each pillar to the deterministic steps
| Pillar | Steps it covers |
|---|---|
| Pillar 1 — Marketplace + signup | Phase 0 (all), Phase 1 step 1a (voucher email), Phase 1 step 1b (redeem + checkout), Phase 1 step 1c (post-signup landing) |
| Pillar 2 — Multi-region BCP at signup | Phase 1 step 1b (wizard region-selection step) |
| Pillar 3 — 2 CNPG clusters + region-kill failover | Phase 1 step 1b (provisioning the 2 clusters), orthogonal D31 (the kill test) |
| Pillar 4 — Sandbox + auto-mounted MCP | Phase 2 steps 2a–2e |
| Pillar 5 — Sovereign independence | Implicit in all of the above; verified separately by the bp-self-sovereign-cutover 10-minute deny-egress hold (see §7 + Principle #11) |
What "shipped" means
A pillar is shipped when an operator (or a read-only Playwright verification agent — never a verification agent that ships fixes) walks a fresh prov through the pillar-relevant steps and produces:
- A screenshot (
.playwright-mcp/t<NN>-<surface>-<YYYY-MM-DD>.png) - A non-empty wire-level capture (log line, curl output, kubectl output, or HAR file)
- A working downstream artifact (the new app reachable, the failover counter intact, the egress-block proof recorded)
One PR landing does not ship a pillar. One walk-with-screenshot does.
Every PR against a surface flips that surface back to 🔴 UNVERIFIED in
TRUST.md until re-walked.
§3 — DoD gates (D0–D35)
This is the convergence contract. Every wipe → create → test cycle must validate every gate below before claiming a Sovereign is converged. Founder ruling 2026-05-15: silent compromise from these gates is a quality violation.
Architecture invariants (never compromise)
| ID | Rule |
|---|---|
| A1 | 3 regions minimum. If a provider has capacity / zone constraints, swap regions — never silently drop to 2. |
| A2 | Inter-region link = DMZ WireGuard over PUBLIC IPs. No hcloud_network cross-region, no VPC peering, no Huawei VPC — provider-agnostic, always over the DMZ WG endpoint. |
| A3 | Cilium ClusterMesh apiserver Service = LoadBalancer (public IP through DMZ WG). The word NodePort must never appear in clustermesh-apiserver Service spec on any Sovereign. |
| A4 | vCluster topology: primary region = MGMT+DMZ; each secondary region = DMZ+RTZ. Cross-vCluster intra-region traffic stays inside host k3s via Cilium. |
| A5 | Zero public exposure of K8s control-plane endpoints. kubectl get svc -A on a converged Sovereign returns no NodePort for clustermesh-apiserver, kube-apiserver, or etcd. |
| A6 | Provider-mix is the canonical case. Assume 1 region Hetzner, 1 AWS, 1 Huawei. Code must work for that even when the active test prov is all-Hetzner. |
Gates D0–D35
Every gate must pass on a SINGLE fresh provision in one continuous run. No partial credit.
D0 sits ABOVE D1. Without successful handover + auto-redirect, the operator never sees that provisioning succeeded — every other gate becomes invisible from their perspective. The zero-touch contract is end-to-end OPERATOR experience, not just backend convergence.
| # | Gate | Verifier |
|---|---|---|
| D0 | Successful handover + auto-redirect to Sovereign Console. Once deployment.status=ready AND deployment.handoverFiredAt != null, the mothership UI auto-routes operator's browser to deployment.handoverURL (/auth/handover?token=<jwt> on the Sovereign Console). Synthetic Apps / Handover per-region stage rows MUST be marked Succeeded (or not-applicable), never stuck Pending after handover fires. No operator action required — they should land on the Sovereign Console without copying / typing the FQDN. |
Playwright MCP |
| D1 | dig console.<fqdn> @1.1.1.1 returns primary LB IP (auto-written by catalyst-api after Phase-0) |
dig |
| D2 | curl https://console.<fqdn>/ → 200, cert publicly trusted (verify=0) |
curl |
| D3 | PIN-login: enter email → receive PIN via IMAP → enter PIN → dashboard | Playwright MCP |
| D4 | Keycloak SSO: PIN-login bounces through Keycloak once, lands on /dashboard with session cookie |
Playwright MCP |
| D5 | /cloud view: renders all 3 regions, no stuck spinners |
Playwright MCP |
| D6 | /jobs view: 0 pending, 0 running — every job in terminal state |
Playwright MCP |
| D7 | Mothership flow Jobs ≡ child Sovereign Jobs (same IDs, same statuses) | Playwright MCP diff |
| D8 | kubectl --context <child> get hr -A shows all 135 HRs Ready=True across all clusters |
kubectl |
| D9 | clustermesh-apiserver Pod Ready in every region, no restarts, no x509 errors | kubectl |
| D10 | cilium clustermesh status shows OK for every peer cluster |
cilium CLI |
| D11 | Inter-region pod-to-pod packet test passes, hubble-flow shows WireGuard traversal | kubectl + hubble |
| D12 | kubectl get svc -A | grep clustermesh shows only LoadBalancer (no NodePort) |
kubectl |
| D13 | Canvas flow page: sibling-deps edges render, no orphan bubbles, no phantom pillars | Playwright MCP |
| D14 | Operator re-login after browser refresh works without re-PIN within session TTL | Playwright MCP |
| D15 | /cloud?view=graph canvas accurate per kind. vCluster N/N non-zero (6/6 on a converged 3-region prov — 1 mgmt + 3 dmz + 2 rtz). LoadBalancer N/N non-zero (clustermesh-apiserver LBs + ingress LBs counted). Cluster N/N matches actual 3 regions. No kind chip shows 0/0 for a resource that actually exists in the cluster. |
Playwright MCP |
| D16 | /dashboard Layer-1 / Layer-2 grouping renders multi-region. Selecting Layer-1=Cluster on /dashboard MUST emit 3 cluster-grouped bubbles (one per region), not a single Sovereign. Layer-2=Namespace MUST emit namespace bubbles WITHIN each cluster bubble. The hierarchy Cloud → Region → Cluster → vCluster → Namespace → Application must collapse correctly per the operator's Layer-1 / Layer-2 selection. |
Playwright MCP |
| D17 | Application detail route /app/<name> shows the application, not "deployment id malformed". Clicking any application card in /apps MUST navigate to a route where the URL segment is the application name (e.g. bp-cnpg) and the renderer treats it as an application reference, NOT as a deployment id. The notifications drawer MUST NOT contain "Deployment id in the URL is malformed" entries for valid app-name segments. |
Playwright MCP |
| D18 | Sovereign-side catalyst-api can self-monitor Phase-1 install state. The chroot Sovereign catalyst-api MUST be able to fetch its own cluster's kubeconfig (or use in-cluster service account) to observe HelmRelease state. The notifications drawer MUST NOT contain "Per-component install monitoring is unavailable for this deployment — the Catalyst API couldn't fetch the new cluster's kubeconfig" entries. Operator should never need to drop to kubectl get helmrelease to know per-app install state. |
Playwright MCP |
| D19 | Apps + Cloud counter consistency. Apps page Deployments tab count MUST equal Catalog "INSTALLED" count MUST equal kubectl get hr -A Ready count. Cloud canvas kind chips MUST NOT show N/0 for resources that exist (vCluster, LoadBalancer, Bucket, Volume, PVC). PVC count in graph view MUST equal PVC count in list view. App card hrefs MUST NOT have doubled prefix (/app/bp-bp-*). |
Playwright MCP |
| D20 | Jobs page surfaces all-region jobs with region filter. Jobs view MUST show per-region prefixes (nbg1-1:, sin-2:) for every app on a multi-region Sovereign, plus an App-filter that lets the operator narrow to a single region. Any unexplained N/M counter MUST resolve to an actionable filter or be removed. |
Playwright MCP |
| D21 | Operator pre-populated as owner-tier on /users post-handover. Sovereign Console /users MUST list the operator who completed PIN-login as tier=owner with their email. /users MUST NOT render empty on a freshly-converged Sovereign. |
Playwright MCP |
| D22 | Settings page shows real values. /settings MUST render real values for Region, Capacity, ControlPlaneSize, Created (timestamp), DeploymentID, Pool subdomain — NOT — placeholders or "API PENDING" badges. Operator MUST be able to see what their Sovereign actually is. |
Playwright MCP |
| D23 | Sovereign-side /wizard route does not collide with post-handover landing. After PIN-login + handover, operator's browser MUST land on /dashboard (or the canonical post-handover surface), NEVER on /wizard (which is the mothership new-prov flow). |
Playwright MCP |
| D24 | Mothership-only views absent from Sovereign Console. The Sovereign Console MUST NOT expose: /app/dashboard (mothership fleet view), /app/settings (mothership settings), + New deployment button. The Sovereign Console is for ONE Sovereign; the mothership fleet view is a different UI. |
Playwright MCP |
| D25 | All operator-facing service hostnames reachable + correctly wired. keycloak.<fqdn> / openbao.<fqdn> / openova-flow.<fqdn> / prometheus.<fqdn> / mimir.<fqdn> / loki.<fqdn> / tempo.<fqdn> / argo.<fqdn> / workspaces.<fqdn> MUST return non-zero HTTP. harbor.<fqdn> / registry.<fqdn> / guacamole.<fqdn> / marketplace.<fqdn> MUST return their app page, not 404. No service config may carry a dev hostname (e.g. gitea.catalyst.local) in production HTML. |
curl + Playwright MCP |
| D26 | CSP allows fonts or self-hosts woff2. Operator MUST NOT see system-font fallback on Sovereign Console pages. Either fonts.googleapis.com is allowed by CSP, or fonts are self-hosted (no external dependency). |
Playwright MCP |
| D27 | Marketplace enabled on the Sovereign. MARKETPLACE_ENABLED=true flows from provision body → bp-catalyst-platform → Sovereign Console: a /marketplace route returns 200 with a non-empty catalog page (apps + voucher admin) — NOT 404, NOT a "marketplace disabled" stub. kubectl get hr -A shows bp-marketplace (or whichever HR backs the marketplace) Ready=True on the chroot. The mothership provision wizard MUST default marketplace.enabled=true (zero-touch — operator never toggles a flag). |
Playwright MCP + kubectl |
| D28 | Voucher issuance from owner-tier UI. Owner (the operator who PIN-logged-in per D21) opens Sovereign Console marketplace admin → issues a voucher for tenant onboarding. Voucher artifact MUST persist (CR + DB row), MUST be emailed to the chosen recipient via the Sovereign's outbound SMTP, AND MUST be visible in the admin's voucher list. Issuance must be one-click (no kubectl, no API call). Test recipient: hatice.yildiz@openova.io (canonical operator-test address) or any other Sovereign-side mailbox the operator controls. |
Playwright MCP |
| D29 | Voucher-based organization (tenant) provisioning is zero-touch. Recipient opens the voucher email → clicks redeem link → PIN-login as the test recipient (e.g. hatice.yildiz@openova.io) → lands on an organization-creation wizard → completes the form → a new Organization / Tenant CR is created → tenant namespace + RBAC + bootstrap apps converge → recipient is auto-redirected to their tenant home page. NO operator intervention beyond the voucher email. |
Playwright MCP |
| D30 | Free-subdomain selection from operator-curated pool. Organization wizard step MUST present a subdomain picker populated from the configured pool: omani.homes, omani.rest, omani.trade (singular — see §4 below), and any others the operator has provisioned. Tenant chooses a free subdomain (e.g. acme.omani.homes) → cert provisions → tenant landing page resolves on the chosen FQDN with publicly-trusted TLS. The pool MUST come from a Sovereign-side CR / config (not hardcoded). |
Playwright MCP + dig + curl |
| D31 | Tenant application with CNPG active-hot-standby replication. Inside the new tenant, user picks a CNPG-backed app from the marketplace (e.g. Ghost or WordPress) → selects "active hot-standby" → app installs with a CNPG Cluster that replicates across the Sovereign's regions (primary + at least one replica). kubectl get cluster.postgresql.cnpg.io -A in the tenant context shows instances distributed across regions (region label / topology spread). Failover test: cordoning the primary region brings the replica to primary, app remains reachable on its FQDN within the documented RTO (≤ 30 s). Full counter-test continuity procedure in §6. |
Playwright MCP + kubectl + curl |
| D32 | Sandbox CRD installable on the Sovereign. kubectl get crd sandboxes.sandbox.openova.io returns the CRD; the controller Pod (sandbox-controller in catalyst-system) is Ready and processes a no-op Sandbox CR within 30 s (status transitions Pending → Reconciling → Ready). helm template of the Sovereign chart with sandbox enabled emits the controller Deployment + RBAC + Service. The Sandbox plane is part of every Sovereign by default — operator does not opt in. |
kubectl + helm template |
| D33 | Sandbox agent catalogue picker functional. Sovereign Console /console/sandbox lists at minimum the six agents specified in products/sandbox/docs/architecture.md — Claude Code, Cursor (cloud), Qwen Code, Aider, OpenCode, plus the Sovereign-native shell. Picking an agent opens a session host page; the BYOS settings page lets the operator paste an Anthropic OAuth client_id (per products/sandbox/docs/claude-code-byos.md). A picked session establishes a WebSocket to the pty-server and renders xterm.js with a live PTY prompt. |
Playwright MCP |
| D34 | newapi Sovereign-side LLM gateway routes to a backend model. https://newapi.<fqdn>/v1/chat/completions accepts an HS256 org-scoped JWT (issued by core/services/auth), authenticates the request, and proxies to a configured backend. The reference backend for this gate is Bank Dhofar Qwen. A round-trip curl with a valid JWT returns a non-empty choices[0].message.content within 30 s. No Anthropic / OpenAI cloud calls leave the Sovereign by default — BYOS is opt-in per-user. |
curl + kubectl |
| D35 | NATS broker round-trips catalyst.tenant.created + catalyst.order.placed end-to-end. SME tenant + billing dispatchers PUBLISH to NATS JetStream (subjects catalyst.tenant.created, catalyst.tenant.updated, catalyst.order.placed, catalyst.invoice.paid observed via nats sub 'catalyst.>'). Organization controller + Sandbox controller CONSUME (consume legs ship per #1862; round-trip wire test per the contract added in 56e04ac8a). Round-trip test: issue a voucher → redeem it → measure latency from billing-service publish to Org controller reconcile-start ≤ 2 s. Convergence is NOT declared until both legs are wired — polling-the-API workaround does not satisfy this gate. |
NATS CLI + kubectl logs |
DoD grows. Every iteration of test-writer / test-executor finds more operator-visible bugs. Append the gate, ship the fix, re-validate. The list is the convergence contract; do not declare convergence until every appended gate passes on a single fresh prov.
Trigger phrases that mean STOP — about to compromise
- "for now let's just do 2 regions"
- "the matrix expects X — let me synth X into the response"
- "Hetzner private net spans zones, let me use that for cross-region"
- "ClusterMesh on NodePort is fine for testing"
- "ash / sin is different zone, let me just stay in eu-central"
- "private NIC link between regions is faster"
If any of these appear in your reasoning → STOP, re-read this file, fix the root cause.
Cycle protocol
Before any tofu apply or POST /api/v1/deployments:
- Read this file (or the auto-memory mirror at
~/.claude/projects/-home-openova-repos-openova/memory/sovereign_multiregion_dod.md). - Log the D0–D35 list to the loop output.
- Refuse to mark convergence until each D0–D35 has been individually checked.
§4 — Domains canon
This section is the single source of truth for FQDN patterns used in Catalyst test provs and tenant Organizations. Every test, walk, agent dispatch, and provisioning request must use the patterns below.
Test-Sovereign FQDNs
| Layer | Pattern | Notes |
|---|---|---|
| Sovereign (test) | t<NN>.omani.works |
<NN> increments with every fresh prov (t39, t40, …). |
| Sovereign (test fallback) | t<NN>.omantel.biz |
Use when omani.works hits a Let's Encrypt rate limit. Swap weekly. |
| Sovereign Console | console.t<NN>.omani.works |
Operator-facing console UI. |
| Marketplace | marketplace.t<NN>.omani.works |
Customer-facing storefront for the operator-curated catalog. |
| Operator services | keycloak.t<NN>.omani.works, openbao.t<NN>.omani.works, openova-flow.t<NN>.omani.works, prometheus.t<NN>.omani.works, mimir.t<NN>.omani.works, loki.t<NN>.omani.works, tempo.t<NN>.omani.works, argo.t<NN>.omani.works, workspaces.t<NN>.omani.works, harbor.t<NN>.omani.works, registry.t<NN>.omani.works, guacamole.t<NN>.omani.works |
Per §3 D25. |
Voucher operations live in the operator console's BSS menu, NOT in any
admin.<sovereign-fqdn> subdomain. The legacy admin.* references in older
docs and agents are outdated.
Let's Encrypt rate-limit fallback policy
omani.works is the default test TLD. When Let's Encrypt rate-limits issuance
on that TLD (typically after many wipe → create cycles in a week), swap to
t<NN>.omantel.biz for the affected week. Both TLDs are operator-owned and
both have the Catalyst NS records pre-provisioned. Never improvise a third
test TLD — adding one to the canon must go through a PR against this section.
Tenant-Organization FQDNs
Tenant Organizations receive a free subdomain from an operator-curated pool
allocated at signup. The pool population is defined in
core/services/parent-domain/sovereign_parent_domains.go (the canonical Go
source — pool TLDs are not hardcoded in tests).
| Pattern | Example | Notes |
|---|---|---|
<orgslug>.omani.homes |
acme.omani.homes |
Default — first NS-ready entry in registration order per core/services/sme/sme_tenant.go:514-521. |
<orgslug>.omani.rest |
acme.omani.rest |
Pool alternate. |
<orgslug>.omani.trade |
acme.omani.trade |
Pool alternate. Note: singular trade, not trades — earlier docs that said omani.trades are wrong; do not reintroduce the plural. |
The tenant console URL pattern is console.<orgslug>.<pool-tld> — e.g.
console.acme.omani.homes. Additional tenant-installed apps are reachable at
<newapp>.<orgslug>.<pool-tld> — e.g. notes.acme.omani.homes.
Voucher redeem URL
The canonical voucher-email link pattern (per
core/services/notification/templates/templates.go):
https://marketplace.t<NN>.omani.works/redeem/?code=<CODE>
The slash before ? is mandatory — both URL ends are part of the Phase 1
step 1a contract and must be byte-for-byte stable.
Forbidden in tests
The following strings must never appear in test code, test data, operator-walk runbooks, fresh-prov provisioning bodies, or any artifact that exercises the 5-Pillar deterministic path:
openova.io— and any subdomain (console.openova.io,marketplace.openova.io, etc.)omantel.openova.io— legacy operator-sample FQDN, deadeventforge.io— never an OpenOva domain; never the canonical app nameNova Cloud— never the operator brand for the test stack
openova.io is reserved for the OpenOva marketing site (the
openova-private/website/ repo) and the mothership control plane during
Phase 0 + Phase 1 cold-start. After
bp-self-sovereign-cutover
runs, every reference to openova.io from a franchised Sovereign is a
Principle #11 violation.
When to switch + why
| Situation | Switch to | Why |
|---|---|---|
| Fresh prov on a clean LE quota | t<NN>.omani.works |
Default; cheapest, most stable. |
LE rate-limit on omani.works |
t<NN>.omantel.biz |
Same Catalyst NS records, different LE quota. |
| Tenant Organization signup, default | <orgslug>.omani.homes |
First NS-ready entry; quietest pool TLD. |
omani.homes pool exhausted in a Sovereign |
<orgslug>.omani.rest then <orgslug>.omani.trade |
Round-robin within the operator's pool config. |
| Any test or walk artifact | NEVER openova.io / omantel.openova.io / eventforge.io / Nova Cloud |
Reserved for mothership / marketing surfaces only; appearance in a tenant test = Principle #11 violation. |
Domain hygiene checks
docs/trust-audit-*.md and PR review hunt for:
openova.ioleaks in test data — any*_test.go/*.spec.ts/*.featureliteral containingopenova.iois a leak.- Hardcoded operator FQDNs — any code path that pins the operator domain to a literal instead of reading it from a runtime parameter (
SOVEREIGN_FQDN,--sovereign-fqdn, etc.). SeePRINCIPLES.md§4 (never hardcode). - Tenant-Org URL pattern drift — any path that emits
<orgslug>.openova.ioor<orgslug>.<sovereign-fqdn>instead of<orgslug>.<pool-tld>. The pool TLD is the source of truth. admin.<sovereign-fqdn>references — voucher and billing operations live in the BSS menu inside the operator console; anadmin.*subdomain means a stale reference.- Plural
omani.trades— must be singularomani.trade. Any new occurrence is a regression.
When in doubt, defer to GLOSSARY.md and the Go source files
named above.
§5 — Persona journeys
How different people use Catalyst. Defer to GLOSSARY.md for
terminology. The journeys described below use Catalyst surfaces (console / Git
/ API) that are partially design-stage — see
STATUS.md.
5.1 Personas
| # | Persona | Where they live | Tools they use |
|---|---|---|---|
| P1 | OpenOva Engineer | github.com/openova-io | Catalyst codebase, Blueprint repos |
| P2 | sovereign-admin |
Catalyst admin UI + Sovereign Gitea | Browser UI, Git, kubectl (debug) |
| P3 | Support Agent (within a Sovereign's operations team) | Catalyst admin UI in support mode | Browser UI |
| P4 | org-admin |
Org-scoped Catalyst console | Browser UI, occasional Git |
| P5 | SME End User (e.g. Ahmed, pharmacy owner on Omantel) | Marketplace + the App they installed | Browser only |
| P6 | SME Power User (e.g. Ahmed's tech-savvy nephew) | Console with Developer mode toggled on | Browser, occasionally Git |
| P7 | Corporate DevOps / SRE (e.g. Layla at Bank Dhofar) | Git + console in advanced view | Browser, Git, kubectl-on-own-vcluster, IDE |
| P8 | Corporate App Developer (e.g. Omar at Bank Dhofar) | Console + Git for own service repos | Browser, Git, IDE |
| P9 | Security / Compliance Officer (e.g. Khalid, CISO) | Audit dashboards + EnvironmentPolicy editor | Browser |
| P10 | Billing Admin | Billing console | Browser |
5.2 Surfaces
The three first-class surfaces (full list and rationale in
ARCHITECTURE.md §7):
- UI — Catalyst console. Form / Advanced / IaC editor depths.
- Git — direct push or PR to the Application's Gitea repo (one repo per App; branches
develop/staging/mainmap to dev / stg / prod), or to private Blueprint repos (shared-blueprintsper-Org orcatalog-sovereignSovereign-wide). - API — REST + GraphQL, for portal integrations.
Plus one debug-only surface:
- kubectl — inside one's own vcluster. Read-mostly, never used to mutate Catalyst-managed resources.
There is no fourth surface. Terraform, Pulumi, "catalystctl install" are not part of this model.
5.3 Personas × Journeys matrix
Cells show which surface(s) the persona uses for that journey. Bold = primary. Italic = secondary. Empty = not applicable.
| P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| J1 Build & publish Blueprint to public catalog | Git + CI | |||||||||
| J2 Provision a Sovereign | UI+Git | |||||||||
| J3 Onboard an Organization | UI | UI | ||||||||
| J4 Create an Environment | UI | UI | auto on signup | UI | UI | view audit | ||||
| J5 Install Application from catalog | UI | UI form | UI | UI + Git | UI + Git | view audit | view cost | |||
| J6 Configure Application | UI | UI form | UI | UI + Git | UI + Git | view audit | ||||
| J7 Author private Blueprint | Git+CI | UI + Git | Git + CI | Git + CI | review + sign | |||||
| J8 Author Crossplane Composition (advanced) | Git + CI | Git+CI | Git + CI | review | ||||||
| J9 Promote between Environments | UI | UI + Git PR | UI + Git PR | UI approve | ||||||
| J10 Observe runtime / debug | UI dashboards | UI dashboards | UI dashboards | App's own UI | UI | UI + kubectl | UI + kubectl | UI audit | ||
| J11 Rotate credentials | UI + auto | UI + auto | auto | UI | UI + auto | UI + policy | ||||
| J12 Audit / compliance review | UI | UI | UI | UI (own changes) | UI (own changes) | UI export to SIEM | ||||
| J13 Billing & quotas | UI quotas | UI read | UI invoices | UI plan | UI | |||||
| J14 Off-board / migrate | UI export | UI | UI cancel | UI export | audit | UI final invoice |
5.4 Operator journey — BSS menu (Phase 0 walkthrough)
The operator's primary mutation surface is the BSS menu in the Sovereign
Console sidebar — Business Support System for voucher issuance, billing,
plan/quota administration, and tenant lifecycle operations. The BSS menu
replaces the dead admin.<sovereign-fqdn> subdomain pattern.
| Step | Surface | Action |
|---|---|---|
| O0 | Sovereign Console at console.t<NN>.omani.works |
PIN-login as owner-tier (D21 must pass — operator pre-populated). |
| O1 | Sidebar → BSS → Vouchers | Issue voucher to test recipient (e.g. hatice.yildiz@openova.io). Voucher CR + DB row materialise (D28). |
| O2 | Sovereign outbound SMTP | Recipient receives voucher email with canonical URL https://marketplace.t<NN>.omani.works/redeem/?code=<CODE> (slash mandatory). |
| O3 | Sidebar → BSS → Tenants (post-redeem) | New Organization appears in the operator's tenant list with the chosen pool subdomain (D30). |
| O4 | Sidebar → Settings | Region / Capacity / ControlPlaneSize / Created / DeploymentID / Pool subdomain populated with real values (D22). |
What the operator never touches: admin.<fqdn>, kubectl, raw NATS, raw SQL.
5.5 Customer journey — voucher → checkout → Org (SME, Ahmed)
Cast. Ahmed owns 4 small pharmacies in Muscat. No IT staff. He has a laptop and a credit card. (Sovereign for this example is the Omantel-run Sovereign for SMEs.)
Day 1 — 14:00
─────────────
1. Ahmed receives the voucher email from his Omantel sales rep.
Link points at his Sovereign's marketplace, e.g.
https://marketplace.<sovereign-fqdn>/redeem/?code=<CODE>
2. Clicks the link → PIN-login (email + 6-digit PIN magic-link).
3. Picks the canonical Postgres-backed bundle from the operator-curated catalog
(e.g. bp-bundle-pharmacy: ERPNext + WooCommerce + Stalwart-mail + Postgres + Redis).
4. Org wizard: picks subdomain `muscatpharmacy.<pool-tld>` from the picker
(D30 — pool comes from operator-curated Sovereign config), confirms
business details + 2-region BCP topology (Pillar 2).
5. Catalyst auto-creates: Organization "muscatpharmacy", Environment
"muscatpharmacy-prod", vcluster "muscatpharmacy" on the chosen primary region.
2 independent CNPG clusters provisioned, ReplicaCluster sync over
ClusterMesh (Pillar 3). Environment-controller spins up the vcluster
in ~60 seconds.
6. Provisioning service creates 5 Application Gitea repos under
gitea.<location-code>.<sovereign-fqdn>/muscatpharmacy/ (one repo per App:
erpnext, woocommerce, pharmacy-mail, shared-postgres, shared-redis), each
with develop/staging/main branches and initial manifests.
Webhook → projector → Flux in the muscatpharmacy vcluster picks up the
N new GitRepository sources and reconciles.
7. ~3 minutes later: Ahmed sees green checkmarks on his dashboard.
Each App card has an "Open" button.
Click ERPNext → SSO via the Sovereign's Keycloak realm for muscatpharmacy → he's in.
─────────────
Day 1 — 14:08 — Ahmed is selling.
What he never saw: Git, kubectl, vcluster, Flux, Blueprint, YAML, JetStream. His mental model: "I have a Sovereign account. I bought a bundle. It works."
5.6 Tenant journey — login → Sandbox → qwen-code → new app (Phase 2)
This is the deterministic Phase 2 walkthrough from §2 expressed as a journey narrative. Same tenant Organization (Ahmed's muscatpharmacy, or any tenant created in §5.5).
| Step | Surface | Action |
|---|---|---|
| T0 | console.<orgslug>.omani.homes |
Tenant PIN-logs-in. Dashboard renders (Pillar 1 + Pillar 2 already shipped). |
| T1 | Tenant Console → Sandbox | Sandbox session launches with agent set to qwen-code by default (routes via newapi to Sovereign-hosted Qwen — zero Anthropic cost leak). |
| T2 | Sandbox session | openova-sandbox-mcp auto-mounts: 49 MCP tools available (sandbox.db.*, sandbox.auth.*, marketplace.app.*, sandbox.git.*, sandbox.iam.*). User pastes zero credentials. |
| T3 | qwen-code prompt | Tenant prompts: "install a notes app backed by Postgres in my Org, public on notes.<orgslug>.omani.homes." |
| T4 | Agent action | qwen-code calls marketplace.app.install + sandbox.db.provision + sandbox.auth.provisionRealm via the MCP plugin. New app's CNPG cluster + namespace + HelmRelease + Gitea repo materialise. |
| T5 | New app | Reachable at https://notes.<orgslug>.omani.homes with publicly-trusted TLS. |
The tenant never typed a kubeconfig, never opened Git, never copied a DB connection string. Pillar 4 shipped end-to-end.
5.7 Corporate journey — Layla at Bank Dhofar (running its own Sovereign)
Cast. Layla is an SRE on Bank Dhofar's 12-person Cloud Platform team.
They run their own Sovereign on Hetzner. Their internal Organizations are
core-banking, digital-channels, analytics, corporate-it. Their default
tooling is Git + IDE.
09:00 Coffee. Opens VS Code. Branch: bp-bd-payment-rail
─────────────────────────────────────────────────────────────────────────
She's authoring a private Blueprint for a payment-rail microservice
with Postgres + Redis dependencies.
09:15 Pushes to gitea.<location-code>.bankdhofar.local/digital-channels/shared-blueprints/
bp-bd-payment-rail. CI in Bank Dhofar's GitHub Actions runner pool
(running inside the Sovereign) builds the image, signs the Blueprint
with cosign, publishes to the local OCI registry. blueprint-controller
picks it up — visible as a private card in the digital-channels Org.
10:00 Switches to her Application repo:
gitea.<location-code>.bankdhofar.local/digital-channels/payment-rail
Checks out branch `develop` (the dev environment branch).
Edits values.yaml (config tweak).
Catalyst console (Plan view) shows the diff: what will change,
dependency impact, drift, cost delta. Like `terraform plan`, but
served by the API on the Git diff.
10:15 Happy. Commits to develop. Webhook → projector → Flux in the
digital-channels vcluster (watching the develop branch on this
Application repo) reconciles in 30s. Audit log captures her as
committer at the App-repo level.
11:00 Need to debug the staging deployment of the same App.
Browser: console → digital-channels-stg → payment-rail card
→ Logs tab. Then Topology tab to see across regions.
Or, drops into kubectl scoped to her vcluster:
$ kubectl --context=hz-fsn-rtz-prod-digital-channels logs -n payment-rail
Direct kubectl, scoped strictly to her Org's vcluster (vcluster name
per NAMING §1.5 is the Org name, not the Sovereign name — Layla's Org
is `digital-channels`). Bank Dhofar's sovereign-admin grants this via
a JIT elevation flow.
14:00 Promotion stg → uat. From the payment-rail Application card,
clicks "Promote staging → uat". Catalyst opens a Gitea PR
within the same payment-rail repo: source branch `staging`,
target branch (a feature branch tracking uat config). The
EnvironmentPolicy CR for digital-channels-uat (in
system/catalyst-config/policies/) requires team-platform approver
and an RE score ≥ 80%. Reviewer approves via Gitea web UI (or
via the Catalyst console's PR view — same backend). Auto-merge.
Flux in the uat-bound vcluster reconciles.
15:00 New Environment needed for a fraud lab. From the console:
"New Environment in analytics" → fills name "fraud-lab-dev" →
picks "small" topology (1 region, single bb=rtz). Environment-controller
creates the vcluster and bootstraps Flux pointing at the develop
branch of every Application repo in the analytics Org. No new repos
are created (Application repos exist already, indexed by branch).
Ready in 60s. Layla now has a new sandbox.
16:00 Business asks for the bank's existing Backstage portal to show
Catalyst-managed services. Layla integrates: Backstage queries
Catalyst REST API at https://api.<location-code>.bankdhofar.local/v1/applications,
authenticated via workload identity (Backstage runs inside the
Sovereign). Backstage's service catalog now includes Catalyst
Applications alongside other systems. No code change in Catalyst —
the API was already there.
What Layla DOES use: UI (for promotion approvals, observability,
EnvironmentPolicy editing), Git (for Blueprint authoring in shared-blueprints
and per-Application config in each App's repo with develop / staging /
main branches), kubectl (for debugging her own vcluster), and the API (for
integrating Backstage). She never writes Crossplane code unless she's
contributing a new Composition upstream as a Blueprint — and even then it's
via a Gitea PR.
What Layla doesn't use: Terraform, Pulumi, a "catalystctl" CLI, or any other tool that bypasses Git.
5.8 Application card (the user's primary handle)
The card is the user's view of an Application in their Environment. Anatomy below; full UX in the console docs.
┌────────────────────────────────────────────────────────────────┐
│ 🌐 marketing-site ⋮ │ ← name + menu
│ bp-wordpress @ 1.3.0 │ ← Blueprint + version
├────────────────────────────────────────────────────────────────┤
│ ● Running 🔗 acme.com ↗ │ ← status + endpoint
│ │
│ 📍 eu-central 5 / 5 pods │ ← placement + health
│ 💾 postgres → shared-postgres (own card) │ ← key dependency (linked)
│ │
│ Last deploy: 2h ago by Layla ⏵ View history │ ← provenance
│ │
│ [ Open app ↗ ] [ Settings ] [ Logs ] [ Topology ] │ ← primary actions
└────────────────────────────────────────────────────────────────┘
States via the status badge:
| State | Meaning |
|---|---|
| ● Running (green) | All replicas healthy, traffic flowing |
| ◐ Installing (blue) | Flux reconciling, progress shown inline |
| ◑ Updating (blue) | Config or version change rolling out |
| ◒ Degraded (amber) | Partial — 3/5 pods, 2 unhealthy |
| ◓ Failed (red) | Install or update failed, "View error" button |
| ○ Paused (grey) | Manually paused, scale-to-zero |
| ◔ Pending approval (purple) | Promotion PR open, awaiting reviewers |
Clicking the card opens the detail page with tabs: Overview, Settings, Topology, Secrets, Observability, History, Manifests.
The Topology tab is where Placement edits happen — single-region → active-active, region picker, failover policy. The Manifests tab is the Monaco IaC editor.
5.9 Catalog vs Applications-in-use view
The Marketplace renders Blueprint cards (something to install) — visually distinct from Application cards (something running). The Blueprint detail page is the "where is this Blueprint running in my Org" view — a query, not a chain object. The Environment view groups Application cards by status, with backing services (Postgres, Redis, etc.) in their own section.
5.10 Default UI mode by Sovereign type
| Setting | SME-style default | Corporate default (Bank Dhofar) |
|---|---|---|
| Console default depth | Form view | Advanced view + IaC editor toggle on |
| Developer mode (Blueprint Studio) | Hidden, off | Visible by role |
| Multi-Environment promotion features | Hidden when only 1 Env | Visible always |
| EnvironmentPolicy editor | Hidden by default | Visible by role |
kubectl access for users |
Off | On for org-developer and above |
| Git access for users | Off (sovereign-admin can flip per-Org) | On |
| Marketplace features (search, bundles, ratings) | All on | All on but de-emphasized |
| Specter / AIOps Blueprint included by default | Optional | Recommended (Cortex + Specter on top) |
Each Sovereign sets its defaults at provisioning time; users within can override via per-user preferences within the role permissions allowed.
§6 — Multi-region BCP test (D31 detail)
D31 is the region-kill BCP failover gate — the verifier for Pillar 3. Run in parallel with Phase 0 / 1 / 2 on the same fresh prov.
Preconditions
- Tenant Organization exists (created via §2 Phase 1).
- Tenant has installed a CNPG-backed app via the marketplace with active
hot-standby selected. Two independent CNPG
ClusterCRs exist — one in each chosen region — withReplicaClustersync over Cilium ClusterMesh on the DMZ WireGuard data plane (A2 + A3). - App reachable on its FQDN with publicly-trusted TLS.
Counter-test continuity procedure
The continuity check is a monotonic counter that increments through the region kill. Any gap, replay, or skipped value = failed gate.
| Step | Action | Pass criterion |
|---|---|---|
| 1 | Start the counter writer: a client process that INSERT … RETURNING id every 100 ms against the app's primary CNPG endpoint, recording each returned id and timestamp locally. |
Counter increments monotonically — no holes pre-failover. |
| 2 | Kill the primary region. Two valid kill modes: (a) instance destroy via the cloud-provider API; (b) NetworkPolicy isolation that drops all traffic in/out of the primary region's namespaces. |
Primary region becomes unreachable within ≤ 5 s. |
| 3 | failover-controller (per Continuum CR) flips traffic to the replica region. Replica CNPG ReplicaCluster promotes to primary. Cilium ClusterMesh keeps inter-region pod-to-pod alive across the surviving regions. |
Failover RTO ≤ 30 s end-to-end (kill → counter writer reconnects to the new primary). |
| 4 | Counter writer reconnects via the app's FQDN (DNS / LB now points at the surviving region). | Writer resumes within the 30 s window; no transaction lost — the next written id is last_id + 1, never less, never with a gap. |
| 5 | (Optional, ledger-grade) After failover, walk the durable WAL and confirm every id from before the kill is present in the new primary's data plane. |
Zero id gaps in the replica-promoted-to-primary's data. |
Hard requirements
- The kill MUST be a real region kill, not a Pod restart or a Deployment
scale-to-zero. A single-region prov cannot satisfy D31 — see PR #1599 shape
in
PRINCIPLES.md("multi-region claim on single-region prov"). - The failover must be triggered by
failover-controllerreading the Continuum CR — never by a human flipping DNS by hand. - The counter writer's local log + the post-failover database state are the only acceptable evidence. Operator-walk screenshots alone do not satisfy D31.
§7 — Pillar 5 sovereignty cutover
The Phase 1 → cutover transition is the proof that a franchised Sovereign can operate independently of the OpenOva mothership.
Choreography
bp-self-sovereign-cutoverinstalls dormant at bootstrap-kit slot 06a during Phase 1. It is present but inert — no Jobs running, no tether pivots active.- Post-handover, the operator clicks the "Achieve True Sovereignty" CTA in the Sovereign Console.
- The Blueprint runs eight sequential Jobs, each pivoting one of the tethers listed in §1 Pillar 5. Tethers are pivoted in dependency order so the cluster never loses its ability to pull what it needs at each step (e.g. Harbor proxy-cache is warmed before containerd registries.yaml flips).
- The final step is a 10-minute deny-egress NetworkPolicy hold against
github.com,ghcr.io, andharbor.openova.io. During the hold:- Flux must continue to reconcile (sources are now local Gitea + Harbor).
- All HelmReleases must remain Ready=True.
- No image-pull errors, no Git fetch errors, no upstream registry hits.
cutoverComplete=trueis set only if the cluster reconciles green during the full 10-minute hold. Any hiccup = the cutover failed; rollback to pre-cutover state, fix the root cause, re-run.
Verification (the only acceptable evidence)
- Egress-block proof: a Hubble / NetworkPolicy log showing zero allowed
flows to
github.com,ghcr.io,harbor.openova.iofor the duration of the 10-minute hold. - Reconcile-green proof:
kubectl get hr -A -o jsonpath=…showing every HelmRelease Ready=True at minute 0 and minute 10 of the hold. - Operator-walk screenshot: the Sovereign Console's "Sovereignty" tab
showing
cutoverComplete=truewith the timestamp of the hold.
No cutover claim without all three. See
adr/0002-post-handover-sovereignty-cutover.md
for the full architecture, alternatives considered, and per-step contract.
Customer-sync — how each Sovereign keeps the catalog after cutover
Each franchised Sovereign's Gitea mirrors the public catalog from this
repo (openova-io/openova):
GitHub (openova-io/openova) Per-Sovereign Gitea (mirrored)
───────────────────────────── ───────────────────────────────
platform/cilium/ ────sync────> gitea.<location-code>.<sovereign-domain>/catalog/bp-cilium/
products/cortex/ ────sync────> gitea.<location-code>.<sovereign-domain>/catalog/bp-cortex/
...
Sovereigns pull on their own schedule (default daily). Air-gapped Sovereigns
mirror via offline media. After bp-self-sovereign-cutover completes, the
Sovereign's Flux reconciles exclusively from its local Gitea + Harbor —
never back to github.com/openova-io or ghcr.io/openova-io (Principle #11).