Canarie kube — Architecture Overview

1. Kubernetes 101 — just enough to be dangerous

1.1 What Kubernetes actually is

Kubernetes is a container orchestrator — it does not run containers itself. It tells worker nodes what containers to run, where, how many copies, and how they should reach each other. The container runtime (in this build: containerd) is what actually executes them.

The control plane is a small set of long-running processes that maintain the cluster's “desired state”:

Component	Job
`etcd`	The cluster's database. Every K8s object (Pod, Service, Secret, etc.) is a row in etcd.
`kube-apiserver`	The only thing that talks to etcd. Every other component, including `kubectl`, talks to the apiserver.
`kube-scheduler`	Decides which node a new Pod runs on.
`kube-controller-manager`	Bundles the dozens of small loops that drive resource state — Deployment controller, ReplicaSet controller, Node controller, etc. Each loop reads “desired” from etcd, observes “actual” via the apiserver, and acts to close the gap.
`cloud-controller-manager`	Cloud-provider-specific glue (in real AKS this is what makes `Service type=LoadBalancer` provision an Azure LB). In this cluster: not used — see §6.

Each worker node runs:

Component	Job
`kubelet`	The node-level agent. Talks to the apiserver, gets told “run these pods”, and tells containerd to start/stop them.
`containerd`	The container runtime.
`kube-proxy` (normally)	Programs iptables/IPVS so Service ClusterIPs route to the right Pod IPs. Replaced by Cilium in this cluster — see §5.
CNI plugin	Gives Pods their IPs and connects them to the cluster network. Cilium here.

The pattern to internalise: everything in K8s is reconciliation. You declare desired state (a YAML manifest), the apiserver writes it to etcd, and a controller loop notices the gap between desired and actual, then takes action. There is no “deploy” verb — there's only “write the desired thing to the apiserver and wait for the world to converge.”

1.2 Core primitives

Pod — The smallest deployable unit. One or more co-located containers sharing a network namespace and (optionally) volumes. Almost never created directly; created by controllers.

Deployment — Manages a ReplicaSet which manages Pods. Use for stateless workloads (web servers, API processes). Supports rolling updates, rollback.

StatefulSet — Like a Deployment but each Pod has a stable identity (pod-0, pod-1) and stable storage (each Pod gets its own PVC). Use for databases, Prometheus, anything that needs to know “which replica am I”. This cluster uses it for Prometheus, Alertmanager, Loki.

DaemonSet — Runs exactly one Pod on every (matching) node. Use for node-level agents. This cluster uses it for ingress-nginx, Promtail, Cilium.

Service — A stable virtual IP and DNS name in front of a set of Pods. Three types relevant here:

ClusterIP — internal-only virtual IP, the default. Used for in-cluster service discovery.
NodePort — exposes the service on a static port on every node's IP. Rarely used in production.
LoadBalancer — asks the cloud-controller-manager to provision an external LB. Not used here — see §6 for why.

Ingress — HTTP(S) routing layer in front of Services. Decides “host grafana.example.com + path / → Service grafana port 80”. Needs an Ingress Controller (an actual running pod) to do anything; the Ingress object is just config. This cluster uses ingress-nginx as the controller.

Namespace — A scope/folder for K8s objects. Every object except a few cluster-scoped ones (Node, ClusterRole, PersistentVolume, CRDs themselves) lives in exactly one namespace. RBAC, NetworkPolicy, and ResourceQuota are all per-namespace.

ConfigMap / Secret — Key-value blobs mounted into Pods as files or env vars. Secret is base64-encoded (not encrypted at rest by default — that's a separate apiserver setting).

PersistentVolumeClaim (PVC) — A Pod's request for storage (“I need 10 GiB”). Bound to a PersistentVolume (PV) which is the actual storage. PVs are usually provisioned dynamically by a CSI driver (Container Storage Interface) — in this cluster, by Longhorn, see §6.

CustomResourceDefinition (CRD) — Lets an operator extend the K8s API with new object kinds. Every component on top of plain K8s (Longhorn, Fleet, cert-manager, Prometheus Operator, Kyverno) ships its own CRDs. When you write a Certificate YAML you're talking to cert-manager's CRD, not core K8s.

Operator — An admission-shorthand: a controller that watches a CRD and reconciles its desired state. “The Trivy Operator” = a pod that watches VulnerabilityReport CRs and acts on them.

1.3 The five-minute concept dump

Helm is K8s's package manager. A “chart” is a templated bundle of YAML. helm install renders the templates with your values and kubectl applys the result. Helm tracks “releases” (named installs) in K8s Secrets named sh.helm.release.v1.<release>.<rev>.
GitOps is the pattern of “the cluster's desired state lives in a git repo; an in-cluster agent pulls and reconciles.” Opposite of “CI runs kubectl apply.” Key benefit: the cluster self-heals to git, drift is visible, rollback is git revert.
Admission controllers are HTTP webhooks the apiserver calls before persisting an object. They can mutate the object (mutating webhook) or reject it (validating webhook). Kyverno and Trivy both register validating webhooks.
Pod Security Admission (PSA) is K8s's built-in admission gate. You label a namespace pod-security.kubernetes.io/enforce: <baseline|restricted|privileged> and the apiserver enforces the corresponding security profile on Pods in that namespace. The replacement for the deprecated PodSecurityPolicy.

2. The shape of this project

Single sentence: a production-pattern RKE2 cluster on Azure, OSS-only, built up across seven discrete PRs so each layer of the stack is a discrete, reviewable change.

Why no AKS? AKS hides the parts of running K8s that give the project its shape — CNI install, CSI install, ingress install, etcd, certificates, GitOps wiring. The project is about assembling each of those layers end-to-end; AKS would short-circuit the exercise.

Why does each phase get its own PR? Each PR is a reviewable, individually-justified extension of the previous layer. The list below is also the order they had to ship in, because each later phase depends on something in an earlier one.

#	Phase	What it adds	Installed by
1	Cluster	RKE2 v1.31 + Cilium (kube-proxy replacement)	cloud-init + RKE2 `HelmChartConfig`
2	Storage	Longhorn 1.7.2 (default StorageClass)	RKE2 `HelmChart` CRD
3	Ingress	ingress-nginx (DaemonSet, hostNetwork)	RKE2 `HelmChart` CRD
4	GitOps	Fleet + the GitRepo pointing at this repo	RKE2 `HelmChart` CRD
5	Observability	kube-prometheus-stack + Loki + Promtail	Fleet bundle
6	Certs + DNS	cert-manager (LE HTTP-01) + external-dns (Azure DNS) + 3 ClusterIssuers	Fleet bundles
7	Security	Trivy Operator + Kyverno + 4 audit-mode ClusterPolicies (Falco deferred)	Fleet bundles

The Falco deferral is real and worth knowing — its eBPF driver doesn't compile against kernel 6.17 in the chart's bundled DKMS path, and the pre-built modern-bpf driver also failed to load. Documented in the rollout plan's Phase 7 post-impl notes.

flowchart TD subgraph P1["Phase 1 — Cluster"] RKE2["RKE2 v1.31 server on cp-01
agents on wk-01, wk-02"] Cilium["Cilium CNI
kubeProxyReplacement=True"] RKE2 --> Cilium end subgraph P2["Phase 2 — Storage"] Longhorn["Longhorn 1.7.2
2 workers × 64 GiB data disk"] end subgraph P3["Phase 3 — Ingress"] Nginx["ingress-nginx DaemonSet
hostNetwork, on workers"] AzLB["Azure Standard LB
:80/:443 → workers
:6443 → cp-01"] AzLB --> Nginx end subgraph P4["Phase 4 — GitOps"] Fleet["Fleet controller+agent
GitRepo: this repo"] end subgraph P5["Phase 5 — Observability"] KPS["kube-prometheus-stack"] Loki["Loki + Promtail"] end subgraph P6["Phase 6 — Certs + DNS"] CM["cert-manager
LE HTTP-01 via ingress-nginx"] EDNS["external-dns
Azure DNS zone rke2.ericharrison.ca"] AzDNS["Azure DNS zone
delegated from Cloudflare"] EDNS --> AzDNS CM -.HTTP-01 challenge path.-> Nginx end subgraph P7["Phase 7 — Security"] Trivy["Trivy Operator
Vuln + ConfigAudit + Compliance"] Kyverno["Kyverno
4 ClusterPolicies in Audit"] end P1 --> P2 --> P3 --> P4 P4 -->|manages| P5 P4 -->|manages| P6 P4 -->|manages| P7

Phase map — how the stack assembled itself. Phases 2–3 install via RKE2's native HelmChart CRD; Phase 4+ is owned by Fleet.

2.1 Topology one-liner

3 Azure VMs — cp-01 (control plane, 10.20.0.10) and wk-01/wk-02 (workers, 10.20.0.132/.133), all Standard_D2ads_v5 Ubuntu 24.04, in one VNet (10.20.0.0/24) split into three subnets (control, lb, workers). One Azure Standard Load Balancer fronts both :6443 (kubectl → cp-01) and :80/:443 (HTTP(S) → workers). One Key Vault holds the RKE2 join token, the Longhorn UI password, and the Grafana admin password. One Azure DNS zone (rke2.ericharrison.ca) is delegated from Cloudflare via four NS records.

3. The infrastructure layer (Azure + OpenTofu + GitLab)

This layer is the foundation — none of K8s exists until cloud-init finishes on the VMs.

3.1 The OpenTofu story

infra/ is the entire IaC tree. Single module, no nested modules.
OpenTofu (open-source fork of Terraform after the BSL relicense). Functionally identical syntax, picked over Terraform to avoid HashiCorp's licence.
AzureRM provider (~> 4.0).
State is stored in GitLab's HTTP backend, one state file per environment.
Four environments: development → integration → staging → production. Dev and int auto-apply on main; staging and prod are manual.
OIDC federated credentials authenticate GitLab CI to Azure — no long-lived service principal secret in CI variables. Each environment has scoped AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID.
Pipeline shape: lint → plan → apply-{dev,int,stg,prod}. MR pipelines run lint + plan only.

3.2 The Azure primitives that exist

Resource group per environment, e.g. rg-kube-dev-001.
VNet 10.20.0.0/24 with three subnets (control, lb, workers) each with their own NSG.
Three VMs (only in dev — int/stg/prod don't get a cluster, just the IaC scaffolding).
One Azure Standard LB with two backend pools (control plane on cp-01, workers on wk-01+wk-02) and three rules (:6443 → control, :80 → workers, :443 → workers). Public IP 20.48.237.183.
One Key Vault (kv-kube-dev-001), RBAC-auth, holding generated secrets.
One Azure DNS zone (rke2.ericharrison.ca).

3.3 The bootstrap pattern that ties Azure → K8s together

This is the piece that ties Azure IaM to K8s cluster bootstrap without a pre-placed secret:

Terraform generates the RKE2 join token via random_password and writes it to Key Vault.
Each VM has a SystemAssigned Managed Identity with Key Vault Secrets User on the vault.
At boot, cloud-init queries the Azure Instance Metadata Service (IMDS) at 169.254.169.254 for an OAuth token, then curls the Key Vault REST API to fetch the join token. No az CLI needed; no secret travels outside Azure.
cp-01 starts rke2-server, persists the token internally. wk-01 and wk-02 read the same token and join via cp-01:9345.
Workers also format their data disk (/dev/disk/azure/scsi1/lun0, 64 GiB) and mount it at /var/lib/longhorn. The mkfs.ext4 is gated by blkid so VM rebuilds preserve Longhorn data.

sequenceDiagram participant TF as Terraform/OpenTofu participant AZ as Azure participant KV as Key Vault participant CP as cp-01 (cloud-init) participant WK as wk-01/02 (cloud-init) participant K8S as RKE2 API TF->>AZ: create RG, VNet, LB, VMs, MIs TF->>KV: random_password rke2-join-token TF->>AZ: assign KV Secrets User to each VM MI AZ->>CP: cloud-init with custom_data AZ->>WK: cloud-init with custom_data CP->>KV: IMDS token → fetch rke2-join-token CP->>CP: write config.yaml.d/10-token.yaml CP->>CP: systemctl enable --now rke2-server CP->>CP: drop HelmChartConfig (Cilium / eBPF) WK->>WK: format disk /dev/disk/azure/scsi1/lun0 WK->>KV: IMDS token → fetch rke2-join-token WK->>CP: join via :9345 CP->>K8S: API becomes ready WK->>K8S: Node registered

Bootstrap flow — empty resource group to ready cluster. The join token never leaves Azure (KV → IMDS → VM via Managed Identity).

The custom_data field on azurerm_linux_virtual_machine is force-new — any change to kube/cloud-init/{cp,wk}.yaml destroys and recreates the VMs. Data disks survive (separate resources). RKE2 server state on cp-01's OS disk does not, so a cloud-init edit after Phase 2 effectively means a cluster rebuild.

3.4 The four operator-run bootstrap scripts

Some Secrets can't be generated by Terraform because they need .env values that live only on the operator's laptop. These four scripts run once after cluster come-up:

Script	Seeds
`sync-fleet-git-auth.sh`	Fleet's git-clone basic-auth Secret from `.env`'s `GITLAB_TOKEN`
`sync-external-dns-azure.sh`	external-dns's Azure SP credentials
`sync-grafana-admin.sh`	Grafana admin password (consumed by KPS via `existingSecret`)
`sync-longhorn-basic-auth.sh`	basic-auth Secret for the Longhorn UI Ingress

Plus fetch-kubeconfig.sh, which ssh's into cp-01 (passwordless sudo on Azure Ubuntu image), cats /etc/rancher/rke2/rke2.yaml, and rewrites the server URL to point at the LB public IP rather than 127.0.0.1. That file is ~/.kube/kube-dev.yaml from there on.

4. Phase 1 — the cluster itself (RKE2 + Cilium)

4.1 Why RKE2

You will get asked “why RKE2 and not k3s or kubeadm or AKS?”. The answer:

Full upstream Kubernetes (not k3s-distilled). Same components, just packaged into a single binary with embedded etcd and containerd.
CIS-hardened by default — passes the CIS Kubernetes Benchmark out of the box.
Native Rancher/Fleet pairing, which is the GitOps story for Phase 4. RKE2 is what Rancher itself ships as the recommended distribution for production clusters.
k3s pattern-matches as “edge / IoT” in enterprise shops even though it's CNCF-conformant — wrong fit for a production-pattern build.
kubeadm would have cost a week of HA etcd plumbing for no portfolio payoff. RKE2 already embeds etcd.
AKS removes the layers the project is trying to demonstrate.

4.2 What RKE2 gives you for free

The single most important RKE2 fact for this project: RKE2's helm-controller runs as a goroutine inside rke2-server, not as a Pod. Manifests dropped into /var/lib/rancher/rke2/server/manifests/ are reconciled on every server start, before the CNI is up. Two CRDs come from this:

HelmChart — installs a chart from a repo. Used for Longhorn, ingress-nginx, Fleet itself.
HelmChartConfig — overrides values for a chart that ships with RKE2 (Cilium, traefik-if-not-disabled, etc.).

That mechanism is how Cilium gets configured — see next.

4.3 Cilium and kube-proxy replacement

Cilium is the CNI plugin — it gives Pods their IPs and connects them. This cluster sets cni: cilium in RKE2's server config, then drops a HelmChartConfig at /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml that sets kubeProxyReplacement: True.

What that means in practice:

No kube-proxy DaemonSet runs. Cilium handles Service ClusterIP → Pod IP routing itself, in eBPF programs attached to the kernel's TC hook.
Faster service routing. No iptables nat table scanning per packet.
Hubble observability is available (network flow inspection).

Why Cilium specifically, not Calico or Flannel?

eBPF kube-proxy replacement (Calico has it too but it's less first-class).
Cilium Network Policies (CRDs) extend stock K8s NetworkPolicy with L7 rules.
Hubble.
The leading CNI in the CNCF ecosystem right now.

eBPF — a safe, sandboxed in-kernel VM that runs verified bytecode in response to kernel events (syscalls, network packets, tracepoints). Originally for packet filtering (“Berkeley Packet Filter”), now general-purpose. Lets userspace programs do work in kernel context without writing a kernel module.

4.4 The token bootstrap (review)

rke2-token in KV → Terraform-generated random_password → VMs read via Managed Identity + IMDS → no chicken-and-egg, no secret in any pipeline log. This is how the cluster bootstraps a multi-node join without an already-running cluster to distribute the token.

5. Phase 2 — Storage (Longhorn)

5.1 What Longhorn is

A distributed block-storage system written in Go that runs as Pods inside your cluster. Each Pod-managed “engine” exposes a virtual block device that's replicated across nodes' local disks. Implements the CSI (Container Storage Interface) so K8s sees it as a normal StorageClass.

5.2 Why Longhorn in this build

No dependency on Azure Disk CSI — keeps the build true to the “OSS-only, cloud-agnostic” theme. The same chart would run on bare metal or AWS.
Block-level replication — every PVC has multiple copies across nodes. Lose a node, the PVC is still readable from another.
Native to Rancher's ecosystem — like Fleet, originally a Rancher project, now CNCF Incubating.

The tradeoff: it eats CPU and disk I/O on every worker, and on a 2-worker cluster you can only get 2 replicas (not 3, since you need replicas on distinct nodes).

5.3 Capacity math

2 workers × 1 × 64 GiB data disk = 128 GiB raw.
defaultReplicaCount: 2 → every PVC consumes 2× its requested size across the cluster.
Already committed: Prometheus 20 + Alertmanager 5 + Grafana 5 + Loki 5 + Trivy 5 = 40 GiB × 2 = 80 GiB.
Headroom: ~58 GiB for new PVCs.
Loki had to shrink from a planned 20 GiB to 5 GiB mid-flight when Longhorn couldn't schedule a 20 GiB replica on the remaining space — Phase 5 post-impl fix-up #3 in the rollout plan.

erDiagram WORKER ||--|| DATA_DISK : "mounts at /var/lib/longhorn" DATA_DISK ||--o{ LONGHORN_REPLICA : "hosts" LONGHORN_VOLUME ||--|{ LONGHORN_REPLICA : "has 2" PVC ||--|| LONGHORN_VOLUME : "bound to" PROMETHEUS_STATEFULSET ||--|| PVC : "20 GiB" ALERTMANAGER_STATEFULSET ||--|| PVC : "5 GiB" GRAFANA_DEPLOYMENT ||--|| PVC : "5 GiB" LOKI_STATEFULSET ||--|| PVC : "5 GiB"

Storage layout — 128 GiB raw ÷ 2 replicas = ~64 GiB usable. Committed PVCs consume 40 GiB logical / 80 GiB physical.

5.4 Two non-obvious facts to know

longhorn-system is the only namespace with a Pod Security Admission label. It's enforce: privileged because Longhorn's instance-manager DaemonSet does block-device manipulation.
The node.longhorn.io/create-default-disk label has to be set via cloud-init's RKE2 node-label: config, not as a post-install kubectl label. RKE2's node-label config survives agent restart and applies at first node registration, which is when Longhorn's controller decides whether to materialize the default disk. Documented in Phase 2 post-impl.

5.5 The data disk separation

Why a separate Azure managed data disk per worker, mounted at /var/lib/longhorn, instead of just using the OS disk? Because OS disks die when a VM is destroyed, but managed data disks survive azurerm_linux_virtual_machine destroy/recreate. Cloud-init's mkfs.ext4 is gated by blkid, so a VM rebuild remounts the existing data disk without reformatting — Longhorn's volume metadata and replicas are preserved.

6. Phase 3 — Ingress (ingress-nginx)

6.1 The Service-type-LoadBalancer story

In a “normal” cloud-managed K8s cluster:

You install ingress-nginx with controller.service.type: LoadBalancer.
K8s asks the cloud-controller-manager to provision an LB.
CCM provisions one. Done.

This cluster has no cloud-controller-manager configured for the Azure provider. The Azure LB already exists (it has to, for the kubectl :6443 rule). So the question becomes: how does external HTTPS traffic reach an ingress-nginx pod?

6.2 The hostNetwork DaemonSet pattern

The chosen answer:

ingress-nginx runs as a DaemonSet on the worker nodes (node-selector targets workers).
The DaemonSet sets hostNetwork: true, so each ingress-nginx Pod binds directly to its node's network interface on :80 and :443.
The pre-existing Azure LB has a backend pool with both worker NICs and rules forwarding :80 and :443 to those backends.
Result: a public HTTPS request lands on the LB, the LB picks a worker, the worker delivers it to the local nginx Pod, which routes by Ingress hostname/path to the right Service.

The benefit: no second Azure LB. One LB does everything. Cheaper, simpler, less Azure-specific magic.

6.3 The `publish-status-address` invariant

ingress-nginx writes the LB IP into each Ingress's status.loadBalancer.ingress[0].ip field. external-dns reads that field to decide what A record to publish. cert-manager's HTTP-01 challenge needs that A record to be reachable from the public internet.

Default behaviour with controller.service.type: ClusterIP is to write the internal ClusterIP into that status field. external-dns then publishes 10.43.0.x as the public A record. LE challenges fail. Disaster.

The fix, encoded in kube/helm/ingress-nginx.values.yaml:

controller:
  publishService:
    enabled: false
  extraArgs:
    publish-status-address: "20.48.237.183"

That tells nginx: “ignore your own Service, publish this IP into Ingress status.” This is Phase 6.5 post-impl fix-up #4 in the rollout plan.

6.4 IngressClass and the default

Set ingressClassName: nginx and mark this IngressClass as default in the chart values. Any Ingress in any namespace that doesn't specify a class will use nginx.

7. Phase 4 — GitOps (Fleet)

7.1 What problem GitOps solves

Without GitOps, every cluster change is a kubectl apply somewhere — usually from a CI runner with cluster credentials. Problems:

The CI runner needs cluster admin creds (broad blast radius).
Drift between git and cluster is invisible — anyone can kubectl edit and the diff disappears.
Rollback is “find the old commit and re-apply”, which is what the cluster should be doing automatically.

GitOps inverts this: an in-cluster agent reads from git on a poll, reconciles desired state into the cluster. Cluster credentials never leave the cluster. Drift is visible (the agent corrects it). Rollback is git revert.

7.2 Fleet vs Argo vs Flux

Three viable choices in 2026. Why Fleet here:

Native to the Rancher / RKE2 ecosystem — pattern-consistent with the distribution choice.
One CR per repo (GitRepo), bundles auto-derived from sub-paths. Less configuration than Argo's Application per app.
Fleet was specifically designed for “many clusters, central git repo” (the Rancher use case) — the GitRepo's targets: lets one git repo deploy differently to many clusters. Not used here (single cluster) but worth mentioning.

If asked “would you pick Fleet again?”: for a multi-cluster Rancher fleet, yes. For a single cluster outside Rancher's universe, Argo CD has a richer UI and bigger community — not necessarily better, but more familiar.

7.3 Fleet's data model

Three CRD kinds you must know:

GitRepo — points at a git URL + branch + paths list. Fleet polls the repo every pollingInterval (60 s here).
Bundle — Fleet auto-generates one Bundle per directory under paths that contains a fleet.yaml.
BundleDeployment — what's actually applied to a target cluster. Per-cluster materialization of a Bundle.

Bundle names are deterministic: <GitRepo name>-<path-with-slashes-as-dashes>. Example: kube/manifests/security/kyverno under GitRepo canarie-kube becomes canarie-kube-kube-manifests-security-kyverno. That name is what you use in dependsOn: between bundles. Used in two places in the repo:

kube/manifests/certs-dns/cluster-issuers/fleet.yaml depends on cert-manager being installed first.
kube/manifests/security/kyverno-policies/fleet.yaml depends on Kyverno being installed first.

7.4 The single GitRepo

kube/manifests/fleet/gitrepo.yaml — its spec.paths: is the source of truth for what Fleet manages. Adding a Phase 4+ component requires appending its sub-directory to that list. Not appending = Fleet ignores it.

The list deliberately excludes kube/manifests/fleet/ — otherwise Fleet would try to reconcile its own install bundle, fight RKE2's helm-controller for it, and produce intermittent agent restarts. (RKE2 installed Fleet, RKE2 keeps reconciling Fleet's install — that's the whole reason this directory uses the HelmChart CRD instead of being a Fleet bundle.)

7.5 The two install mechanisms

	Mechanism A: RKE2 `HelmChart`	Mechanism B: Fleet bundle
Used in	Phases 2–3 (Longhorn, ingress-nginx, Fleet itself)	Phases 4+ (everything else)
CRD	`helm.cattle.io/v1 HelmChart`	`fleet.cattle.io/v1alpha1 Bundle` (via `fleet.yaml`)
Reconciler	RKE2's helm-controller (goroutine in `rke2-server`)	Fleet controller + agent Pods
Apply path	`kubectl apply -R -f kube/manifests/` from operator laptop (`sync-manifests.sh`)	Fleet polls git every 60 s
Namespace creation	Explicit `kind: Namespace` YAML in the same file	`defaultNamespace:` in `fleet.yaml` + Helm `--create-namespace`
Values	Inline `valuesContent` in the HelmChart YAML, byte-identical to `kube/helm/<name>.values.yaml` (CI enforces)	`valuesFiles: [../../../helm/<name>.values.yaml]` — single source of truth
Why this one?	Has to exist before Fleet does	Standard once Fleet exists

The “byte-identical values” rule is enforced by the values-sync-check CI job. It's there because Phase 2/3 manifests need both the YAML representation (for the manifest file) and the canonical file (for helm lint). Phase 4+ doesn't have this duplication because Fleet reads the values file directly.

8. Phase 5 — Observability (kube-prometheus-stack + Loki + Promtail)

8.1 kube-prometheus-stack (KPS)

Single Helm chart that bundles the whole Prometheus Operator world:

Prometheus — metrics scraping, time-series database, PromQL.
Alertmanager — receives Prometheus alerts, deduplicates, routes to receivers.
Grafana — dashboards.
node-exporter — host metrics (CPU, memory, disk, network) per node.
kube-state-metrics — converts the K8s API state into metrics (“how many Pods are Running per namespace”).
Prometheus Operator — the controller that watches ServiceMonitor/PodMonitor CRs and adds them to Prometheus's scrape config.

Why a ServiceMonitor instead of editing prometheus.yaml? Because the Operator pattern lets each app declare its own scrape config alongside its other manifests, no central edit needed. Combined with searchNamespace: ALL, any ServiceMonitor in any namespace is auto-discovered.

8.2 Loki + Promtail

Loki — log aggregation system, conceptually “Prometheus for logs”. Indexes only labels (not log content), stores compressed log chunks. Cheap.
Promtail — DaemonSet that tails /var/log/pods/* on every worker, attaches K8s metadata as labels, ships to Loki.

This pair gets every namespace's Pod logs into Grafana with no per-namespace config. Promtail picks up new namespaces automatically because it's just reading the host filesystem.

8.3 The Fleet canary

There's a tiny bundle at kube/manifests/observability/canary/fleet-canary.yaml that just deploys an empty ConfigMap to the default namespace. Its purpose: a smoke test that Fleet's reconciliation loop is working. If the ConfigMap doesn't exist, Fleet itself is broken.

9. Phase 6 — Certs + DNS

9.1 cert-manager — the pieces

Five CRDs to know:

CRD	Role
`Issuer` / `ClusterIssuer`	A source of certificates (Let's Encrypt prod, LE staging, self-signed CA, etc.). `Issuer` is namespaced; `ClusterIssuer` is cluster-wide and used by all namespaces.
`Certificate`	“I want a cert for `grafana.example.com`, signed by issuer X, stored in Secret `grafana-tls`.”
`CertificateRequest`	A pending request to the Issuer. Cert-manager creates these from `Certificate`s.
`Order`	LE-specific. The ACME order for one or more domains.
`Challenge`	LE-specific. One DNS or HTTP challenge per domain in an Order.

You don't usually create Certificates directly — you put cert-manager annotations on an Ingress, and cert-manager generates the Certificate for you from the Ingress's tls: spec.

9.2 The three ClusterIssuers in this cluster

kube/manifests/certs-dns/cluster-issuers/issuers.yaml:

selfsigned — for “I just need something.”
letsencrypt-staging — LE staging API; rate-limit-friendly; certs not browser-trusted; use during testing.
letsencrypt-prod — LE production API; real browser-trusted certs; strict rate limits (5 duplicate certs per week, etc.).

Both LE issuers use HTTP-01 challenge via ingressClassName: nginx. cert-manager creates a temporary Ingress for /.well-known/acme-challenge/<token>, LE fetches it over HTTP, validates control of the domain, signs the cert.

This is why publish-status-address (from §6.3) is load-bearing: HTTP-01 needs the public A record to point at the public LB IP, not at a ClusterIP.

9.3 DNS-01 vs HTTP-01

DNS-01 puts the challenge as a TXT record on the domain — works for wildcards (*.example.com) and doesn't need the host to be HTTP-reachable. HTTP-01 is simpler but doesn't do wildcards and requires the host to be publicly HTTP-reachable. This cluster picked HTTP-01 because no wildcards are needed and the Ingress path was already wired up.

9.4 external-dns

A controller that watches Ingress (and Service-type-LoadBalancer) objects and writes DNS records to a configured provider — here, Azure DNS.

Two annotations on the Ingress drive it:

external-dns.alpha.kubernetes.io/hostname: foo.rke2.ericharrison.ca
(cert-manager) cert-manager.io/cluster-issuer: letsencrypt-prod

external-dns watches every namespace (namespaceFilter: "") and only writes records inside its domainFilter (rke2.ericharrison.ca). Records outside that suffix are ignored.

Auth to Azure DNS is via a service principal whose JSON is in a K8s Secret (seeded by sync-external-dns-azure.sh from .env). Workload Identity was the original plan but deferred as a tactical choice — the SP-in-Secret path works identically and retires cleanly once WI is done as a follow-up.

9.5 Why Cloudflare → Azure DNS delegation

The apex zone ericharrison.ca lives at Cloudflare. The subdomain rke2.ericharrison.ca is delegated to Azure DNS via four NS records at Cloudflare (DNS-only / “grey cloud” — NS records can't be Cloudflare-proxied). From there, Azure DNS is authoritative.

Why? Reduce platform integration surface. The whole stack already commits to Azure; using Cloudflare's API for DNS would add a second cloud-provider auth path (Cloudflare API token in .env, separate external-dns provider config). Letting Cloudflare just be the registrar/apex and delegating to Azure is one less moving part.

The NS delegation is a manual step at the registrar. Terraform outputs the four NS names but cannot itself update Cloudflare. Documented in Phase 6 post-impl.

9.6 The end-to-end “expose a service” flow

Helm chart values for any new public service:

ingress:
  enabled: true
  ingressClassName: nginx
  hosts:
    - host: foo.rke2.ericharrison.ca
      paths: [{path: /, pathType: Prefix}]
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    external-dns.alpha.kubernetes.io/hostname: foo.rke2.ericharrison.ca
  tls:
    - secretName: foo-tls
      hosts: [foo.rke2.ericharrison.ca]

Within ~2 minutes after git push:

Fleet polls, applies the chart.
external-dns sees the Ingress, writes A + TXT records in Azure DNS.
cert-manager creates a Certificate, runs the HTTP-01 challenge, gets a cert from LE, stores it in foo-tls.
ingress-nginx serves HTTPS with the new cert.

10. Phase 7 — Security (Trivy + Kyverno; Falco deferred)

10.1 Trivy Operator

Aqua Security's Trivy is a vulnerability scanner. The Operator wraps it as a controller that:

Watches every Pod.
For each container image, runs a scan Job and stores results as a VulnerabilityReport CR in the workload's namespace.
Also produces ConfigAuditReport CRs (workload misconfigurations: missing resource limits, privileged containers, etc.) and ClusterComplianceReport CRs (CIS K8s benchmark, NSA hardening guide).

Why an Operator and not a CronJob? So results are queryable as native K8s objects (kubectl get vulnerabilityreports -A) and consumable by other tools.

10.2 Kyverno

A policy engine that's K8s-native — policies are written in YAML, not a DSL. Two modes:

validationFailureAction: Audit — violations create PolicyReport CRs but don't block kubectl apply.
validationFailureAction: Enforce — violations are rejected at admission time.

This cluster ships all four policies in Audit mode at commit 53eb10b. Flipping to Enforce is a one-line PR per policy, deliberately deferred until a week of clean PolicyReports has accumulated. Why deferred? Because some infra workloads (often charts you don't control) genuinely don't set resource limits or pin image tags, and Enforcing on day one would block them.

10.3 The four ClusterPolicies

Live in kube/manifests/security/kyverno-policies/:

Policy	What it blocks (in Enforce mode)	Scope
`disallow-privileged`	Pods with `securityContext.privileged: true`	Every namespace
`disallow-latest-tag`	Images using `:latest` or no tag	Every namespace
`require-resource-limits`	Containers missing `resources.limits.cpu` and `.memory`	Every namespace
`disallow-host-path`	Pods using `hostPath` volumes	Every namespace except a hardcoded allow-list of 10 infra namespaces (kube-system, falco, longhorn-system, monitoring, ingress-nginx, cert-manager, external-dns, trivy-system, cattle-fleet-system, kyverno)

disallow-host-path has the allow-list because Longhorn, Promtail, etc. legitimately need hostPath. Any new namespace that needs hostPath must be added to the policy's allow-list — there's no label-based mechanism.

10.4 Falco — what was deferred and why

Falco is the runtime threat detection layer. It tails kernel events (syscalls) via either a kernel module or eBPF program and matches them against rules (“a shell was spawned in a container”, “a pod tried to read /etc/shadow”). Streams findings via Falcosidekick to e.g. Loki.

Why deferred: the 0.40.x chart's bundled DKMS driver doesn't compile against the kernel 6.17 series shipped by Ubuntu 24.04, and the pre-built modern-bpf driver also failed to load on test. Documented in plan Phase 7. The Fleet bundle is scaffolded (kube/manifests/security/falco/fleet.yaml exists) but the GitRepo's paths: deliberately omits it, so Fleet doesn't try to reconcile a broken bundle.

10.5 Pod Security Admission (PSA)

K8s's built-in admission gate, replaced PodSecurityPolicy in v1.25. You label a namespace with one of three profiles (privileged, baseline, restricted) and three modes (enforce, audit, warn):

metadata:
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest

In this cluster: only longhorn-system carries any PSA label (enforce: privileged, because Longhorn needs it). No cluster-wide PSA default is configured. New namespaces get K8s's built-in default behaviour (which is effectively no PSA enforcement beyond privileged).

This is something you might be asked to improve. The honest answer: a follow-up would set a cluster-wide AdmissionConfiguration defaulting to baseline enforce, with the infra namespaces opted into privileged per-namespace.

11. The two flows through the system

11.1 How a public HTTPS request lands on a Pod

DNS resolution. Browser asks for grafana.rke2.ericharrison.ca.
Recursive resolver asks the apex ericharrison.ca zone (Cloudflare). Cloudflare returns NS records pointing at ns[1-4]-01.azure-dns.* (delegated subdomain).
Resolver queries Azure DNS, gets back the A record 20.48.237.183 (written there by external-dns based on the Ingress status).
Browser opens TCP :443 to 20.48.237.183.
Azure Standard LB receives the connection. Its :443 rule has backend pool bepool-workers (worker NICs). LB picks a worker.
The worker VM's NIC delivers the packet. Because ingress-nginx's DaemonSet uses hostNetwork: true, it's bound to :443 on the worker's interface — packet goes straight to nginx.
nginx terminates TLS using the Secret grafana-tls (provisioned by cert-manager from LE).
nginx looks up the Ingress for Host: grafana.rke2.ericharrison.ca, finds the Service backend (monitoring/kps-grafana:80).
nginx makes an in-cluster connection to the Service ClusterIP. Cilium (not kube-proxy) translates ClusterIP → an actual Pod IP via eBPF.
The Pod responds. Path back is symmetric.

sequenceDiagram participant U as User participant CF as Cloudflare DNS participant AZDNS as Azure DNS participant LB as Azure LB 20.48.237.183 participant WK as Worker VM participant NGX as ingress-nginx participant POD as Backend Pod U->>CF: NS query for rke2.ericharrison.ca CF-->>U: delegated to azure-dns U->>AZDNS: A query for grafana.rke2... AZDNS-->>U: 20.48.237.183 U->>LB: HTTPS :443 LB->>WK: :443 (backendPool=workers) WK->>NGX: DaemonSet on host :80/:443 NGX->>POD: ClusterIP → Pod IP (Cilium eBPF) POD-->>NGX: response NGX-->>WK: response WK-->>LB: response LB-->>U: response

Inbound HTTPS — DNS delegation, Azure LB, worker hostNetwork, nginx TLS termination, Cilium service routing.

11.2 How a `git push` reaches a Pod

Developer commits to kube/manifests/<phase>/<bundle>/... and pushes to main.
GitLab pipeline runs (in parallel): helm-lint, kubeconform, values-sync-check. Lint only — no kubectl apply. If lint fails, the commit is on main but reviewers see red.
Fleet controller polls GitLab every 60 s using the canarie-kube-auth Secret (basic-auth, seeded from .env's GITLAB_TOKEN).
Fleet sees a new commit, walks the GitRepo's spec.paths:, generates/updates a Bundle per directory.
The Fleet agent (running locally in the cluster) sees the new Bundle, materializes a BundleDeployment, and runs helm install or helm upgrade (with --create-namespace if needed).
Helm renders the chart with valuesFiles:-pointed values, applies the rendered manifests via the K8s API.
K8s controllers reconcile the manifests into running Pods. Loop closes.

flowchart LR Dev["Developer commits
kube/manifests/*"] GL["GitLab"] CI["Pipeline:
helm-lint + kubeconform"] Fleet["Fleet controller
cattle-fleet-system"] Bundle["BundleDeployment
per sub-path"] Helm["helm install / upgrade"] K8s["Running pods"] Dev -->|git push main| GL GL -->|triggers| CI GL -->|HTTPS basic-auth| Fleet Fleet -->|polls 60s| GL Fleet --> Bundle Bundle --> Helm Helm --> K8s

GitOps flow — two independent paths. CI lints; Fleet deploys. Either can fail without blocking the other.

Two independent paths. CI failure does not block Fleet (Fleet keeps reconciling whatever's in main). Fleet failure does not block CI (you can still ship infra changes). This separation is intentional and worth pointing out — it means one broken layer doesn't blast-radius into the other.

12. Non-obvious invariants

Each one is a “wait, why?” implementation detail worth internalising.

kube/helm/<chart>.values.yaml and the matching helmchart.yaml's valuesContent block must be byte-for-byte identical. CI job values-sync-check enforces. Only applies to Phase 2/3 (longhorn, ingress-nginx). Phase 4+ uses Fleet's valuesFiles: and avoids the duplication.
controller.publishService.enabled: false + extraArgs.publish-status-address: <LB IP> is load-bearing for cert-manager. Without it, every Ingress reports a ClusterIP as its public address; external-dns publishes that; LE challenges fail. (§6.3, §9.2)
Longhorn's create-default-disk label has to come from cloud-init node-label:, not a post-install patch. That's the only mechanism that survives agent restart and runs at first node registration, which is when Longhorn decides to materialize its default disk.
Managed data disks survive VM destroy/recreate; OS disks don't. Cloud-init mkfs.ext4 is gated by blkid, so existing Longhorn data on /dev/disk/azure/scsi1/lun0 is preserved across rebuilds. This is the entire reason Longhorn state is on a separate data disk.
custom_data is force-new on azurerm_linux_virtual_machine. Any cloud-init edit destroys and recreates the VMs. Data disks survive (separate resources); RKE2 server state on the OS disk does not. After Phase 2, a cloud-init edit means a cluster rebuild.
kube/manifests/fleet/ is excluded from the Fleet GitRepo's spec.paths. Otherwise Fleet would reconcile its own install bundle and fight with RKE2's helm-controller.
Azure DNS NS delegation at Cloudflare is a manual step at the registrar. No Terraform automation can delegate NS records at a registrar the operator owns.
KV soft-delete is enabled with purge_protection_enabled: false and a 90-day retention. Why: if a partial Terraform apply creates a secret but state rolls back, the next apply hits “already exists.” With purge protection on, recovery would be a 90-day wait. This way, az keyvault secret delete + purge is the one-liner fix.
Fleet bundle names are <GitRepo>-<path-with-slashes-as-dashes>. Used in dependsOn: references. Get the name wrong and dependsOn: silently never resolves.
searchNamespace: ALL (KPS) and namespaceFilter: "" (external-dns) mean any new namespace is auto-discovered. No need to update the observability or DNS bundles when adding new app namespaces.

13. Quick-reference glossary

Term	One-liner
CNI	Container Network Interface — plugin spec for “give Pods IPs and connectivity.” Cilium implements it here.
CRI	Container Runtime Interface — what kubelet talks to. containerd implements it here.
CSI	Container Storage Interface — plugin spec for “provision and mount storage.” Longhorn implements it here.
CRD	CustomResourceDefinition — extends K8s API with new object kinds.
eBPF	Sandboxed, verified bytecode that runs in the Linux kernel in response to events. Used by Cilium for fast packet routing without iptables.
etcd	Distributed key-value store. K8s's source of truth. RKE2 embeds it in `rke2-server`.
GitOps	Cluster state declared in git, pulled by an in-cluster agent. Inverts CI-driven `kubectl apply`.
Helm chart	Templated bundle of K8s manifests with values. `helm install` renders + applies.
HPA	HorizontalPodAutoscaler — scales replicas of a Deployment based on metrics. Not used in this cluster.
IaC	Infrastructure-as-Code. OpenTofu here.
Ingress	HTTP(S) routing layer in front of Services. Needs an Ingress Controller (ingress-nginx) to do anything.
IMDS	Instance Metadata Service. Azure's `169.254.169.254`. VMs use it to get OAuth tokens for their MI.
Kyverno	Policy engine. Writes K8s policies in YAML, not a DSL.
MI	Managed Identity. Azure-native way for a VM to have an identity that can `RBAC`-grant to other Azure resources without a secret.
NSG	Network Security Group. Azure's stateful firewall, attached per subnet (or per NIC).
OpenTofu	Open-source fork of Terraform after the BSL relicense.
PSA	Pod Security Admission. K8s's built-in admission gate; replaced PodSecurityPolicy.
RBAC	Role-Based Access Control. K8s has it; this cluster mostly uses chart-default RBAC.
RKE2	Rancher's K8s distro. Full upstream K8s, embedded etcd + containerd, CIS-hardened.
ServiceMonitor	Prometheus Operator CRD that tells Prometheus to scrape a Service.
Trivy	Vulnerability scanner from Aqua Security. The Operator wraps it as a controller producing CR-shaped reports.