Canarie kube — Architecture Overview

1. Kubernetes 101 — just enough to be dangerous

1.1 What Kubernetes actually is

Kubernetes is a container orchestrator — it does not run containers itself. It tells worker nodes what containers to run, where, how many copies, and how they should reach each other. The container runtime (in this build: containerd) is what actually executes them.

The control plane is a small set of long-running processes that maintain the cluster's “desired state”:

ComponentJob
etcdThe cluster's database. Every K8s object (Pod, Service, Secret, etc.) is a row in etcd.
kube-apiserverThe only thing that talks to etcd. Every other component, including kubectl, talks to the apiserver.
kube-schedulerDecides which node a new Pod runs on.
kube-controller-managerBundles the dozens of small loops that drive resource state — Deployment controller, ReplicaSet controller, Node controller, etc. Each loop reads “desired” from etcd, observes “actual” via the apiserver, and acts to close the gap.
cloud-controller-managerCloud-provider-specific glue (in real AKS this is what makes Service type=LoadBalancer provision an Azure LB). In this cluster: not used — see §6.

Each worker node runs:

ComponentJob
kubeletThe node-level agent. Talks to the apiserver, gets told “run these pods”, and tells containerd to start/stop them.
containerdThe container runtime.
kube-proxy (normally)Programs iptables/IPVS so Service ClusterIPs route to the right Pod IPs. Replaced by Cilium in this cluster — see §5.
CNI pluginGives Pods their IPs and connects them to the cluster network. Cilium here.

The pattern to internalise: everything in K8s is reconciliation. You declare desired state (a YAML manifest), the apiserver writes it to etcd, and a controller loop notices the gap between desired and actual, then takes action. There is no “deploy” verb — there's only “write the desired thing to the apiserver and wait for the world to converge.”

1.2 Core primitives

Pod — The smallest deployable unit. One or more co-located containers sharing a network namespace and (optionally) volumes. Almost never created directly; created by controllers.

Deployment — Manages a ReplicaSet which manages Pods. Use for stateless workloads (web servers, API processes). Supports rolling updates, rollback.

StatefulSet — Like a Deployment but each Pod has a stable identity (pod-0, pod-1) and stable storage (each Pod gets its own PVC). Use for databases, Prometheus, anything that needs to know “which replica am I”. This cluster uses it for Prometheus, Alertmanager, Loki.

DaemonSet — Runs exactly one Pod on every (matching) node. Use for node-level agents. This cluster uses it for ingress-nginx, Promtail, Cilium.

Service — A stable virtual IP and DNS name in front of a set of Pods. Three types relevant here:

Ingress — HTTP(S) routing layer in front of Services. Decides “host grafana.example.com + path / → Service grafana port 80”. Needs an Ingress Controller (an actual running pod) to do anything; the Ingress object is just config. This cluster uses ingress-nginx as the controller.

Namespace — A scope/folder for K8s objects. Every object except a few cluster-scoped ones (Node, ClusterRole, PersistentVolume, CRDs themselves) lives in exactly one namespace. RBAC, NetworkPolicy, and ResourceQuota are all per-namespace.

ConfigMap / Secret — Key-value blobs mounted into Pods as files or env vars. Secret is base64-encoded (not encrypted at rest by default — that's a separate apiserver setting).

PersistentVolumeClaim (PVC) — A Pod's request for storage (“I need 10 GiB”). Bound to a PersistentVolume (PV) which is the actual storage. PVs are usually provisioned dynamically by a CSI driver (Container Storage Interface) — in this cluster, by Longhorn, see §6.

CustomResourceDefinition (CRD) — Lets an operator extend the K8s API with new object kinds. Every component on top of plain K8s (Longhorn, Fleet, cert-manager, Prometheus Operator, Kyverno) ships its own CRDs. When you write a Certificate YAML you're talking to cert-manager's CRD, not core K8s.

Operator — An admission-shorthand: a controller that watches a CRD and reconciles its desired state. “The Trivy Operator” = a pod that watches VulnerabilityReport CRs and acts on them.

1.3 The five-minute concept dump


2. The shape of this project

Single sentence: a production-pattern RKE2 cluster on Azure, OSS-only, built up across seven discrete PRs so each layer of the stack is a discrete, reviewable change.

Why no AKS? AKS hides the parts of running K8s that give the project its shape — CNI install, CSI install, ingress install, etcd, certificates, GitOps wiring. The project is about assembling each of those layers end-to-end; AKS would short-circuit the exercise.

Why does each phase get its own PR? Each PR is a reviewable, individually-justified extension of the previous layer. The list below is also the order they had to ship in, because each later phase depends on something in an earlier one.

#PhaseWhat it addsInstalled by
1ClusterRKE2 v1.31 + Cilium (kube-proxy replacement)cloud-init + RKE2 HelmChartConfig
2StorageLonghorn 1.7.2 (default StorageClass)RKE2 HelmChart CRD
3Ingressingress-nginx (DaemonSet, hostNetwork)RKE2 HelmChart CRD
4GitOpsFleet + the GitRepo pointing at this repoRKE2 HelmChart CRD
5Observabilitykube-prometheus-stack + Loki + PromtailFleet bundle
6Certs + DNScert-manager (LE HTTP-01) + external-dns (Azure DNS) + 3 ClusterIssuersFleet bundles
7SecurityTrivy Operator + Kyverno + 4 audit-mode ClusterPolicies (Falco deferred)Fleet bundles

The Falco deferral is real and worth knowing — its eBPF driver doesn't compile against kernel 6.17 in the chart's bundled DKMS path, and the pre-built modern-bpf driver also failed to load. Documented in the rollout plan's Phase 7 post-impl notes.

flowchart TD subgraph P1["Phase 1 — Cluster"] RKE2["RKE2 v1.31 server on cp-01
agents on wk-01, wk-02"] Cilium["Cilium CNI
kubeProxyReplacement=True"] RKE2 --> Cilium end subgraph P2["Phase 2 — Storage"] Longhorn["Longhorn 1.7.2
2 workers × 64 GiB data disk"] end subgraph P3["Phase 3 — Ingress"] Nginx["ingress-nginx DaemonSet
hostNetwork, on workers"] AzLB["Azure Standard LB
:80/:443 → workers
:6443 → cp-01"] AzLB --> Nginx end subgraph P4["Phase 4 — GitOps"] Fleet["Fleet controller+agent
GitRepo: this repo"] end subgraph P5["Phase 5 — Observability"] KPS["kube-prometheus-stack"] Loki["Loki + Promtail"] end subgraph P6["Phase 6 — Certs + DNS"] CM["cert-manager
LE HTTP-01 via ingress-nginx"] EDNS["external-dns
Azure DNS zone rke2.ericharrison.ca"] AzDNS["Azure DNS zone
delegated from Cloudflare"] EDNS --> AzDNS CM -.HTTP-01 challenge path.-> Nginx end subgraph P7["Phase 7 — Security"] Trivy["Trivy Operator
Vuln + ConfigAudit + Compliance"] Kyverno["Kyverno
4 ClusterPolicies in Audit"] end P1 --> P2 --> P3 --> P4 P4 -->|manages| P5 P4 -->|manages| P6 P4 -->|manages| P7
Phase map — how the stack assembled itself. Phases 2–3 install via RKE2's native HelmChart CRD; Phase 4+ is owned by Fleet.

2.1 Topology one-liner

3 Azure VMs — cp-01 (control plane, 10.20.0.10) and wk-01/wk-02 (workers, 10.20.0.132/.133), all Standard_D2ads_v5 Ubuntu 24.04, in one VNet (10.20.0.0/24) split into three subnets (control, lb, workers). One Azure Standard Load Balancer fronts both :6443 (kubectl → cp-01) and :80/:443 (HTTP(S) → workers). One Key Vault holds the RKE2 join token, the Longhorn UI password, and the Grafana admin password. One Azure DNS zone (rke2.ericharrison.ca) is delegated from Cloudflare via four NS records.


3. The infrastructure layer (Azure + OpenTofu + GitLab)

This layer is the foundation — none of K8s exists until cloud-init finishes on the VMs.

3.1 The OpenTofu story

3.2 The Azure primitives that exist

3.3 The bootstrap pattern that ties Azure → K8s together

This is the piece that ties Azure IaM to K8s cluster bootstrap without a pre-placed secret:

  1. Terraform generates the RKE2 join token via random_password and writes it to Key Vault.
  2. Each VM has a SystemAssigned Managed Identity with Key Vault Secrets User on the vault.
  3. At boot, cloud-init queries the Azure Instance Metadata Service (IMDS) at 169.254.169.254 for an OAuth token, then curls the Key Vault REST API to fetch the join token. No az CLI needed; no secret travels outside Azure.
  4. cp-01 starts rke2-server, persists the token internally. wk-01 and wk-02 read the same token and join via cp-01:9345.
  5. Workers also format their data disk (/dev/disk/azure/scsi1/lun0, 64 GiB) and mount it at /var/lib/longhorn. The mkfs.ext4 is gated by blkid so VM rebuilds preserve Longhorn data.
sequenceDiagram participant TF as Terraform/OpenTofu participant AZ as Azure participant KV as Key Vault participant CP as cp-01 (cloud-init) participant WK as wk-01/02 (cloud-init) participant K8S as RKE2 API TF->>AZ: create RG, VNet, LB, VMs, MIs TF->>KV: random_password rke2-join-token TF->>AZ: assign KV Secrets User to each VM MI AZ->>CP: cloud-init with custom_data AZ->>WK: cloud-init with custom_data CP->>KV: IMDS token → fetch rke2-join-token CP->>CP: write config.yaml.d/10-token.yaml CP->>CP: systemctl enable --now rke2-server CP->>CP: drop HelmChartConfig (Cilium / eBPF) WK->>WK: format disk /dev/disk/azure/scsi1/lun0 WK->>KV: IMDS token → fetch rke2-join-token WK->>CP: join via :9345 CP->>K8S: API becomes ready WK->>K8S: Node registered
Bootstrap flow — empty resource group to ready cluster. The join token never leaves Azure (KV → IMDS → VM via Managed Identity).

The custom_data field on azurerm_linux_virtual_machine is force-new — any change to kube/cloud-init/{cp,wk}.yaml destroys and recreates the VMs. Data disks survive (separate resources). RKE2 server state on cp-01's OS disk does not, so a cloud-init edit after Phase 2 effectively means a cluster rebuild.

3.4 The four operator-run bootstrap scripts

Some Secrets can't be generated by Terraform because they need .env values that live only on the operator's laptop. These four scripts run once after cluster come-up:

ScriptSeeds
sync-fleet-git-auth.shFleet's git-clone basic-auth Secret from .env's GITLAB_TOKEN
sync-external-dns-azure.shexternal-dns's Azure SP credentials
sync-grafana-admin.shGrafana admin password (consumed by KPS via existingSecret)
sync-longhorn-basic-auth.shbasic-auth Secret for the Longhorn UI Ingress

Plus fetch-kubeconfig.sh, which ssh's into cp-01 (passwordless sudo on Azure Ubuntu image), cats /etc/rancher/rke2/rke2.yaml, and rewrites the server URL to point at the LB public IP rather than 127.0.0.1. That file is ~/.kube/kube-dev.yaml from there on.


4. Phase 1 — the cluster itself (RKE2 + Cilium)

4.1 Why RKE2

You will get asked “why RKE2 and not k3s or kubeadm or AKS?”. The answer:

4.2 What RKE2 gives you for free

The single most important RKE2 fact for this project: RKE2's helm-controller runs as a goroutine inside rke2-server, not as a Pod. Manifests dropped into /var/lib/rancher/rke2/server/manifests/ are reconciled on every server start, before the CNI is up. Two CRDs come from this:

That mechanism is how Cilium gets configured — see next.

4.3 Cilium and kube-proxy replacement

Cilium is the CNI plugin — it gives Pods their IPs and connects them. This cluster sets cni: cilium in RKE2's server config, then drops a HelmChartConfig at /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml that sets kubeProxyReplacement: True.

What that means in practice:

Why Cilium specifically, not Calico or Flannel?

eBPF — a safe, sandboxed in-kernel VM that runs verified bytecode in response to kernel events (syscalls, network packets, tracepoints). Originally for packet filtering (“Berkeley Packet Filter”), now general-purpose. Lets userspace programs do work in kernel context without writing a kernel module.

4.4 The token bootstrap (review)

rke2-token in KV → Terraform-generated random_password → VMs read via Managed Identity + IMDS → no chicken-and-egg, no secret in any pipeline log. This is how the cluster bootstraps a multi-node join without an already-running cluster to distribute the token.


5. Phase 2 — Storage (Longhorn)

5.1 What Longhorn is

A distributed block-storage system written in Go that runs as Pods inside your cluster. Each Pod-managed “engine” exposes a virtual block device that's replicated across nodes' local disks. Implements the CSI (Container Storage Interface) so K8s sees it as a normal StorageClass.

5.2 Why Longhorn in this build

The tradeoff: it eats CPU and disk I/O on every worker, and on a 2-worker cluster you can only get 2 replicas (not 3, since you need replicas on distinct nodes).

5.3 Capacity math

erDiagram WORKER ||--|| DATA_DISK : "mounts at /var/lib/longhorn" DATA_DISK ||--o{ LONGHORN_REPLICA : "hosts" LONGHORN_VOLUME ||--|{ LONGHORN_REPLICA : "has 2" PVC ||--|| LONGHORN_VOLUME : "bound to" PROMETHEUS_STATEFULSET ||--|| PVC : "20 GiB" ALERTMANAGER_STATEFULSET ||--|| PVC : "5 GiB" GRAFANA_DEPLOYMENT ||--|| PVC : "5 GiB" LOKI_STATEFULSET ||--|| PVC : "5 GiB"
Storage layout — 128 GiB raw ÷ 2 replicas = ~64 GiB usable. Committed PVCs consume 40 GiB logical / 80 GiB physical.

5.4 Two non-obvious facts to know

5.5 The data disk separation

Why a separate Azure managed data disk per worker, mounted at /var/lib/longhorn, instead of just using the OS disk? Because OS disks die when a VM is destroyed, but managed data disks survive azurerm_linux_virtual_machine destroy/recreate. Cloud-init's mkfs.ext4 is gated by blkid, so a VM rebuild remounts the existing data disk without reformatting — Longhorn's volume metadata and replicas are preserved.


6. Phase 3 — Ingress (ingress-nginx)

6.1 The Service-type-LoadBalancer story

In a “normal” cloud-managed K8s cluster:

  1. You install ingress-nginx with controller.service.type: LoadBalancer.
  2. K8s asks the cloud-controller-manager to provision an LB.
  3. CCM provisions one. Done.

This cluster has no cloud-controller-manager configured for the Azure provider. The Azure LB already exists (it has to, for the kubectl :6443 rule). So the question becomes: how does external HTTPS traffic reach an ingress-nginx pod?

6.2 The hostNetwork DaemonSet pattern

The chosen answer:

The benefit: no second Azure LB. One LB does everything. Cheaper, simpler, less Azure-specific magic.

6.3 The publish-status-address invariant

ingress-nginx writes the LB IP into each Ingress's status.loadBalancer.ingress[0].ip field. external-dns reads that field to decide what A record to publish. cert-manager's HTTP-01 challenge needs that A record to be reachable from the public internet.

Default behaviour with controller.service.type: ClusterIP is to write the internal ClusterIP into that status field. external-dns then publishes 10.43.0.x as the public A record. LE challenges fail. Disaster.

The fix, encoded in kube/helm/ingress-nginx.values.yaml:

controller:
  publishService:
    enabled: false
  extraArgs:
    publish-status-address: "20.48.237.183"

That tells nginx: “ignore your own Service, publish this IP into Ingress status.” This is Phase 6.5 post-impl fix-up #4 in the rollout plan.

6.4 IngressClass and the default

Set ingressClassName: nginx and mark this IngressClass as default in the chart values. Any Ingress in any namespace that doesn't specify a class will use nginx.


7. Phase 4 — GitOps (Fleet)

7.1 What problem GitOps solves

Without GitOps, every cluster change is a kubectl apply somewhere — usually from a CI runner with cluster credentials. Problems:

GitOps inverts this: an in-cluster agent reads from git on a poll, reconciles desired state into the cluster. Cluster credentials never leave the cluster. Drift is visible (the agent corrects it). Rollback is git revert.

7.2 Fleet vs Argo vs Flux

Three viable choices in 2026. Why Fleet here:

If asked “would you pick Fleet again?”: for a multi-cluster Rancher fleet, yes. For a single cluster outside Rancher's universe, Argo CD has a richer UI and bigger community — not necessarily better, but more familiar.

7.3 Fleet's data model

Three CRD kinds you must know:

Bundle names are deterministic: <GitRepo name>-<path-with-slashes-as-dashes>. Example: kube/manifests/security/kyverno under GitRepo canarie-kube becomes canarie-kube-kube-manifests-security-kyverno. That name is what you use in dependsOn: between bundles. Used in two places in the repo:

7.4 The single GitRepo

kube/manifests/fleet/gitrepo.yaml — its spec.paths: is the source of truth for what Fleet manages. Adding a Phase 4+ component requires appending its sub-directory to that list. Not appending = Fleet ignores it.

The list deliberately excludes kube/manifests/fleet/ — otherwise Fleet would try to reconcile its own install bundle, fight RKE2's helm-controller for it, and produce intermittent agent restarts. (RKE2 installed Fleet, RKE2 keeps reconciling Fleet's install — that's the whole reason this directory uses the HelmChart CRD instead of being a Fleet bundle.)

7.5 The two install mechanisms

Mechanism A: RKE2 HelmChartMechanism B: Fleet bundle
Used inPhases 2–3 (Longhorn, ingress-nginx, Fleet itself)Phases 4+ (everything else)
CRDhelm.cattle.io/v1 HelmChartfleet.cattle.io/v1alpha1 Bundle (via fleet.yaml)
ReconcilerRKE2's helm-controller (goroutine in rke2-server)Fleet controller + agent Pods
Apply pathkubectl apply -R -f kube/manifests/ from operator laptop (sync-manifests.sh)Fleet polls git every 60 s
Namespace creationExplicit kind: Namespace YAML in the same filedefaultNamespace: in fleet.yaml + Helm --create-namespace
ValuesInline valuesContent in the HelmChart YAML, byte-identical to kube/helm/<name>.values.yaml (CI enforces)valuesFiles: [../../../helm/<name>.values.yaml] — single source of truth
Why this one?Has to exist before Fleet doesStandard once Fleet exists

The “byte-identical values” rule is enforced by the values-sync-check CI job. It's there because Phase 2/3 manifests need both the YAML representation (for the manifest file) and the canonical file (for helm lint). Phase 4+ doesn't have this duplication because Fleet reads the values file directly.


8. Phase 5 — Observability (kube-prometheus-stack + Loki + Promtail)

8.1 kube-prometheus-stack (KPS)

Single Helm chart that bundles the whole Prometheus Operator world:

Why a ServiceMonitor instead of editing prometheus.yaml? Because the Operator pattern lets each app declare its own scrape config alongside its other manifests, no central edit needed. Combined with searchNamespace: ALL, any ServiceMonitor in any namespace is auto-discovered.

8.2 Loki + Promtail

This pair gets every namespace's Pod logs into Grafana with no per-namespace config. Promtail picks up new namespaces automatically because it's just reading the host filesystem.

8.3 The Fleet canary

There's a tiny bundle at kube/manifests/observability/canary/fleet-canary.yaml that just deploys an empty ConfigMap to the default namespace. Its purpose: a smoke test that Fleet's reconciliation loop is working. If the ConfigMap doesn't exist, Fleet itself is broken.


9. Phase 6 — Certs + DNS

9.1 cert-manager — the pieces

Five CRDs to know:

CRDRole
Issuer / ClusterIssuerA source of certificates (Let's Encrypt prod, LE staging, self-signed CA, etc.). Issuer is namespaced; ClusterIssuer is cluster-wide and used by all namespaces.
Certificate“I want a cert for grafana.example.com, signed by issuer X, stored in Secret grafana-tls.”
CertificateRequestA pending request to the Issuer. Cert-manager creates these from Certificates.
OrderLE-specific. The ACME order for one or more domains.
ChallengeLE-specific. One DNS or HTTP challenge per domain in an Order.

You don't usually create Certificates directly — you put cert-manager annotations on an Ingress, and cert-manager generates the Certificate for you from the Ingress's tls: spec.

9.2 The three ClusterIssuers in this cluster

kube/manifests/certs-dns/cluster-issuers/issuers.yaml:

Both LE issuers use HTTP-01 challenge via ingressClassName: nginx. cert-manager creates a temporary Ingress for /.well-known/acme-challenge/<token>, LE fetches it over HTTP, validates control of the domain, signs the cert.

This is why publish-status-address (from §6.3) is load-bearing: HTTP-01 needs the public A record to point at the public LB IP, not at a ClusterIP.

9.3 DNS-01 vs HTTP-01

DNS-01 puts the challenge as a TXT record on the domain — works for wildcards (*.example.com) and doesn't need the host to be HTTP-reachable. HTTP-01 is simpler but doesn't do wildcards and requires the host to be publicly HTTP-reachable. This cluster picked HTTP-01 because no wildcards are needed and the Ingress path was already wired up.

9.4 external-dns

A controller that watches Ingress (and Service-type-LoadBalancer) objects and writes DNS records to a configured provider — here, Azure DNS.

Two annotations on the Ingress drive it:

external-dns watches every namespace (namespaceFilter: "") and only writes records inside its domainFilter (rke2.ericharrison.ca). Records outside that suffix are ignored.

Auth to Azure DNS is via a service principal whose JSON is in a K8s Secret (seeded by sync-external-dns-azure.sh from .env). Workload Identity was the original plan but deferred as a tactical choice — the SP-in-Secret path works identically and retires cleanly once WI is done as a follow-up.

9.5 Why Cloudflare → Azure DNS delegation

The apex zone ericharrison.ca lives at Cloudflare. The subdomain rke2.ericharrison.ca is delegated to Azure DNS via four NS records at Cloudflare (DNS-only / “grey cloud” — NS records can't be Cloudflare-proxied). From there, Azure DNS is authoritative.

Why? Reduce platform integration surface. The whole stack already commits to Azure; using Cloudflare's API for DNS would add a second cloud-provider auth path (Cloudflare API token in .env, separate external-dns provider config). Letting Cloudflare just be the registrar/apex and delegating to Azure is one less moving part.

The NS delegation is a manual step at the registrar. Terraform outputs the four NS names but cannot itself update Cloudflare. Documented in Phase 6 post-impl.

9.6 The end-to-end “expose a service” flow

Helm chart values for any new public service:

ingress:
  enabled: true
  ingressClassName: nginx
  hosts:
    - host: foo.rke2.ericharrison.ca
      paths: [{path: /, pathType: Prefix}]
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    external-dns.alpha.kubernetes.io/hostname: foo.rke2.ericharrison.ca
  tls:
    - secretName: foo-tls
      hosts: [foo.rke2.ericharrison.ca]

Within ~2 minutes after git push:

  1. Fleet polls, applies the chart.
  2. external-dns sees the Ingress, writes A + TXT records in Azure DNS.
  3. cert-manager creates a Certificate, runs the HTTP-01 challenge, gets a cert from LE, stores it in foo-tls.
  4. ingress-nginx serves HTTPS with the new cert.

10. Phase 7 — Security (Trivy + Kyverno; Falco deferred)

10.1 Trivy Operator

Aqua Security's Trivy is a vulnerability scanner. The Operator wraps it as a controller that:

Why an Operator and not a CronJob? So results are queryable as native K8s objects (kubectl get vulnerabilityreports -A) and consumable by other tools.

10.2 Kyverno

A policy engine that's K8s-native — policies are written in YAML, not a DSL. Two modes:

This cluster ships all four policies in Audit mode at commit 53eb10b. Flipping to Enforce is a one-line PR per policy, deliberately deferred until a week of clean PolicyReports has accumulated. Why deferred? Because some infra workloads (often charts you don't control) genuinely don't set resource limits or pin image tags, and Enforcing on day one would block them.

10.3 The four ClusterPolicies

Live in kube/manifests/security/kyverno-policies/:

PolicyWhat it blocks (in Enforce mode)Scope
disallow-privilegedPods with securityContext.privileged: trueEvery namespace
disallow-latest-tagImages using :latest or no tagEvery namespace
require-resource-limitsContainers missing resources.limits.cpu and .memoryEvery namespace
disallow-host-pathPods using hostPath volumesEvery namespace except a hardcoded allow-list of 10 infra namespaces (kube-system, falco, longhorn-system, monitoring, ingress-nginx, cert-manager, external-dns, trivy-system, cattle-fleet-system, kyverno)

disallow-host-path has the allow-list because Longhorn, Promtail, etc. legitimately need hostPath. Any new namespace that needs hostPath must be added to the policy's allow-list — there's no label-based mechanism.

10.4 Falco — what was deferred and why

Falco is the runtime threat detection layer. It tails kernel events (syscalls) via either a kernel module or eBPF program and matches them against rules (“a shell was spawned in a container”, “a pod tried to read /etc/shadow”). Streams findings via Falcosidekick to e.g. Loki.

Why deferred: the 0.40.x chart's bundled DKMS driver doesn't compile against the kernel 6.17 series shipped by Ubuntu 24.04, and the pre-built modern-bpf driver also failed to load on test. Documented in plan Phase 7. The Fleet bundle is scaffolded (kube/manifests/security/falco/fleet.yaml exists) but the GitRepo's paths: deliberately omits it, so Fleet doesn't try to reconcile a broken bundle.

10.5 Pod Security Admission (PSA)

K8s's built-in admission gate, replaced PodSecurityPolicy in v1.25. You label a namespace with one of three profiles (privileged, baseline, restricted) and three modes (enforce, audit, warn):

metadata:
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest

In this cluster: only longhorn-system carries any PSA label (enforce: privileged, because Longhorn needs it). No cluster-wide PSA default is configured. New namespaces get K8s's built-in default behaviour (which is effectively no PSA enforcement beyond privileged).

This is something you might be asked to improve. The honest answer: a follow-up would set a cluster-wide AdmissionConfiguration defaulting to baseline enforce, with the infra namespaces opted into privileged per-namespace.


11. The two flows through the system

11.1 How a public HTTPS request lands on a Pod

  1. DNS resolution. Browser asks for grafana.rke2.ericharrison.ca.
  2. Recursive resolver asks the apex ericharrison.ca zone (Cloudflare). Cloudflare returns NS records pointing at ns[1-4]-01.azure-dns.* (delegated subdomain).
  3. Resolver queries Azure DNS, gets back the A record 20.48.237.183 (written there by external-dns based on the Ingress status).
  4. Browser opens TCP :443 to 20.48.237.183.
  5. Azure Standard LB receives the connection. Its :443 rule has backend pool bepool-workers (worker NICs). LB picks a worker.
  6. The worker VM's NIC delivers the packet. Because ingress-nginx's DaemonSet uses hostNetwork: true, it's bound to :443 on the worker's interface — packet goes straight to nginx.
  7. nginx terminates TLS using the Secret grafana-tls (provisioned by cert-manager from LE).
  8. nginx looks up the Ingress for Host: grafana.rke2.ericharrison.ca, finds the Service backend (monitoring/kps-grafana:80).
  9. nginx makes an in-cluster connection to the Service ClusterIP. Cilium (not kube-proxy) translates ClusterIP → an actual Pod IP via eBPF.
  10. The Pod responds. Path back is symmetric.
sequenceDiagram participant U as User participant CF as Cloudflare DNS participant AZDNS as Azure DNS participant LB as Azure LB 20.48.237.183 participant WK as Worker VM participant NGX as ingress-nginx participant POD as Backend Pod U->>CF: NS query for rke2.ericharrison.ca CF-->>U: delegated to azure-dns U->>AZDNS: A query for grafana.rke2... AZDNS-->>U: 20.48.237.183 U->>LB: HTTPS :443 LB->>WK: :443 (backendPool=workers) WK->>NGX: DaemonSet on host :80/:443 NGX->>POD: ClusterIP → Pod IP (Cilium eBPF) POD-->>NGX: response NGX-->>WK: response WK-->>LB: response LB-->>U: response
Inbound HTTPS — DNS delegation, Azure LB, worker hostNetwork, nginx TLS termination, Cilium service routing.

11.2 How a git push reaches a Pod

  1. Developer commits to kube/manifests/<phase>/<bundle>/... and pushes to main.
  2. GitLab pipeline runs (in parallel): helm-lint, kubeconform, values-sync-check. Lint only — no kubectl apply. If lint fails, the commit is on main but reviewers see red.
  3. Fleet controller polls GitLab every 60 s using the canarie-kube-auth Secret (basic-auth, seeded from .env's GITLAB_TOKEN).
  4. Fleet sees a new commit, walks the GitRepo's spec.paths:, generates/updates a Bundle per directory.
  5. The Fleet agent (running locally in the cluster) sees the new Bundle, materializes a BundleDeployment, and runs helm install or helm upgrade (with --create-namespace if needed).
  6. Helm renders the chart with valuesFiles:-pointed values, applies the rendered manifests via the K8s API.
  7. K8s controllers reconcile the manifests into running Pods. Loop closes.
flowchart LR Dev["Developer commits
kube/manifests/*"] GL["GitLab"] CI["Pipeline:
helm-lint + kubeconform"] Fleet["Fleet controller
cattle-fleet-system"] Bundle["BundleDeployment
per sub-path"] Helm["helm install / upgrade"] K8s["Running pods"] Dev -->|git push main| GL GL -->|triggers| CI GL -->|HTTPS basic-auth| Fleet Fleet -->|polls 60s| GL Fleet --> Bundle Bundle --> Helm Helm --> K8s
GitOps flow — two independent paths. CI lints; Fleet deploys. Either can fail without blocking the other.

Two independent paths. CI failure does not block Fleet (Fleet keeps reconciling whatever's in main). Fleet failure does not block CI (you can still ship infra changes). This separation is intentional and worth pointing out — it means one broken layer doesn't blast-radius into the other.


12. Non-obvious invariants

Each one is a “wait, why?” implementation detail worth internalising.

  1. kube/helm/<chart>.values.yaml and the matching helmchart.yaml's valuesContent block must be byte-for-byte identical. CI job values-sync-check enforces. Only applies to Phase 2/3 (longhorn, ingress-nginx). Phase 4+ uses Fleet's valuesFiles: and avoids the duplication.
  2. controller.publishService.enabled: false + extraArgs.publish-status-address: <LB IP> is load-bearing for cert-manager. Without it, every Ingress reports a ClusterIP as its public address; external-dns publishes that; LE challenges fail. (§6.3, §9.2)
  3. Longhorn's create-default-disk label has to come from cloud-init node-label:, not a post-install patch. That's the only mechanism that survives agent restart and runs at first node registration, which is when Longhorn decides to materialize its default disk.
  4. Managed data disks survive VM destroy/recreate; OS disks don't. Cloud-init mkfs.ext4 is gated by blkid, so existing Longhorn data on /dev/disk/azure/scsi1/lun0 is preserved across rebuilds. This is the entire reason Longhorn state is on a separate data disk.
  5. custom_data is force-new on azurerm_linux_virtual_machine. Any cloud-init edit destroys and recreates the VMs. Data disks survive (separate resources); RKE2 server state on the OS disk does not. After Phase 2, a cloud-init edit means a cluster rebuild.
  6. kube/manifests/fleet/ is excluded from the Fleet GitRepo's spec.paths. Otherwise Fleet would reconcile its own install bundle and fight with RKE2's helm-controller.
  7. Azure DNS NS delegation at Cloudflare is a manual step at the registrar. No Terraform automation can delegate NS records at a registrar the operator owns.
  8. KV soft-delete is enabled with purge_protection_enabled: false and a 90-day retention. Why: if a partial Terraform apply creates a secret but state rolls back, the next apply hits “already exists.” With purge protection on, recovery would be a 90-day wait. This way, az keyvault secret delete + purge is the one-liner fix.
  9. Fleet bundle names are <GitRepo>-<path-with-slashes-as-dashes>. Used in dependsOn: references. Get the name wrong and dependsOn: silently never resolves.
  10. searchNamespace: ALL (KPS) and namespaceFilter: "" (external-dns) mean any new namespace is auto-discovered. No need to update the observability or DNS bundles when adding new app namespaces.

13. Quick-reference glossary

TermOne-liner
CNIContainer Network Interface — plugin spec for “give Pods IPs and connectivity.” Cilium implements it here.
CRIContainer Runtime Interface — what kubelet talks to. containerd implements it here.
CSIContainer Storage Interface — plugin spec for “provision and mount storage.” Longhorn implements it here.
CRDCustomResourceDefinition — extends K8s API with new object kinds.
eBPFSandboxed, verified bytecode that runs in the Linux kernel in response to events. Used by Cilium for fast packet routing without iptables.
etcdDistributed key-value store. K8s's source of truth. RKE2 embeds it in rke2-server.
GitOpsCluster state declared in git, pulled by an in-cluster agent. Inverts CI-driven kubectl apply.
Helm chartTemplated bundle of K8s manifests with values. helm install renders + applies.
HPAHorizontalPodAutoscaler — scales replicas of a Deployment based on metrics. Not used in this cluster.
IaCInfrastructure-as-Code. OpenTofu here.
IngressHTTP(S) routing layer in front of Services. Needs an Ingress Controller (ingress-nginx) to do anything.
IMDSInstance Metadata Service. Azure's 169.254.169.254. VMs use it to get OAuth tokens for their MI.
KyvernoPolicy engine. Writes K8s policies in YAML, not a DSL.
MIManaged Identity. Azure-native way for a VM to have an identity that can RBAC-grant to other Azure resources without a secret.
NSGNetwork Security Group. Azure's stateful firewall, attached per subnet (or per NIC).
OpenTofuOpen-source fork of Terraform after the BSL relicense.
PSAPod Security Admission. K8s's built-in admission gate; replaced PodSecurityPolicy.
RBACRole-Based Access Control. K8s has it; this cluster mostly uses chart-default RBAC.
RKE2Rancher's K8s distro. Full upstream K8s, embedded etcd + containerd, CIS-hardened.
ServiceMonitorPrometheus Operator CRD that tells Prometheus to scrape a Service.
TrivyVulnerability scanner from Aqua Security. The Operator wraps it as a controller producing CR-shaped reports.