Kubernetes is a container orchestrator — it does not run containers itself. It tells worker nodes what containers to run, where, how many copies, and how they should reach each other. The container runtime (in this build: containerd) is what actually executes them.
The control plane is a small set of long-running processes that maintain the cluster's “desired state”:
| Component | Job |
|---|---|
etcd | The cluster's database. Every K8s object (Pod, Service, Secret, etc.) is a row in etcd. |
kube-apiserver | The only thing that talks to etcd. Every other component, including kubectl, talks to the apiserver. |
kube-scheduler | Decides which node a new Pod runs on. |
kube-controller-manager | Bundles the dozens of small loops that drive resource state — Deployment controller, ReplicaSet controller, Node controller, etc. Each loop reads “desired” from etcd, observes “actual” via the apiserver, and acts to close the gap. |
cloud-controller-manager | Cloud-provider-specific glue (in real AKS this is what makes Service type=LoadBalancer provision an Azure LB). In this cluster: not used — see §6. |
Each worker node runs:
| Component | Job |
|---|---|
kubelet | The node-level agent. Talks to the apiserver, gets told “run these pods”, and tells containerd to start/stop them. |
containerd | The container runtime. |
kube-proxy (normally) | Programs iptables/IPVS so Service ClusterIPs route to the right Pod IPs. Replaced by Cilium in this cluster — see §5. |
| CNI plugin | Gives Pods their IPs and connects them to the cluster network. Cilium here. |
The pattern to internalise: everything in K8s is reconciliation. You declare desired state (a YAML manifest), the apiserver writes it to etcd, and a controller loop notices the gap between desired and actual, then takes action. There is no “deploy” verb — there's only “write the desired thing to the apiserver and wait for the world to converge.”
Pod — The smallest deployable unit. One or more co-located containers sharing a network namespace and (optionally) volumes. Almost never created directly; created by controllers.
Deployment — Manages a ReplicaSet which manages Pods. Use for stateless workloads (web servers, API processes). Supports rolling updates, rollback.
StatefulSet — Like a Deployment but each Pod has a stable identity (pod-0, pod-1) and stable storage (each Pod gets its own PVC). Use for databases, Prometheus, anything that needs to know “which replica am I”. This cluster uses it for Prometheus, Alertmanager, Loki.
DaemonSet — Runs exactly one Pod on every (matching) node. Use for node-level agents. This cluster uses it for ingress-nginx, Promtail, Cilium.
Service — A stable virtual IP and DNS name in front of a set of Pods. Three types relevant here:
ClusterIP — internal-only virtual IP, the default. Used for in-cluster service discovery.NodePort — exposes the service on a static port on every node's IP. Rarely used in production.LoadBalancer — asks the cloud-controller-manager to provision an external LB. Not used here — see §6 for why.Ingress — HTTP(S) routing layer in front of Services. Decides “host grafana.example.com + path / → Service grafana port 80”. Needs an Ingress Controller (an actual running pod) to do anything; the Ingress object is just config. This cluster uses ingress-nginx as the controller.
Namespace — A scope/folder for K8s objects. Every object except a few cluster-scoped ones (Node, ClusterRole, PersistentVolume, CRDs themselves) lives in exactly one namespace. RBAC, NetworkPolicy, and ResourceQuota are all per-namespace.
ConfigMap / Secret — Key-value blobs mounted into Pods as files or env vars. Secret is base64-encoded (not encrypted at rest by default — that's a separate apiserver setting).
PersistentVolumeClaim (PVC) — A Pod's request for storage (“I need 10 GiB”). Bound to a PersistentVolume (PV) which is the actual storage. PVs are usually provisioned dynamically by a CSI driver (Container Storage Interface) — in this cluster, by Longhorn, see §6.
CustomResourceDefinition (CRD) — Lets an operator extend the K8s API with new object kinds. Every component on top of plain K8s (Longhorn, Fleet, cert-manager, Prometheus Operator, Kyverno) ships its own CRDs. When you write a Certificate YAML you're talking to cert-manager's CRD, not core K8s.
Operator — An admission-shorthand: a controller that watches a CRD and reconciles its desired state. “The Trivy Operator” = a pod that watches VulnerabilityReport CRs and acts on them.
helm install renders the templates with your values and kubectl applys the result. Helm tracks “releases” (named installs) in K8s Secrets named sh.helm.release.v1.<release>.<rev>.kubectl apply.” Key benefit: the cluster self-heals to git, drift is visible, rollback is git revert.pod-security.kubernetes.io/enforce: <baseline|restricted|privileged> and the apiserver enforces the corresponding security profile on Pods in that namespace. The replacement for the deprecated PodSecurityPolicy.Single sentence: a production-pattern RKE2 cluster on Azure, OSS-only, built up across seven discrete PRs so each layer of the stack is a discrete, reviewable change.
Why no AKS? AKS hides the parts of running K8s that give the project its shape — CNI install, CSI install, ingress install, etcd, certificates, GitOps wiring. The project is about assembling each of those layers end-to-end; AKS would short-circuit the exercise.
Why does each phase get its own PR? Each PR is a reviewable, individually-justified extension of the previous layer. The list below is also the order they had to ship in, because each later phase depends on something in an earlier one.
| # | Phase | What it adds | Installed by |
|---|---|---|---|
| 1 | Cluster | RKE2 v1.31 + Cilium (kube-proxy replacement) | cloud-init + RKE2 HelmChartConfig |
| 2 | Storage | Longhorn 1.7.2 (default StorageClass) | RKE2 HelmChart CRD |
| 3 | Ingress | ingress-nginx (DaemonSet, hostNetwork) | RKE2 HelmChart CRD |
| 4 | GitOps | Fleet + the GitRepo pointing at this repo | RKE2 HelmChart CRD |
| 5 | Observability | kube-prometheus-stack + Loki + Promtail | Fleet bundle |
| 6 | Certs + DNS | cert-manager (LE HTTP-01) + external-dns (Azure DNS) + 3 ClusterIssuers | Fleet bundles |
| 7 | Security | Trivy Operator + Kyverno + 4 audit-mode ClusterPolicies (Falco deferred) | Fleet bundles |
The Falco deferral is real and worth knowing — its eBPF driver doesn't compile against kernel 6.17 in the chart's bundled DKMS path, and the pre-built modern-bpf driver also failed to load. Documented in the rollout plan's Phase 7 post-impl notes.
HelmChart CRD; Phase 4+ is owned by Fleet.3 Azure VMs — cp-01 (control plane, 10.20.0.10) and wk-01/wk-02 (workers, 10.20.0.132/.133), all Standard_D2ads_v5 Ubuntu 24.04, in one VNet (10.20.0.0/24) split into three subnets (control, lb, workers). One Azure Standard Load Balancer fronts both :6443 (kubectl → cp-01) and :80/:443 (HTTP(S) → workers). One Key Vault holds the RKE2 join token, the Longhorn UI password, and the Grafana admin password. One Azure DNS zone (rke2.ericharrison.ca) is delegated from Cloudflare via four NS records.
This layer is the foundation — none of K8s exists until cloud-init finishes on the VMs.
infra/ is the entire IaC tree. Single module, no nested modules.~> 4.0).development → integration → staging → production. Dev and int auto-apply on main; staging and prod are manual.AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID.lint → plan → apply-{dev,int,stg,prod}. MR pipelines run lint + plan only.rg-kube-dev-001.10.20.0.0/24 with three subnets (control, lb, workers) each with their own NSG.cp-01, workers on wk-01+wk-02) and three rules (:6443 → control, :80 → workers, :443 → workers). Public IP 20.48.237.183.kv-kube-dev-001), RBAC-auth, holding generated secrets.rke2.ericharrison.ca).This is the piece that ties Azure IaM to K8s cluster bootstrap without a pre-placed secret:
random_password and writes it to Key Vault.Key Vault Secrets User on the vault.169.254.169.254 for an OAuth token, then curls the Key Vault REST API to fetch the join token. No az CLI needed; no secret travels outside Azure.rke2-server, persists the token internally. wk-01 and wk-02 read the same token and join via cp-01:9345./dev/disk/azure/scsi1/lun0, 64 GiB) and mount it at /var/lib/longhorn. The mkfs.ext4 is gated by blkid so VM rebuilds preserve Longhorn data.The custom_data field on azurerm_linux_virtual_machine is force-new — any change to kube/cloud-init/{cp,wk}.yaml destroys and recreates the VMs. Data disks survive (separate resources). RKE2 server state on cp-01's OS disk does not, so a cloud-init edit after Phase 2 effectively means a cluster rebuild.
Some Secrets can't be generated by Terraform because they need .env values that live only on the operator's laptop. These four scripts run once after cluster come-up:
| Script | Seeds |
|---|---|
sync-fleet-git-auth.sh | Fleet's git-clone basic-auth Secret from .env's GITLAB_TOKEN |
sync-external-dns-azure.sh | external-dns's Azure SP credentials |
sync-grafana-admin.sh | Grafana admin password (consumed by KPS via existingSecret) |
sync-longhorn-basic-auth.sh | basic-auth Secret for the Longhorn UI Ingress |
Plus fetch-kubeconfig.sh, which ssh's into cp-01 (passwordless sudo on Azure Ubuntu image), cats /etc/rancher/rke2/rke2.yaml, and rewrites the server URL to point at the LB public IP rather than 127.0.0.1. That file is ~/.kube/kube-dev.yaml from there on.
You will get asked “why RKE2 and not k3s or kubeadm or AKS?”. The answer:
The single most important RKE2 fact for this project: RKE2's helm-controller runs as a goroutine inside rke2-server, not as a Pod. Manifests dropped into /var/lib/rancher/rke2/server/manifests/ are reconciled on every server start, before the CNI is up. Two CRDs come from this:
HelmChart — installs a chart from a repo. Used for Longhorn, ingress-nginx, Fleet itself.HelmChartConfig — overrides values for a chart that ships with RKE2 (Cilium, traefik-if-not-disabled, etc.).That mechanism is how Cilium gets configured — see next.
Cilium is the CNI plugin — it gives Pods their IPs and connects them. This cluster sets cni: cilium in RKE2's server config, then drops a HelmChartConfig at /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml that sets kubeProxyReplacement: True.
What that means in practice:
kube-proxy DaemonSet runs. Cilium handles Service ClusterIP → Pod IP routing itself, in eBPF programs attached to the kernel's TC hook.Why Cilium specifically, not Calico or Flannel?
eBPF — a safe, sandboxed in-kernel VM that runs verified bytecode in response to kernel events (syscalls, network packets, tracepoints). Originally for packet filtering (“Berkeley Packet Filter”), now general-purpose. Lets userspace programs do work in kernel context without writing a kernel module.
rke2-token in KV → Terraform-generated random_password → VMs read via Managed Identity + IMDS → no chicken-and-egg, no secret in any pipeline log. This is how the cluster bootstraps a multi-node join without an already-running cluster to distribute the token.
A distributed block-storage system written in Go that runs as Pods inside your cluster. Each Pod-managed “engine” exposes a virtual block device that's replicated across nodes' local disks. Implements the CSI (Container Storage Interface) so K8s sees it as a normal StorageClass.
The tradeoff: it eats CPU and disk I/O on every worker, and on a 2-worker cluster you can only get 2 replicas (not 3, since you need replicas on distinct nodes).
defaultReplicaCount: 2 → every PVC consumes 2× its requested size across the cluster.longhorn-system is the only namespace with a Pod Security Admission label. It's enforce: privileged because Longhorn's instance-manager DaemonSet does block-device manipulation.node.longhorn.io/create-default-disk label has to be set via cloud-init's RKE2 node-label: config, not as a post-install kubectl label. RKE2's node-label config survives agent restart and applies at first node registration, which is when Longhorn's controller decides whether to materialize the default disk. Documented in Phase 2 post-impl.Why a separate Azure managed data disk per worker, mounted at /var/lib/longhorn, instead of just using the OS disk? Because OS disks die when a VM is destroyed, but managed data disks survive azurerm_linux_virtual_machine destroy/recreate. Cloud-init's mkfs.ext4 is gated by blkid, so a VM rebuild remounts the existing data disk without reformatting — Longhorn's volume metadata and replicas are preserved.
In a “normal” cloud-managed K8s cluster:
controller.service.type: LoadBalancer.This cluster has no cloud-controller-manager configured for the Azure provider. The Azure LB already exists (it has to, for the kubectl :6443 rule). So the question becomes: how does external HTTPS traffic reach an ingress-nginx pod?
The chosen answer:
hostNetwork: true, so each ingress-nginx Pod binds directly to its node's network interface on :80 and :443.:80 and :443 to those backends.The benefit: no second Azure LB. One LB does everything. Cheaper, simpler, less Azure-specific magic.
publish-status-address invariantingress-nginx writes the LB IP into each Ingress's status.loadBalancer.ingress[0].ip field. external-dns reads that field to decide what A record to publish. cert-manager's HTTP-01 challenge needs that A record to be reachable from the public internet.
Default behaviour with controller.service.type: ClusterIP is to write the internal ClusterIP into that status field. external-dns then publishes 10.43.0.x as the public A record. LE challenges fail. Disaster.
The fix, encoded in kube/helm/ingress-nginx.values.yaml:
controller:
publishService:
enabled: false
extraArgs:
publish-status-address: "20.48.237.183"
That tells nginx: “ignore your own Service, publish this IP into Ingress status.” This is Phase 6.5 post-impl fix-up #4 in the rollout plan.
Set ingressClassName: nginx and mark this IngressClass as default in the chart values. Any Ingress in any namespace that doesn't specify a class will use nginx.
Without GitOps, every cluster change is a kubectl apply somewhere — usually from a CI runner with cluster credentials. Problems:
kubectl edit and the diff disappears.GitOps inverts this: an in-cluster agent reads from git on a poll, reconciles desired state into the cluster. Cluster credentials never leave the cluster. Drift is visible (the agent corrects it). Rollback is git revert.
Three viable choices in 2026. Why Fleet here:
GitRepo), bundles auto-derived from sub-paths. Less configuration than Argo's Application per app.targets: lets one git repo deploy differently to many clusters. Not used here (single cluster) but worth mentioning.If asked “would you pick Fleet again?”: for a multi-cluster Rancher fleet, yes. For a single cluster outside Rancher's universe, Argo CD has a richer UI and bigger community — not necessarily better, but more familiar.
Three CRD kinds you must know:
GitRepo — points at a git URL + branch + paths list. Fleet polls the repo every pollingInterval (60 s here).Bundle — Fleet auto-generates one Bundle per directory under paths that contains a fleet.yaml.BundleDeployment — what's actually applied to a target cluster. Per-cluster materialization of a Bundle.Bundle names are deterministic: <GitRepo name>-<path-with-slashes-as-dashes>. Example: kube/manifests/security/kyverno under GitRepo canarie-kube becomes canarie-kube-kube-manifests-security-kyverno. That name is what you use in dependsOn: between bundles. Used in two places in the repo:
kube/manifests/certs-dns/cluster-issuers/fleet.yaml depends on cert-manager being installed first.kube/manifests/security/kyverno-policies/fleet.yaml depends on Kyverno being installed first.kube/manifests/fleet/gitrepo.yaml — its spec.paths: is the source of truth for what Fleet manages. Adding a Phase 4+ component requires appending its sub-directory to that list. Not appending = Fleet ignores it.
The list deliberately excludes kube/manifests/fleet/ — otherwise Fleet would try to reconcile its own install bundle, fight RKE2's helm-controller for it, and produce intermittent agent restarts. (RKE2 installed Fleet, RKE2 keeps reconciling Fleet's install — that's the whole reason this directory uses the HelmChart CRD instead of being a Fleet bundle.)
Mechanism A: RKE2 HelmChart | Mechanism B: Fleet bundle | |
|---|---|---|
| Used in | Phases 2–3 (Longhorn, ingress-nginx, Fleet itself) | Phases 4+ (everything else) |
| CRD | helm.cattle.io/v1 HelmChart | fleet.cattle.io/v1alpha1 Bundle (via fleet.yaml) |
| Reconciler | RKE2's helm-controller (goroutine in rke2-server) | Fleet controller + agent Pods |
| Apply path | kubectl apply -R -f kube/manifests/ from operator laptop (sync-manifests.sh) | Fleet polls git every 60 s |
| Namespace creation | Explicit kind: Namespace YAML in the same file | defaultNamespace: in fleet.yaml + Helm --create-namespace |
| Values | Inline valuesContent in the HelmChart YAML, byte-identical to kube/helm/<name>.values.yaml (CI enforces) | valuesFiles: [../../../helm/<name>.values.yaml] — single source of truth |
| Why this one? | Has to exist before Fleet does | Standard once Fleet exists |
The “byte-identical values” rule is enforced by the values-sync-check CI job. It's there because Phase 2/3 manifests need both the YAML representation (for the manifest file) and the canonical file (for helm lint). Phase 4+ doesn't have this duplication because Fleet reads the values file directly.
Single Helm chart that bundles the whole Prometheus Operator world:
ServiceMonitor/PodMonitor CRs and adds them to Prometheus's scrape config.Why a ServiceMonitor instead of editing prometheus.yaml? Because the Operator pattern lets each app declare its own scrape config alongside its other manifests, no central edit needed. Combined with searchNamespace: ALL, any ServiceMonitor in any namespace is auto-discovered.
/var/log/pods/* on every worker, attaches K8s metadata as labels, ships to Loki.This pair gets every namespace's Pod logs into Grafana with no per-namespace config. Promtail picks up new namespaces automatically because it's just reading the host filesystem.
There's a tiny bundle at kube/manifests/observability/canary/fleet-canary.yaml that just deploys an empty ConfigMap to the default namespace. Its purpose: a smoke test that Fleet's reconciliation loop is working. If the ConfigMap doesn't exist, Fleet itself is broken.
Five CRDs to know:
| CRD | Role |
|---|---|
Issuer / ClusterIssuer | A source of certificates (Let's Encrypt prod, LE staging, self-signed CA, etc.). Issuer is namespaced; ClusterIssuer is cluster-wide and used by all namespaces. |
Certificate | “I want a cert for grafana.example.com, signed by issuer X, stored in Secret grafana-tls.” |
CertificateRequest | A pending request to the Issuer. Cert-manager creates these from Certificates. |
Order | LE-specific. The ACME order for one or more domains. |
Challenge | LE-specific. One DNS or HTTP challenge per domain in an Order. |
You don't usually create Certificates directly — you put cert-manager annotations on an Ingress, and cert-manager generates the Certificate for you from the Ingress's tls: spec.
kube/manifests/certs-dns/cluster-issuers/issuers.yaml:
selfsigned — for “I just need something.”letsencrypt-staging — LE staging API; rate-limit-friendly; certs not browser-trusted; use during testing.letsencrypt-prod — LE production API; real browser-trusted certs; strict rate limits (5 duplicate certs per week, etc.).Both LE issuers use HTTP-01 challenge via ingressClassName: nginx. cert-manager creates a temporary Ingress for /.well-known/acme-challenge/<token>, LE fetches it over HTTP, validates control of the domain, signs the cert.
This is why publish-status-address (from §6.3) is load-bearing: HTTP-01 needs the public A record to point at the public LB IP, not at a ClusterIP.
DNS-01 puts the challenge as a TXT record on the domain — works for wildcards (*.example.com) and doesn't need the host to be HTTP-reachable. HTTP-01 is simpler but doesn't do wildcards and requires the host to be publicly HTTP-reachable. This cluster picked HTTP-01 because no wildcards are needed and the Ingress path was already wired up.
A controller that watches Ingress (and Service-type-LoadBalancer) objects and writes DNS records to a configured provider — here, Azure DNS.
Two annotations on the Ingress drive it:
external-dns.alpha.kubernetes.io/hostname: foo.rke2.ericharrison.cacert-manager.io/cluster-issuer: letsencrypt-prodexternal-dns watches every namespace (namespaceFilter: "") and only writes records inside its domainFilter (rke2.ericharrison.ca). Records outside that suffix are ignored.
Auth to Azure DNS is via a service principal whose JSON is in a K8s Secret (seeded by sync-external-dns-azure.sh from .env). Workload Identity was the original plan but deferred as a tactical choice — the SP-in-Secret path works identically and retires cleanly once WI is done as a follow-up.
The apex zone ericharrison.ca lives at Cloudflare. The subdomain rke2.ericharrison.ca is delegated to Azure DNS via four NS records at Cloudflare (DNS-only / “grey cloud” — NS records can't be Cloudflare-proxied). From there, Azure DNS is authoritative.
Why? Reduce platform integration surface. The whole stack already commits to Azure; using Cloudflare's API for DNS would add a second cloud-provider auth path (Cloudflare API token in .env, separate external-dns provider config). Letting Cloudflare just be the registrar/apex and delegating to Azure is one less moving part.
The NS delegation is a manual step at the registrar. Terraform outputs the four NS names but cannot itself update Cloudflare. Documented in Phase 6 post-impl.
Helm chart values for any new public service:
ingress:
enabled: true
ingressClassName: nginx
hosts:
- host: foo.rke2.ericharrison.ca
paths: [{path: /, pathType: Prefix}]
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
external-dns.alpha.kubernetes.io/hostname: foo.rke2.ericharrison.ca
tls:
- secretName: foo-tls
hosts: [foo.rke2.ericharrison.ca]
Within ~2 minutes after git push:
Certificate, runs the HTTP-01 challenge, gets a cert from LE, stores it in foo-tls.Aqua Security's Trivy is a vulnerability scanner. The Operator wraps it as a controller that:
VulnerabilityReport CR in the workload's namespace.ConfigAuditReport CRs (workload misconfigurations: missing resource limits, privileged containers, etc.) and ClusterComplianceReport CRs (CIS K8s benchmark, NSA hardening guide).Why an Operator and not a CronJob? So results are queryable as native K8s objects (kubectl get vulnerabilityreports -A) and consumable by other tools.
A policy engine that's K8s-native — policies are written in YAML, not a DSL. Two modes:
validationFailureAction: Audit — violations create PolicyReport CRs but don't block kubectl apply.validationFailureAction: Enforce — violations are rejected at admission time.This cluster ships all four policies in Audit mode at commit 53eb10b. Flipping to Enforce is a one-line PR per policy, deliberately deferred until a week of clean PolicyReports has accumulated. Why deferred? Because some infra workloads (often charts you don't control) genuinely don't set resource limits or pin image tags, and Enforcing on day one would block them.
Live in kube/manifests/security/kyverno-policies/:
| Policy | What it blocks (in Enforce mode) | Scope |
|---|---|---|
disallow-privileged | Pods with securityContext.privileged: true | Every namespace |
disallow-latest-tag | Images using :latest or no tag | Every namespace |
require-resource-limits | Containers missing resources.limits.cpu and .memory | Every namespace |
disallow-host-path | Pods using hostPath volumes | Every namespace except a hardcoded allow-list of 10 infra namespaces (kube-system, falco, longhorn-system, monitoring, ingress-nginx, cert-manager, external-dns, trivy-system, cattle-fleet-system, kyverno) |
disallow-host-path has the allow-list because Longhorn, Promtail, etc. legitimately need hostPath. Any new namespace that needs hostPath must be added to the policy's allow-list — there's no label-based mechanism.
Falco is the runtime threat detection layer. It tails kernel events (syscalls) via either a kernel module or eBPF program and matches them against rules (“a shell was spawned in a container”, “a pod tried to read /etc/shadow”). Streams findings via Falcosidekick to e.g. Loki.
Why deferred: the 0.40.x chart's bundled DKMS driver doesn't compile against the kernel 6.17 series shipped by Ubuntu 24.04, and the pre-built modern-bpf driver also failed to load on test. Documented in plan Phase 7. The Fleet bundle is scaffolded (kube/manifests/security/falco/fleet.yaml exists) but the GitRepo's paths: deliberately omits it, so Fleet doesn't try to reconcile a broken bundle.
K8s's built-in admission gate, replaced PodSecurityPolicy in v1.25. You label a namespace with one of three profiles (privileged, baseline, restricted) and three modes (enforce, audit, warn):
metadata:
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
In this cluster: only longhorn-system carries any PSA label (enforce: privileged, because Longhorn needs it). No cluster-wide PSA default is configured. New namespaces get K8s's built-in default behaviour (which is effectively no PSA enforcement beyond privileged).
This is something you might be asked to improve. The honest answer: a follow-up would set a cluster-wide AdmissionConfiguration defaulting to baseline enforce, with the infra namespaces opted into privileged per-namespace.
grafana.rke2.ericharrison.ca.ericharrison.ca zone (Cloudflare). Cloudflare returns NS records pointing at ns[1-4]-01.azure-dns.* (delegated subdomain).20.48.237.183 (written there by external-dns based on the Ingress status).:443 to 20.48.237.183.:443 rule has backend pool bepool-workers (worker NICs). LB picks a worker.hostNetwork: true, it's bound to :443 on the worker's interface — packet goes straight to nginx.grafana-tls (provisioned by cert-manager from LE).Host: grafana.rke2.ericharrison.ca, finds the Service backend (monitoring/kps-grafana:80).git push reaches a Podkube/manifests/<phase>/<bundle>/... and pushes to main.helm-lint, kubeconform, values-sync-check. Lint only — no kubectl apply. If lint fails, the commit is on main but reviewers see red.canarie-kube-auth Secret (basic-auth, seeded from .env's GITLAB_TOKEN).spec.paths:, generates/updates a Bundle per directory.BundleDeployment, and runs helm install or helm upgrade (with --create-namespace if needed).valuesFiles:-pointed values, applies the rendered manifests via the K8s API.Two independent paths. CI failure does not block Fleet (Fleet keeps reconciling whatever's in main). Fleet failure does not block CI (you can still ship infra changes). This separation is intentional and worth pointing out — it means one broken layer doesn't blast-radius into the other.
Each one is a “wait, why?” implementation detail worth internalising.
kube/helm/<chart>.values.yaml and the matching helmchart.yaml's valuesContent block must be byte-for-byte identical. CI job values-sync-check enforces. Only applies to Phase 2/3 (longhorn, ingress-nginx). Phase 4+ uses Fleet's valuesFiles: and avoids the duplication.controller.publishService.enabled: false + extraArgs.publish-status-address: <LB IP> is load-bearing for cert-manager. Without it, every Ingress reports a ClusterIP as its public address; external-dns publishes that; LE challenges fail. (§6.3, §9.2)create-default-disk label has to come from cloud-init node-label:, not a post-install patch. That's the only mechanism that survives agent restart and runs at first node registration, which is when Longhorn decides to materialize its default disk.mkfs.ext4 is gated by blkid, so existing Longhorn data on /dev/disk/azure/scsi1/lun0 is preserved across rebuilds. This is the entire reason Longhorn state is on a separate data disk.custom_data is force-new on azurerm_linux_virtual_machine. Any cloud-init edit destroys and recreates the VMs. Data disks survive (separate resources); RKE2 server state on the OS disk does not. After Phase 2, a cloud-init edit means a cluster rebuild.kube/manifests/fleet/ is excluded from the Fleet GitRepo's spec.paths. Otherwise Fleet would reconcile its own install bundle and fight with RKE2's helm-controller.purge_protection_enabled: false and a 90-day retention. Why: if a partial Terraform apply creates a secret but state rolls back, the next apply hits “already exists.” With purge protection on, recovery would be a 90-day wait. This way, az keyvault secret delete + purge is the one-liner fix.<GitRepo>-<path-with-slashes-as-dashes>. Used in dependsOn: references. Get the name wrong and dependsOn: silently never resolves.searchNamespace: ALL (KPS) and namespaceFilter: "" (external-dns) mean any new namespace is auto-discovered. No need to update the observability or DNS bundles when adding new app namespaces.| Term | One-liner |
|---|---|
| CNI | Container Network Interface — plugin spec for “give Pods IPs and connectivity.” Cilium implements it here. |
| CRI | Container Runtime Interface — what kubelet talks to. containerd implements it here. |
| CSI | Container Storage Interface — plugin spec for “provision and mount storage.” Longhorn implements it here. |
| CRD | CustomResourceDefinition — extends K8s API with new object kinds. |
| eBPF | Sandboxed, verified bytecode that runs in the Linux kernel in response to events. Used by Cilium for fast packet routing without iptables. |
| etcd | Distributed key-value store. K8s's source of truth. RKE2 embeds it in rke2-server. |
| GitOps | Cluster state declared in git, pulled by an in-cluster agent. Inverts CI-driven kubectl apply. |
| Helm chart | Templated bundle of K8s manifests with values. helm install renders + applies. |
| HPA | HorizontalPodAutoscaler — scales replicas of a Deployment based on metrics. Not used in this cluster. |
| IaC | Infrastructure-as-Code. OpenTofu here. |
| Ingress | HTTP(S) routing layer in front of Services. Needs an Ingress Controller (ingress-nginx) to do anything. |
| IMDS | Instance Metadata Service. Azure's 169.254.169.254. VMs use it to get OAuth tokens for their MI. |
| Kyverno | Policy engine. Writes K8s policies in YAML, not a DSL. |
| MI | Managed Identity. Azure-native way for a VM to have an identity that can RBAC-grant to other Azure resources without a secret. |
| NSG | Network Security Group. Azure's stateful firewall, attached per subnet (or per NIC). |
| OpenTofu | Open-source fork of Terraform after the BSL relicense. |
| PSA | Pod Security Admission. K8s's built-in admission gate; replaced PodSecurityPolicy. |
| RBAC | Role-Based Access Control. K8s has it; this cluster mostly uses chart-default RBAC. |
| RKE2 | Rancher's K8s distro. Full upstream K8s, embedded etcd + containerd, CIS-hardened. |
| ServiceMonitor | Prometheus Operator CRD that tells Prometheus to scrape a Service. |
| Trivy | Vulnerability scanner from Aqua Security. The Operator wraps it as a controller producing CR-shaped reports. |