k8s-agent-skills 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/README.md +102 -0
  2. package/package.json +63 -0
  3. package/skills/atlas/SKILL.md +166 -0
  4. package/skills/cert-manager/SKILL.md +212 -0
  5. package/skills/cilium-gateway/SKILL.md +283 -0
  6. package/skills/cilium-network/SKILL.md +243 -0
  7. package/skills/cnpg/SKILL.md +130 -0
  8. package/skills/dragonfly/SKILL.md +194 -0
  9. package/skills/external-dns/SKILL.md +185 -0
  10. package/skills/flagger/SKILL.md +292 -0
  11. package/skills/flux/SKILL.md +36 -0
  12. package/skills/gitea/SKILL.md +32 -0
  13. package/skills/gitea-api/SKILL.md +104 -0
  14. package/skills/gitea-registry/SKILL.md +71 -0
  15. package/skills/gitea-runner/SKILL.md +126 -0
  16. package/skills/gitea-tea/SKILL.md +206 -0
  17. package/skills/gitea-webhooks/SKILL.md +93 -0
  18. package/skills/harbor/SKILL.md +32 -0
  19. package/skills/harbor-api/SKILL.md +231 -0
  20. package/skills/harbor-helm/SKILL.md +238 -0
  21. package/skills/harbor-terraform/SKILL.md +233 -0
  22. package/skills/higress/SKILL.md +27 -0
  23. package/skills/higress-helm/SKILL.md +328 -0
  24. package/skills/higress-operator/SKILL.md +435 -0
  25. package/skills/kserve/SKILL.md +28 -0
  26. package/skills/kserve-helm/SKILL.md +330 -0
  27. package/skills/kserve-operator/SKILL.md +763 -0
  28. package/skills/kubeflow/SKILL.md +33 -0
  29. package/skills/kubeflow-pipelines/SKILL.md +392 -0
  30. package/skills/kubeflow-trainer/SKILL.md +429 -0
  31. package/skills/kubeflow-training-operator/SKILL.md +176 -0
  32. package/skills/mariadb/SKILL.md +27 -0
  33. package/skills/mariadb-helm/SKILL.md +378 -0
  34. package/skills/mariadb-operator/SKILL.md +1114 -0
  35. package/skills/nvidia-device-plugin/SKILL.md +204 -0
  36. package/skills/rook-ceph/SKILL.md +22 -0
  37. package/skills/rook-ceph-operator/SKILL.md +150 -0
  38. package/skills/rook-ceph-toolbox/SKILL.md +220 -0
  39. package/skills/sealed-secrets/SKILL.md +221 -0
  40. package/skills/stakater-reloader/SKILL.md +259 -0
  41. package/skills/talos/SKILL.md +244 -0
  42. package/skills/tekton/SKILL.md +187 -0
  43. package/skills/vector/SKILL.md +24 -0
  44. package/skills/vector-helm/SKILL.md +186 -0
  45. package/skills/vector-operator/SKILL.md +455 -0
  46. package/skills/victoria-metrics/SKILL.md +35 -0
  47. package/skills/victoriametrics-operator/SKILL.md +248 -0
  48. package/skills/zitadel/SKILL.md +24 -0
  49. package/skills/zitadel-api/SKILL.md +962 -0
  50. package/skills/zitadel-helm/SKILL.md +263 -0
  51. package/skills/zitadel-terraform/SKILL.md +728 -0
@@ -0,0 +1,194 @@
1
+ ---
2
+ name: dragonfly
3
+ description: Use when working with DragonflyDB operator on Kubernetes — creating or troubleshooting Dragonfly resources, configuring replication, snapshots, TLS, authentication, or affinity/scheduling for Dragonfly instances.
4
+ ---
5
+
6
+ # DragonflyDB Operator
7
+
8
+ ## Overview
9
+
10
+ Dragonfly operator manages Dragonfly (Redis-compatible) in-memory data store instances on Kubernetes. API: `dragonflydb.io/v1alpha1`. Single CRD: `Dragonfly` (plural: `dragonflies`). Deployed via Helm chart `oci://ghcr.io/dragonflydb/dragonfly-operator/helm/dragonfly-operator`. Latest operator: **v1.5.0** (Mar 2026). Deployed chart: v1.5.0.
11
+
12
+ ## CRD Fields
13
+
14
+ | Field | Type | Since | Description |
15
+ |-------|------|-------|-------------|
16
+ | `replicas` | int | — | Total instances (1 = standalone primary, 2 = 1 primary + 1 replica) |
17
+ | `image` | string | — | Dragonfly image (default: `docker.dragonflydb.io/dragonflydb/dragonfly:v1.21.2`) |
18
+ | `args` | []string | — | Dragonfly server args (e.g. `--maxmemory=2gb`) |
19
+ | `resources` | ResourceRequirements | — | Container CPU/memory |
20
+ | `affinity` | Affinity | — | Pod affinity (nodeAffinity, podAntiAffinity, etc.) |
21
+ | `nodeSelector` | map | v1.1.1 | Node selector for pod scheduling |
22
+ | `tolerations` | []Toleration | — | Pod tolerations |
23
+ | `topologySpreadConstraints` | []TopologySpreadConstraint | v1.1.1 | Spread pods across topology domains |
24
+ | `annotations` | object | — | Annotations on Dragonfly pods |
25
+ | `labels` | object | — | Labels on Dragonfly pods |
26
+ | `env` | []EnvVar | — | Environment variables |
27
+ | `authentication.passwordFromSecret` | SecretKeySelector | — | Password from Secret key |
28
+ | `authentication.clientCaCertSecret` | SecretReference | — | Client CA certificate Secret |
29
+ | `tlsSecretRef` | SecretReference | — | TLS cert Secret for server TLS |
30
+ | `snapshot.cron` | string | — | Cron schedule for snapshots (e.g. `"\*/5 * * * *"`) |
31
+ | `snapshot.persistentVolumeClaimSpec` | PVC Spec | — | PVC for snapshot storage |
32
+ | `aclFromSecret` | SecretKeySelector | v1.1.1 | ACL file from Secret |
33
+ | `serviceAccountName` | string | — | Pod service account |
34
+ | `serviceSpec.type` | string | — | Service type (ClusterIP, LoadBalancer, etc.) |
35
+ | `serviceSpec.name` | string | v1.1.3 | Custom service name |
36
+ | `serviceSpec.annotations` | object | — | Service annotations |
37
+ | `priorityClassName` | string | v1.1.1 | Pod priority class |
38
+ | `skipFSGroup` | bool | v1.1.2 | Skip FSGroup assignment (OpenShift) |
39
+ | `memcachedPort` | int | v1.1.2 | Memcached port (alternative to `--memcached_port` arg) |
40
+ | `additionalContainers` | []Container | — | Sidecar containers |
41
+ | `additionalVolumes` | []Volume | — | Extra volumes |
42
+
43
+ ## Dragonfly Spec Patterns
44
+
45
+ ```yaml
46
+ apiVersion: dragonflydb.io/v1alpha1
47
+ kind: Dragonfly
48
+ metadata:
49
+ name: my-cache
50
+ spec:
51
+ replicas: 1 # 1 = standalone primary
52
+ args:
53
+ - --maxmemory=2gb
54
+ - --logtostderr
55
+ - --cluster_mode=emulated # Enable cluster-compatible mode
56
+ - --lock_on_hashtags # Hashtag-based locking
57
+ - --default_lua_flags=allow-undeclared-keys
58
+
59
+ resources:
60
+ requests:
61
+ cpu: 500m
62
+ memory: 1Gi
63
+ limits:
64
+ cpu: "1"
65
+ memory: 2Gi
66
+
67
+ affinity:
68
+ nodeAffinity:
69
+ requiredDuringSchedulingIgnoredDuringExecution:
70
+ nodeSelectorTerms:
71
+ - matchExpressions:
72
+ - key: kubernetes.io/hostname
73
+ operator: In
74
+ values:
75
+ - worker-proxmox
76
+
77
+ authentication:
78
+ passwordFromSecret:
79
+ name: dragonfly-password
80
+ key: password
81
+
82
+ snapshot:
83
+ cron: "*/5 * * * *"
84
+ persistentVolumeClaimSpec:
85
+ accessModes:
86
+ - ReadWriteOnce
87
+ storageClassName: ceph-block
88
+ resources:
89
+ requests:
90
+ storage: 2Gi
91
+
92
+ serviceSpec:
93
+ type: ClusterIP
94
+ annotations:
95
+ external-dns.alpha.kubernetes.io/hostname: dragonfly.example.com
96
+ ```
97
+
98
+ ## Replication Model
99
+
100
+ - `replicas=1` — standalone primary
101
+ - `replicas=2` — 1 primary + 1 replica
102
+ - `replicas=3` — 1 primary + 2 replicas
103
+ - **Always exactly 1 primary** regardless of replica count
104
+ - Operator manages automatic failover if primary fails
105
+ - Service `<name>.<ns>.svc.cluster.local` always points to current primary
106
+
107
+ ## Authentication
108
+
109
+ | Method | Config | Description |
110
+ |--------|--------|-------------|
111
+ | Password | `authentication.passwordFromSecret` | Basic password auth (maps to `--requirepass`) |
112
+ | Client CA | `authentication.clientCaCertSecret` | TLS client cert verification |
113
+ | ACL file | `aclFromSecret` | ACL rules file from Secret (v1.1.1+) |
114
+
115
+ Password can also be set via `args: ["--requirepass=<pw>"]` or `env: [{name: DFLY_requirepass, value: "<pw>"}]`.
116
+
117
+ ## TLS
118
+
119
+ ```yaml
120
+ spec:
121
+ tlsSecretRef:
122
+ name: dragonfly-tls # Secret must have tls.crt, tls.key
123
+ ```
124
+
125
+ Secret must exist in same namespace. Optionally combine with `authentication.clientCaCertSecret` for mutual TLS.
126
+
127
+ ## Snapshots
128
+
129
+ Snapshots store Dragonfly data to PVC for persistence across restarts:
130
+
131
+ ```yaml
132
+ spec:
133
+ snapshot:
134
+ cron: "*/5 * * * *"
135
+ persistentVolumeClaimSpec:
136
+ storageClassName: ceph-block
137
+ accessModes: [ReadWriteOnce]
138
+ resources:
139
+ requests:
140
+ storage: 2Gi
141
+ ```
142
+
143
+ - `cron` is optional — omit for on-demand only
144
+ - Snapshots **not auto-pruned** — manage disk or use static `--dbfilename` to overwrite
145
+ - PVC spec follows standard Kubernetes `PersistentVolumeClaimSpec`
146
+
147
+ ## Exposing Dragonfly as Cache
148
+
149
+ Applications connect via the service at `<name>.<ns>.svc.cluster.local:6379`. In apps using Dragonfly as Redis-compatible cache (valkey/Redis clients):
150
+
151
+ ```yaml
152
+ # In the app's ConfigMap or env
153
+ REDIS_URL: redis://:${REDIS_PASSWORD}@my-cache.namespace:6379
154
+ REDIS_PASSWORD: # from Dragonfly auth secret
155
+ ```
156
+
157
+ For cluster-mode emulated (`--cluster_mode=emulated`), Redis cluster clients (e.g. `ioredis` cluster mode) can connect as if it's a Redis Cluster — but there's only one actual Dragonfly instance behind the service.
158
+
159
+ ## Monitoring
160
+
161
+ Dragonfly exposes metrics on the `admin` port. Use a `PodMonitor` to scrape:
162
+
163
+ ```yaml
164
+ apiVersion: monitoring.coreos.com/v1
165
+ kind: PodMonitor
166
+ metadata:
167
+ name: dragonfly-monitor
168
+ spec:
169
+ selector:
170
+ matchLabels:
171
+ app: my-cache # Must match Dragonfly resource name
172
+ podMetricsEndpoints:
173
+ - port: admin
174
+ ```
175
+
176
+ The operator creates pods with label `app: <dragonfly-name>` by default.
177
+
178
+ ## Common Mistakes
179
+
180
+ - **`replicas` confusion** — replicas=2 means 1 primary + 1 replica, NOT 2 primaries
181
+ - **No snapshot cron** — without `cron` + PVC, restarts lose all data; always configure for stateful use
182
+ - **`cluster_mode=emulated` without `lock_on_hashtags`** — emulated cluster needs hashtag locking for multi-key ops
183
+ - **Password collision** — don't set password via both `authentication.passwordFromSecret` AND `--requirepass` arg; use the CRD field
184
+ - **Same storageClass for all** — immich with ceph-block fine; if latency-sensitive, use local SSD via `nodeSelector` + local storage
185
+ - **No resource limits** — Dragonfly can OOM under load; always set `resources.limits.memory`
186
+ - **Helm chart OCI URL** — use `oci://ghcr.io/dragonflydb/dragonfly-operator/helm`, chart name `dragonfly-operator`, version `v1.5.0`
187
+
188
+ ## Version History
189
+
190
+ | Operator | Helm Chart | Date | Notes |
191
+ |----------|-----------|------|-------|
192
+ | v1.5.0 | v1.5.0 | Mar 2026 | Latest (deployed) |
193
+ | v1.4.0 | v1.4.0 | Jan 2026 | |
194
+ | v1.3.1 | v1.3.1 | Nov 2025 | |
@@ -0,0 +1,185 @@
1
+ ---
2
+ name: external-dns
3
+ description: Use when working with ExternalDNS — synchronizing Kubernetes resources with DNS providers (Cloudflare). Covers providers, sources, registry, RBAC, Gateway API integration, Helm values. No CRDs.
4
+ ---
5
+
6
+ # ExternalDNS
7
+
8
+ ## Overview
9
+
10
+ ExternalDNS synchronizes exposed Kubernetes Services, Ingresses, and Gateway API routes with DNS providers. It watches resources via the K8s watch API and creates/updates/deletes DNS records to match.
11
+
12
+ **No CRDs.** Controlled via CLI flags, sources, and provider-specific config.
13
+
14
+ **Latest:** chart 1.21.1, app v0.21.0.
15
+
16
+ ## Architecture
17
+
18
+ ```
19
+ K8s Sources (Ingress, HTTPRoute, Service...)
20
+ → ExternalDNS watches for changes
21
+ → Resolves to DNS endpoints
22
+ → Creates/updates/deletes DNS records (Cloudflare, Route53, etc.)
23
+ → Registry (TXT records) tracks ownership
24
+ ```
25
+
26
+ ## Sources
27
+
28
+ ExternalDNS queries one or more source types for DNS endpoints:
29
+
30
+ | Source | Flag | Supported |
31
+ |--------|------|-----------|
32
+ | Ingress | `--source=ingress` | ✅ |
33
+ | Service (LoadBalancer) | `--source=service` | ✅ (no NodePort) |
34
+ | Gateway HTTPRoute | `--source=gateway-httproute` | ✅ |
35
+ | Gateway GRPCRoute | `--source=gateway-grpcroute` | ✅ |
36
+ | Gateway TLSRoute | `--source=gateway-tlsroute` | ✅ (v1alpha2) |
37
+ | Gateway TCPRoute | `--source=gateway-tcproute` | ✅ (experimental) |
38
+ | Gateway UDPRoute | `--source=gateway-udproute` | ✅ (experimental) |
39
+ | Istio Gateway | `--source=istio-gateway` | ✅ |
40
+ | Istio VirtualService | `--source=istio-virtualservice` | ✅ |
41
+ | CRD | `--source=crd` | ✅ (externaldns.k8s.io/v1alpha1) |
42
+ | Node | `--source=node` | ✅ |
43
+ | Pod | `--source=pod` | ✅ |
44
+ | OpenShift Route | `--source=openshift-route` | ✅ |
45
+ | Contour HTTPProxy | `--source=contour-httpproxy` | ✅ |
46
+ | Traefik Proxy | `--source=traefik-proxy` | ✅ |
47
+
48
+ **Deployed sources:** `ingress`, `gateway-httproute`.
49
+
50
+ ## Providers
51
+
52
+ ExternalDNS supports 25+ providers. Deployed: Cloudflare.
53
+
54
+ ### Cloudflare
55
+
56
+ Auth via API token (env var `CF_API_TOKEN`). Example values:
57
+
58
+ ```yaml
59
+ provider:
60
+ name: cloudflare
61
+ env:
62
+ - name: CF_API_TOKEN
63
+ valueFrom:
64
+ secretKeyRef:
65
+ name: cloudflare-credentials
66
+ key: api-token
67
+ extraArgs:
68
+ - --cloudflare-proxied
69
+ ```
70
+
71
+ Cloudflare-specific flags:
72
+
73
+ | Flag | Description |
74
+ |------|-------------|
75
+ | `--cloudflare-proxied` | Enable Cloudflare proxy (orange cloud) — CDN, DDoS protection, SSL |
76
+ | `--cloudflare-dns-records-per-page=N` | Records per page (default 100, max 5000) |
77
+ | `--cloudflare-custom-hostnames` | Enable Cloudflare for SaaS Custom Hostnames |
78
+ | `--cloudflare-regional-services` | Restrict HTTPS decryption to specific regions |
79
+ | `--cloudflare-region-key` | Region key for regional services |
80
+ | `--cloudflare-record-comment` | Add comment to provisioned records (≤100/≤500 chars) |
81
+
82
+ ## Registry & Policy
83
+
84
+ Controls how ExternalDNS tracks ownership of records:
85
+
86
+ ```yaml
87
+ registry: txt # Use TXT records to track ownership
88
+ txtOwnerId: my-cluster # Owner identifier in TXT record
89
+ policy: upsert-only # Only create/update, never delete
90
+ ```
91
+
92
+ | Registry | Description |
93
+ |----------|-------------|
94
+ | `txt` | TXT records with owner ID (prevents overwriting records from other sources) |
95
+ | `aws` | AWS Route53 tag-based (provider-specific) |
96
+ | `noop` | No ownership tracking |
97
+
98
+ | Policy | Description |
99
+ |--------|-------------|
100
+ | `upsert-only` | Create and update only (safe for shared zones) |
101
+ | `sync` | Full sync — create, update, delete (can delete external records) |
102
+ | `create-only` | Only create, never update or delete |
103
+
104
+ ## Domain & Ownership Filtering
105
+
106
+ ```yaml
107
+ domainFilters:
108
+ - example.com # Only manage records in this zone
109
+ excludeDomains: [] # Exclude specific domains
110
+ zoneIdFilters: [] # Limit to specific zone IDs
111
+ annotationFilter: "" # Filter resources by annotation
112
+ labelFilter: "" # Filter resources by label
113
+ ```
114
+
115
+ ## Gateway API Integration
116
+
117
+ ExternalDNS reads hostnames from Gateway API HTTPRoute/GRPCRoute resources:
118
+
119
+ ```yaml
120
+ sources:
121
+ - gateway-httproute
122
+ - gateway-grpcroute
123
+ ```
124
+
125
+ RBAC for Gateway sources requires additional permissions:
126
+
127
+ ```yaml
128
+ rbac:
129
+ extraRules:
130
+ - apiGroups: ["gateway.networking.k8s.io"]
131
+ resources: ["httproutes", "gateways"]
132
+ verbs: ["get", "watch", "list"]
133
+ ```
134
+
135
+ HTTPRoute annotations for per-route overrides:
136
+
137
+ ```yaml
138
+ metadata:
139
+ annotations:
140
+ external-dns.alpha.kubernetes.io/cloudflare-proxied: "true"
141
+ external-dns.alpha.kubernetes.io/ttl: "300"
142
+ ```
143
+
144
+ ## Deployment (Flux HelmRelease)
145
+
146
+ ```yaml
147
+ apiVersion: helm.toolkit.fluxcd.io/v2
148
+ kind: HelmRelease
149
+ spec:
150
+ chart:
151
+ spec:
152
+ chart: external-dns
153
+ sourceRef:
154
+ kind: HelmRepository
155
+ name: external-dns
156
+ version: 1.21.1
157
+ ```
158
+
159
+ ### Helm Values
160
+
161
+ | Value | Default | Description |
162
+ |-------|---------|-------------|
163
+ | `provider.name` | `aws` | DNS provider (cloudflare, google, aws, azure, etc.) |
164
+ | `sources` | `[service, ingress]` | Resources to watch (ingress, gateway-httproute, service, etc.) |
165
+ | `domainFilters` | `[]` | Limit to specific DNS zones |
166
+ | `policy` | `upsert-only` | Sync policy (upsert-only, sync, create-only) |
167
+ | `registry` | `txt` | Ownership registry (txt, aws, noop) |
168
+ | `txtOwnerId` | — | Owner identifier for TXT registry |
169
+ | `interval` | `1m` | Sync interval |
170
+ | `logLevel` | `info` | Log verbosity |
171
+ | `rbac.create` | `true` | Create ClusterRole |
172
+ | `rbac.extraRules` | `[]` | Additional RBAC rules (e.g. for Gateway API) |
173
+ | `nodeSelector` | `{}` | Node selector |
174
+ | `tolerations` | `[]` | Pod tolerations |
175
+ | `extraArgs` | `[]` | Additional CLI args |
176
+
177
+ ## Common Mistakes
178
+
179
+ - **Missing RBAC for Gateway sources.** Without `rbac.extraRules`, gateway-httproute source returns no endpoints. Both `httproutes` and `gateways` resources must be listed.
180
+ - **`upsert-only` doesn't clean up stale records.** When an Ingress/HTTPRoute is deleted, its DNS record persists. Use `sync` policy or manual cleanup.
181
+ - **Cloudflare API token needs specific permissions.** Requires `Zone:DNS:Edit` for the target zone. A token with only `Zone:Read` will fail silently.
182
+ - **`--cloudflare-proxied` is a global flag.** To proxy only specific records, omit the global flag and use the `external-dns.alpha.kubernetes.io/cloudflare-proxied: "true"` annotation per resource.
183
+ - **Domain filter is a suffix match.** `kubexa.tech` matches `app.kubexa.tech` but NOT `kubexa.tech.app.com`. Add trailing dot if needed.
184
+ - **NodePort services not supported** with `source=service`. Only LoadBalancer services are detected.
185
+ - **Multiple TXT owner IDs on same zone.** If two ExternalDNS instances manage the same zone with different `txtOwnerId`, they can coexist. Records with unknown owner ID are left untouched by `upsert-only`.
@@ -0,0 +1,292 @@
1
+ ---
2
+ name: flagger
3
+ description: Use when working with Flagger — progressive delivery, canary deployments, A/B testing, blue/green on Kubernetes. Covers Canary CRD, analysis/meshes/metrics/webhooks, Helm values. CRDs: Canary, MetricTemplate, AlertProvider.
4
+ ---
5
+
6
+ # Flagger
7
+
8
+ ## Overview
9
+
10
+ Flagger automates progressive delivery for Kubernetes workloads. It gradually shifts traffic to a new version while measuring metrics and running conformance tests. Supports canary releases (weighted traffic), A/B testing (header/cookie routing), and blue/green deployments (instant switch or mirroring).
11
+
12
+ **CRDs:** `Canary` (flagger.app/v1beta1), `MetricTemplate`, `AlertProvider`.
13
+
14
+ **Latest:** chart 1.43.0, app v1.43.0 (Apr 2026).
15
+
16
+ ## Architecture
17
+
18
+ ```
19
+ User creates/updates Canary resource
20
+ → Flagger creates:
21
+ - <name>-primary Deployment (stable version)
22
+ - <name>-canary Deployment (new version)
23
+ - <name> ClusterIP service (routes to primary)
24
+ - <name>-primary ClusterIP service (stable)
25
+ - <name>-canary ClusterIP service (new)
26
+ - Mesh/Ingress routing objects (if mesh provider set)
27
+ → Analysis loop:
28
+ 1. Increment traffic to canary (stepWeight)
29
+ 2. Run webhooks (pre-rollout, rollout, post-rollout)
30
+ 3. Check metrics (success rate, duration, custom)
31
+ 4. If all pass → promote canary to primary
32
+ 5. If threshold exceeded → rollback
33
+ ```
34
+
35
+ ## CRD: Canary
36
+
37
+ `apiVersion: flagger.app/v1beta1`, `kind: Canary`
38
+
39
+ ### Minimal Example (Kubernetes CNI — no mesh)
40
+
41
+ ```yaml
42
+ apiVersion: flagger.app/v1beta1
43
+ kind: Canary
44
+ metadata:
45
+ name: myapp
46
+ namespace: prod
47
+ spec:
48
+ provider: kubernetes # No service mesh (uses ClusterIP + pod labels)
49
+ targetRef:
50
+ apiVersion: apps/v1
51
+ kind: Deployment
52
+ name: myapp
53
+ service:
54
+ port: 9898
55
+ portDiscovery: true
56
+ analysis:
57
+ interval: 1m
58
+ threshold: 5
59
+ iterations: 10 # Used for blue/green with no mesh provider
60
+ metrics:
61
+ - name: request-success-rate
62
+ thresholdRange:
63
+ min: 99
64
+ interval: 1m
65
+ webhooks:
66
+ - name: load-test
67
+ type: rollout
68
+ url: http://flagger-loadtester.test/
69
+ metadata:
70
+ cmd: "hey -z 1m -q 10 http://myapp-canary.prod:9898/"
71
+ ```
72
+
73
+ ### CanarySpec Fields
74
+
75
+ | Field | Required | Description |
76
+ |-------|----------|-------------|
77
+ | `provider` | Yes | Traffic provider: `kubernetes`, `istio`, `linkerd`, `nginx`, `contour`, `gloo`, `traefik`, `gatewayapi:v1`, `apisix`, `kuma`, `knative`, `skipper`, `osm`, `smi:v1alpha2`, `appmesh:v1beta2` |
78
+ | `targetRef` | Yes | Target Deployment reference |
79
+ | `autoscalerRef` | No | HPA reference (copied to canary) |
80
+ | `service` | Yes | Service spec (port, portName, targetPort, hosts, gatewayRefs, match, rewrite, timeout, headers, etc.) |
81
+ | `suspend` | No | Suspend all canary runs |
82
+ | `progressDeadlineSeconds` | No | Max time for canary progress before rollback (default 600) |
83
+ | `skipAnalysis` | No | Promote without analysis (default false) |
84
+
85
+ ### AnalysisSpec Fields
86
+
87
+ | Field | Required | Description |
88
+ |-------|----------|-------------|
89
+ | `interval` | Yes | Schedule interval (e.g. `1m`, `30s`) |
90
+ | `threshold` | Yes | Max failed checks before rollback |
91
+ | `maxWeight` | Canary | Max traffic % to canary (0-100). Used with `stepWeight` |
92
+ | `stepWeight` | Canary | Traffic increment per interval (0-100). Used with `maxWeight` |
93
+ | `stepWeights` | Canary | Explicit array of traffic weights. Replaces stepWeight |
94
+ | `stepWeightPromotion` | No | Traffic increment during promotion phase |
95
+ | `iterations` | A/B, Blue/Green | Number of iterations (replaces stepWeight/maxWeight) |
96
+ | `match` | A/B | HTTP header/cookie match conditions for A/B testing |
97
+ | `mirror` | Blue/Green | Mirror traffic to canary (default false) |
98
+ | `mirrorWeight` | No | % of traffic to mirror (0-100) |
99
+ | `primaryReadyThreshold` | No | % of pods that must be available before starting (% , default 100) |
100
+ | `canaryReadyThreshold` | No | % of canary pods that must be available (%, default 100) |
101
+ | `metrics` | No | List of metric checks |
102
+ | `webhooks` | No | List of webhooks (pre-rollout, rollout, confirm-promotion, etc.) |
103
+ | `alerts` | No | List of alert configs |
104
+ | `sessionAffinity` | No | Session affinity settings for canary |
105
+
106
+ ### Analysis Strategies
107
+
108
+ | Strategy | Fields | Traffic shaping | Use case |
109
+ |----------|--------|-----------------|----------|
110
+ | Canary (weighted) | `stepWeight` + `maxWeight` | Gradual traffic shift | Gradual rollout with metrics |
111
+ | Canary (custom steps) | `stepWeights: [5, 10, 25, 50, 75]` | Custom traffic steps | Non-linear rollout |
112
+ | A/B Testing | `iterations` + `match` | Header/cookie routing | Test specific user segments |
113
+ | Blue/Green | `iterations` | Instant switch | Quick rollback or pre-production validation |
114
+ | Blue/Green Mirror | `iterations` + `mirror: true` | Traffic mirroring | Shadow traffic without impact |
115
+
116
+ ### Metrics
117
+
118
+ ```yaml
119
+ metrics:
120
+ - name: request-success-rate
121
+ thresholdRange:
122
+ min: 99
123
+ interval: 1m
124
+ - name: request-duration
125
+ thresholdRange:
126
+ max: 500
127
+ interval: 30s
128
+ - name: custom-metric
129
+ templateRef:
130
+ name: my-metric-template
131
+ namespace: flagger
132
+ thresholdRange:
133
+ min: 2
134
+ max: 100
135
+ interval: 1m
136
+ ```
137
+
138
+ Built-in metric checks (when `templateRef` is not set):
139
+ - `request-success-rate` — Prometheus query `rate(...)` for non-5xx responses
140
+ - `request-duration` — Prometheus query `histogram_quantile(0.99, ...)` for P99 latency
141
+
142
+ Custom metrics use `MetricTemplate` CRD (see below).
143
+
144
+ ### Webhooks
145
+
146
+ ```yaml
147
+ webhooks:
148
+ - name: "load test"
149
+ type: rollout # Run during canary analysis
150
+ url: http://tester/ # Webhook endpoint
151
+ timeout: 5m
152
+ retries: 3
153
+ disableTLS: false
154
+ metadata:
155
+ cmd: "hey -z 1m http://app:9898/"
156
+ ```
157
+
158
+ Webhook types (execution order):
159
+
160
+ | Type | Phase | Purpose |
161
+ |------|-------|---------|
162
+ | `pre-rollout` | Before canary starts | Acceptance tests, DB migrations check |
163
+ | `confirm-rollout` | Before canary starts (gating) | Manual approval gate |
164
+ | `rollout` | During analysis (each step) | Load tests |
165
+ | `confirm-promotion` | Before promotion (gating) | Manual approval for promotion |
166
+ | `post-rollout` | After promotion | Smoke tests, cleanup |
167
+ | `rollback` | After rollback | Cleanup, notifications |
168
+ | `event` | Any time | Informational events |
169
+ | `confirm-traffic-increase` | Before each step increase (gating) | Per-step manual approval |
170
+
171
+ ### Alerts
172
+
173
+ ```yaml
174
+ alerts:
175
+ - name: "Slack"
176
+ severity: error # info, warn, error
177
+ providerRef:
178
+ name: dev-slack
179
+ namespace: flagger
180
+ ```
181
+
182
+ ## CRD: MetricTemplate
183
+
184
+ `apiVersion: flagger.app/v1beta1`, `kind: MetricTemplate`
185
+
186
+ Defines custom metric queries for canary analysis:
187
+
188
+ ```yaml
189
+ apiVersion: flagger.app/v1beta1
190
+ kind: MetricTemplate
191
+ metadata:
192
+ name: db-connections
193
+ namespace: flagger
194
+ spec:
195
+ provider:
196
+ type: prometheus
197
+ address: http://prometheus.monitoring:9090
198
+ query: |
199
+ avg_over_time(
200
+ pg_stat_activity_count{namespace="{{ namespace }}",app="{{ target }}"}[{{ interval }}]
201
+ )
202
+ ```
203
+
204
+ Template variables: `{{ namespace }}`, `{{ target }}`, `{{ interval }}`.
205
+
206
+ ## CRD: AlertProvider
207
+
208
+ `apiVersion: flagger.app/v1beta1`, `kind: AlertProvider`
209
+
210
+ ```yaml
211
+ apiVersion: flagger.app/v1beta1
212
+ kind: AlertProvider
213
+ metadata:
214
+ name: dev-slack
215
+ namespace: flagger
216
+ spec:
217
+ type: slack
218
+ channel: flagger-alerts
219
+ username: flager
220
+ address: https://hooks.slack.com/services/TOKEN
221
+ ```
222
+
223
+ Supported types: `slack`, `teams`, `discord`, `rocket`.
224
+
225
+ ## Deployment (Flux HelmRelease)
226
+
227
+ ```yaml
228
+ apiVersion: source.toolkit.fluxcd.io/v1
229
+ kind: HelmRepository
230
+ metadata:
231
+ name: flagger
232
+ namespace: flagger-system
233
+ spec:
234
+ interval: 24h
235
+ url: https://flagger.app
236
+ ---
237
+ apiVersion: helm.toolkit.fluxcd.io/v2
238
+ kind: HelmRelease
239
+ metadata:
240
+ name: flagger
241
+ namespace: flagger-system
242
+ spec:
243
+ chart:
244
+ spec:
245
+ chart: flagger
246
+ sourceRef:
247
+ kind: HelmRepository
248
+ name: flagger
249
+ version: "1.39.0"
250
+ values:
251
+ meshProvider: "" # Kubernetes CNI mode
252
+ metricsServer: "http://prometheus:9090"
253
+ prometheus:
254
+ install: false # Use existing Prometheus
255
+ ```
256
+
257
+ ## Helm Values
258
+
259
+ | Value | Default | Description |
260
+ |-------|---------|-------------|
261
+ | `meshProvider` | `""` (kubernetes) | Traffic provider: istio, linkerd, nginx, contour, kubernetes, etc. |
262
+ | `metricsServer` | `http://prometheus.istio-system:9090` | Prometheus URL |
263
+ | `logLevel` | `info` | Log level |
264
+ | `crd.create` | `false` | Create CRDs (Helm v3 handles this separately) |
265
+ | `prometheus.install` | `false` | Install bundled Prometheus |
266
+ | `prometheus.retention` | `2h` | Prometheus data retention |
267
+ | `serviceMonitor.enabled` | `false` | Create ServiceMonitor |
268
+ | `podMonitor.enabled` | `false` | Create PodMonitor |
269
+ | `namespace` | `""` (all) | Watch single namespace (empty = all) |
270
+ | `selectorLabels` | `app,name,app.kubernetes.io/name` | Labels for workload selection |
271
+
272
+ ## Provider-Specific Features
273
+
274
+ | Feature | Istio | Linkerd | Contour | NGINX | Kubernetes | Gateway API |
275
+ |---------|-------|---------|---------|-------|-----------|-------------|
276
+ | Weighted canary | ✅ | ✅ | ✅ | ✅ | ➖ | ✅ |
277
+ | A/B testing | ✅ | ➖ | ✅ | ✅ | ➖ | ✅ |
278
+ | Blue/green (switch) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
279
+ | Blue/green (mirror) | ✅ | ➖ | ➖ | ➖ | ➖ | ➖ |
280
+ | Request success rate | ✅ | ✅ | ✅ | ➖ | ✅ | ✅ |
281
+ | Request duration | ✅ | ✅ | ✅ | ➖ | ✅ | ✅ |
282
+
283
+ ## Common Mistakes
284
+
285
+ - **`meshProvider: ""` (Kubernetes CNI) has no traffic shaping.** Flagger can only do blue/green (iterations-based) with the `kubernetes` provider. Weighted canary (stepWeight) requires a service mesh or ingress controller.
286
+ - **Prometheus must be reachable.** Without `metricsServer`, the analysis loop immediately fails. Verify Prometheus URL and that Flagger can query it.
287
+ - **CRDs not installed.** `crd.create: false` means CRDs must be installed separately. If running Flux, the CRDs from the upstream `crds.yaml` must exist before Canary resources are applied.
288
+ - **Webhook URL must be reachable from Flagger pod.** Load test webhooks are called during canary analysis. If the webhook times out or returns error, the canary fails. Use cluster-internal URLs.
289
+ - **`targetRef` must be a Deployment.** Flagger only supports Deployment as the target. Other workload types (StatefulSet, DaemonSet) won't work.
290
+ - **`progressDeadlineSeconds` too low.** If canary takes longer than this (e.g., image pull delay, slow startup), Flagger rolls back. Default 600s. Increase for large images.
291
+ - **Missing `service.port`.** Required field. Flagger creates ClusterIP services and needs to know the container port.
292
+ - **Metric template `{{ target }}` defaults to the canary name.** In Prometheus queries, `{{ target }}` resolves to the service name. Ensure your metrics service matches the label selectors Flagger sets.