npm - @synapta/skills - Versions diffs - 0.1.1 → 0.2.0 - Mend

@synapta/skills 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (354) hide show

package/skills/deploy-k8s/docs/core-concepts/workflow.md ADDED Viewed

@@ -0,0 +1,124 @@
+# Workflow
+KubeShark operates through a 7-step workflow defined in `SKILL.md`. The workflow runs top to bottom on every Kubernetes task. This page explains what each step does and why it exists.
+---
+## Step 1: Capture Execution Context
+Before writing any YAML, KubeShark records the environment it is operating in. This prevents the most common LLM failure: generating manifests that assume a generic cluster and ignore the user's actual setup.
+**Context captured:**
+| Dimension | Examples | Why it matters |
+|-----------|----------|----------------|
+| Cluster version | 1.29, 1.30, 1.31 | API availability differs across versions; deprecated APIs cause hard failures |
+| Distribution | EKS, GKE, AKS, k3s, vanilla | Each has distribution-specific defaults, storage classes, and networking behaviors |
+| Namespace | `default`, `production`, `monitoring` | Determines resource quotas, network policies, and RBAC scope |
+| Environment | dev, staging, prod | Controls security strictness, resource sizing, and validation rigor |
+| Workload type | Deployment, StatefulSet, Job, CronJob, DaemonSet | Different workload types have different failure patterns and configuration requirements |
+| Deployment method | Raw YAML, Helm, Kustomize, operator-managed | Determines output format and which tooling references to load |
+| Policy enforcement | Pod Security Admission, Kyverno, OPA/Gatekeeper | Affects what security controls are required versus optional |
+| Cloud provider and CNI | AWS/VPC CNI, GCP/Calico, Azure/Azure CNI | Impacts networking, storage classes, load balancer annotations, and service mesh compatibility |
+When any dimension is unknown, KubeShark states the assumption explicitly rather than guessing silently. These assumptions appear in the output contract (Step 7) so the user can verify them.
+---
+## Step 2: Diagnose Failure Modes
+This is the step that distinguishes KubeShark from a reference manual. Before generating anything, the workflow identifies which of the six failure modes are relevant to the task.
+**The six failure modes:**
+1. **Insecure workload defaults** -- missing security contexts, PSS violations, host access, excessive capabilities
+2. **Resource starvation** -- missing requests/limits, no QoS strategy, absent PodDisruptionBudgets, scheduling chaos
+3. **Network exposure** -- flat networking, missing NetworkPolicies, wrong Service types, DNS misconfigurations
+4. **Privilege sprawl** -- overly permissive RBAC, leaked secrets, unscoped ServiceAccount tokens
+5. **Fragile rollouts** -- misconfigured probes, mutable image tags, unsafe update strategies, missing graceful shutdown
+6. **API drift** -- wrong apiVersion, deprecated APIs, schema violations, tool-specific structural errors
+Most tasks trigger multiple failure modes. A "create a Deployment with an Ingress" request involves at least insecure workload defaults, network exposure, and fragile rollouts. The diagnosis step ensures none of these are overlooked.
+See [Failure Modes](failure-modes.md) for a detailed breakdown of each.
+---
+## Step 3: Load Targeted References
+KubeShark includes 20 reference files, but only 1-2 are loaded per query. This is a deliberate token efficiency decision: loading all references would burn thousands of tokens on irrelevant guidance.
+**Reference selection logic:**
+- A probe configuration question loads `fragile-rollouts.md` -- it never touches `privilege-sprawl.md` or `network-exposure.md`.
+- A Helm chart task loads `helm-patterns.md` and the failure-mode reference for the workload being charted.
+- A security review loads `insecure-workload-defaults.md` and `security-hardening.md`.
+**Reference categories:**
+| Category | Files | Loaded when |
+|----------|-------|-------------|
+| Primary failure modes | 6 files (one per failure mode) | The corresponding failure mode is diagnosed in Step 2 |
+| Workload patterns | Deployment, StatefulSet, Job, DaemonSet patterns | Generating a specific workload type |
+| Cross-cutting concerns | Security hardening, observability, multi-tenancy, storage | The task spans multiple domains |
+| Tooling | Helm patterns, Kustomize patterns, validation and policy | Using a specific deployment tool |
+| Pattern banks | Good examples, bad examples, do/don't checklist | Reviewing code or learning patterns |
+Each reference file is self-contained. No file depends on another being loaded simultaneously.
+---
+## Step 4: Propose Fix Path
+For every recommendation, KubeShark provides three things:
+1. **Why this addresses the failure mode** -- the causal link between the fix and the diagnosed risk.
+2. **What could still go wrong** -- runtime behavior, edge cases, and deployment-time risks that remain even after the fix.
+3. **Guardrails** -- validation commands, policy checks, and rollback paths that protect against the remaining risks.
+This structure prevents a common LLM pattern: recommending a fix without acknowledging its limitations. A liveness probe fix that does not mention the risk of checking external dependencies is incomplete. A NetworkPolicy recommendation that does not mention egress is incomplete.
+---
+## Step 5: Generate Artifacts
+When the task calls for implementation, KubeShark produces the appropriate artifacts:
+- **Kubernetes manifests** -- YAML with security contexts, resource limits, proper labels, and annotations
+- **Helm values and templates** -- chart structure following Helm best practices
+- **Kustomize overlays** -- base/overlay structure with proper patch formats
+- **NetworkPolicies** -- default-deny with explicit allow rules
+- **RBAC resources** -- least-privilege Roles and RoleBindings with dedicated ServiceAccounts
+- **PodDisruptionBudgets** -- tuned to workload replica count and availability requirements
+- **Policy rules** -- Kyverno ClusterPolicies or OPA/Gatekeeper ConstraintTemplates
+All generated manifests default to the Pod Security Standards restricted profile: `runAsNonRoot: true`, `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`, `drop: ["ALL"]` capabilities, and `RuntimeDefault` seccomp profile.
+---
+## Step 6: Validate
+KubeShark never recommends applying directly to production without validation. Every response includes validation steps matched to the deployment method and risk level:
+- **`kubectl apply --dry-run=server`** or **`kubectl diff`** -- catches API-level errors without making changes
+- **`kubeconform`** -- schema validation against the target cluster version to catch API drift
+- **Cross-resource consistency checks** -- verifies that labels, selectors, ports, and names align across Deployments, Services, Ingress, PDBs, HPAs, and NetworkPolicies
+- **Policy scan** -- PSS profile compliance check, Kyverno audit, or OPA/Gatekeeper dry-run
+Cross-resource consistency is especially important because Kubernetes silently accepts mismatched selectors. A Service with a selector that matches no pods deploys without error -- the failure only surfaces when traffic arrives.
+---
+## Step 7: Output Contract
+Every KubeShark response ends with a structured output contract containing five sections:
+| Section | Purpose |
+|---------|---------|
+| **Assumptions and cluster version floor** | States what was assumed about the cluster, distribution, and environment so the user can verify |
+| **Selected failure modes** | Lists which of the 6 failure modes were diagnosed as relevant |
+| **Chosen remediation and tradeoffs** | Explains what was recommended and what was explicitly traded off |
+| **Validation/test plan** | Provides the specific commands and checks to verify the output |
+| **Rollback/recovery notes** | Describes how to undo the changes if something goes wrong -- `kubectl rollout undo`, revision history, data safety considerations |
+The output contract makes every response auditable. A reviewer can check whether the assumptions match reality, whether the right failure modes were identified, and whether the rollback path is viable -- all before applying anything to the cluster.

package/skills/deploy-k8s/docs/examples/bad-patterns.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Bad Patterns -- Common LLM Anti-Patterns
+These eight anti-patterns represent manifests that LLMs frequently generate. Each one compiles and appears valid but has serious issues in production. The danger of these patterns is that Kubernetes accepts them without error -- the failure only surfaces at runtime or under load.
+For full annotated YAML with detailed explanations of what is wrong in each case, see [references/examples-bad.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/references/examples-bad.md).
+## 1. Deployment Running as Root with No Security Context
+No `securityContext` at pod or container level -- container runs as root by default. Missing `runAsNonRoot`, `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem`, capabilities drop, seccomp profile. Also lacks resource requests, probes, and standard labels. Uses `:latest` tag.
+**Failure modes:** Insecure workload defaults, resource starvation, fragile rollouts.
+## 2. Service with Selector That Matches No Pods
+Service selector includes `version: v1` but pods have `version: v2`. Kubernetes does not warn about selector mismatches -- the Service silently has zero endpoints. A frequent LLM mistake when updating version labels on the Deployment without updating the Service.
+## 3. ClusterRoleBinding with cluster-admin for a Single-Namespace App
+Binds a single-namespace application ServiceAccount to `cluster-admin`, granting unrestricted access to the entire cluster. If the service account token is compromised, the attacker owns every namespace, every resource, every verb.
+**Failure mode:** Privilege sprawl.
+## 4. Liveness Probe Checking External Database
+Liveness probe depends on `pg_isready` against an external database. If the database is briefly unavailable, Kubernetes kills all API pods, causing cascading failure: database blip leads to thundering herd reconnects and further overload.
+## 5. Deployment with :latest Tag and No imagePullPolicy
+Uses the mutable `:latest` tag. Different nodes may pull different versions, causing inconsistent behavior across replicas. Rollbacks are impossible because every revision points to the same tag.
+## 6. Ingress Using Removed API Version
+Uses `extensions/v1beta1` (removed in Kubernetes 1.22) with the deprecated `kubernetes.io/ingress.class` annotation, old backend syntax (`serviceName`/`servicePort`), and missing `pathType`. LLMs frequently generate this because training data contains many examples of the old API.
+**Failure mode:** API drift.
+## 7. Secret Data in a ConfigMap
+Stores database passwords, API keys, and AWS credentials in a ConfigMap instead of a Secret. ConfigMaps are stored unencrypted in etcd and appear in plain text in `kubectl describe`, logs, and version control.
+## 8. PVC with ReadWriteMany on an Unsupported Provider
+Requests `ReadWriteMany` access mode with a `gp3` (AWS EBS) storage class. EBS volumes only support `ReadWriteOnce`. The PVC will be stuck in `Pending` state with no clear error. LLMs frequently pair RWX with block storage classes because they do not track provider-specific storage capabilities.
+---
+Each anti-pattern maps to one or more of KubeShark's six failure modes. The reference file includes the exact broken YAML so you can study the specific mistakes and understand why Kubernetes does not catch them at admission time.

package/skills/deploy-k8s/docs/examples/do-dont-checklist.md ADDED Viewed

@@ -0,0 +1,37 @@
+# Do/Don't Quick Reference Checklist
+A terse, actionable checklist of Kubernetes best practices organized by category. Each line is a standalone rule. The default security posture is the PSS restricted profile.
+For the full checklist with every rule, see [references/do-dont-patterns.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/references/do-dont-patterns.md).
+## Categories Covered
+The checklist spans eight categories that map directly to KubeShark's failure modes:
+| Category | Key concern |
+|---|---|
+| **Security Contexts** | runAsNonRoot, capabilities, seccomp, read-only filesystem |
+| **RBAC** | Namespace-scoped roles, least-privilege verbs, no wildcards |
+| **Resource Management** | Requests/limits, ResourceQuota, LimitRange, QoS class |
+| **Networking** | Default-deny NetworkPolicy, DNS egress, ingressClassName |
+| **Probes and Rollouts** | Readiness/liveness separation, revision history, zero-downtime |
+| **Image Management** | Immutable tags, imagePullPolicy, private registry secrets |
+| **Storage** | Access mode vs storage class, volumeClaimTemplates, no hostPath |
+| **Configuration** | Secrets not ConfigMaps, ExternalSecrets, hash-based naming |
+| **Namespaces and Isolation** | PSA labels, ResourceQuota per namespace, trust boundaries |
+## How to Use
+Use this checklist as a final review pass before applying any manifest to a cluster. Each DO/DON'T rule is self-contained -- you can check them individually without reading the surrounding context. The checklist is designed for both human review and LLM self-verification during manifest generation.
+## Relationship to Failure Modes
+The categories map directly to KubeShark's six named failure modes:
+- **Security Contexts, RBAC** -- insecure workload defaults, privilege sprawl
+- **Resource Management** -- resource starvation
+- **Networking** -- network exposure
+- **Probes and Rollouts, Image Management** -- fragile rollouts
+- **Storage, Configuration, Namespaces** -- cross-cutting concerns that affect multiple failure modes
+Every rule in the checklist exists because it prevents a specific, observed failure pattern. No generic advice is included unless it maps to a real failure mode.

package/skills/deploy-k8s/docs/examples/good-patterns.md ADDED Viewed

@@ -0,0 +1,49 @@
+# Good Patterns -- Production-Ready Examples
+These eight patterns demonstrate production-ready Kubernetes manifests that follow the PSS restricted profile, include proper labels, and set explicit resource constraints. Each pattern is annotated with key points explaining why specific choices were made.
+For the full annotated YAML of every pattern below, see [references/examples-good.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/references/examples-good.md).
+## 1. Minimal Production Deployment
+A complete Deployment with full security context (pod-level and container-level), resource bounds, liveness and readiness probes, topology spread constraints, and standard `app.kubernetes.io/*` labels. Demonstrates the `readOnlyRootFilesystem` pattern with an emptyDir `/tmp` mount.
+**Key takeaway:** Both pod-level and container-level `securityContext` are required. Topology spread prevents all replicas landing on one node.
+## 2. Default-Deny NetworkPolicy
+A two-resource pattern: a blanket deny-all policy (empty `podSelector`) followed by a targeted allow policy. Demonstrates allowing specific ingress from an ingress controller, scoped egress to a database, and mandatory DNS egress to kube-dns.
+**Key takeaway:** Always allow DNS egress (UDP/TCP 53 to kube-dns) or name resolution breaks silently.
+## 3. Scoped RBAC for CI Deployer
+Namespace-scoped Role and RoleBinding for a CI pipeline ServiceAccount. Only grants the specific verbs and resources needed for deployment -- no `delete`, no `cluster-admin`, no ClusterRoleBinding.
+## 4. CronJob with Lifecycle Controls
+A CronJob with `concurrencyPolicy: Forbid`, `startingDeadlineSeconds`, `activeDeadlineSeconds`, `ttlSecondsAfterFinished`, history limits, and proper security context. Demonstrates safe scheduled job configuration that prevents overlapping runs and auto-cleans completed pods.
+**Key takeaway:** `activeDeadlineSeconds` kills jobs that hang; `ttlSecondsAfterFinished` auto-cleans completed pods.
+## 5. Ingress with TLS and Path-Based Routing
+Uses the current `networking.k8s.io/v1` API with `ingressClassName` (not the deprecated annotation), TLS configuration, and path-based routing with explicit `pathType`. More specific paths listed first.
+## 6. HPA with Scale-Down Stabilization
+An HPA using `autoscaling/v2` with separate scale-up and scale-down behaviors. Scale-down is conservative (300s stabilization window, 25% per minute limit) while scale-up is aggressive. Targets both CPU and memory utilization.
+## 7. Namespace with Quota, LimitRange, and PSA Labels
+A complete namespace setup: PSA labels enforcing the restricted profile, a ResourceQuota capping total resource consumption, and a LimitRange providing defaults and bounds for containers that omit resource specs.
+## 8. ExternalSecret for Vault Integration
+Namespace-scoped SecretStore with Vault backend using Kubernetes auth, and an ExternalSecret that syncs credentials with a refresh interval. Demonstrates `deletionPolicy: Retain` to prevent accidental secret loss.
+**Key takeaway:** Use namespace-scoped SecretStore (not ClusterSecretStore) unless multiple namespaces genuinely share the same Vault path.
+---
+Each of these patterns addresses one or more of KubeShark's six named failure modes. Use them as starting points and adapt to your cluster's specific requirements.

package/skills/deploy-k8s/docs/failure-modes/api-drift.md ADDED Viewed

@@ -0,0 +1,104 @@
+# FM6: API Drift
+Kubernetes follows a strict API deprecation lifecycle: beta APIs are introduced, stable APIs replace them, and beta APIs are eventually removed. LLMs hallucinate removed API versions more than any other type of Kubernetes error because their training data contains years of blog posts, tutorials, and Stack Overflow answers written for older cluster versions.
+## The Deprecation Lifecycle
+Every API migration follows the same pattern:
+1. **Beta API introduced** -- a new resource or feature enters as `v1beta1` under an API group.
+2. **Stable API introduced** -- the resource graduates to `v1`. The beta version is deprecated in the same release or shortly after.
+3. **Beta API removed** -- typically 2-3 minor versions after deprecation, per the Kubernetes deprecation policy. From this point, the API server rejects manifests using the old version with a hard error.
+"Deprecated" means the API still works but prints a warning. "Removed" means it fails. LLMs do not distinguish between these states.
+## Major Migrations LLMs Get Wrong
+### Ingress: `extensions/v1beta1` to `networking.k8s.io/v1`
+Removed in Kubernetes 1.22. This is the most frequently hallucinated API version because Ingress existed as a beta for years (1.1 through 1.21) and generated enormous amounts of training data.
+The structural changes in v1 are not just a version swap:
+- `spec.backend` renamed to `spec.defaultBackend`.
+- `serviceName` and `servicePort` (flat fields) replaced by `service.name` and `service.port.number` (nested).
+- `pathType` is required on every path -- it was optional in beta.
+- `ingressClassName` replaces the `kubernetes.io/ingress.class` annotation.
+An LLM that generates `extensions/v1beta1` will also use the old field structure, compounding the error.
+### PodDisruptionBudget: `policy/v1beta1` to `policy/v1`
+Removed in Kubernetes 1.25. The v1 API makes `spec.selector` immutable after creation and adds `spec.unhealthyPodEvictionPolicy`. LLMs frequently generate `policy/v1beta1` because PDB examples in training data predate 1.25.
+### HorizontalPodAutoscaler: `autoscaling/v2beta1` and `v2beta2` to `autoscaling/v2`
+`v2beta1` removed in 1.25, `v2beta2` removed in 1.26. The key structural change: `targetAverageUtilization` (a top-level field in beta) moves to `target.averageUtilization` (nested under `target` in v2). LLMs mix beta and stable field structures unpredictably.
+### Other Removed APIs
+| Resource | Old API | Stable API | Removed in |
+|---|---|---|---|
+| CronJob | `batch/v1beta1` | `batch/v1` | 1.25 |
+| EndpointSlice | `discovery.k8s.io/v1beta1` | `discovery.k8s.io/v1` | 1.25 |
+| CSIDriver | `storage.k8s.io/v1beta1` | `storage.k8s.io/v1` | 1.22 |
+| FlowSchema | `flowcontrol.apiserver.k8s.io/v1beta1` | `v1` | 1.26 |
+## Schema Validation
+There are two levels of manifest validity, and LLM-generated manifests can fail at either:
+- **Structural validity:** Does the YAML conform to the schema for this API version? Caught by `kubeconform` or `--dry-run=server`. Wrong field names, wrong nesting, unknown fields.
+- **Semantic validity:** Does the manifest make sense in context? Does the referenced Service exist? Is the port correct? Caught only at apply time or with policy tools.
+`kubeconform` validates manifests against the OpenAPI schema for a specific Kubernetes version. Always pin the version to match your target cluster:
+```bash
+kubeconform -kubernetes-version 1.30.0 -strict manifests/
+```
+The `-strict` flag rejects unknown fields, which catches the common case where an LLM generates fields from one API version in a manifest tagged with a different version.
+## Helm-Specific Drift
+Helm templates can produce syntactically valid YAML that uses the wrong API version. The template renders without error, but `kubectl apply` fails on the cluster. Use `Capabilities.APIVersions` to branch on cluster version:
+{% raw %}
+```yaml
+{{- if .Capabilities.APIVersions.Has "networking.k8s.io/v1" }}
+apiVersion: networking.k8s.io/v1
+{{- else }}
+apiVersion: networking.k8s.io/v1beta1
+{{- end }}
+```
+{% endraw %}
+Another common Helm drift error: broken Go template expressions that fail silently. {% raw %}`{{ .Values.replicas }}`{% endraw %} evaluates to empty (not an error) if `replicas` is not defined in `values.yaml`. Always use defaults: {% raw %}`{{ .Values.replicas | default 3 }}`{% endraw %}.
+## Kustomize-Specific Drift
+Kustomize strategic merge patches specify a `target` with `group`, `version`, and `kind`. If the API group in the patch does not match the resource, the patch silently fails to apply -- no error, no warning, just unpatched output.
+## What LLMs Get Wrong
+1. **`extensions/v1beta1` for Ingress.** Removed since 1.22, but still the most common LLM-generated Ingress API version.
+2. **Beta HPA API versions.** Mixing `autoscaling/v2beta1` field structures with `autoscaling/v2` API version, or vice versa.
+3. **Flat Ingress backend fields.** Using `serviceName`/`servicePort` instead of the nested `service.name`/`service.port.number` structure.
+4. **Missing `pathType` on Ingress paths.** Required in `networking.k8s.io/v1` but optional in beta. LLMs trained on beta examples omit it.
+5. **`batch/v1beta1` for CronJob.** Removed since 1.25, but CronJob tutorials from the beta era are abundant in training data.
+6. **No schema validation in the workflow.** LLMs generate manifests without suggesting validation, so errors are discovered only at deploy time.
+## Prevention
+The most effective defense against API drift is automated validation in the CI pipeline:
+1. **`kubeconform`** with `-strict` and `-kubernetes-version` matching the target cluster.
+2. **`pluto`** scans manifests, Helm charts, and running clusters for deprecated and removed APIs.
+3. **`--dry-run=server`** validates against the live API server schema, catching CRD and admission webhook issues that offline tools miss.
+Run all three: kubeconform in CI, pluto as a pre-commit check, and dry-run=server in the deployment pipeline.
+## Further Reading
+- [Kubernetes Deprecation Policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/)
+- [API Migration Guide](https://kubernetes.io/docs/reference/using-api/deprecation-guide/)
+- [KubeShark Validation and Policy Guide](../guides/validation-and-policy.md)

package/skills/deploy-k8s/docs/failure-modes/fragile-rollouts.md ADDED Viewed

@@ -0,0 +1,99 @@
+# FM5: Fragile Rollouts
+A bad rollout is worse than no rollout. Misconfigured probes, mutable image tags, and missing graceful shutdown logic turn routine deployments into outages. Fragile rollouts are the failure mode most likely to cause user-facing downtime because they activate during the exact moment the system is changing.
+## The Three Probe Types
+Kubernetes provides three probes, each with a distinct purpose. Confusing them is the leading cause of cascading failures:
+- **Liveness probe:** "Is the process alive?" If it fails, the kubelet kills and restarts the container. This probe must check only the process itself -- never external dependencies.
+- **Readiness probe:** "Can this pod serve traffic?" If it fails, the pod is removed from Service endpoints. This is where dependency checks belong -- if the database is down, the pod should stop receiving requests but should not be killed.
+- **Startup probe:** "Has initialization finished?" Disables liveness and readiness checks until it succeeds. Required for applications with slow startup (JVM warmup, ML model loading, large cache priming).
+## Cascading Failure From Liveness Probes
+The single most dangerous rollout misconfiguration is a liveness probe that checks an external dependency. When the database goes down:
+1. The liveness probe fails on all pods simultaneously.
+2. The kubelet restarts all pods.
+3. Pods restart, database is still down, liveness fails again.
+4. The entire service enters `CrashLoopBackOff` with exponential backoff.
+5. When the database recovers, the service takes minutes to recover because of the backoff timer.
+If the liveness probe only checked "is the main thread responsive?", the pods would have stayed up and resumed serving immediately when the database returned. The readiness probe would have removed them from traffic in the meantime.
+## The `:latest` Tag Trap
+Using `:latest` as an image tag introduces three problems:
+1. **Nondeterminism:** Different nodes may cache different image layers. After a rollout, some pods run version A and others run version B, depending on which nodes had cached layers.
+2. **Impossible rollbacks:** `kubectl rollout undo` re-deploys the same `:latest` tag, which may now point to a newer (broken) image.
+3. **Silent drift:** No change is detected by the Deployment controller because the tag has not changed, even though the image content has.
+With `imagePullPolicy: IfNotPresent` (the default for non-`:latest` tags), nodes use cached images. With `:latest`, the default policy is `Always`, but some environments override this, creating inconsistent behavior.
+The fix: always use immutable tags -- semantic versions (`v2.4.1`), git SHAs, or digests (`@sha256:...`).
+## Rolling Update Strategy
+The `strategy.rollingUpdate` fields control how many pods are replaced simultaneously:
+- **`maxSurge`**: How many extra pods above the desired count during the update. Higher values speed up rollouts but consume more resources.
+- **`maxUnavailable`**: How many pods can be unavailable during the update. Set to `0` for zero-downtime deployments (requires `maxSurge >= 1`).
+- **`minReadySeconds`**: How long a new pod must be Ready before it counts as Available. Catches pods that start successfully but crash shortly after (e.g., failing to connect to a dependency after initialization).
+For critical services, use `maxSurge: 1, maxUnavailable: 0`. This ensures capacity never drops below the desired count during a rollout.
+## Graceful Shutdown
+When Kubernetes terminates a pod, two things happen in parallel:
+1. The pod is removed from Service endpoints (asynchronous).
+2. The container receives `SIGTERM`.
+Because endpoint removal is asynchronous, the pod may still receive traffic for several seconds after `SIGTERM`. Without a `preStop` hook, the application begins shutting down while requests are still arriving, causing dropped connections and 502 errors.
+The fix is a `preStop` sleep of 3-5 seconds to allow endpoint propagation before the application begins its shutdown sequence:
+```yaml
+lifecycle:
+  preStop:
+    exec:
+      command: ["sh", "-c", "sleep 5"]
+```
+Set `terminationGracePeriodSeconds` to a value that exceeds the preStop sleep plus the application's drain time. The default of 30 seconds is often insufficient for applications with long-lived connections.
+## Init Containers for Dependency Waiting
+Dependencies should be waited on in init containers, not liveness probes. An init container blocks pod startup until the dependency is available, then exits. This keeps the probe system focused on runtime health, not startup prerequisites:
+```yaml
+initContainers:
+  - name: wait-for-db
+    image: busybox:1.36
+    command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
+```
+## What LLMs Get Wrong
+1. **Liveness probe checking database connectivity.** The number one cause of cascading outages. Liveness should check only process health.
+2. **Same endpoint for liveness and readiness.** These probes have different purposes and should hit different endpoints (`/healthz` for liveness, `/ready` for readiness).
+3. **No startup probe for slow applications.** JVM apps, Python ML services, and applications loading large datasets need 60-120+ seconds to start. Without a startup probe, the liveness probe kills them during initialization.
+4. **`failureThreshold: 1` on liveness.** A single blip (GC pause, network hiccup) kills the pod. Use at least 3.
+5. **`:latest` tag with no registry prefix.** `image: myapp:latest` with no registry means the kubelet looks in the default registry, which varies by runtime configuration.
+6. **Missing `preStop` hook.** Traffic arrives after SIGTERM, causing dropped connections.
+7. **`maxUnavailable` too high.** With 3 replicas and `maxUnavailable: 2`, only 1 pod serves traffic during rollout -- a single failure causes a complete outage.
+## Real-World Impact
+- **Cloudflare outage (2019):** A misconfigured health check caused a cascading restart of edge proxies across multiple data centers, resulting in a global 30-minute outage.
+- **GitLab incident (2021):** A canary deployment with no readiness probe sent traffic to pods still loading their configuration, causing elevated error rates for 45 minutes.
+- **Shopify Black Friday (2020):** Aggressive liveness probes combined with database latency caused pod restarts during peak traffic, requiring manual intervention to stabilize.
+Rollout fragility is entirely preventable. Every field -- probes, strategy, shutdown hooks, image tags -- has a correct configuration that eliminates the corresponding failure mode.
+## Further Reading
+- [Configure Liveness, Readiness, and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
+- [KubeShark Good Patterns](../examples/good-patterns.md)
+- [KubeShark Bad Patterns](../examples/bad-patterns.md)

package/skills/deploy-k8s/docs/failure-modes/insecure-workload-defaults.md ADDED Viewed

@@ -0,0 +1,80 @@
+# FM1: Insecure Workload Defaults
+Kubernetes does not ship with secure defaults. Pods created without explicit security contexts run as root, retain all Linux capabilities, and have writable root filesystems. This is the single most impactful failure mode for LLM-generated manifests because training data overwhelmingly consists of insecure examples.
+## Why This Matters
+OWASP Kubernetes Top Ten ranks "Insecure Workload Configurations" as **K01** -- the number one risk. A compromised container running as root with full capabilities can escape to the host node, access the cloud metadata service, pivot to other workloads, and exfiltrate secrets. Every missing security control compounds the blast radius.
+## Security Context: Pod-Level vs Container-Level
+Kubernetes splits security settings across two scopes, and both must be configured:
+- **Pod-level** (`spec.securityContext`): applies to all containers including init containers. This is where `runAsNonRoot`, `runAsUser`, `runAsGroup`, `fsGroup`, and `seccompProfile` belong.
+- **Container-level** (`spec.containers[].securityContext`): per-container overrides. This is where `allowPrivilegeEscalation`, `readOnlyRootFilesystem`, and `capabilities` belong.
+Omitting either level leaves gaps. A pod-level `runAsNonRoot: true` without container-level `capabilities.drop: [ALL]` still retains dangerous capabilities like `CAP_NET_RAW` (used for ARP spoofing and network sniffing within the cluster).
+## Pod Security Standards
+Kubernetes enforces security through Pod Security Admission (PSA), which evaluates pods against three profiles:
+| Profile | Purpose | Typical use |
+|---|---|---|
+| **Restricted** | Full hardening: non-root, drop all caps, read-only FS, seccomp required | All application workloads (the KubeShark default) |
+| **Baseline** | Prevents known privilege escalations but allows running as root | Legacy apps that cannot run as non-root |
+| **Privileged** | No restrictions at all | CNI plugins, CSI drivers, node-level agents only |
+PSA is enforced via namespace labels. A namespace without these labels has no enforcement -- pods run with whatever the manifest specifies, including fully privileged.
+## Capabilities and Privilege Escalation
+Linux capabilities grant fine-grained privileges. The default Docker/containerd capability set includes `CAP_NET_RAW`, `CAP_SETUID`, `CAP_SETGID`, and others that attackers exploit for container escapes. The hardened baseline is:
+```yaml
+securityContext:
+  capabilities:
+    drop:
+      - ALL
+```
+If a workload genuinely needs a specific capability (e.g., `NET_BIND_SERVICE` to bind port 443), add only that one capability back. Never leave the default set in place.
+The `allowPrivilegeEscalation: false` field is equally critical. Without it, a process inside the container can gain more privileges than its parent process through setuid binaries or other escalation vectors. This field must be set at the container level -- setting it at the pod level has no effect.
+## Host Namespace Access
+Setting `hostNetwork`, `hostPID`, or `hostIPC` to `true` breaks the container isolation boundary entirely. `hostNetwork` exposes the pod to the node's network stack and bypasses all NetworkPolicy enforcement. `hostPID` lets the container see and signal every process on the node. These fields must be `false` (the default) for all application workloads.
+## AppArmor and Seccomp
+Seccomp restricts which system calls a container can make. The `RuntimeDefault` profile blocks dangerous syscalls like `ptrace` and `mount` while allowing normal application behavior. Under PSS restricted, `seccompProfile.type: RuntimeDefault` is mandatory at the pod level.
+AppArmor provides mandatory access control on top of seccomp. As of Kubernetes 1.30, AppArmor has graduated to a first-class field (`securityContext.appArmorProfile`), replacing the older annotation-based approach (`container.apparmor.security.beta.kubernetes.io/<name>`). For clusters running 1.30+, use the native field. For older clusters, use the annotation. LLMs frequently mix these two approaches in the same manifest.
+Custom seccomp profiles (`type: Localhost`) can further restrict syscall access beyond `RuntimeDefault`, but require the profile to be available on every node. Use `RuntimeDefault` as the starting point unless specific workload requirements demand a custom profile.
+## What LLMs Get Wrong
+LLMs reproduce patterns from their training data, which is dominated by quickstart guides and blog posts without security hardening. The most frequent errors:
+1. **Omitting security context entirely.** The most common mistake. The generated manifest has no `securityContext` at either level.
+2. **Setting `runAsNonRoot` but not `runAsUser`.** The kubelet checks the image metadata at runtime -- if the image specifies `USER root`, the pod fails to start with a confusing error.
+3. **Dropping capabilities partially.** Dropping `SYS_ADMIN` but not all capabilities still leaves `NET_RAW`, `SETUID`, and others.
+4. **Forgetting init containers.** Security context on main containers but not init containers leaves a privilege escalation window during pod startup.
+5. **Confusing pod-level and container-level fields.** Putting `allowPrivilegeEscalation` at the pod level (where it is ignored) instead of the container level.
+6. **Missing `readOnlyRootFilesystem`.** Without it, an attacker can write binaries into the container filesystem. Combine with `emptyDir` mounts for `/tmp` and any other write paths.
+## Real-World Impact
+- **Tesla cryptojacking (2018):** Kubernetes dashboard exposed without authentication, pods deployed with no security context, cryptominers ran as root on GPU nodes.
+- **Shopify bug bounty (2020):** A container escape via `CAP_SYS_ADMIN` in a pod that did not drop capabilities, granting access to the underlying node.
+- **Capital One breach (2019):** While not Kubernetes-specific, the pattern is identical -- overly permissive workload identity plus missing runtime restrictions enabled lateral movement from a single SSRF to full S3 access.
+The common thread: every breach was amplified by workloads running with more privileges than they needed. Secure defaults are not optional -- they are the primary defense against turning a single vulnerability into a cluster-wide compromise.
+## Further Reading
+- [OWASP Kubernetes Top Ten - K01](https://owasp.org/www-project-kubernetes-top-ten/)
+- [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
+- [KubeShark Security Hardening Guide](../guides/security-hardening.md)