npm - @synapta/skills - Versions diffs - 0.1.0 → 0.1.2 - Mend

@synapta/skills 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (353) hide show

package/skills/deploy-k8s/docs/architecture/storage-and-state.md ADDED Viewed

@@ -0,0 +1,102 @@
+# Storage and State
+Misconfigured storage is the only Kubernetes failure mode that can cause irreversible data loss. Unlike compute issues (which resolve by restarting pods) or network issues (which resolve by fixing policies), a deleted PersistentVolume with `reclaimPolicy: Delete` destroys the underlying disk permanently. Every storage decision must account for data durability.
+## The PV/PVC Model
+Kubernetes abstracts storage through three resources:
+- **PersistentVolume (PV):** Represents a piece of provisioned storage -- a cloud disk, an NFS share, or a local SSD. PVs are cluster-scoped, not namespaced.
+- **PersistentVolumeClaim (PVC):** A namespaced request for storage. Specifies size, access mode, and StorageClass. The control plane binds the PVC to a PV that satisfies its requirements.
+- **StorageClass:** Defines how PVs are dynamically provisioned. Specifies the CSI driver, parameters (disk type, encryption), reclaim policy, and binding mode.
+Dynamic provisioning is the default workflow: a PVC references a StorageClass, the CSI driver provisions a volume, and the control plane creates a PV and binds it to the PVC automatically.
+## StorageClass: Critical Fields
+Two StorageClass defaults are dangerous for production data:
+**`reclaimPolicy: Delete`** (the default) destroys the underlying volume when the PVC is deleted. A single `kubectl delete pvc` command permanently deletes the data. Production StorageClasses must use `Retain`, which preserves the volume for manual recovery.
+**`volumeBindingMode: Immediate`** (the default) provisions the volume before a pod is scheduled. This can place the volume in a different availability zone than the pod, causing the pod to stay Pending indefinitely. `WaitForFirstConsumer` provisions the volume in the same zone as the pod.
+Always set `allowVolumeExpansion: true` so PVCs can be resized without recreation. PVCs can be expanded but never shrunk.
+## Access Modes
+| Mode | Abbreviation | Meaning | Supported by |
+|---|---|---|---|
+| `ReadWriteOnce` | RWO | One node mounts read-write | All block storage (EBS, PD, Azure Disk) |
+| `ReadOnlyMany` | ROX | Many nodes mount read-only | NFS, CephFS, cloud file storage |
+| `ReadWriteMany` | RWX | Many nodes mount read-write | NFS, CephFS, EFS, Azure Files |
+| `ReadWriteOncePod` | RWOP | Exactly one pod mounts read-write | CSI drivers supporting RWOP (1.29+ GA) |
+The most common mistake: requesting `ReadWriteMany` with a block storage provisioner. Block storage is physically attached to a single node and cannot support RWX. The PVC stays in `Pending` state with no clear error message. Use a file storage solution (EFS, Filestore, Azure Files) for shared access.
+For databases, prefer `ReadWriteOncePod` over `ReadWriteOnce`. RWO allows multiple pods on the same node to mount the volume, which can cause data corruption. RWOP restricts access to exactly one pod.
+## Dynamic Provisioning and CSI Drivers
+Each cloud provider and storage platform has a CSI driver:
+| Environment | Block storage CSI | File storage CSI |
+|---|---|---|
+| AWS EKS | `ebs.csi.aws.com` | `efs.csi.aws.com` |
+| GKE | `pd.csi.storage.gke.io` | `filestore.csi.storage.gke.io` |
+| Azure AKS | `disk.csi.azure.com` | `file.csi.azure.com` |
+| Bare metal | Longhorn, Rook-Ceph, OpenEBS | Rook-CephFS, NFS provisioner |
+All major CSI drivers support snapshots, volume expansion, and encryption. Always enable encryption (`parameters.encrypted: "true"`) for production StorageClasses.
+## VolumeSnapshot for Backup and Restore
+VolumeSnapshots provide point-in-time copies of PVCs. They are the primary mechanism for data protection before destructive operations:
+```yaml
+apiVersion: snapshot.storage.k8s.io/v1
+kind: VolumeSnapshot
+metadata:
+  name: db-snapshot-2025-04-12
+spec:
+  volumeSnapshotClassName: csi-snapclass
+  source:
+    persistentVolumeClaimName: data-postgres-0
+```
+To restore, create a new PVC with `dataSource` referencing the snapshot. The CSI driver provisions a new volume from the snapshot data.
+Critical rules for snapshots:
+- Always snapshot before PVC deletion, StorageClass migration, or major upgrades.
+- Snapshots may be crash-consistent, not application-consistent. For databases, run logical backups (pg_dump, mysqldump) alongside snapshots.
+- Test restore procedures regularly. A backup never restored is not a backup.
+## Ephemeral Storage: emptyDir
+`emptyDir` volumes are tied to the pod lifecycle -- deleted when the pod is removed. Use them for scratch space, caches, and temporary files required by `readOnlyRootFilesystem: true`.
+Always set `sizeLimit` on emptyDir volumes. Without it, a runaway process can fill the node's disk and trigger eviction of every pod on that node. For in-memory emptyDirs (`medium: Memory`), the size counts against the container's memory limit.
+## StatefulSet volumeClaimTemplates
+StatefulSets create one PVC per replica automatically. PVCs created by `volumeClaimTemplates` are intentionally not deleted when the StatefulSet is deleted or scaled down -- this protects data. To reclaim storage, delete the PVCs manually after verifying the data is no longer needed.
+The `persistentVolumeClaimRetentionPolicy` field (1.27+) can configure automatic PVC deletion on scale-down or StatefulSet deletion, but use it with extreme caution in production.
+## fsGroup and Permissions
+When running containers as non-root with `readOnlyRootFilesystem: true`, mounted PVCs may not be writable because the volume's filesystem ownership does not match the container's user. Set `fsGroup` in the pod security context to ensure the mounted volume is writable by the pod's group:
+```yaml
+securityContext:
+  runAsUser: 10000
+  runAsGroup: 10000
+  fsGroup: 10000
+```
+Without `fsGroup`, the pod mounts the volume but cannot write to it, causing application errors that appear to be permission issues inside the container.
+## Further Reading
+- [Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
+- [Storage Classes](https://kubernetes.io/docs/concepts/storage/storage-classes/)
+- [KubeShark Resource Starvation](../failure-modes/resource-starvation.md)

package/skills/deploy-k8s/docs/architecture/workload-patterns.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Workload Patterns
+Kubernetes provides five workload resource types, each designed for a specific execution model. Choosing the wrong type forces workarounds that break update semantics, storage management, and scaling behavior. This guide provides a decision framework for selecting the right workload type.
+## Decision Matrix
+| Workload Type | Execution Model | Pod Identity | Storage | Scaling |
+|---|---|---|---|---|
+| **Deployment** | Long-running, stateless | Interchangeable (random suffix) | Shared or none | HPA, manual replicas |
+| **StatefulSet** | Long-running, stateful | Stable ordinal (0, 1, 2...) | Per-pod PVC via volumeClaimTemplates | Manual or custom |
+| **DaemonSet** | One pod per node | Per-node | hostPath or emptyDir | Automatic (node count) |
+| **Job** | Run-to-completion | Disposable | Temporary | completions + parallelism |
+| **CronJob** | Scheduled run-to-completion | Disposable | Temporary | schedule-driven |
+## Deployment
+**Use when:** Pods are interchangeable and need no stable identity or persistent local storage. Web servers, REST/gRPC APIs, microservices, frontend proxies, stateless queue workers.
+**Key considerations:**
+- Always set `replicas >= 2` for production with a PodDisruptionBudget.
+- Use `topologySpreadConstraints` to distribute across zones and nodes.
+- Pair with HPA for elastic scaling. Set `scaleDown.stabilizationWindowSeconds` to prevent flapping.
+- Never put `app.kubernetes.io/version` in `selector.matchLabels` -- selectors are immutable and this breaks upgrades.
+**Common mistake:** Using a Deployment with a RWO PersistentVolumeClaim and `replicas > 1`. Only one pod can mount a RWO volume at a time. The second replica stays Pending. Use a StatefulSet with per-pod volumes or switch to RWX storage.
+## StatefulSet
+**Use when:** Pods need stable network identity (predictable DNS per pod), stable per-pod storage (PVC follows the pod across reschedules), or ordered deployment. Databases (PostgreSQL, MySQL), message brokers (Kafka, RabbitMQ), consensus systems (etcd, ZooKeeper).
+**Key considerations:**
+- Requires a headless Service (`clusterIP: None`) for per-pod DNS: `<pod>.<service>.<ns>.svc.cluster.local`.
+- `volumeClaimTemplates` create one PVC per pod. PVCs are never auto-deleted on scale-down to protect data.
+- `podManagementPolicy: OrderedReady` (default) creates pods sequentially. Use `Parallel` when pods initialize independently.
+- Set `terminationGracePeriodSeconds` to 60-120 seconds for databases. The default 30 seconds is insufficient for clean shutdown.
+**Common mistake:** Using a StatefulSet when a Deployment with a single PVC or an external database would suffice. If you only need storage (not per-pod identity), a Deployment is simpler. StatefulSets add operational complexity for ordered rollouts, scale-down behavior, and PVC lifecycle management.
+## DaemonSet
+**Use when:** Exactly one pod must run on every qualifying node. Log collectors (Fluent Bit, Vector), monitoring agents (node-exporter, Datadog), network plugins (Cilium), CSI node drivers, security agents (Falco).
+**Key considerations:**
+- DaemonSets have no `replicas` field. The scheduler places one pod per qualifying node automatically.
+- Resources are multiplied across every node. 100m CPU x 200 nodes = 20 CPU cores cluster-wide. Be conservative with requests.
+- Use `nodeSelector` or `nodeAffinity` to target specific node pools. Add tolerations for tainted nodes (control-plane, GPU).
+- Use a custom PriorityClass (not `system-node-critical`) for application-level agents.
+**Common mistake:** Specifying a `replicas` field. DaemonSets do not support it -- the API rejects the manifest.
+## Job
+**Use when:** Work runs to completion and then stops. Database migrations, data exports, ETL pipelines, one-time scripts, ML training runs.
+**Key considerations:**
+- `restartPolicy` must be `Never` or `OnFailure`. The default `Always` is rejected by the API for Jobs.
+- Always set `activeDeadlineSeconds` to prevent runaway jobs.
+- Always set `ttlSecondsAfterFinished` to auto-clean completed Jobs and their pods.
+- Jobs may retry on failure. Every Job must be idempotent -- assume at-least-once execution.
+- Use `podFailurePolicy` (1.26+) to distinguish retryable from fatal errors.
+**Common mistake:** Using `restartPolicy: Always`, which is the default for pods but invalid for Jobs. LLMs frequently omit `restartPolicy` in Job specs, relying on the default that the API rejects.
+## CronJob
+**Use when:** Work runs on a recurring schedule. Report generation, cache warming, log rotation, periodic health checks, certificate renewal.
+**Key considerations:**
+- Set `concurrencyPolicy: Forbid` by default. Overlapping runs cause resource exhaustion and data corruption.
+- Set `startingDeadlineSeconds` to skip runs that are too late (prevents burst of overdue jobs after controller downtime).
+- Set `timeZone` explicitly. Without it, the schedule uses the controller's clock (typically UTC).
+- CronJobs have three label levels (CronJob, jobTemplate, pod template). All three need consistent labels.
+**Common mistake:** Leaving `concurrencyPolicy` at the default `Allow`, which permits overlapping runs. A CronJob that takes 10 minutes, scheduled every 5 minutes, will accumulate concurrent instances until the cluster runs out of resources.
+## Anti-Patterns
+- **StatefulSet for stateless workloads.** Adds unnecessary complexity. Use a Deployment.
+- **Deployment for one-shot tasks.** The pod restarts forever after completion. Use a Job.
+- **DaemonSet when only some nodes need the workload.** Use `nodeSelector` to target the correct subset, not a blanket DaemonSet with no selector.
+- **CronJob for long-running daemons.** If the workload should run continuously, use a Deployment with HPA.
+## Further Reading
+- [Workloads](https://kubernetes.io/docs/concepts/workloads/)
+- [KubeShark Good Patterns](../examples/good-patterns.md)
+- [KubeShark Bad Patterns](../examples/bad-patterns.md)

package/skills/deploy-k8s/docs/book.json ADDED Viewed

@@ -0,0 +1,16 @@
+{
+  "title": "Kubernetes Skill for Claude Code — KubeShark Documentation",
+  "plugins": ["-sharing", "search-pro", "-lunr", "-search"],
+  "pluginsConfig": {
+    "search-pro": {}
+  },
+  "structure": {
+    "readme": "README.md",
+    "summary": "SUMMARY.md"
+  },
+  "links": {
+    "sidebar": {
+      "GitHub": "https://github.com/LukasNiessen/kubernetes-skill"
+    }
+  }
+}

package/skills/deploy-k8s/docs/community/changelog.md ADDED Viewed

@@ -0,0 +1,34 @@
+# Changelog
+All notable changes to the Kubernetes Skill (KubeShark) are documented here. This project uses [Semantic Versioning](https://semver.org/).
+For the repository-level changelog, see [CHANGELOG.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/CHANGELOG.md).
+---
+## v1.0.0
+Initial release of KubeShark.
+### Failure Modes
+- 6 primary failure modes: insecure workload defaults, resource starvation, network exposure, privilege sprawl, fragile rollouts, API drift
+- 7-step failure-mode-first diagnostic workflow (diagnose before generate)
+### Reference Files
+- 20 granular reference files covering failure modes, workload patterns, cross-cutting concerns, tooling, and examples
+- LLM mistake checklists in every reference file that covers a risk domain
+### Pattern Banks
+- 8 production-ready good examples with annotated YAML
+- 8 common anti-pattern bad examples with explanations
+- Do/Don't checklist spanning 9 categories
+### Tooling
+- Helm chart pattern guidance with template conventions
+- Kustomize overlay and patch patterns
+- Validation and policy enforcement (kubeconform, Kyverno, OPA/Gatekeeper, Polaris)
+### Infrastructure
+- HonKit documentation site
+- GitHub Actions CI validation and docs deployment
+- Conventional commits and semantic versioning

package/skills/deploy-k8s/docs/community/contributing.md ADDED Viewed

@@ -0,0 +1,67 @@
+# Contributing
+Thanks for contributing to Kubernetes Skill (KubeShark). This is a condensed guide. For the full version, see [CONTRIBUTING.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/CONTRIBUTING.md).
+## Core Principle
+Every change must map to a failure mode. Before submitting, answer three questions:
+1. Which failure mode does this prevent?
+2. What measurable quality gain does it provide?
+3. Is the token cost justified?
+## Development Flow
+1. **Branch** -- create a feature or fix branch from `main`
+2. **Change** -- make focused changes; keep PRs small and single-purpose
+3. **Check** -- run local checks (see below)
+4. **PR** -- open a pull request using the PR template
+## Local Checks
+```bash
+# Verify no placeholder text remains
+rg -n "FIXME|placeholder-text" README.md SKILL.md references/*.md
+# Verify required files exist
+python - <<'PY'
+from pathlib import Path
+assert Path('SKILL.md').exists()
+assert Path('README.md').exists()
+for p in [
+  'references/insecure-workload-defaults.md',
+  'references/resource-starvation.md',
+  'references/network-exposure.md',
+  'references/privilege-sprawl.md',
+  'references/fragile-rollouts.md',
+  'references/api-drift.md',
+]:
+  assert Path(p).exists(), f'missing {p}'
+print('basic structure OK')
+PY
+```
+## Content Rules
+- Keep examples original and clearly distinct
+- Prefer failure-mode framing over generic best-practice text
+- Avoid cloud-provider-specific deep dives unless they directly reduce a known LLM failure mode
+- Keep claims precise; avoid vague "always" language when tradeoffs exist
+- Default to the PSS restricted profile in all examples
+## Required for PR Approval
+- Clear mapping to one or more failure modes
+- No contradictory guidance across references
+- Updated links and indexes if files were moved or renamed
+- Validation workflow passing (`.github/workflows/validate.yml`)
+## Security
+- Never commit credentials, tokens, or secret values
+- Do not paste real cluster state or kubeconfig data
+- Do not include real IP addresses, hostnames, or cloud account identifiers
+## Reporting Issues
+Open an issue with: the observed hallucination or failure pattern, a minimal reproducible prompt/context, and the expected behavior.

package/skills/deploy-k8s/docs/core-concepts/failure-modes.md ADDED Viewed

@@ -0,0 +1,153 @@
+# Failure Modes
+KubeShark organizes Kubernetes risks into six named failure modes. Every piece of guidance in the skill maps to at least one of these. Content that does not reduce the probability of any failure mode is excluded.
+These are not arbitrary categories. They represent the six most common ways LLM-generated Kubernetes manifests cause real damage in production.
+---
+## 1. Insecure Workload Defaults
+Containers running with overly permissive security settings because no explicit security context was provided.
+**Symptoms:**
+- Containers running as root (UID 0)
+- Pods admitted without any `securityContext`
+- Linux capabilities not dropped (`CAP_NET_RAW`, `CAP_SYS_ADMIN` still present)
+- `hostPath` volumes mounted into workload pods
+- Privileged containers that can escape to the node
+- PodSecurity admission rejecting manifests at deploy time
+**Common causes:**
+- Upstream example manifests and Helm chart defaults rarely include security contexts
+- LLMs train on those permissive examples and reproduce them verbatim
+- `securityContext` has both pod-level and container-level fields; omitting either leaves gaps
+- Confusion between PSS levels (privileged, baseline, restricted)
+**Risk pattern:** A Deployment without a security context deploys successfully, runs as root, and becomes a container escape vector when a CVE is exploited. The cluster accepts it without complaint.
+---
+## 2. Resource Starvation
+Workloads deployed without proper resource requests and limits, leading to scheduling failures, evictions, and cascading outages.
+**Symptoms:**
+- OOMKilled containers exceeding memory limits
+- Pods stuck in Pending because the scheduler cannot find a node
+- Node pressure evictions killing BestEffort pods
+- CPU throttling causing invisible latency spikes
+- Noisy neighbors starving co-located pods
+- HPA flapping between replica counts
+**Common causes:**
+- Missing requests and limits entirely (BestEffort QoS, first to be evicted)
+- Arbitrary round numbers (`cpu: 1`, `memory: 1Gi`) without profiling
+- No PodDisruptionBudget -- voluntary disruptions take down all replicas
+- CPU limits set too close to requests, causing constant CFS throttling
+- No LimitRange to catch misconfigured pods at admission
+**Risk pattern:** A pod without resource requests gets scheduled on an overcommitted node. Under load, the kubelet evicts it. The replacement pod lands on another overcommitted node. The cycle continues until the workload is effectively unavailable.
+---
+## 3. Network Exposure
+Cluster networking left in the default open state, exposing all pods to all other pods and potentially to the internet.
+**Symptoms:**
+- All pods can reach all pods (Kubernetes default)
+- Unexpected external exposure via `NodePort` or `LoadBalancer` Services
+- DNS resolution failures from wrong Service names or missing namespace qualifiers
+- Silent routing to nothing when Service selectors do not match pod labels
+- Lateral movement after compromise because no NetworkPolicy exists
+- Ingress 404s or 502s from path/backend mismatches
+**Common causes:**
+- Kubernetes has no network segmentation by default -- every pod can reach every other pod
+- LLMs generate `NodePort` and `LoadBalancer` Services when `ClusterIP` is sufficient
+- Service selectors silently fail when labels do not match (zero errors, zero traffic)
+- No policy means allow-all, not deny-all
+- Egress policies are forgotten -- ingress-only policies still allow unrestricted outbound
+**Risk pattern:** A compromised pod in one namespace freely connects to the database in another namespace. No NetworkPolicy exists, so every service in the cluster is reachable. The blast radius of a single vulnerability is the entire cluster.
+---
+## 4. Privilege Sprawl
+RBAC permissions, ServiceAccount tokens, and secret access granted far beyond what workloads actually require.
+**Symptoms:**
+- ClusterRoleBinding with `cluster-admin` attached to a workload ServiceAccount
+- Rules containing `verbs: ["*"]` or `resources: ["*"]`
+- Pods running with the `default` ServiceAccount (shared identity across the namespace)
+- `automountServiceAccountToken: true` on pods that never call the Kubernetes API
+- Secrets injected as environment variables (visible in `kubectl describe pod` and crash dumps)
+**Common causes:**
+- Copy-pasting `cluster-admin` bindings from quickstart guides
+- Using wildcards to "get it working" and never scoping down
+- Not creating dedicated ServiceAccounts per workload
+- Misunderstanding that Kubernetes Secrets are base64-encoded, not encrypted
+- Injecting secrets via `env` instead of volume mounts or external operators
+**Risk pattern:** A web application pod runs with the default ServiceAccount, which has a ClusterRoleBinding to `cluster-admin` left over from initial setup. An SSRF vulnerability in the application allows an attacker to read the mounted token and take full control of the cluster.
+---
+## 5. Fragile Rollouts
+Deployments that break during updates due to misconfigured probes, mutable image tags, or missing graceful shutdown handling.
+**Symptoms:**
+- Cascading restarts across all pods (liveness probe checks an external dependency)
+- Dropped connections and 502s during deploys (readiness probe passes too early)
+- All replicas unavailable simultaneously (`maxUnavailable` too high)
+- Version drift across pods (`:latest` tag with cached layers)
+- Pods killed before finishing in-flight requests (no preStop hook)
+- Slow-starting apps killed in restart loops (no startup probe)
+**Common causes:**
+- Misunderstanding the difference between liveness and readiness probes
+- Checking external dependencies (databases, APIs) in liveness probes
+- Using `:latest` tags, which are mutable and nondeterministic
+- Not setting `terminationGracePeriodSeconds` or preStop hooks
+- `maxUnavailable` and `maxSurge` left at defaults without considering replica count
+**Risk pattern:** A Deployment with a liveness probe that checks database connectivity deploys successfully. The database has a brief network blip. Every pod fails its liveness check simultaneously. Kubernetes restarts all pods at once, causing a full outage that outlasts the original database blip.
+---
+## 6. API Drift
+Manifests using wrong, deprecated, or removed API versions that fail silently or break on cluster upgrades.
+**Symptoms:**
+- `no matches for kind "Ingress" in version "extensions/v1beta1"` (removed API)
+- `Warning: policy/v1beta1 PodDisruptionBudget is deprecated` (deprecated, not yet removed)
+- Fields silently ignored after upgrade (existed in beta, removed in stable)
+- Helm templates render valid YAML but `kubectl apply` fails
+- `kubeconform` reports schema violations
+**Common causes:**
+- LLM training data contains outdated manifests from blog posts and Stack Overflow
+- Copy-paste from tutorials written for the Kubernetes 1.18-1.21 era
+- Helm charts pinned to old API versions without `Capabilities` checks
+- Not running schema validation against the target cluster version
+- Confusing "deprecated" (still works, prints warning) with "removed" (hard failure)
+**Risk pattern:** An LLM generates a manifest with `apiVersion: extensions/v1beta1` for an Ingress resource. This was removed in Kubernetes 1.22. The manifest looks correct, passes YAML linting, but fails on any modern cluster. The correct version is `networking.k8s.io/v1`.
+---
+## How Failure Modes Are Used
+Failure modes drive the entire KubeShark workflow:
+1. **Step 2 (Diagnose)** selects the relevant failure modes based on the task.
+2. **Step 3 (Load references)** pulls the reference files that correspond to the diagnosed failure modes.
+3. **Step 4 (Propose)** structures recommendations around preventing the specific risks identified.
+4. **Step 7 (Output contract)** lists which failure modes were addressed, making the response auditable.
+Most tasks involve multiple failure modes. A Deployment creation task typically triggers insecure workload defaults, resource starvation, and fragile rollouts at minimum. The workflow ensures none are overlooked.

package/skills/deploy-k8s/docs/core-concepts/philosophy.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Philosophy
+This page describes the design rationale behind KubeShark. For the full treatment, see [PHILOSOPHY.md](https://github.com/LukasNiessen/kubernetes-skill/blob/main/PHILOSOPHY.md) in the repository root.
+---
+## Failure-Mode-First vs. Reference Manuals
+The core insight: telling an LLM *what good Kubernetes looks like* is less effective than telling it *how to think about Kubernetes problems*.
+A static reference manual gives the model information but no diagnostic process. There is no risk assessment step, no structured output, and no way to verify that the right concerns were addressed. The model reads the reference and generates whatever it thinks fits.
+KubeShark takes the opposite approach. The core `SKILL.md` is an operational workflow, not a knowledge dump. It forces a diagnostic sequence: capture context, identify failure modes, load only relevant references, propose fixes with risk controls, validate, and deliver a structured output contract. The model diagnoses before it generates.
+---
+## Why Kubernetes Needs This More Than Terraform
+Terraform fails explicitly. A misconfiguration surfaces at `terraform plan` or `terraform apply` with a clear error message. Kubernetes is different in three critical ways:
+**Silent failures are common.** A Service with the wrong selector deploys successfully but routes to nothing. A NetworkPolicy with a mistyped label silently does nothing. A probe pointing to the wrong port passes creation but fails at runtime. The cluster accepts the manifest without complaint -- failures surface only when traffic arrives.
+**Runtime is continuous.** Terraform is plan-and-apply. Kubernetes is a continuous reconciliation loop. A misconfigured liveness probe does not just fail once -- it restarts the pod every 30 seconds forever. A missing PodDisruptionBudget does not just affect one deploy -- it allows every future rolling update to take down all replicas simultaneously.
+**The blast radius is multi-dimensional.** Terraform operates at infrastructure provisioning time. Kubernetes operates across provisioning, deployment, runtime, networking, scheduling, and security simultaneously. An LLM must reason about all these dimensions for every resource it generates.
+These properties make a diagnostic workflow essential. Without one, the LLM produces syntactically valid but operationally dangerous manifests -- and the cluster silently accepts them.
+---
+## Token Efficiency as Design Constraint
+Context window space is a finite resource. Every token spent on skill content is a token unavailable for the user's actual manifests, conversation history, and tool results.
+KubeShark is designed for minimal activation cost:
+- **SKILL.md is ~85 lines (~650 tokens).** It contains no YAML examples, no inline manifests, and no tutorial material. It is purely procedural.
+- **20 granular reference files.** The model loads only the 1-2 files relevant to the diagnosed failure mode per query.
+- **No duplication.** A query about probe configuration never loads the RBAC guidance. A query about Helm chart structure never loads the NetworkPolicy patterns.
+A single large reference file would force the model to process thousands of irrelevant tokens. Twenty small files let it load precisely what it needs.
+---
+## LLM-Aware Guardrails
+Every reference file that covers a risk domain includes an **LLM mistake checklist** -- a list of specific errors that language models make when generating Kubernetes configurations:
+- Omitting `securityContext` entirely, producing manifests that run as root
+- Setting liveness probes that check external dependencies, causing cascading restarts
+- Using `apiVersion: extensions/v1beta1` for Ingress (removed in 1.22)
+- Generating RBAC with wildcard verbs and resources on ClusterRoleBindings
+- Omitting resource requests and limits, or using arbitrary round numbers
+- Using `:latest` image tags without `imagePullPolicy` override
+- Creating Services with selectors that do not match any pod labels
+These checklists exist because the model needs to know *what it gets wrong*, not just *what is correct*. A reference that only shows the right pattern still allows the model to hallucinate the wrong one. A reference that explicitly names the hallucination pattern reduces it.
+---
+## Output Contracts for Auditability
+Every KubeShark response ends with a structured output contract: assumptions, failure modes addressed, remediation choices and tradeoffs, validation plan, and rollback notes.
+This is a deliberate design choice. Kubernetes manifests applied to a cluster have real operational consequences. The output contract makes every response auditable -- a reviewer can check whether the model's assumptions matched reality, whether the right risks were identified, and whether the rollback path is viable, all before applying anything.
+Without an output contract, the user receives a manifest and must independently assess whether it is safe. The contract shifts that burden: the model states what it assumed and what it did not account for.
+---
+## Default Security Posture
+KubeShark defaults to the Pod Security Standards **restricted** profile. Every generated workload includes:
+- `runAsNonRoot: true`
+- `allowPrivilegeEscalation: false`
+- `readOnlyRootFilesystem: true`
+- `capabilities: { drop: ["ALL"] }`
+- `seccompProfile: { type: RuntimeDefault }`
+The restricted profile prevents the largest class of container escape vulnerabilities. Deviations are allowed only when the user explicitly requests them, and the deviation is documented in the output contract with justification.
+This is a secure-by-default posture. It is easier to relax security with documented justification than to retroactively harden manifests that were generated permissively.