npm - @raishin/vanguard-frontier-agentic - Versions diffs - 1.2.0 → 1.3.0 - Mend

@raishin/vanguard-frontier-agentic 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (442) hide show

package/skills/oci/oci-live-network-security-rule-guard/references/preflight-commands.md ADDED Viewed

@@ -0,0 +1,69 @@
+# Preflight Commands: OCI Live Network Security Rule Guard
+Run all of these before adding, modifying, or removing any Security List or NSG rule.
+## 1. Confirm active OCI profile and tenancy
+```bash
+oci iam region list --output table   # confirms CLI auth works
+oci iam tenancy get --tenancy-id $(oci iam user get --user-id $(oci iam user list --query 'data[0].id' --raw-output) --query 'data."compartment-id"' --raw-output) 2>/dev/null || echo "Use: oci iam user list --all"
+```
+Simpler identity check:
+```bash
+oci iam user list --all --query 'data[0].{name:name, description:description}' --output table
+```
+## 2. Capture current Security List rules (CRITICAL — save as rollback baseline)
+```bash
+# Get current ingress and egress rules — save this output BEFORE any mutation
+oci network security-list get \
+  --security-list-id <SECURITY_LIST_OCID> \
+  --query 'data.{"display-name":"display-name", "ingress-security-rules":"ingress-security-rules", "egress-security-rules":"egress-security-rules"}'
+```
+## 3. Capture current NSG rules (CRITICAL — save as rollback baseline)
+```bash
+oci network nsg rules list \
+  --nsg-id <NSG_OCID> \
+  --all \
+  --query 'data[].{id:id, direction:direction, protocol:protocol, source:source, destination:destination, "source-type":"source-type", "tcp-options":"tcp-options", "udp-options":"udp-options", stateless:stateless}'
+```
+## 4. List Security Lists in a VCN to identify the target
+```bash
+oci network security-list list \
+  --compartment-id <COMPARTMENT_OCID> \
+  --vcn-id <VCN_OCID> \
+  --query 'data[].{"display-name":"display-name", id:id, "lifecycle-state":"lifecycle-state"}'
+```
+## 5. Identify subnets attached to the Security List (blast radius)
+```bash
+oci network subnet list \
+  --compartment-id <COMPARTMENT_OCID> \
+  --vcn-id <VCN_OCID> \
+  --query 'data[].{"display-name":"display-name", "cidr-block":"cidr-block", "security-list-ids":"security-list-ids", "prohibit-public-ip-on-vnic":"prohibit-public-ip-on-vnic"}'
+```
+`prohibit-public-ip-on-vnic: true` = private subnet. Ingress from 0.0.0.0/0 on a private subnet still allows internal CIDR access — confirm VCN CIDR scope.
+## 6. Check if DB System or Autonomous DB is in the affected subnet
+```bash
+# List DB systems in compartment
+oci db system list \
+  --compartment-id <COMPARTMENT_OCID> \
+  --query 'data[].{"display-name":"display-name", "subnet-id":"subnet-id", "lifecycle-state":"lifecycle-state"}'
+# List Autonomous DBs
+oci db autonomous-database list \
+  --compartment-id <COMPARTMENT_OCID> \
+  --query 'data[].{"db-name":"db-name", "subnet-id":"subnet-id", "lifecycle-state":"lifecycle-state"}'
+```
+If the affected subnet hosts a DB workload, classify the change as **critical** and require explicit DBA approval.

package/skills/oci/oci-live-network-security-rule-guard/references/rollback-playbook.md ADDED Viewed

@@ -0,0 +1,79 @@
+# Rollback Playbook: OCI Live Network Security Rule Guard
+OCI Security List and NSG rule changes take effect immediately with no native undo operation. The only rollback path is restoring the previous rule set from a captured baseline. **Capture current rules before every mutation — no exceptions.**
+## Pre-mutation capture (mandatory)
+```bash
+# Security List — save to file before any change
+oci network security-list get \
+  --security-list-id <SECURITY_LIST_OCID> \
+  --query 'data.{"ingress-security-rules":"ingress-security-rules","egress-security-rules":"egress-security-rules"}' \
+  > securitylist-backup-$(date +%Y%m%d-%H%M%S).json
+# NSG — save to file before any change
+oci network nsg rules list \
+  --nsg-id <NSG_OCID> --all \
+  > nsg-backup-$(date +%Y%m%d-%H%M%S).json
+```
+## Restore Security List rules from backup
+Security List update is a **full replace** — the update command overwrites the entire rule set. Pass the exact previous rules from the backup file.
+```bash
+# Restore ingress rules
+INGRESS=$(cat securitylist-backup-<TIMESTAMP>.json | python3 -c "import json,sys; d=json.load(sys.stdin); print(json.dumps(d['ingress-security-rules']))")
+oci network security-list update \
+  --security-list-id <SECURITY_LIST_OCID> \
+  --ingress-security-rules "$INGRESS" \
+  --force
+# Restore egress rules (same file, egress key)
+EGRESS=$(cat securitylist-backup-<TIMESTAMP>.json | python3 -c "import json,sys; d=json.load(sys.stdin); print(json.dumps(d['egress-security-rules']))")
+oci network security-list update \
+  --security-list-id <SECURITY_LIST_OCID> \
+  --egress-security-rules "$EGRESS" \
+  --force
+```
+## Restore NSG rules from backup
+NSG rule updates require rule IDs. To restore, remove new rules and re-add the old ones.
+```bash
+# List current rule IDs to identify added rules
+oci network nsg rules list --nsg-id <NSG_OCID> --all --query 'data[].id'
+# Remove a specific rule that was incorrectly added
+oci network nsg rules remove \
+  --nsg-id <NSG_OCID> \
+  --security-rule-ids '["<RULE_ID_TO_REMOVE>"]'
+```
+## Verify restoration
+```bash
+# Confirm rules match the backup
+oci network security-list get \
+  --security-list-id <SECURITY_LIST_OCID> \
+  --query 'data.{"ingress-security-rules":"ingress-security-rules","egress-security-rules":"egress-security-rules"}'
+```
+## Connectivity verification after rollback
+```bash
+# Check if affected instance can still reach expected endpoints
+# (Run from inside the VCN or use OCI Network Path Analyzer)
+oci network path-analyzer-test create \
+  --compartment-id <COMPARTMENT_OCID> \
+  --protocol-parameters '{"type":"TCP","destinationPort":<PORT>}' \
+  --source-endpoint '{"type":"COMPUTE_INSTANCE","instanceId":"<INSTANCE_OCID>"}' \
+  --destination-endpoint '{"type":"IP_ADDRESS","address":"<DEST_IP>"}'
+```
+## What cannot be rolled back
+- Traffic that flowed through an incorrectly open rule during the window cannot be recalled.
+- Data exfiltrated or connections established during the exposure window must be investigated separately via VCN Flow Logs.
+- Enable Flow Logs on affected subnets before and after any security rule change for forensic coverage.

package/skills/opentelemetry/README.md ADDED Viewed

@@ -0,0 +1,31 @@
+# 🔭 OpenTelemetry Skills
+<p align="center">
+  <!-- 🖼️ Add an OpenTelemetry logo to assets/logos/cnative/opentelemetry/ and update this path -->
+  <span style="font-size:3.5em">🔭</span>
+</p>
+This folder contains OpenTelemetry-focused skills curated for this marketplace.
+## Local marketplace portfolio
+This folder contains **1** local OpenTelemetry skill:
+- `opentelemetry-collector-config-review`
+## Portfolio posture
+OpenTelemetry skills for evidence-backed observability pipeline review covering the four `OpenTelemetryCollector` deployment modes (`deployment`, `statefulset`, `daemonset`, `sidecar`), the `Instrumentation` CR for auto-instrumentation across Java/Node/Python/.NET/Go, the Target Allocator for distributed Prometheus scraping, and exporter/processor/receiver pipeline correctness.
+These skills are intentionally conservative:
+- prefer `kubectl get opentelemetrycollectors,instrumentations -A -o yaml` for live collector state grounding before any review
+- treat **collector pipeline with no exporter** as a critical finding — telemetry is silently dropped at collector boundary
+- treat **removal of `memory_limiter` processor** as a critical finding — collector OOMs and loses spans/metrics
+- challenge tail sampling rule changes — past spans are not re-evaluated, sampling drift is permanent for already-collected windows
+- challenge `Instrumentation` CR removal from a running namespace — auto-instrumented pods stop emitting telemetry on next restart
+- challenge collector `service.pipelines` lacking the `k8sattributes` processor — telemetry loses Kubernetes context (namespace, pod, deployment)
+- challenge TLS `insecure: true` on production exporters — telemetry data flows in plaintext, often containing PII
+- use official OpenTelemetry documentation (opentelemetry.io, opentelemetry-operator) for Collector/Instrumentation CRD syntax, processor pipelines, and Target Allocator semantics
+Run `npm run validate` after changing cataloged OpenTelemetry skills.

package/skills/opentelemetry/opentelemetry-collector-config-review/SKILL.md ADDED Viewed

@@ -0,0 +1,44 @@
+---
+name: opentelemetry-collector-config-review
+description: Use this skill for OpenTelemetry Operator review covering OpenTelemetryCollector deployment modes (Deployment, StatefulSet, DaemonSet, Sidecar), Instrumentation CR auto-instrumentation across Java/Node/Python/.NET/Go, Target Allocator for distributed Prometheus scraping, and pipeline correctness across receivers, processors, and exporters. Trigger when the user asks whether a collector configuration will lose telemetry, whether the right deployment mode is used, whether memory_limiter and batch are present, whether tail_sampling is safe to change, or whether auto-instrumentation will cover a workload after restart.
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+---
+# OpenTelemetry Collector Config Review
+## Purpose
+Review OpenTelemetry Operator-managed `OpenTelemetryCollector` and `Instrumentation` resources against pipeline correctness, deployment-mode appropriateness, memory safety, sampling integrity, exporter security, and Kubernetes-attribute enrichment. Telemetry pipelines fail silently — a misconfigured exporter drops every span; a missing `memory_limiter` OOMs the collector; a deleted `Instrumentation` resource stops auto-instrumentation on next pod restart.
+## Lean operating rules
+- Prefer live cluster evidence (`kubectl get opentelemetrycollectors,instrumentations -A -o yaml` plus collector logs and metrics) when the active client exposes it; otherwise fall back to official OpenTelemetry documentation (opentelemetry.io, opentelemetry-operator) and sanitized YAML.
+- Separate confirmed facts from inference. If collector pipeline state, exporter health, or `Instrumentation` propagation was not queried, say so.
+- Treat **a pipeline with no exporter** (or with only `debug` exporter in production) as a critical finding — telemetry is dropped at the collector.
+- Treat **removal of the `memory_limiter` processor** as a critical finding — collector OOMs and loses spans/metrics on burst traffic.
+- Treat **removal of the `k8sattributes` processor** as a high finding — telemetry loses `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name`, and SLO dashboards lose context.
+- Challenge tail sampling rule changes — past spans are not re-evaluated; sampling drift is permanent for already-collected windows.
+- Challenge `Instrumentation` CR removal in a running namespace — auto-instrumented pods stop emitting telemetry after their next restart.
+- Challenge collector exporters with `tls.insecure: true` in production — telemetry data flows in plaintext, often containing PII/PHI.
+- Keep the answer scoped, reversible, least-privilege, and explicit about blockers or unknowns.
+## References
+Load these only when needed:
+- [Evidence path and tooling](references/mcp-and-evidence.md) — use when choosing live evidence, confirming Operator version and Collector pipeline state, or switching to documentation mode.
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the full review, applying stress checks per deployment mode, or formatting the final answer.
+- [Official sources](references/official-sources.md) — use when you need the detailed OpenTelemetry documentation list, processor pipeline references, and grounded insights.
+## Response minimum
+Return, at minimum:
+- the scoped target (`OpenTelemetryCollector` of which mode, `Instrumentation` CR, or pipeline element) and evidence level,
+- the deployment-mode appropriateness (Deployment / StatefulSet / DaemonSet / Sidecar) for the use case,
+- the pipeline correctness (receivers, processors, exporters all present and ordered safely),
+- the failure mode if exporter is unreachable or downstream is full (queue, drop, retry semantics),
+- the safest next actions and rollback plan,
+- the assumptions or blockers that prevent stronger conclusions.

package/skills/opentelemetry/opentelemetry-collector-config-review/metadata.json ADDED Viewed

@@ -0,0 +1,30 @@
+{
+  "id": "opentelemetry-collector-config-review",
+  "name": "OpenTelemetry Collector Config Review",
+  "type": "skill",
+  "provider": "opentelemetry",
+  "harnesses": [
+    "codex",
+    "claude-code",
+    "cursor",
+    "gemini",
+    "kiro",
+    "other"
+  ],
+  "summary": "Review OpenTelemetry Operator OpenTelemetryCollector and Instrumentation resources for deployment-mode appropriateness, pipeline correctness, memory_limiter and k8sattributes presence, exporter security, and sampling integrity.",
+  "source_type": "original",
+  "official_docs": [
+    "https://opentelemetry.io/docs/",
+    "https://opentelemetry.io/docs/collector/",
+    "https://opentelemetry.io/docs/collector/configuration/",
+    "https://opentelemetry.io/docs/kubernetes/operator/",
+    "https://opentelemetry.io/docs/kubernetes/operator/automatic/",
+    "https://opentelemetry.io/docs/kubernetes/operator/target-allocator/",
+    "https://github.com/open-telemetry/opentelemetry-operator"
+  ],
+  "security_notes": "Pipeline with no exporter silently drops telemetry. Missing memory_limiter causes collector OOM under burst. Missing k8sattributes drops Kubernetes context. Tail sampling changes are not retroactive. Removing Instrumentation CR stops auto-instrumentation on next pod restart.",
+  "last_verified": "2026-05-01",
+  "path": "skills/opentelemetry/opentelemetry-collector-config-review",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}

package/skills/opentelemetry/opentelemetry-collector-config-review/references/mcp-and-evidence.md ADDED Viewed

@@ -0,0 +1,49 @@
+# Evidence Path and Tooling
+## Evidence path
+1. Prefer live cluster evidence when a Kubernetes MCP server, `kubectl`, and access to the OpenTelemetry Operator namespace are available.
+2. Fall back to official OpenTelemetry documentation (opentelemetry.io, opentelemetry-operator GitHub) when live inspection is unavailable.
+3. Ask only for sanitized `OpenTelemetryCollector` / `Instrumentation` YAML, collector logs, and target backend reachability evidence when current-state proof matters.
+4. Label conclusions as `live evidence`, `documentation-based`, `sanitized user evidence`, or `inference`.
+## Useful live-evidence commands
+```shell
+# All Collectors and Instrumentation CRs across the cluster
+kubectl get opentelemetrycollectors,instrumentations -A -o yaml
+# Detailed Collector status — replicas, mode, generated config map
+kubectl -n <ns> get opentelemetrycollector <name> -o yaml
+kubectl -n <ns> get configmap <collector-name>-collector -o yaml
+# Operator state
+kubectl -n opentelemetry-operator-system get deploy,svc,validatingwebhookconfiguration
+# Collector pod logs — confirm pipeline is processing data
+kubectl -n <ns> logs deploy/<collector-name>-collector --tail=200 -f
+# Collector internal metrics (Prometheus on :8888 by default)
+kubectl -n <ns> port-forward svc/<collector-name>-collector 8888:8888
+curl http://localhost:8888/metrics | grep otelcol_
+# Auto-instrumentation propagation — which pods received the init container?
+kubectl get pods -A -o jsonpath='{range .items[?(@.metadata.annotations.instrumentation\.opentelemetry\.io/inject-java=="true")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'
+# Verify exporter reachability from within the collector pod
+kubectl -n <ns> exec -it deploy/<collector-name>-collector -- nc -zv <exporter-host> <exporter-port>
+```
+## Operator and Collector state to confirm before review
+- Operator version (`kubectl -n opentelemetry-operator-system get deploy opentelemetry-operator-controller-manager -o jsonpath='{.spec.template.spec.containers[*].image}'`) — `OpenTelemetryCollector` API has evolved; `v1beta1` is the current stable.
+- Collector image and version — different versions support different receivers/processors/exporters. The contrib distribution has a much wider set than the core distribution.
+- Whether Target Allocator is deployed — required for `mode: statefulset` Prometheus scraping at scale.
+- Whether `Instrumentation` CRs exist and which language images are pinned (Java, Node, Python, .NET, Go) — version drift between auto-instrumentation images and application runtimes is a common silent failure mode.
+- Backend reachability — the actual telemetry destination (vendor SaaS, Tempo, Jaeger, Prometheus remote write, Loki) must accept the collector's data; check from inside the pod.
+## Sanitization rules
+- Never request kubeconfig contents, vendor API keys, OTLP bearer tokens, or backend authentication secrets.
+- Replace identifiable backend hostnames, vendor URLs, and tenant IDs with placeholders unless the user provides them.
+- Do not print the collector's `Authorization` header values; reference them by configuration key only.

package/skills/opentelemetry/opentelemetry-collector-config-review/references/official-sources.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Official Sources
+Load these only when needed:
+- [OpenTelemetry documentation home](https://opentelemetry.io/docs/) — use as the entry point for any OTEL question.
+- [Collector overview](https://opentelemetry.io/docs/collector/) — use for collector architecture, distributions (core vs contrib), and component model.
+- [Collector configuration](https://opentelemetry.io/docs/collector/configuration/) — use for receivers, processors, exporters, extensions, and `service.pipelines` syntax.
+- [Operator overview](https://opentelemetry.io/docs/kubernetes/operator/) — use for `OpenTelemetryCollector` CRD, deployment modes, and Operator behavior.
+- [Operator automatic instrumentation](https://opentelemetry.io/docs/kubernetes/operator/automatic/) — use for `Instrumentation` CR, language-specific init containers, annotation-based pod injection.
+- [Target Allocator](https://opentelemetry.io/docs/kubernetes/operator/target-allocator/) — use for sharding Prometheus scrape jobs across collector replicas.
+- [opentelemetry-operator GitHub](https://github.com/open-telemetry/opentelemetry-operator) — use for CRD source, examples, and recent feature notes.
+- [opentelemetry-collector-contrib processors](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor) — use for `k8sattributes`, `resourcedetection`, `tail_sampling`, `transform`, `filter`, `routing` processor configs.
+- [opentelemetry-collector-contrib receivers](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver) — use for `kubeletstats`, `k8s_cluster`, `prometheus`, `filelog` receiver configs.
+- [opentelemetry-collector-contrib exporters](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter) — use for vendor exporters and queue/retry semantics.
+- [Sampling guide](https://opentelemetry.io/docs/concepts/sampling/) — use when designing tail sampling vs probabilistic sampling vs head sampling.
+- [Semantic conventions for Kubernetes](https://opentelemetry.io/docs/specs/semconv/resource/k8s/) — use for the canonical `k8s.*` attribute names that `k8sattributes` populates.
+- [Collector internal observability](https://opentelemetry.io/docs/collector/internal-telemetry/) — use for `otelcol_*` self-metrics that diagnose collector health.
+## Grounded insights worth carrying into the skill
+- The OpenTelemetry Operator manages `OpenTelemetryCollector` and `Instrumentation` CRs and supports four deployment modes: `deployment`, `statefulset`, `daemonset`, and `sidecar`. Each is appropriate for a different use case and the wrong mode silently produces incomplete or duplicate data.
+- A pipeline with **no exporter** is valid YAML and silently drops every span/metric/log. The collector emits an internal warning at startup but otherwise behaves as if data is being processed.
+- `memory_limiter` is the only protection against OOM under burst load. Without it, the collector consumes memory until the kernel kills the pod and loses everything in flight. It is recommended as the **first processor** in every pipeline.
+- `batch` is recommended **last before exporters** because batching drops in-flight individual signals into batched export calls. Without it, every span is a separate export, which destroys throughput at any meaningful volume.
+- `k8sattributes` enriches signals with Kubernetes object names. Without it, traces and metrics cannot be grouped by namespace/pod/deployment, breaking SLO dashboards and alerting. It requires RBAC: `pods/get,list,watch`, `namespaces/get,list,watch`, `replicasets/get,list,watch`.
+- `tail_sampling` is the most common production sampling mode because it samples on complete trace properties (root span attributes, total duration). The critical caveat: **changes are not retroactive** — already-collected windows do not re-sample, so a sampling change creates a discontinuity in observed trace counts.
+- `Instrumentation` CR removal is invisible to running pods; the next pod restart silently starts without auto-instrumentation. Many silent SLO regressions trace back to an `Instrumentation` CR being removed during a "cleanup".
+- The Target Allocator is required for any `mode: statefulset` Prometheus collector serving more than a handful of scrape targets. Without it, every replica scrapes every target and the data is duplicated.
+- Auto-instrumentation images are pinned per language (Java, Node.js, Python, .NET, Go). When the application's runtime version moves ahead of the instrumentation image, instrumentation can fail to load silently. Treat the auto-instrumentation image versions as a cataloged dependency.
+- The collector exposes its own metrics on `:8888/metrics`. The most useful Prometheus series for diagnosing pipeline health: `otelcol_exporter_send_failed_spans`, `otelcol_processor_dropped_spans`, `otelcol_receiver_refused_spans`, `otelcol_processor_batch_send_size`. Any non-zero value on the failure counters is a finding.
+- The `debug` exporter (formerly `logging` exporter) prints to the collector's stdout and is meant for development. It is a frequent silent failure mode in production when someone replaced a real exporter with `debug` for debugging and forgot to restore it.

package/skills/opentelemetry/opentelemetry-collector-config-review/references/workflow-and-output.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Workflow and Output Contract
+## Workflow
+### Step 1 — Identify the deployment mode
+`OpenTelemetryCollector` supports four deployment modes, each appropriate for different use cases:
+1. **`mode: deployment`** — collector runs as a stateless `Deployment`, multiple replicas. Use for OTLP gateway / aggregation; NOT for hostmetrics.
+2. **`mode: statefulset`** — ordered, stable identity. Required for Target Allocator (sharding Prometheus scrape jobs across collectors).
+3. **`mode: daemonset`** — one collector per node. Use for hostmetrics, filelog (node-local logs), and per-node OTLP receiver.
+4. **`mode: sidecar`** — injected into application pods via annotation `sidecar.opentelemetry.io/inject: <name>`. Use for short-lived workloads or when application cannot reach a cluster-wide collector.
+Common mismatches that are findings:
+- `mode: deployment` with `hostmetrics` receiver — only one replica gets host data; data is incomplete.
+- `mode: daemonset` with HTTP receiver bound to `0.0.0.0:4318` — every node opens a port; verify network policy.
+- `mode: statefulset` without Target Allocator — wastes the ordered identity.
+- `mode: sidecar` for high-volume workloads — every pod runs a collector; CPU/memory cost multiplies.
+Reference: [Operator Modes](https://opentelemetry.io/docs/kubernetes/operator/) and the operator README in [open-telemetry/opentelemetry-operator](https://github.com/open-telemetry/opentelemetry-operator).
+### Step 2 — Audit the receivers
+Receivers ingest telemetry. Common patterns:
+- **`otlp`** — gRPC (`:4317`) and HTTP (`:4318`). Standard. Verify both protocols are needed; otherwise narrow.
+- **`prometheus`** — scrapes Prometheus endpoints. Pair with Target Allocator at scale.
+- **`hostmetrics`** — node CPU, memory, disk, network. Requires `hostNetwork` or volume mounts (`/hostfs`).
+- **`filelog`** — reads pod/container logs. Requires `/var/log/pods` mount.
+- **`k8s_cluster`** — cluster-level metrics (deployment status, node conditions). Requires RBAC.
+- **`kubeletstats`** — kubelet per-node stats. Requires kubelet TLS configuration.
+Findings to flag:
+- `otlp` receiver with `tls.insecure: true` and inbound traffic from untrusted networks — telemetry can be tampered.
+- `prometheus` receiver scraping endpoints with secrets in the response (rare; some vendor exporters do this) — sensitive data flows into the pipeline.
+- `filelog` without a `multiline` config for stack traces — multi-line logs split into single-line entries.
+### Step 3 — Audit the processors (the safety net)
+Processors transform data between receiver and exporter. **Two are essentially mandatory in production**:
+1. **`memory_limiter`** — drops data when collector memory exceeds a threshold. Without it, collector OOMs under load and loses everything in flight. Recommended position: **first** in the pipeline.
+2. **`batch`** — batches data before export. Without it, every span/metric is a separate export call; backend rate limits or network overhead destroy throughput. Recommended position: **last** before export.
+Other commonly required processors:
+- **`k8sattributes`** — enriches data with `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name`, `k8s.node.name`. Without it, dashboards and SLOs cannot group by Kubernetes object.
+- **`resource`** — sets static resource attributes (e.g., `cluster.name`, `deployment.environment`).
+- **`resourcedetection`** — auto-detects from environment, system, docker, kubernetes, GCP, AWS, Azure metadata services.
+- **`tail_sampling`** — keeps a sample of complete traces. **Critical caveat: changes are not retroactive — already-collected windows do not get re-sampled.**
+- **`filter`** — drops spans/metrics by attribute. Risk: a typo can drop everything.
+- **`transform`** — modifies attribute values via OTTL. Risk: a bad OTTL expression can corrupt every signal.
+- **`probabilistic_sampler`** — randomly samples a percentage. Simpler than tail sampling but loses correlated traces.
+Stress-tests:
+- Pipeline with no `memory_limiter` and high-volume traces — collector OOMs on burst, loses everything.
+- Pipeline with `memory_limiter` placed **after** other processors — those processors run on data that should have been dropped, wasting CPU.
+- Pipeline with `batch` placed **before** `tail_sampling` — sampling decisions are made per-batch, breaking trace coherence.
+- Pipeline with `k8sattributes` `auth_type: serviceAccount` but no RBAC granting `pods/get,list,watch` — enrichment fails silently.
+Reference: [Collector configuration](https://opentelemetry.io/docs/collector/configuration/) and [Collector processors](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor).
+### Step 4 — Audit the exporters
+Exporters send data to backends. Findings:
+- **No exporter on a pipeline** — the pipeline silently drops everything. Confirm at least one non-`debug` exporter per pipeline.
+- **Only `debug` exporter** in production — data prints to collector logs and is not sent anywhere. Useful for testing only.
+- **`tls.insecure: true`** on a production exporter — telemetry flows in plaintext. PII/PHI leak path.
+- **Missing `sending_queue`** — exporter blocks the pipeline when backend is slow; backpressure cascades.
+- **`sending_queue.enabled: false`** explicitly — telemetry is lost on any backend hiccup.
+- **`retry_on_failure.enabled: false`** — temporary network failures lose data.
+- **`prometheusremotewrite` exporter without `external_labels`** — multiple collectors write to the same Prometheus, time series collide.
+Reference: [Exporter configuration patterns](https://opentelemetry.io/docs/collector/configuration/#exporters).
+### Step 5 — Audit the `service.pipelines` ordering
+Three signal pipelines (`traces`, `metrics`, `logs`) compose receivers → processors → exporters. Order in the `processors` list **matters** — it is the execution order.
+Recommended order for a traces pipeline:
+```yaml
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      processors:
+        - memory_limiter        # 1. drop early under pressure
+        - resourcedetection     # 2. detect environment
+        - k8sattributes         # 3. enrich with K8s context
+        - resource              # 4. add static attributes
+        - tail_sampling         # 5. sample after enrichment
+        - batch                 # 6. batch last
+      exporters: [otlp, debug]
+```
+Common findings: `batch` not last, `memory_limiter` not first, `k8sattributes` after `tail_sampling` (sampling on un-enriched data, then enriching what survived = wasted).
+### Step 6 — Audit the `Instrumentation` CR
+The `Instrumentation` CR (`opentelemetry.io/v1alpha1`) drives auto-instrumentation. Pods are instrumented when they have one of the annotations: `instrumentation.opentelemetry.io/inject-java`, `inject-nodejs`, `inject-python`, `inject-dotnet`, `inject-go`, or `inject-sdk`.
+Critical concerns:
+- **Removing an `Instrumentation` CR while pods reference it** — running pods continue working, but on next restart the init container injection fails, and the pod starts without instrumentation. Telemetry stops silently.
+- **Image tag drift** — auto-instrumentation images are pinned per language. If the application moves to a newer runtime (e.g., Java 21) but the auto-instrumentation image hasn't been updated, instrumentation may not load.
+- **`exporter.endpoint` pointing to a collector that no longer exists** — telemetry calls fail; application logs may show OTLP export errors.
+- **`sampler.type: parentbased_traceidratio` with `argument: "0.0"`** — samples nothing.
+- **Missing `propagators`** — distributed traces don't link across services.
+- **`resource.resourceAttributes.deployment.environment` not set** — every environment looks the same in dashboards.
+Reference: [Operator auto-instrumentation](https://opentelemetry.io/docs/kubernetes/operator/automatic/).
+### Step 7 — Audit the Target Allocator (StatefulSet mode)
+When `targetAllocator.enabled: true`, Prometheus scrape jobs are sharded across the StatefulSet replicas. Findings:
+- `targetAllocator.allocationStrategy: least-weighted` (default) is good for even distribution; `consistent-hashing` is better for re-shard stability.
+- `targetAllocator.prometheusCR.enabled: true` requires `ServiceMonitor`/`PodMonitor` selectors. An empty selector matches everything; a too-narrow selector matches nothing.
+- Missing RBAC for the Target Allocator — it cannot list ServiceMonitors and silently scrapes nothing.
+Reference: [Target Allocator](https://opentelemetry.io/docs/kubernetes/operator/target-allocator/).
+### Step 8 — Stress-test operational hygiene
+- Prefer `v1beta1` `OpenTelemetryCollector` over `v1alpha1` — current stable.
+- Prefer named pipelines that match the source data shape (`traces/api`, `metrics/host`, `logs/app`) when one collector handles multiple streams.
+- Prefer `debug` exporter only in non-production.
+- Prefer `OTEL_RESOURCE_ATTRIBUTES` env propagation in `Instrumentation` over hardcoded values — makes the CR portable across environments.
+- Test pipeline changes by sending synthetic OTLP and watching the collector's `otelcol_` self-metrics — `otelcol_exporter_send_failed_spans` should be zero.
+## Output
+Return:
+- **target**: which `OpenTelemetryCollector` (and mode) or `Instrumentation` CR,
+- **evidence level**: `live evidence` / `documentation-based` / `sanitized user evidence` / `inference`,
+- **deployment-mode appropriateness** for the use case,
+- **pipeline correctness**: receivers, processors (with explicit `memory_limiter` and `batch` audit), exporters,
+- **failure mode**: what happens when backend is unreachable or backed up,
+- **risk findings** (with severity: high / medium / low),
+- **safest next actions** with sample manifest changes and self-metric expectations,
+- **rollback plan**: how to revert without losing the in-flight buffer,
+- **assumptions and missing facts**.
+## Security notes
+- Never recommend removing `memory_limiter` from a production pipeline.
+- Never recommend `tls.insecure: true` on a production exporter shipping data outside the cluster.
+- Never recommend deleting an `Instrumentation` CR without first confirming no running deployments reference it via annotation.
+- Do not print collector authentication tokens or vendor API keys; reference them by configuration key only.

package/skills/prometheus/prometheus-alerting-cardinality-review/SKILL.md ADDED Viewed

@@ -0,0 +1,38 @@
+---
+name: prometheus-alerting-cardinality-review
+description: Use this skill when reviewing Prometheus or AlertManager configuration for cardinality, alerting correctness, scrape security, remote_write safety, or retention adequacy. Trigger when a user provides prometheus.yml, alertmanager.yml, recording rules YAML, alerting rules YAML, or asks whether their Prometheus setup is production-ready.
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+---
+# Prometheus Alerting and Cardinality Review
+## Purpose
+This skill reviews Prometheus and AlertManager configuration for cardinality explosion risks, recording rule adequacy, alert expression correctness, routing tree safety, scrape configuration security, and retention posture. Cardinality explosion is the leading cause of Prometheus OOM crashes in production, and flapping alerts from missing `for:` durations erode on-call trust faster than any other alerting defect.
+## Lean operating rules
+- Flag any label dimension that is unbounded at the application level (e.g., `user_id`, `request_id`, `session_id`, `url_path`, `pod_hash`) — these cause cardinality explosion and must be moved off the label set or aggregated away.
+- Treat `prometheus_tsdb_head_series` exceeding 5 million as a cardinality warning threshold; note it if the user reports series counts or if the config makes it likely.
+- Treat any alert rule with `for: 0m`, `for: 0s`, or no `for:` field as HIGH — bare threshold alerts flap on every scrape jitter.
+- Treat `honor_labels: true` on any scrape target that is not a trusted federation endpoint as HIGH — it allows the scraped workload to override `job` and `instance` labels.
+- Treat any scrape config with a non-cluster HTTP scheme (`http://external-host`) as a potential SSRF candidate and flag it.
+- Recording rules are required for any PromQL expression used in dashboards or SLO burn-rate calculations; flag their absence as MEDIUM.
+- Multi-window multi-burn-rate (MWMB) alerting is the correct pattern for SLO breach detection; flag single-window SLO alerts as MEDIUM.
+- Flag `remote_write` configs where `write_relabel_configs` drop non-`__` metric labels — data loss is silent.
+- Flag retention under 30 days with no `remote_write` or Thanos/Cortex integration as MEDIUM compliance risk.
+- Do not recommend disabling any existing alert or recording rule without stating the specific reason and risk trade-off.
+## References
+Load these only when needed:
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the full review or formatting the final answer.
+## Response minimum
+Return, at minimum:
+- Cardinality risk assessment (label audit findings)
+- Alert expression correctness findings (for: duration, absent() misuse, MWMB posture)
+- AlertManager routing and inhibition findings
+- Scrape config security findings
+- Retention and remote_write findings
+- Severity-labelled finding list (critical / high / medium / low)
+- Safe next actions

package/skills/prometheus/prometheus-alerting-cardinality-review/metadata.json ADDED Viewed

@@ -0,0 +1,22 @@
+{
+  "id": "prometheus-alerting-cardinality-review",
+  "name": "Prometheus Alerting and Cardinality Review",
+  "type": "skill",
+  "provider": "prometheus",
+  "harnesses": ["codex", "claude-code", "cursor", "gemini", "kiro", "other"],
+  "summary": "Review Prometheus and AlertManager configuration for cardinality explosion, recording rules, alert expression correctness, routing, scrape security, and retention.",
+  "source_type": "original",
+  "official_docs": [
+    "https://prometheus.io/docs/prometheus/latest/querying/basics/",
+    "https://prometheus.io/docs/practices/naming/",
+    "https://prometheus.io/docs/practices/alerting/",
+    "https://prometheus.io/docs/alerting/latest/alertmanager/",
+    "https://prometheus.io/docs/prometheus/latest/storage/",
+    "https://prometheus.io/docs/practices/remote_write/"
+  ],
+  "security_notes": "honor_labels: true on untrusted scrape targets allows the scraped workload to override job/instance labels, enabling metric spoofing. Scrape configs pointing to external HTTP endpoints are SSRF candidates.",
+  "last_verified": "2026-05-02",
+  "path": "skills/prometheus/prometheus-alerting-cardinality-review",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}