npm - @raishin/vanguard-frontier-agentic - Versions diffs - 2.0.1 → 2.2.0 - Mend

@raishin/vanguard-frontier-agentic 2.0.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (467) hide show

package/skills/qa/kubernetes-manifest-quality-review/SKILL.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+name: kubernetes-manifest-quality-review
+description: Use this skill when the user provides raw Kubernetes YAML manifests or asks to review K8s manifests for quality, security, or policy compliance — covering Deployment, StatefulSet, DaemonSet, Service, Ingress, NetworkPolicy, RBAC, and CRD resources.
+allowed-tools: Read Grep Glob
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+  updated: "2026-05-17"
+  category: delivery
+  lifecycle: experimental
+---
+# Kubernetes Manifest Quality Review
+## Purpose
+This skill reviews raw Kubernetes YAML manifests for quality, security, and policy-compliance defects. It covers Deployment, StatefulSet, DaemonSet, Service, Ingress, NetworkPolicy, RBAC, and CRD resources. The review is entirely static — it reads YAML files and never applies manifests to a cluster, never contacts the Kubernetes API, and never requests kubeconfig, service account tokens, or cloud credentials.
+## Lean operating rules
+### Schema and structure
+- `apiVersion` or `kind` missing — CRITICAL: the manifest cannot be applied; flag and stop review of that resource.
+- Deprecated API versions (e.g., `extensions/v1beta1`, `networking.k8s.io/v1beta1`, `policy/v1beta1` PodSecurityPolicy) — HIGH: these will be rejected by newer clusters.
+- Missing required labels (`app`, `app.kubernetes.io/name`, `app.kubernetes.io/version`) on Pods and workload controllers — MEDIUM: impairs observability, selector targeting, and policy enforcement.
+- No `namespace` specified (reliance on default namespace) — MEDIUM: encourages lateral movement and policy bypass; everything should be explicitly namespaced.
+### Pod security (Pod Security Standards)
+- `securityContext.runAsRoot: true` on a container, or no `runAsNonRoot: true` at pod or container level — HIGH: processes run as UID 0 inside the container.
+- `privileged: true` on a container security context — CRITICAL: the container has near-host-root access.
+- `allowPrivilegeEscalation: true` or field absent (it defaults to `true` unless `privileged: false` is set) — HIGH: child processes can gain more privileges than the parent.
+- `hostNetwork: true`, `hostPID: true`, `hostIPC: true` on the pod spec — CRITICAL: the pod shares the host network stack, process table, or IPC namespace, enabling broad host compromise.
+- `capabilities.add` containing `SYS_ADMIN`, `NET_ADMIN`, `ALL`, `SYS_PTRACE`, or `DAC_OVERRIDE` — CRITICAL: these capabilities provide near-root privilege; drop all capabilities and add only what is specifically required.
+- `readOnlyRootFilesystem: false` or field absent on a container — MEDIUM: a writable root filesystem makes container compromise easier; set to `true` and use `emptyDir` or volume mounts for mutable paths.
+- `seccompProfile` absent at pod or container level — MEDIUM: no syscall filtering, increasing the kernel attack surface; use `RuntimeDefault` or a custom profile.
+### Image hygiene
+- Image tag is `:latest` or absent — HIGH: non-reproducible deployments; a rollout can silently pull a different image than what was tested.
+- No image digest pinning for production manifests — MEDIUM: tag mutability allows supply-chain substitution; prefer `image@sha256:<digest>`.
+- Image pulled from an unverified public registry (e.g., Docker Hub) with no `imagePullPolicy: IfNotPresent` or digest — MEDIUM: arbitrary public images without integrity verification.
+### Resource governance
+- `resources.requests` and `resources.limits` both absent on a container — HIGH: the container is unschedulable on resource-constrained nodes and can starve co-located workloads.
+- Memory limit set without a CPU limit — MEDIUM: CPU throttling surprise; the container can be throttled heavily with no visible error.
+- Ephemeral storage limit absent on containers known to produce logs or temp files — LOW: unbounded ephemeral storage can exhaust node disk and evict other pods.
+### Health probes
+- `livenessProbe` missing — HIGH: the kubelet cannot detect application deadlocks or crash-loop conditions and restart the container.
+- `readinessProbe` missing — HIGH: the endpoint controller sends traffic to the pod before the application is ready, causing errors during startup and rolling updates.
+- Probe using `exec` command with no `timeoutSeconds` specified — MEDIUM: exec probes default to a 1-second timeout; a slow command silently causes probe failures and restarts.
+### Networking and exposure
+- Service type `LoadBalancer` or `NodePort` without a comment or annotation documenting the business justification — MEDIUM: these expose services externally or on every node port; ClusterIP is sufficient for internal services.
+- Ingress resource with no TLS block configured — HIGH: traffic between the client and the ingress controller is unencrypted.
+- No `NetworkPolicy` resource restricts pod ingress or egress in the namespace — MEDIUM: the default Kubernetes network model is allow-all; without a NetworkPolicy every pod can reach every other pod.
+- Ingress annotation `nginx.ingress.kubernetes.io/use-proxy-protocol` or similar annotation that forwards arbitrary upstream headers into backend requests from untrusted input — CRITICAL: enables SSRF and header injection.
+### RBAC and service accounts
+- `ClusterRole` with verb `*` on resource `*` or on `secrets` — CRITICAL: any principal bound to this role has full cluster read/write access.
+- `RoleBinding` or `ClusterRoleBinding` whose subject is `system:anonymous` or `system:unauthenticated` — CRITICAL: unauthenticated callers inherit these permissions.
+- `automountServiceAccountToken: true` (or field absent, which defaults to `true`) on pods that do not contact the Kubernetes API — HIGH: the token is mounted at a known path and exploitable if the container is compromised.
+- RBAC role granting `get` or `list` on `secrets` beyond what the workload demonstrably needs — HIGH: broadens blast radius of a credential compromise.
+### Secrets and config
+- Plaintext credentials (passwords, tokens, connection strings) in `env.value` on a container or in `ConfigMap.data` — CRITICAL: credentials visible in manifests committed to source control or stored in etcd in plaintext.
+- `Secret` with `type: Opaque` and a base64-encoded value that decodes to an empty string — MEDIUM: placeholder secret that will cause application startup failures and suggests secrets management is not wired up.
+## References
+Load these only when needed:
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the full review or formatting the final answer.
+## Response minimum
+Return, at minimum:
+- Schema and API version findings
+- Pod security findings (PSS Restricted/Baseline comparison)
+- Image hygiene findings
+- Resource governance findings
+- Health probe findings
+- Networking and exposure findings
+- RBAC and service account findings
+- Secrets and config findings
+- Severity-labelled finding list (CRITICAL / HIGH / MEDIUM / LOW)
+- Safe next actions

package/skills/qa/kubernetes-manifest-quality-review/metadata.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "id": "kubernetes-manifest-quality-review",
+  "name": "Kubernetes Manifest Quality Review",
+  "type": "skill",
+  "provider": "generic",
+  "harnesses": ["codex", "claude-code", "cursor", "gemini", "kiro", "other"],
+  "summary": "Review raw Kubernetes YAML manifests for security, quality, and policy defects — deprecated APIs, missing securityContext, absent resource limits, missing health probes, RBAC over-permission, plaintext secrets, and network exposure — statically, without applying manifests or contacting a cluster.",
+  "source_type": "original",
+  "official_docs": [
+    "https://kubernetes.io/docs/concepts/security/pod-security-standards/",
+    "https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/",
+    "https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/",
+    "https://kubernetes.io/docs/reference/access-authn-authz/rbac/",
+    "https://kubernetes.io/docs/concepts/services-networking/network-policies/",
+    "https://github.com/yannh/kubeconform",
+    "https://github.com/zegl/kube-score"
+  ],
+  "security_notes": "Static review only — reads manifest YAML files, never applies manifests to a cluster, never connects to the Kubernetes API, and never requests kubeconfig, service account tokens, or cloud credentials. Do not accept manifests containing real secret values or connection strings decoded from base64; ask for sanitized versions with placeholder values.",
+  "last_verified": "2026-05-17",
+  "path": "skills/qa/kubernetes-manifest-quality-review",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}

package/skills/qa/kubernetes-manifest-quality-review/references/workflow-and-output.md ADDED Viewed

@@ -0,0 +1,246 @@
+# Workflow and Output Contract
+## Workflow
+### Step 1 — Collect inputs
+Ask the user to provide one or more of the following as sanitized files (no real secret values, no kubeconfig, no service account tokens, no cloud credentials — replace sensitive values with placeholders):
+- Workload manifests: Deployment, StatefulSet, DaemonSet YAML
+- Service and Ingress YAML
+- NetworkPolicy YAML
+- RBAC resources: Role, ClusterRole, RoleBinding, ClusterRoleBinding YAML
+- CRD definitions if relevant
+- Any Kustomize base and overlay files
+If NetworkPolicy resources are not provided, the egress/ingress audit findings are stated as `inference` — say so and ask for them.
+### Step 2 — Schema and API version audit
+Validate that every manifest has `apiVersion` and `kind` present. Check for deprecated or removed API versions:
+```yaml
+# HIGH — removed in Kubernetes 1.22
+apiVersion: extensions/v1beta1
+kind: Ingress
+# HIGH — networking.k8s.io/v1beta1 Ingress removed in 1.22
+apiVersion: networking.k8s.io/v1beta1
+kind: Ingress
+# HIGH — policy/v1beta1 PodSecurityPolicy removed in 1.25
+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+```
+Check that required labels are present on Pod templates and workload controllers: `app`, `app.kubernetes.io/name`, `app.kubernetes.io/version`. Flag missing `namespace` on all resources.
+### Step 3 — Pod security audit (PSS Restricted/Baseline comparison)
+Evaluate each Pod spec against the Pod Security Standards Restricted profile:
+```yaml
+# CRITICAL — privileged container
+securityContext:
+  privileged: true
+# CRITICAL — host namespaces
+hostNetwork: true
+hostPID: true
+hostIPC: true
+# HIGH — runAsRoot or missing runAsNonRoot
+securityContext:
+  runAsUser: 0
+  # or: runAsNonRoot absent
+# HIGH — allowPrivilegeEscalation unset or true
+securityContext:
+  allowPrivilegeEscalation: true
+# CRITICAL — dangerous capabilities
+securityContext:
+  capabilities:
+    add: ["SYS_ADMIN"]
+# MEDIUM — writable root filesystem
+securityContext:
+  readOnlyRootFilesystem: false
+  # or: field absent
+# MEDIUM — no seccomp profile
+securityContext:
+  # seccompProfile absent
+```
+For each container in the pod, note whether the field is set at the pod level, the container level, or both. Container-level settings override pod-level settings.
+### Step 4 — Image hygiene audit
+Check every container and init container image reference:
+```yaml
+# HIGH — mutable tag, non-reproducible
+image: nginx:latest
+image: myapp   # tag absent
+# MEDIUM — no digest pinning
+image: nginx:1.25.3   # tag present but no @sha256 digest
+# MEDIUM — unverified public registry, no digest
+image: docker.io/library/nginx:1.25.3
+```
+For production-grade manifests, recommend digest-pinned images:
+```yaml
+image: nginx:1.25.3@sha256:<digest>
+```
+### Step 5 — Resource governance audit
+Check every container for `resources.requests` and `resources.limits`:
+```yaml
+# HIGH — no requests or limits
+containers:
+  - name: app
+    image: myapp:1.0.0
+    # resources absent
+# MEDIUM — memory limit set without CPU limit
+resources:
+  limits:
+    memory: 512Mi
+  requests:
+    cpu: 100m
+    memory: 256Mi
+  # limits.cpu absent
+```
+Check for ephemeral storage limits on containers known to produce log output or temporary files.
+### Step 6 — Health probe audit
+Check every container for `livenessProbe` and `readinessProbe`:
+```yaml
+# HIGH — missing livenessProbe
+containers:
+  - name: app
+    # livenessProbe absent
+# HIGH — missing readinessProbe
+containers:
+  - name: app
+    # readinessProbe absent
+# MEDIUM — exec probe with no timeoutSeconds
+livenessProbe:
+  exec:
+    command: ["/bin/check"]
+  # timeoutSeconds absent, defaults to 1 second
+```
+### Step 7 — Networking and exposure audit
+Review Service types, Ingress TLS, NetworkPolicy coverage, and Ingress annotations:
+```yaml
+# MEDIUM — external exposure without documented justification
+kind: Service
+spec:
+  type: LoadBalancer   # or NodePort
+# HIGH — Ingress without TLS
+kind: Ingress
+spec:
+  # tls block absent
+# MEDIUM — no NetworkPolicy found in namespace (default allow-all)
+# CRITICAL — SSRF-enabling Ingress annotation
+metadata:
+  annotations:
+    nginx.ingress.kubernetes.io/use-proxy-protocol: "true"
+```
+If no NetworkPolicy resources are provided for the namespace, state that the default-allow posture is inferred and ask for NetworkPolicy files.
+### Step 8 — RBAC and secrets audit
+Review ClusterRole, Role, RoleBinding, ClusterRoleBinding, and Secret resources:
+```yaml
+# CRITICAL — wildcard verbs on wildcard resources
+rules:
+  - apiGroups: ["*"]
+    resources: ["*"]
+    verbs: ["*"]
+# CRITICAL — unauthenticated subject
+subjects:
+  - kind: Group
+    name: system:unauthenticated
+# HIGH — automount enabled on pods that do not need API access
+automountServiceAccountToken: true   # or field absent
+# HIGH — broad secret access
+rules:
+  - resources: ["secrets"]
+    verbs: ["get", "list"]
+# CRITICAL — plaintext credentials in env
+env:
+  - name: DB_PASSWORD
+    value: "mysecretpassword"
+# MEDIUM — empty-string secret value
+data:
+  password: ""   # decodes to empty
+```
+---
+## Output
+Return findings in this structure:
+```
+## Verdict
+<one sentence: manifests pass baseline / manifests have blocking security defects / manifests need remediation before production>
+## Evidence level
+<manifest files provided | partial manifests only | inference for missing resources>
+## Findings
+### CRITICAL
+- [C1] <resource name> — <finding>: <description> — <remediation>
+### HIGH
+- [H1] <resource name> — <finding>: <description> — <remediation>
+### MEDIUM
+- [M1] <resource name> — <finding>: <description> — <remediation>
+### LOW
+- [L1] <resource name> — <finding>: <description> — <remediation>
+## Safe next actions
+1. <action>
+2. <action>
+## Open questions
+- <question requiring user clarification>
+```
+---
+## Security notes
+- Never request or accept kubeconfig, service account tokens, cloud credentials, or actual secret values. Ask for sanitized manifests with placeholder values in Secret resources.
+- This is a static review: do not apply manifests, run `kubectl`, or contact any cluster.
+- A `privileged: true` container, `hostNetwork/hostPID/hostIPC: true`, or a ClusterRole with `*` verbs on `*` resources is the highest-impact finding class. Lead with it.
+- `RoleBinding` to `system:unauthenticated` or `system:anonymous` is a critical exposure; tell the user to remove it immediately.
+- Plaintext credentials in `env.value` or `ConfigMap.data` should be replaced with `secretKeyRef` references; never recommend committing real credentials even in base64.
+- Do not recommend disabling probes or relaxing securityContext fields to pass short-term validation — recommend the correct secure configuration and explain the rationale.

package/skills/qa/llm-ai-pipeline-test-review/SKILL.md ADDED Viewed

@@ -0,0 +1,52 @@
+---
+name: llm-ai-pipeline-test-review
+description: Use this skill when reviewing how an LLM or AI pipeline is evaluated — metric selection, golden datasets, threshold governance, adversarial coverage, and regression gating — to determine whether low-quality or unsafe model outputs can ship undetected. Trigger when a user provides evaluation configuration files, DeepEval or RAGAS test scripts, eval CI steps, or asks whether their AI pipeline actually prevents a bad model from reaching production. This skill reviews evaluation setup statically; it does not call LLM APIs, run evaluations, or contact inference endpoints.
+allowed-tools: Read Grep Glob
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+  updated: "2026-05-17"
+  category: ai
+  lifecycle: experimental
+---
+# LLM AI Pipeline Test Review
+## Purpose
+This skill reviews how an LLM or AI pipeline is evaluated — not the model itself, but the evaluation setup that decides whether a model change is safe to ship. An evaluation suite only protects users if it measures the right things, gates on meaningful thresholds, covers adversarial inputs, and detects drift across model versions. The review catches missing hallucination and factuality metrics, absent answer-relevancy and faithfulness checks for RAG pipelines, unguarded bias and toxicity, no adversarial or red-team coverage, agent evals that ignore tool correctness and task completion, thresholds that are undefined or set to zero, single-shot evals on non-deterministic outputs, and no regression baseline to detect metric drift.
+## Lean operating rules
+- Treat a RAG or summarisation pipeline with no `HallucinationMetric` or no GEval with factuality criteria against source documents as HIGH — the pipeline can fabricate facts and ship them.
+- Treat a pipeline with no golden dataset (fixed reference set for regression) as HIGH — metric drift across model versions is undetectable.
+- Treat the absence of `AnswerRelevancyMetric` as MEDIUM — responses may be fluent but off-topic, and no eval catches it.
+- Treat a RAG pipeline with no `FaithfulnessMetric` as HIGH — the model can ignore retrieved context and hallucinate; faithfulness is the primary RAG correctness signal.
+- Treat missing `ContextualPrecisionMetric` or `ContextualRecallMetric` in a RAG pipeline as MEDIUM — retrieval quality is unmeasured; noisy or incomplete retrieval is invisible to the eval.
+- Treat the absence of `BiasMetric` or `ToxicityMetric` as HIGH if the system is user-facing — unsafe outputs can reach users without detection; treat as CRITICAL if the audience is vulnerable (children, medical patients, crisis users).
+- Treat no adversarial test cases and no red-team dataset as CRITICAL for agentic systems; HIGH for all other user-facing LLM products — prompt-injection and jailbreak paths are untested.
+- Treat agent evals with no `ToolCorrectnessMetric` as HIGH — the agent can call wrong tools silently and the eval still passes.
+- Treat multi-step agent evals with no `TaskCompletionMetric` as HIGH — end-to-end success is unmeasured even if individual steps look fine.
+- Treat metric thresholds that are undefined, set to 0, or not reviewed by a domain expert as HIGH — a threshold of 0 means every output passes; an unreviewed threshold is a guess.
+- Treat evals that run only once per input on non-deterministic outputs (no pass@k or mean-score aggregation across multiple runs) as MEDIUM — a single lucky sample masks systematic failure.
+- Treat the absence of a golden dataset or scoring baseline that would detect metric regression across model versions as HIGH — a model update can silently degrade quality.
+- Treat static golden datasets that have never been rotated or supplemented with synthetic adversarial data as MEDIUM — a suite that tests the same inputs repeatedly stops finding new defects (the pesticide paradox).
+- Apply thresholds contextually: a faithfulness score of 0.7 may be acceptable for a joke generator and unacceptable for a medical chatbot — flag any threshold that appears copied from a tutorial without domain justification.
+- Define eval metrics early in the model selection process, not after a model is chosen — catching defects before model selection is always cheaper than retrofitting evals.
+- Label every finding with evidence basis: eval config provided, test script provided, documentation-based, or inference.
+- Static review only — read eval configs and test source; never call LLM APIs, never run evaluations, never request model API keys or inference endpoints.
+## References
+Load these only when needed:
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the full review or formatting the final answer.
+## Response minimum
+Return, at minimum:
+- Hallucination and factual correctness findings
+- Answer relevancy and faithfulness findings (especially for RAG pipelines)
+- Safety metric findings (bias, toxicity)
+- Adversarial and red-team coverage findings
+- Agent-specific metric findings (tool correctness, task completion)
+- Threshold governance and non-determinism findings
+- Regression gating findings (golden dataset, baseline)
+- Severity-labelled finding list (critical / high / medium / low)
+- Safe next actions

package/skills/qa/llm-ai-pipeline-test-review/metadata.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "id": "llm-ai-pipeline-test-review",
+  "name": "LLM AI Pipeline Test Review",
+  "type": "skill",
+  "provider": "generic",
+  "harnesses": ["codex", "claude-code", "cursor", "gemini", "kiro", "other"],
+  "summary": "Review an LLM or AI pipeline's evaluation setup for test-quality defects — missing hallucination, relevancy, faithfulness, bias, toxicity, and tool-correctness metrics; absent golden datasets; unthresholded or single-shot evals; and no regression gate across model versions. Static review only.",
+  "source_type": "original",
+  "official_docs": [
+    "https://docs.confident-ai.com/",
+    "https://docs.confident-ai.com/docs/metrics-hallucination",
+    "https://docs.confident-ai.com/docs/metrics-answer-relevancy",
+    "https://docs.confident-ai.com/docs/metrics-faithfulness",
+    "https://docs.confident-ai.com/docs/metrics-bias",
+    "https://docs.confident-ai.com/docs/metrics-tool-correctness",
+    "https://www.istqb.org/certifications/certified-tester-foundation-level"
+  ],
+  "security_notes": "Static review only — reads eval configuration and test source; never calls LLM APIs, never runs evaluations, never requests model API keys or inference endpoints. Do not accept eval fixtures containing real user PII, private prompt chains, or model weights; ask for sanitized configurations.",
+  "last_verified": "2026-05-17",
+  "path": "skills/qa/llm-ai-pipeline-test-review",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}

package/skills/qa/llm-ai-pipeline-test-review/references/workflow-and-output.md ADDED Viewed

@@ -0,0 +1,221 @@
+# Workflow and Output Contract
+## Workflow
+### Step 1 — Collect inputs
+Ask the user to provide one or more of the following as sanitized files (no API keys, no model weights, no real user PII — replace with placeholders):
+- Evaluation configuration files (DeepEval `test_*.py`, RAGAS config, custom eval scripts)
+- Golden dataset samples or references to a golden dataset (path, size, last-updated date)
+- CI step that runs evals (workflow YAML, script, or description of the gate)
+- The metric list and threshold values in use (even if embedded in test files)
+- For RAG pipelines: retrieval configuration (vector store, top-k, similarity threshold)
+- Optional: recent eval run report or score history showing metric trends
+If CI gating configuration is not provided, regression-gate findings are stated as `inference` — say so and ask for it.
+If threshold values are not provided, threshold-governance findings are stated as `inference`.
+### Step 2 — Hallucination and factual correctness audit
+Confirm the eval measures whether the model's claims are factually grounded.
+```python
+# HIGH — no hallucination check; fabrications pass the suite undetected
+test_cases = [LLMTestCase(input=q, actual_output=answer)]
+# no HallucinationMetric or GEval with factuality criteria
+# Correct — hallucination measured against source
+hallucination_metric = HallucinationMetric(threshold=0.2)
+dataset = EvaluationDataset(test_cases=[
+    LLMTestCase(input=q, actual_output=answer, context=[source_doc])
+])
+assert_test(dataset, [hallucination_metric])
+```
+Check for:
+- Presence of `HallucinationMetric` or a GEval with `"factual consistency"` / `"faithfulness to source"` criteria
+- Whether `context` (source documents) is provided to the metric — without it, the metric cannot detect contradiction
+- Whether a golden dataset with expected answers exists for regression comparisons
+### Step 3 — Answer relevancy and faithfulness audit (RAG focus)
+For all pipelines, confirm responses address the input. For RAG pipelines, confirm outputs are grounded in retrieved context.
+```python
+# MEDIUM — relevancy not measured; off-topic responses pass
+# missing AnswerRelevancyMetric
+# HIGH — RAG pipeline without faithfulness check; model can ignore retrieved docs
+# missing FaithfulnessMetric with retrieved_contexts
+# Correct — both relevancy and faithfulness measured
+relevancy = AnswerRelevancyMetric(threshold=0.7)
+faithfulness = FaithfulnessMetric(threshold=0.7)
+test_case = LLMTestCase(
+    input=query,
+    actual_output=answer,
+    retrieval_context=retrieved_docs
+)
+```
+Check for:
+- `AnswerRelevancyMetric` present for any conversational or Q&A pipeline
+- `FaithfulnessMetric` present for any RAG pipeline — this is the primary RAG correctness signal
+- `ContextualPrecisionMetric` and `ContextualRecallMetric` for RAG pipelines measuring retrieval quality
+- Whether `retrieval_context` is populated in test cases — an empty context silently disables the metric
+### Step 4 — Safety metrics audit (bias, toxicity)
+Confirm the eval catches unsafe outputs before they reach users.
+```python
+# HIGH (CRITICAL for vulnerable audiences) — no safety guardrails in eval
+# missing BiasMetric and ToxicityMetric
+# Correct — safety metrics applied
+bias_metric = BiasMetric(threshold=0.5)
+toxicity_metric = ToxicityMetric(threshold=0.5)
+```
+Check for:
+- `BiasMetric` present for any user-facing system
+- `ToxicityMetric` present for any user-facing system
+- Threshold values reviewed for the deployment context — a threshold appropriate for an adult content filter may be too permissive for a children's education tool
+- Whether bias and toxicity metrics are in the gating suite or are only advisory/non-blocking
+### Step 5 — Adversarial and red-team coverage audit
+Confirm the eval includes adversarial inputs, not only happy-path test cases.
+```python
+# CRITICAL for agentic / HIGH for others — no adversarial cases
+test_cases = [LLMTestCase(input=normal_query, actual_output=answer)]
+# only benign inputs; no prompt-injection attempts, no jailbreaks
+# Correct — red-team dataset included
+adversarial_cases = load_dataset("adversarial_prompts.json")
+```
+Check for:
+- Presence of adversarial test cases or a red-team dataset (prompt-injection attempts, jailbreak patterns, boundary inputs)
+- For agentic systems: test cases that verify the agent refuses or handles malicious tool-calling instructions
+- Whether adversarial cases are rotated periodically — a static adversarial set becomes predictable (pesticide paradox)
+- Whether adversarial inputs cluster around the topic or domain boundaries of the deployment (defect clustering)
+### Step 6 — Agent-specific metrics audit (tool correctness, task completion)
+For pipelines that include LLM agents, confirm the eval measures agent behavior, not only text quality.
+```python
+# HIGH — agent evals check only output text; wrong tool calls pass undetected
+# missing ToolCorrectnessMetric
+# HIGH — multi-step agent eval has no end-to-end success signal
+# missing TaskCompletionMetric
+# Correct — both agent metrics present
+tool_correctness = ToolCorrectnessMetric()
+task_completion = TaskCompletionMetric(threshold=0.8)
+agent_test_case = LLMTestCase(
+    input=user_request,
+    actual_output=final_answer,
+    tools_called=agent_tool_log,
+    expected_tools=["search", "summarize"]
+)
+```
+Check for:
+- `ToolCorrectnessMetric` present when an agent selects or calls tools
+- `TaskCompletionMetric` present for multi-step agentic workflows
+- Whether `tools_called` is logged and passed to tool metrics — without the log the metric cannot evaluate tool use
+- Whether task completion is defined and measurable for the specific agent goal
+### Step 7 — Threshold governance and non-determinism audit
+Confirm thresholds are meaningful and results are statistically reliable.
+```python
+# HIGH — threshold of 0 means every output passes; the metric is decorative
+HallucinationMetric(threshold=0)
+# MEDIUM — single run on a non-deterministic model; one lucky sample masks failures
+result = evaluate(dataset, metrics=[hallucination_metric])
+# Correct — multiple runs aggregated; threshold domain-reviewed
+scores = [evaluate(dataset, metrics=[hallucination_metric]).scores for _ in range(5)]
+mean_score = sum(scores) / len(scores)
+# threshold=0.2 reviewed by a domain expert for this medical-chatbot use case
+```
+Check for:
+- Any threshold set to 0 or left at default without documented review — flag as HIGH
+- Whether thresholds are documented with a rationale (use case, acceptable failure rate, domain expert sign-off)
+- Whether multi-run aggregation (pass@k, mean score over N runs) is used for non-deterministic outputs
+- Whether thresholds differ appropriately across deployment contexts (production vs. staging, medical vs. entertainment)
+### Step 8 — Regression gate audit
+Confirm the eval detects when a model update silently degrades quality.
+```python
+# HIGH — no baseline; a new model can score worse than the old one and ship
+evaluate(dataset, metrics=[hallucination_metric])
+# no comparison to previous run scores
+# Correct — baseline scores recorded and compared
+baseline = load_baseline("eval_baseline_v1.json")
+current = evaluate(dataset, metrics=[hallucination_metric])
+assert current.score >= baseline.score - ALLOWED_REGRESSION
+```
+Check for:
+- A golden dataset that is versioned and stable enough to detect regression
+- Baseline scores stored from prior runs and compared against current runs
+- CI or eval step that fails when scores drop below the baseline by more than an allowed delta
+- Whether the golden dataset is ever refreshed — a dataset that never changes stops finding new defect categories (pesticide paradox); rotate or supplement it with synthetic data periodically
+---
+## Output
+Return findings in this structure:
+```
+## Verdict
+<one sentence: eval suite gates unsafe outputs / eval runs but gates nothing / partial coverage with gaps>
+## Evidence level
+<eval config + test scripts provided | eval config only | documentation-based | inference>
+## Findings
+### CRITICAL
+- [C1] <finding>: <description> — <remediation>
+### HIGH
+- [H1] <finding>: <description> — <remediation>
+### MEDIUM
+- [M1] <finding>: <description> — <remediation>
+### LOW
+- [L1] <finding>: <description> — <remediation>
+## Safe next actions
+1. <action>
+2. <action>
+## Open questions
+- <question requiring user clarification>
+```
+---
+## Security notes
+- Never request or accept model API keys, inference endpoint URLs, or model weights. Ask for sanitized eval configuration with placeholders.
+- Never call LLM APIs, run evaluations, or contact inference endpoints — this is a static review only.
+- Do not accept eval fixtures containing real user PII or private prompt chains; ask the user to anonymize them first.
+- A metric with threshold=0 is functionally disabled — it is the eval equivalent of `continue-on-error: true` on a test step. Lead with it when present.
+- Bias and toxicity without thresholds reviewed for the actual audience are a false signal of safety; flag the gap and ask what the audience is.
+- Adversarial coverage is the most commonly absent category; absence is not evidence that the model is robust — it is evidence the question was never asked.