npm - @raishin/vanguard-frontier-agentic - Versions diffs - 2.0.0 → 2.1.0 - Mend

@raishin/vanguard-frontier-agentic 2.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (342) hide show

package/skills/qa/kubernetes-manifest-quality-review/references/workflow-and-output.md ADDED Viewed

@@ -0,0 +1,246 @@
+# Workflow and Output Contract
+## Workflow
+### Step 1 — Collect inputs
+Ask the user to provide one or more of the following as sanitized files (no real secret values, no kubeconfig, no service account tokens, no cloud credentials — replace sensitive values with placeholders):
+- Workload manifests: Deployment, StatefulSet, DaemonSet YAML
+- Service and Ingress YAML
+- NetworkPolicy YAML
+- RBAC resources: Role, ClusterRole, RoleBinding, ClusterRoleBinding YAML
+- CRD definitions if relevant
+- Any Kustomize base and overlay files
+If NetworkPolicy resources are not provided, the egress/ingress audit findings are stated as `inference` — say so and ask for them.
+### Step 2 — Schema and API version audit
+Validate that every manifest has `apiVersion` and `kind` present. Check for deprecated or removed API versions:
+```yaml
+# HIGH — removed in Kubernetes 1.22
+apiVersion: extensions/v1beta1
+kind: Ingress
+# HIGH — networking.k8s.io/v1beta1 Ingress removed in 1.22
+apiVersion: networking.k8s.io/v1beta1
+kind: Ingress
+# HIGH — policy/v1beta1 PodSecurityPolicy removed in 1.25
+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+```
+Check that required labels are present on Pod templates and workload controllers: `app`, `app.kubernetes.io/name`, `app.kubernetes.io/version`. Flag missing `namespace` on all resources.
+### Step 3 — Pod security audit (PSS Restricted/Baseline comparison)
+Evaluate each Pod spec against the Pod Security Standards Restricted profile:
+```yaml
+# CRITICAL — privileged container
+securityContext:
+  privileged: true
+# CRITICAL — host namespaces
+hostNetwork: true
+hostPID: true
+hostIPC: true
+# HIGH — runAsRoot or missing runAsNonRoot
+securityContext:
+  runAsUser: 0
+  # or: runAsNonRoot absent
+# HIGH — allowPrivilegeEscalation unset or true
+securityContext:
+  allowPrivilegeEscalation: true
+# CRITICAL — dangerous capabilities
+securityContext:
+  capabilities:
+    add: ["SYS_ADMIN"]
+# MEDIUM — writable root filesystem
+securityContext:
+  readOnlyRootFilesystem: false
+  # or: field absent
+# MEDIUM — no seccomp profile
+securityContext:
+  # seccompProfile absent
+```
+For each container in the pod, note whether the field is set at the pod level, the container level, or both. Container-level settings override pod-level settings.
+### Step 4 — Image hygiene audit
+Check every container and init container image reference:
+```yaml
+# HIGH — mutable tag, non-reproducible
+image: nginx:latest
+image: myapp   # tag absent
+# MEDIUM — no digest pinning
+image: nginx:1.25.3   # tag present but no @sha256 digest
+# MEDIUM — unverified public registry, no digest
+image: docker.io/library/nginx:1.25.3
+```
+For production-grade manifests, recommend digest-pinned images:
+```yaml
+image: nginx:1.25.3@sha256:<digest>
+```
+### Step 5 — Resource governance audit
+Check every container for `resources.requests` and `resources.limits`:
+```yaml
+# HIGH — no requests or limits
+containers:
+  - name: app
+    image: myapp:1.0.0
+    # resources absent
+# MEDIUM — memory limit set without CPU limit
+resources:
+  limits:
+    memory: 512Mi
+  requests:
+    cpu: 100m
+    memory: 256Mi
+  # limits.cpu absent
+```
+Check for ephemeral storage limits on containers known to produce log output or temporary files.
+### Step 6 — Health probe audit
+Check every container for `livenessProbe` and `readinessProbe`:
+```yaml
+# HIGH — missing livenessProbe
+containers:
+  - name: app
+    # livenessProbe absent
+# HIGH — missing readinessProbe
+containers:
+  - name: app
+    # readinessProbe absent
+# MEDIUM — exec probe with no timeoutSeconds
+livenessProbe:
+  exec:
+    command: ["/bin/check"]
+  # timeoutSeconds absent, defaults to 1 second
+```
+### Step 7 — Networking and exposure audit
+Review Service types, Ingress TLS, NetworkPolicy coverage, and Ingress annotations:
+```yaml
+# MEDIUM — external exposure without documented justification
+kind: Service
+spec:
+  type: LoadBalancer   # or NodePort
+# HIGH — Ingress without TLS
+kind: Ingress
+spec:
+  # tls block absent
+# MEDIUM — no NetworkPolicy found in namespace (default allow-all)
+# CRITICAL — SSRF-enabling Ingress annotation
+metadata:
+  annotations:
+    nginx.ingress.kubernetes.io/use-proxy-protocol: "true"
+```
+If no NetworkPolicy resources are provided for the namespace, state that the default-allow posture is inferred and ask for NetworkPolicy files.
+### Step 8 — RBAC and secrets audit
+Review ClusterRole, Role, RoleBinding, ClusterRoleBinding, and Secret resources:
+```yaml
+# CRITICAL — wildcard verbs on wildcard resources
+rules:
+  - apiGroups: ["*"]
+    resources: ["*"]
+    verbs: ["*"]
+# CRITICAL — unauthenticated subject
+subjects:
+  - kind: Group
+    name: system:unauthenticated
+# HIGH — automount enabled on pods that do not need API access
+automountServiceAccountToken: true   # or field absent
+# HIGH — broad secret access
+rules:
+  - resources: ["secrets"]
+    verbs: ["get", "list"]
+# CRITICAL — plaintext credentials in env
+env:
+  - name: DB_PASSWORD
+    value: "mysecretpassword"
+# MEDIUM — empty-string secret value
+data:
+  password: ""   # decodes to empty
+```
+---
+## Output
+Return findings in this structure:
+```
+## Verdict
+<one sentence: manifests pass baseline / manifests have blocking security defects / manifests need remediation before production>
+## Evidence level
+<manifest files provided | partial manifests only | inference for missing resources>
+## Findings
+### CRITICAL
+- [C1] <resource name> — <finding>: <description> — <remediation>
+### HIGH
+- [H1] <resource name> — <finding>: <description> — <remediation>
+### MEDIUM
+- [M1] <resource name> — <finding>: <description> — <remediation>
+### LOW
+- [L1] <resource name> — <finding>: <description> — <remediation>
+## Safe next actions
+1. <action>
+2. <action>
+## Open questions
+- <question requiring user clarification>
+```
+---
+## Security notes
+- Never request or accept kubeconfig, service account tokens, cloud credentials, or actual secret values. Ask for sanitized manifests with placeholder values in Secret resources.
+- This is a static review: do not apply manifests, run `kubectl`, or contact any cluster.
+- A `privileged: true` container, `hostNetwork/hostPID/hostIPC: true`, or a ClusterRole with `*` verbs on `*` resources is the highest-impact finding class. Lead with it.
+- `RoleBinding` to `system:unauthenticated` or `system:anonymous` is a critical exposure; tell the user to remove it immediately.
+- Plaintext credentials in `env.value` or `ConfigMap.data` should be replaced with `secretKeyRef` references; never recommend committing real credentials even in base64.
+- Do not recommend disabling probes or relaxing securityContext fields to pass short-term validation — recommend the correct secure configuration and explain the rationale.

package/skills/qa/llm-ai-pipeline-test-review/SKILL.md ADDED Viewed

@@ -0,0 +1,52 @@
+---
+name: llm-ai-pipeline-test-review
+description: Use this skill when reviewing how an LLM or AI pipeline is evaluated — metric selection, golden datasets, threshold governance, adversarial coverage, and regression gating — to determine whether low-quality or unsafe model outputs can ship undetected. Trigger when a user provides evaluation configuration files, DeepEval or RAGAS test scripts, eval CI steps, or asks whether their AI pipeline actually prevents a bad model from reaching production. This skill reviews evaluation setup statically; it does not call LLM APIs, run evaluations, or contact inference endpoints.
+allowed-tools: Read Grep Glob
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+  updated: "2026-05-17"
+  category: ai
+  lifecycle: experimental
+---
+# LLM AI Pipeline Test Review
+## Purpose
+This skill reviews how an LLM or AI pipeline is evaluated — not the model itself, but the evaluation setup that decides whether a model change is safe to ship. An evaluation suite only protects users if it measures the right things, gates on meaningful thresholds, covers adversarial inputs, and detects drift across model versions. The review catches missing hallucination and factuality metrics, absent answer-relevancy and faithfulness checks for RAG pipelines, unguarded bias and toxicity, no adversarial or red-team coverage, agent evals that ignore tool correctness and task completion, thresholds that are undefined or set to zero, single-shot evals on non-deterministic outputs, and no regression baseline to detect metric drift.
+## Lean operating rules
+- Treat a RAG or summarisation pipeline with no `HallucinationMetric` or no GEval with factuality criteria against source documents as HIGH — the pipeline can fabricate facts and ship them.
+- Treat a pipeline with no golden dataset (fixed reference set for regression) as HIGH — metric drift across model versions is undetectable.
+- Treat the absence of `AnswerRelevancyMetric` as MEDIUM — responses may be fluent but off-topic, and no eval catches it.
+- Treat a RAG pipeline with no `FaithfulnessMetric` as HIGH — the model can ignore retrieved context and hallucinate; faithfulness is the primary RAG correctness signal.
+- Treat missing `ContextualPrecisionMetric` or `ContextualRecallMetric` in a RAG pipeline as MEDIUM — retrieval quality is unmeasured; noisy or incomplete retrieval is invisible to the eval.
+- Treat the absence of `BiasMetric` or `ToxicityMetric` as HIGH if the system is user-facing — unsafe outputs can reach users without detection; treat as CRITICAL if the audience is vulnerable (children, medical patients, crisis users).
+- Treat no adversarial test cases and no red-team dataset as CRITICAL for agentic systems; HIGH for all other user-facing LLM products — prompt-injection and jailbreak paths are untested.
+- Treat agent evals with no `ToolCorrectnessMetric` as HIGH — the agent can call wrong tools silently and the eval still passes.
+- Treat multi-step agent evals with no `TaskCompletionMetric` as HIGH — end-to-end success is unmeasured even if individual steps look fine.
+- Treat metric thresholds that are undefined, set to 0, or not reviewed by a domain expert as HIGH — a threshold of 0 means every output passes; an unreviewed threshold is a guess.
+- Treat evals that run only once per input on non-deterministic outputs (no pass@k or mean-score aggregation across multiple runs) as MEDIUM — a single lucky sample masks systematic failure.
+- Treat the absence of a golden dataset or scoring baseline that would detect metric regression across model versions as HIGH — a model update can silently degrade quality.
+- Treat static golden datasets that have never been rotated or supplemented with synthetic adversarial data as MEDIUM — a suite that tests the same inputs repeatedly stops finding new defects (the pesticide paradox).
+- Apply thresholds contextually: a faithfulness score of 0.7 may be acceptable for a joke generator and unacceptable for a medical chatbot — flag any threshold that appears copied from a tutorial without domain justification.
+- Define eval metrics early in the model selection process, not after a model is chosen — catching defects before model selection is always cheaper than retrofitting evals.
+- Label every finding with evidence basis: eval config provided, test script provided, documentation-based, or inference.
+- Static review only — read eval configs and test source; never call LLM APIs, never run evaluations, never request model API keys or inference endpoints.
+## References
+Load these only when needed:
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the full review or formatting the final answer.
+## Response minimum
+Return, at minimum:
+- Hallucination and factual correctness findings
+- Answer relevancy and faithfulness findings (especially for RAG pipelines)
+- Safety metric findings (bias, toxicity)
+- Adversarial and red-team coverage findings
+- Agent-specific metric findings (tool correctness, task completion)
+- Threshold governance and non-determinism findings
+- Regression gating findings (golden dataset, baseline)
+- Severity-labelled finding list (critical / high / medium / low)
+- Safe next actions

package/skills/qa/llm-ai-pipeline-test-review/metadata.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "id": "llm-ai-pipeline-test-review",
+  "name": "LLM AI Pipeline Test Review",
+  "type": "skill",
+  "provider": "generic",
+  "harnesses": ["codex", "claude-code", "cursor", "gemini", "kiro", "other"],
+  "summary": "Review an LLM or AI pipeline's evaluation setup for test-quality defects — missing hallucination, relevancy, faithfulness, bias, toxicity, and tool-correctness metrics; absent golden datasets; unthresholded or single-shot evals; and no regression gate across model versions. Static review only.",
+  "source_type": "original",
+  "official_docs": [
+    "https://docs.confident-ai.com/",
+    "https://docs.confident-ai.com/docs/metrics-hallucination",
+    "https://docs.confident-ai.com/docs/metrics-answer-relevancy",
+    "https://docs.confident-ai.com/docs/metrics-faithfulness",
+    "https://docs.confident-ai.com/docs/metrics-bias",
+    "https://docs.confident-ai.com/docs/metrics-tool-correctness",
+    "https://www.istqb.org/certifications/certified-tester-foundation-level"
+  ],
+  "security_notes": "Static review only — reads eval configuration and test source; never calls LLM APIs, never runs evaluations, never requests model API keys or inference endpoints. Do not accept eval fixtures containing real user PII, private prompt chains, or model weights; ask for sanitized configurations.",
+  "last_verified": "2026-05-17",
+  "path": "skills/qa/llm-ai-pipeline-test-review",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}

package/skills/qa/llm-ai-pipeline-test-review/references/workflow-and-output.md ADDED Viewed

@@ -0,0 +1,221 @@
+# Workflow and Output Contract
+## Workflow
+### Step 1 — Collect inputs
+Ask the user to provide one or more of the following as sanitized files (no API keys, no model weights, no real user PII — replace with placeholders):
+- Evaluation configuration files (DeepEval `test_*.py`, RAGAS config, custom eval scripts)
+- Golden dataset samples or references to a golden dataset (path, size, last-updated date)
+- CI step that runs evals (workflow YAML, script, or description of the gate)
+- The metric list and threshold values in use (even if embedded in test files)
+- For RAG pipelines: retrieval configuration (vector store, top-k, similarity threshold)
+- Optional: recent eval run report or score history showing metric trends
+If CI gating configuration is not provided, regression-gate findings are stated as `inference` — say so and ask for it.
+If threshold values are not provided, threshold-governance findings are stated as `inference`.
+### Step 2 — Hallucination and factual correctness audit
+Confirm the eval measures whether the model's claims are factually grounded.
+```python
+# HIGH — no hallucination check; fabrications pass the suite undetected
+test_cases = [LLMTestCase(input=q, actual_output=answer)]
+# no HallucinationMetric or GEval with factuality criteria
+# Correct — hallucination measured against source
+hallucination_metric = HallucinationMetric(threshold=0.2)
+dataset = EvaluationDataset(test_cases=[
+    LLMTestCase(input=q, actual_output=answer, context=[source_doc])
+])
+assert_test(dataset, [hallucination_metric])
+```
+Check for:
+- Presence of `HallucinationMetric` or a GEval with `"factual consistency"` / `"faithfulness to source"` criteria
+- Whether `context` (source documents) is provided to the metric — without it, the metric cannot detect contradiction
+- Whether a golden dataset with expected answers exists for regression comparisons
+### Step 3 — Answer relevancy and faithfulness audit (RAG focus)
+For all pipelines, confirm responses address the input. For RAG pipelines, confirm outputs are grounded in retrieved context.
+```python
+# MEDIUM — relevancy not measured; off-topic responses pass
+# missing AnswerRelevancyMetric
+# HIGH — RAG pipeline without faithfulness check; model can ignore retrieved docs
+# missing FaithfulnessMetric with retrieved_contexts
+# Correct — both relevancy and faithfulness measured
+relevancy = AnswerRelevancyMetric(threshold=0.7)
+faithfulness = FaithfulnessMetric(threshold=0.7)
+test_case = LLMTestCase(
+    input=query,
+    actual_output=answer,
+    retrieval_context=retrieved_docs
+)
+```
+Check for:
+- `AnswerRelevancyMetric` present for any conversational or Q&A pipeline
+- `FaithfulnessMetric` present for any RAG pipeline — this is the primary RAG correctness signal
+- `ContextualPrecisionMetric` and `ContextualRecallMetric` for RAG pipelines measuring retrieval quality
+- Whether `retrieval_context` is populated in test cases — an empty context silently disables the metric
+### Step 4 — Safety metrics audit (bias, toxicity)
+Confirm the eval catches unsafe outputs before they reach users.
+```python
+# HIGH (CRITICAL for vulnerable audiences) — no safety guardrails in eval
+# missing BiasMetric and ToxicityMetric
+# Correct — safety metrics applied
+bias_metric = BiasMetric(threshold=0.5)
+toxicity_metric = ToxicityMetric(threshold=0.5)
+```
+Check for:
+- `BiasMetric` present for any user-facing system
+- `ToxicityMetric` present for any user-facing system
+- Threshold values reviewed for the deployment context — a threshold appropriate for an adult content filter may be too permissive for a children's education tool
+- Whether bias and toxicity metrics are in the gating suite or are only advisory/non-blocking
+### Step 5 — Adversarial and red-team coverage audit
+Confirm the eval includes adversarial inputs, not only happy-path test cases.
+```python
+# CRITICAL for agentic / HIGH for others — no adversarial cases
+test_cases = [LLMTestCase(input=normal_query, actual_output=answer)]
+# only benign inputs; no prompt-injection attempts, no jailbreaks
+# Correct — red-team dataset included
+adversarial_cases = load_dataset("adversarial_prompts.json")
+```
+Check for:
+- Presence of adversarial test cases or a red-team dataset (prompt-injection attempts, jailbreak patterns, boundary inputs)
+- For agentic systems: test cases that verify the agent refuses or handles malicious tool-calling instructions
+- Whether adversarial cases are rotated periodically — a static adversarial set becomes predictable (pesticide paradox)
+- Whether adversarial inputs cluster around the topic or domain boundaries of the deployment (defect clustering)
+### Step 6 — Agent-specific metrics audit (tool correctness, task completion)
+For pipelines that include LLM agents, confirm the eval measures agent behavior, not only text quality.
+```python
+# HIGH — agent evals check only output text; wrong tool calls pass undetected
+# missing ToolCorrectnessMetric
+# HIGH — multi-step agent eval has no end-to-end success signal
+# missing TaskCompletionMetric
+# Correct — both agent metrics present
+tool_correctness = ToolCorrectnessMetric()
+task_completion = TaskCompletionMetric(threshold=0.8)
+agent_test_case = LLMTestCase(
+    input=user_request,
+    actual_output=final_answer,
+    tools_called=agent_tool_log,
+    expected_tools=["search", "summarize"]
+)
+```
+Check for:
+- `ToolCorrectnessMetric` present when an agent selects or calls tools
+- `TaskCompletionMetric` present for multi-step agentic workflows
+- Whether `tools_called` is logged and passed to tool metrics — without the log the metric cannot evaluate tool use
+- Whether task completion is defined and measurable for the specific agent goal
+### Step 7 — Threshold governance and non-determinism audit
+Confirm thresholds are meaningful and results are statistically reliable.
+```python
+# HIGH — threshold of 0 means every output passes; the metric is decorative
+HallucinationMetric(threshold=0)
+# MEDIUM — single run on a non-deterministic model; one lucky sample masks failures
+result = evaluate(dataset, metrics=[hallucination_metric])
+# Correct — multiple runs aggregated; threshold domain-reviewed
+scores = [evaluate(dataset, metrics=[hallucination_metric]).scores for _ in range(5)]
+mean_score = sum(scores) / len(scores)
+# threshold=0.2 reviewed by a domain expert for this medical-chatbot use case
+```
+Check for:
+- Any threshold set to 0 or left at default without documented review — flag as HIGH
+- Whether thresholds are documented with a rationale (use case, acceptable failure rate, domain expert sign-off)
+- Whether multi-run aggregation (pass@k, mean score over N runs) is used for non-deterministic outputs
+- Whether thresholds differ appropriately across deployment contexts (production vs. staging, medical vs. entertainment)
+### Step 8 — Regression gate audit
+Confirm the eval detects when a model update silently degrades quality.
+```python
+# HIGH — no baseline; a new model can score worse than the old one and ship
+evaluate(dataset, metrics=[hallucination_metric])
+# no comparison to previous run scores
+# Correct — baseline scores recorded and compared
+baseline = load_baseline("eval_baseline_v1.json")
+current = evaluate(dataset, metrics=[hallucination_metric])
+assert current.score >= baseline.score - ALLOWED_REGRESSION
+```
+Check for:
+- A golden dataset that is versioned and stable enough to detect regression
+- Baseline scores stored from prior runs and compared against current runs
+- CI or eval step that fails when scores drop below the baseline by more than an allowed delta
+- Whether the golden dataset is ever refreshed — a dataset that never changes stops finding new defect categories (pesticide paradox); rotate or supplement it with synthetic data periodically
+---
+## Output
+Return findings in this structure:
+```
+## Verdict
+<one sentence: eval suite gates unsafe outputs / eval runs but gates nothing / partial coverage with gaps>
+## Evidence level
+<eval config + test scripts provided | eval config only | documentation-based | inference>
+## Findings
+### CRITICAL
+- [C1] <finding>: <description> — <remediation>
+### HIGH
+- [H1] <finding>: <description> — <remediation>
+### MEDIUM
+- [M1] <finding>: <description> — <remediation>
+### LOW
+- [L1] <finding>: <description> — <remediation>
+## Safe next actions
+1. <action>
+2. <action>
+## Open questions
+- <question requiring user clarification>
+```
+---
+## Security notes
+- Never request or accept model API keys, inference endpoint URLs, or model weights. Ask for sanitized eval configuration with placeholders.
+- Never call LLM APIs, run evaluations, or contact inference endpoints — this is a static review only.
+- Do not accept eval fixtures containing real user PII or private prompt chains; ask the user to anonymize them first.
+- A metric with threshold=0 is functionally disabled — it is the eval equivalent of `continue-on-error: true` on a test step. Lead with it when present.
+- Bias and toxicity without thresholds reviewed for the actual audience are a false signal of safety; flag the gap and ask what the audience is.
+- Adversarial coverage is the most commonly absent category; absence is not evidence that the model is robust — it is evidence the question was never asked.

package/skills/qa/playwright-e2e-execution-run/SKILL.md ADDED Viewed

@@ -0,0 +1,54 @@
+---
+name: playwright-e2e-execution-run
+description: Use this skill when an operator wants to actually execute an existing Playwright end-to-end suite against a confirmed non-production target and receive a structured, attested run report — pass/fail counts, flaky tests, durations, and trace artifacts. Trigger when the user asks to "run the e2e suite", "execute the Playwright tests against staging", or hands the agent a Playwright project plus a target base URL. This is the live-execution counterpart to the static-review skill `playwright-e2e-suite-review`. Default mode is static and runs nothing; runtime execution is a per-session opt-in that requires explicit target confirmation.
+allowed-tools: Read Grep Glob Bash(npx playwright test*) Bash(npx playwright install*) Bash(npx playwright show-report*)
+metadata:
+  author: "github: Raishin"
+  version: "0.1.0"
+  updated: "2026-05-17"
+  category: delivery
+  lifecycle: experimental
+  execution_tier: read-only-runtime
+  required_egress:
+    - operator-confirmed-target-host
+    - cdn.playwright.dev
+    - playwright.download.prss.microsoft.com
+  requires_credentials: []
+  output_attestation:
+    schema: schemas/attestation.schema.json
+    signed_with: none
+---
+# Playwright E2E Execution Run
+## Purpose
+This skill executes an existing Playwright end-to-end suite against an operator-confirmed non-production target and emits a structured run attestation: total/passed/failed/flaky counts, slowest tests, retry-only passes, and the location of trace and screenshot artifacts. It is the live-execution counterpart to `playwright-e2e-suite-review` (which is static-review only and never runs anything). The skill runs the suite as authored — it does not write the tests, deploy the application, or mutate infrastructure — and it refuses to run against a production target.
+## Execution modes
+- **Static (default).** The skill runs nothing. It inspects `playwright.config`, enumerates the project and target, states exactly which command it would execute, and asks the operator for explicit runtime opt-in plus target confirmation.
+- **Runtime (per-session opt-in).** Only after the operator explicitly opts in and confirms a non-production base URL does the skill invoke `npx playwright test`. Runtime mode is never assumed from the request alone.
+## Lean operating rules
+- Never execute the suite without an explicit, in-session runtime opt-in AND an operator-confirmed base URL — absent either, stay in static mode and ask.
+- Refuse to run if the target base URL resolves to, or is named like, a production environment (`prod`, `www.`, a customer-facing apex domain). Require a staging, preview, or ephemeral target; state the refusal reason.
+- Never accept credentials, bearer tokens, or a `storageState` file inline. Test credentials must come from the environment or a config the operator already controls; the skill never collects, echoes, or logs their values.
+- Run only the allowlisted commands: `npx playwright test` (with operator-supplied flags), `npx playwright install` (browser binaries), `npx playwright show-report`. Never run deploy, migration, seed, or registry commands.
+- Treat the suite's own side effects as the operator's responsibility — state plainly that E2E tests may create or modify data in the target, which is why a non-production target is mandatory.
+- Do not retry a failed run with raised timeouts or added retries to manufacture a green result — report the failure as observed.
+- Emit the run attestation as JSON conforming to `schemas/attestation.schema.json`; the verdict is one of `pass`, `fail`, or `manual-review`.
+- If browser binaries are missing, run `npx playwright install` only with operator awareness; if egress to the browser CDN is blocked, degrade to `manual-review` rather than reporting a false `fail`.
+- Label the run: command executed, target host (host only, never the full credentialed URL), Playwright version, and wall-clock duration.
+## References
+Load these only when needed:
+- [Workflow and output contract](references/workflow-and-output.md) — use when executing the run or formatting the attestation.
+## Response minimum
+Return, at minimum:
+- The execution mode used (static or runtime) and why
+- The exact command executed (runtime) or that would be executed (static)
+- The confirmed target host and Playwright version
+- Run results: total / passed / failed / flaky (retry-only pass) counts
+- Trace and screenshot artifact locations for any failure
+- A `pass` / `fail` / `manual-review` verdict with reasons
+- Safe next actions

package/skills/qa/playwright-e2e-execution-run/metadata.json ADDED Viewed

@@ -0,0 +1,24 @@
+{
+  "id": "playwright-e2e-execution-run",
+  "name": "Playwright E2E Execution Run",
+  "type": "skill",
+  "provider": "generic",
+  "harnesses": ["claude-code", "cursor"],
+  "summary": "Execute an existing Playwright E2E suite against an operator-confirmed non-production target and emit a structured run attestation — pass/fail/flaky counts, slowest tests, and trace artifact locations. Live-execution counterpart to playwright-e2e-suite-review.",
+  "source_type": "original",
+  "official_docs": [
+    "https://playwright.dev/docs/test-cli",
+    "https://playwright.dev/docs/running-tests",
+    "https://playwright.dev/docs/test-reporters",
+    "https://playwright.dev/docs/trace-viewer",
+    "https://playwright.dev/docs/ci"
+  ],
+  "security_notes": "Live-execution skill, read-only-runtime tier. Default mode is static and runs nothing; runtime execution is a per-session opt-in requiring explicit operator confirmation of a non-production target. The Bash allowlist locks invocations to `npx playwright test`, `npx playwright install`, and `npx playwright show-report` — no deploy, migration, seed, or registry commands. Refuses production targets. Never accepts or echoes credentials, tokens, or storageState; test credentials come from the operator-controlled environment. Egress limited to the operator-confirmed target host and the Playwright browser CDN; blocked CDN egress degrades to manual-review rather than a false fail.",
+  "last_verified": "2026-05-17",
+  "path": "skills/qa/playwright-e2e-execution-run",
+  "category": "delivery",
+  "lifecycle": "experimental",
+  "execution_tier": "read-only-runtime",
+  "author": "github: Raishin",
+  "version": "0.1.0"
+}