npm - @brunosps00/dev-workflow - Versions diffs - 0.11.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/scaffold/skills/dw-llm-eval/references/rag-metrics.md ADDED Viewed

@@ -0,0 +1,186 @@
+# RAG evaluation — three orthogonal metrics
+Retrieval-augmented generation (RAG) has three failure modes, each requiring its own metric. Measure all three. Measuring only one creates blindspots.
+## The three metrics
+### 1. Retrieval precision@k
+**What it measures:** of the top-K chunks retrieved, how many were RELEVANT to the user's query?
+**How to compute:**
+```python
+def precision_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = retrieved_chunk_ids[:k]
+    relevant_in_top_k = sum(1 for cid in top_k if cid in relevant_chunk_ids)
+    return relevant_in_top_k / k
+```
+**Reference data needed:** for each test case, the human-labeled set of "chunks that should have been retrieved" — the ground truth.
+**Target:** depends on K. For k=5, target precision >0.6 (3 of 5 chunks relevant). For k=10, target >0.5.
+**What it catches:** retrieval is bringing back junk. Chunk embeddings are wrong, the index is stale, the query rewriting is broken.
+**What it misses:** the LLM may still produce a great answer even from imperfect retrieval — or a hallucinated answer despite perfect retrieval. Pair with metrics #2 and #3.
+### 2. Answer faithfulness
+**What it measures:** does the answer make claims that are SUPPORTED by the retrieved context? Or does it fabricate?
+**How to compute (rung-4 LLM-as-judge with rubric):**
+The judge sees: user question + retrieved context + generated answer. Scores 1-5 per the faithfulness rubric (see `judge-calibration.md` for an example).
+**Reference data needed:** the retrieved context (saved from the run) and the answer. No ground-truth answer required — the judge checks claim-by-claim against the context.
+**Target:** 80% of cases score ≥4 on the 1-5 scale.
+**What it catches:** hallucination — the answer says things the context didn't support. This is the #1 failure mode in production RAG.
+**What it misses:** the answer might be faithful to the retrieved context but the retrieved context might be WRONG. Pair with metric #1.
+### 3. Context utilization
+**What it measures:** did the answer USE the retrieved context, or ignore it and produce a generic / parametric-memory response?
+**How to compute (heuristic + LLM-as-judge hybrid):**
+Heuristic part — n-gram overlap or semantic similarity:
+```python
+def context_overlap(answer, context, n=3):
+    answer_ngrams = set(ngrams(answer, n))
+    context_ngrams = set(ngrams(context, n))
+    if not answer_ngrams:
+        return 0
+    return len(answer_ngrams & context_ngrams) / len(answer_ngrams)
+```
+Judge part — ask if the answer would change materially without the context:
+> "If the retrieved context were removed, would the answer be substantially different? 1 = same as without context (didn't use it), 5 = fully context-grounded."
+**Target:** 70%+ overlap on substantive answers; judge score ≥4 on 80% of cases.
+**What it catches:** the answer is faithful to the context (metric #2 passes) but ignores it — the model used its parametric memory instead. This means retrieval is doing nothing.
+**What it misses:** the answer might use the context but cite it incorrectly. Pair with metric #2.
+## Why all three are needed
+| Metric | Detects | Misses |
+|--------|---------|--------|
+| Retrieval precision@k | Junk in retrieval | Faithfulness; utilization |
+| Answer faithfulness | Hallucination | Retrieval quality; whether context was used |
+| Context utilization | Ignoring retrieval | Hallucination beyond context; retrieval quality |
+A RAG system can fail in all three independent ways. Measuring only one creates blind spots in the other two.
+## Combined metric example
+```python
+def evaluate_rag(case):
+    retrieved = retrieve(case.query)
+    answer = generate(case.query, retrieved)
+    return {
+        'precision_at_5': precision_at_k(
+            [c.id for c in retrieved],
+            case.relevant_chunk_ids,
+            k=5
+        ),
+        'faithfulness': llm_judge_faithfulness(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+        'context_utilization_overlap': context_overlap(answer, retrieved),
+        'context_utilization_judge': llm_judge_utilization(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+    }
+```
+Aggregate per-case scores into the per-run summary:
+```
+Run 2026-05-12:
+  precision@5:          0.68  (target >0.6) ✓
+  faithfulness ≥4:      83%   (target >80%) ✓
+  context utilization:  72%   (target >70%) ✓
+  Overall: PASS
+```
+## Common RAG failure modes
+| Symptom | Likely metric that catches it |
+|---------|------------------------------|
+| User says "the bot is making stuff up" | Faithfulness |
+| User says "the bot didn't see my documents" | Context utilization (or retrieval precision) |
+| User says "the bot is bad at finding things" | Retrieval precision@k |
+| User says "the answer is correct but ignores recent updates" | Retrieval recall (precision's partner — different metric) |
+| User says "the bot gives the same generic answer no matter what I ask" | Context utilization |
+| User says "the bot says the doc says X but it doesn't" | Faithfulness |
+The metric points at the layer to fix. Without it, debugging is guesswork.
+## Retrieval recall (the fourth metric, conditional)
+Precision asks "of what we retrieved, how much was good?" Recall asks "of what was good, how much did we retrieve?"
+In production RAG with many candidate chunks, recall is often the limiting factor — the right chunk exists in the index but doesn't surface.
+Compute:
+```python
+def recall_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = set(retrieved_chunk_ids[:k])
+    return len(top_k & set(relevant_chunk_ids)) / len(relevant_chunk_ids)
+```
+Track recall when:
+- The corpus is large (>1000 chunks per query domain).
+- Users report "the bot can't find things that exist in our docs."
+- You're tuning the retrieval pipeline (chunking strategy, embedding model, search algorithm).
+Skip recall when:
+- The corpus is small (top-K = ~10% of the corpus; recall is high by default).
+- Precision is the dominant problem.
+## Dataset structure for RAG
+```json
+{
+  "id": "rag-case-001",
+  "query": "What's our PTO policy for sabbatical years?",
+  "expected": {
+    "relevant_chunk_ids": ["chunk-policy-pto-2024", "chunk-policy-sabbatical"],
+    "expected_answer_themes": ["accrual rate", "carryover limits", "sabbatical exception"],
+    "should_cite": true
+  },
+  "metadata": {
+    "source": "production-2026-04-12-support-thread-S-892",
+    "difficulty": "medium",
+    "tags": ["pto-policy", "sabbatical", "rare-query"]
+  }
+}
+```
+The `relevant_chunk_ids` field requires human labeling — domain expert reviews the corpus, identifies which chunks SHOULD surface for that query.
+## Anti-patterns
+- **Measuring only one metric** (usually faithfulness via LLM-as-judge) → blind to retrieval and utilization failures.
+- **No human-labeled relevance** → can't compute precision/recall.
+- **Treating retrieval and generation as one black box** → can't tell which layer regressed.
+- **Eval set drawn only from "easy" queries** → metrics are good in test, terrible in production.
+- **Ignoring recent-information bias** (RAG must use retrieval; parametric memory is stale) → context utilization metric catches this.
+## Tooling
+- **ragas** (open source) implements precision, recall, faithfulness, and other RAG metrics with LLM judges. Use as reference implementation.
+- **Custom implementation** is straightforward — the metrics above are <100 lines of Python each.
+- **LangSmith / Weights & Biases** wrap eval runs with tracking but don't replace the core metrics.
+The discipline isn't tool choice; it's measuring all three orthogonal dimensions every run.

package/scaffold/skills/dw-llm-eval/references/reference-dataset.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Reference dataset — 20 from failures beats 200 perfect
+The dataset is the bedrock. Without one, every "improvement" is anecdote and every regression goes unnoticed until users complain. With one, you can measure change.
+## The 20-from-failures principle
+> 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
+Why:
+- Synthetic cases reflect what the team IMAGINED would happen — they cover the cases the team already knows about.
+- Production failures reflect what ACTUALLY happens — they cover blind spots and edge cases.
+- 20 well-curated cases at the right level of difficulty discriminate models better than 200 average-difficulty cases.
+A 20-case dataset is enough to:
+- Detect regressions of >10% accuracy.
+- Calibrate LLM-as-judge.
+- Run cheaply enough to evaluate on every PR.
+Scale up only when you've validated that 20 is producing useful signal.
+## Where cases come from
+In rough order of priority:
+1. **Real production failures.** User reported a bug, support escalated a case, error logs show an unexpected output. Each becomes a case. Sanitize PII before saving.
+2. **Edge cases discovered during development.** "Oh, what if the user asks X?" — add it.
+3. **Adversarial examples.** Inputs designed to trip the system (prompt injection, ambiguity, contradiction). Especially important for chat/RAG.
+4. **Boundary inputs.** Empty, very long, special characters, mixed languages, unusual encodings.
+5. **Synthetic — last resort.** Only when no real input exists yet (e.g., pre-launch). Mark them as synthetic; replace as production data arrives.
+Target distribution: **80% real production-sourced**, 20% adversarial/boundary. Pure-synthetic datasets give pure-synthetic confidence.
+## Case structure
+Each case is one line in `cases.jsonl`:
+```json
+{
+  "id": "case-001",
+  "input": {
+    "user_message": "I want to cancel my subscription",
+    "user_context": { "tier": "premium", "tenure_months": 18 }
+  },
+  "expected": {
+    "intent": "cancellation_request",
+    "should_offer_retention": true,
+    "tone_targets": ["empathetic", "non-manipulative"]
+  },
+  "rubric_criteria": ["faithfulness", "tone", "completeness"],
+  "metadata": {
+    "source": "production-2026-04-12-ticket-T-1234",
+    "added_at": "2026-04-15",
+    "added_by": "@bruno",
+    "difficulty": "medium",
+    "tags": ["cancellation", "retention", "premium-user"]
+  }
+}
+```
+Fields:
+- **`id`** — stable identifier (you'll reference it in regression reports).
+- **`input`** — what the system receives. Match the production input shape exactly.
+- **`expected`** — for rungs 1-3, the deterministic expected output or state change. For rung 4, the rubric-target (what a 5/5 answer would do).
+- **`rubric_criteria`** — which rubric dimensions this case exercises.
+- **`metadata.source`** — provenance. Production ticket? Synthetic? Adversarial?
+- **`metadata.difficulty`** — easy/medium/hard. Track score by difficulty bucket.
+- **`metadata.tags`** — for filtering ("show me cases that exercise retention logic").
+## Dataset layout
+```
+.dw/eval/datasets/<feature-name>/
+├── README.md                    # provenance, sample size, last review, change log
+├── cases.jsonl                  # the cases themselves
+├── rubric.md                    # the rubric used for rung-4 scoring
+├── runs/
+│   ├── 2026-05-01.jsonl         # one line per case with scores from that run
+│   ├── 2026-05-08.jsonl
+│   └── ...
+├── calibration/
+│   ├── 2026-05-12-human-scores.jsonl
+│   └── spearman-2026-05-12.txt
+└── changelog.md                 # when cases were added/removed and why
+```
+Everything is committed. Datasets evolve with the feature; the git history shows when and why.
+## README.md template
+```markdown
+# Reference dataset — <feature name>
+**Purpose:** evaluate <feature> for <quality dimensions>.
+**Current size:** N cases (X production-sourced, Y adversarial, Z synthetic).
+**Difficulty distribution:** easy: A, medium: B, hard: C.
+**Last reviewed:** YYYY-MM-DD.
+**Maintainers:** @name1, @name2.
+## When to expand
+Add a case when:
+- A new production failure is observed (always — that's the primary signal).
+- A new edge case is identified during development.
+- A new adversarial pattern is discovered (security review, red-team session).
+Do NOT add cases just to inflate the count. The 20-from-failures principle: quality over quantity.
+## When to retire a case
+Retire (don't delete) when:
+- The behavior the case checked is no longer relevant (feature removed).
+- The case became trivially passing across all model versions (it's no longer discriminating).
+Move retired cases to `cases-retired.jsonl` with a `retired_reason`.
+```
+## Adding cases from production
+Process:
+1. **Capture the failure** — paste the actual input that failed in `cases-pending.jsonl`.
+2. **Sanitize PII** — replace names, emails, IDs, account numbers with realistic-but-fake equivalents. NEVER commit real user data.
+3. **Define expected behavior** — what SHOULD have happened. Get sign-off from a domain expert if subjective.
+4. **Categorize** — difficulty, tags, rubric criteria.
+5. **Promote** to `cases.jsonl` after review.
+## Sampling for regression runs
+You don't need to re-run the entire dataset every time. Smart sampling:
+- **PR-time:** random sample of 30% + all "high difficulty" cases + any case added in the last 30 days. Fast feedback.
+- **Pre-merge to main:** full dataset.
+- **Nightly:** full dataset + judge re-calibration check.
+- **Pre-deploy:** full dataset + manual eyeball on 10 random outputs.
+## Detecting drift
+After each run, compare against the prior run on the SAME cases:
+```
+Run 2026-05-08 vs 2026-05-01:
+  faithfulness: 4.2 → 3.9 (-0.3) ⚠ regression
+  completeness: 4.0 → 4.1 (+0.1)
+  tone:         4.5 → 4.4 (-0.1)
+  outcome accuracy: 95% → 92% ⚠ regression
+```
+Two ways drift happens:
+1. **Code change degraded quality** — your model swap or prompt tweak hurt something. Bisect.
+2. **Judge drift** — the LLM-as-judge itself changed (vendor updated the model). Re-calibrate; the "regression" may be the judge, not the system.
+## Dataset versioning
+When the dataset materially changes (cases added/removed in batch, rubric updated), bump a version in the README:
+```
+Dataset version: 2.3
+- v2.3 (2026-05-12): added 8 cases from production tickets in last 30 days
+- v2.2 (2026-04-15): retired 3 cases that became trivially passing
+- v2.1 (2026-03-20): rubric updated to add "completeness" criterion
+```
+Each run logs which dataset version it ran against. You can't compare a v1 score to a v3 score directly.
+## Cost discipline
+- Keep the dataset SMALL on purpose. 20-50 cases for most features. 100+ only if the feature has many categorically different inputs.
+- Cheap evaluations (rungs 1-3) run on every case every time.
+- Expensive evaluations (rung 4) run on samples — 20-50 random cases — except for full pre-deploy runs.
+## Anti-patterns
+- **Synthetic-only dataset.** No connection to real production. Confidence isn't real.
+- **Dataset grew to 500 cases nobody re-reads.** Half are duplicates; half are no longer discriminating. Audit and prune.
+- **Cases without expected behavior.** "Just look at the output." No measurement possible.
+- **Dataset not committed.** Lives in a notebook; ephemeral; lost when person leaves.
+- **No metadata tracking source.** Can't tell synthetic from real; can't audit dataset quality.
+- **Dataset reused across features.** Each feature has its own dataset; one-size-fits-all is one-size-fits-none.
+## Cross-reference
+- `oracle-ladder.md` — what to assert per case.
+- `judge-calibration.md` — how to make rung-4 judgments meaningful.
+- `rag-metrics.md` — RAG-specific extras for the dataset structure.
+- `agent-eval.md` — agent-specific extras (trajectory matching).

package/scaffold/skills/dw-testing-discipline/SKILL.md ADDED Viewed

@@ -0,0 +1,171 @@
+---
+name: dw-testing-discipline
+description: Use when authoring, reviewing, or debugging tests — enforces six core rules (assert behavior, push to lowest layer, fix prod first on red, real systems gate merge, mutation > coverage, no test backdoors), a catalog of anti-patterns, agent-authoring guardrails, and flaky-test discipline so tests reveal bugs instead of decorating CI.
+---
+# Testing Discipline
+## Founding principle
+> Tests exist to expose defects, not to keep CI green.
+> A test that fails has done its job.
+> A test that passes for the wrong reason is worse than no test.
+Everything else in this skill follows from that.
+## The six core rules
+```
+1. Test the behavior, never the mock.
+2. Push each test to the lowest layer that can detect the defect.
+3. When a test fails, read production first — change the test only with documented justification.
+4. Real systems gate the merge. Mocks isolate; they do not validate.
+5. Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
+6. No test-only methods, branches, or flags leak into production code.
+```
+Each rule has nuance read `references/core-rules.md` for the long version with examples.
+## When to use
+- Authoring any test (unit, integration, contract, E2E).
+- Reviewing a PR diff under test paths.
+- Debugging a flaky test (or considering retry-as-fix — read `references/flaky-discipline.md` first).
+- Generating tests via an AI agent → invokes `references/agent-guardrails.md` automatically.
+- Browser-based E2E with Playwright → recipes in `references/playwright-recipes.md`.
+- Verifying browser-side trust boundaries (auth, CSRF, headers) → `references/security-boundary.md`.
+- Picking which test workflow applies (UI / network / perf) → `references/three-workflow-patterns.md`.
+## Reference router
+| Doing what | Read |
+|------------|------|
+| Placing a new test (which layer?) | `references/core-rules.md` (Rule 2 deep-dive) |
+| Writing new tests | `references/patterns.md` |
+| Reviewing tests / spotting smells | `references/anti-patterns.md` |
+| Agent-generated tests | `references/agent-guardrails.md` + `references/anti-patterns.md` |
+| Flaky tests | `references/flaky-discipline.md` |
+| Playwright E2E | `references/playwright-recipes.md` |
+| Browser trust boundary | `references/security-boundary.md` |
+| Picking the right workflow | `references/three-workflow-patterns.md` |
+## Patterns that produce reliable tests (one-liners; full in `references/patterns.md`)
+1. Query by behavior and accessible role; never CSS selectors or DOM indices.
+2. Selector ladder: role → label → text → test-id → structural. Stop at the highest rung that disambiguates.
+3. Wait on observable conditions; never wall-clock sleeps.
+4. Each test independent and order-free; lean on `beforeEach`, not `beforeAll`.
+5. One behavior per test; as many assertions as that behavior requires.
+6. Names read as specifications: `should <outcome> when <condition> given <state>`.
+7. Table-driven / parameterized when inputs vary.
+8. Build test data via factories; literal blobs only for fields under test.
+9. Mock at boundaries you don't control; real wiring for the systems you own.
+10. Real systems gate the final merge; contract tests bridge unit and E2E.
+11. Mutation score, not coverage percentage, measures suite strength.
+12. Page Object Model is a tool; collapse it for small suites where it adds noise.
+## Anti-pattern catalog (four families, full in `references/anti-patterns.md`)
+The four kinds of smell that produce most test debt:
+**A. Fragile to refactor** — tests bound to internals, not behavior:
+- Implementation-detail selectors.
+- Asserting internal structure instead of observable outcome.
+- Testing private methods directly.
+- Snapshots replacing real assertions.
+- Vague existence assertions (`toBeTruthy`, `should('exist')`).
+- Actions with no assertion ("clicking save works").
+**B. Non-deterministic outcomes** — tests that flip verdict on the same code:
+- Static sleeps / fixed-timeout waits.
+- Test order dependency / hidden shared state.
+- Non-deterministic inputs (clock, RNG, locale).
+**C. Mock-driven false confidence** — tests testing the test setup:
+- Asserting the mock exists.
+- Mock drift (mocked response no longer matches real API).
+- Over-mocking child components.
+- Incomplete mocks (missing fields the code reads).
+- Mocking the wrong level (mocking methods of the SUT itself).
+- Asserting on a value the test body fed into a mock.
+**D. Suite hygiene problems** — team and suite-level pathologies:
+- Coverage as vanity metric.
+- Happy-path-only coverage.
+- Eternal `beforeAll` hiding dependencies.
+- Cleanup in `afterEach` (move to `beforeEach`).
+- Magic strings and logic in tests.
+- Testing against third-party sites.
+- Quarantine-as-cemetery (skip without owner or deadline).
+- Retry-as-fix (auto-retry hiding real bugs).
+- Duplicate tests across pyramid layers.
+- Weakening assertions to make tests pass.
+Total: 25 specific patterns across the four families.
+## Agent-authoring guardrails (mandatory when an LLM writes tests)
+Six guardrails block the most common failure modes when an LLM produces test code. Each is a pre-condition before the diff goes to review. Full prompts and verification in `references/agent-guardrails.md`:
+1. **State the invariant first** — agent prints `INVARIANT`, `OWNING_LAYER`, `EXISTING_SUITE` before writing code.
+2. **Extend, don't sprawl** — agent extends an existing suite; new files require a named invariant.
+3. **Real execution somewhere** — at least one test path runs against real systems before merge.
+4. **Red? Read production** — on failure, the agent reads production code first and prints `ANALYSIS:` before changing tests.
+5. **Classify before snapshot** — snapshots only with explicit `PRODUCT_CONTRACT` classification; `IMPLEMENTATION_DETAIL` forbids them.
+6. **Negative companion** — every positive assertion ships with a negative test for invalid input or failure mode.
+## Placement doctrine (tripwires)
+Before writing test code:
+- Name the invariant in **one sentence**. Fuzzy language signals unclear requirements — stop and clarify.
+- Place the test at the **lowest layer** capable of detecting the defect when the invariant breaks.
+- Reject tests where (`likelihood-of-bug` × `blast-radius`) falls below a ten-minute-maintenance threshold (the test is more expensive to maintain than the bug would be to fix).
+## Flaky discipline (tripwires)
+- Quarantine flaky tests within ONE HOUR of detection. Assign a named owner within 24 hours with a fix-by date.
+- Track `flaky_rate` as a first-class metric: SLO under 1–2%; alert at >5%.
+- Real systems at the final gate: mock at unit; contract-test boundaries; real DB/queue/route at integration; near-zero mocks at E2E.
+Full taxonomy in `references/flaky-discipline.md`.
+## Cross-cutting red flags
+Any of these in a PR is enough to REJECT a verdict:
+- Mock setup larger than the test logic.
+- Test breaks when an internal method is renamed (not the public contract).
+- Removing the assertion body leaves the test green.
+- Test fails when run with `.only` in isolation.
+- `sleep`, `Thread.sleep`, or `cy.wait(<number>)` appears.
+- Selector contains CSS class, index, or `xpath`.
+- Test asserts a third-party site is reachable.
+- Snapshot diffs accepted without reading.
+- Coverage percentage is the only metric quoted.
+- Failing tests auto-retried until green; no investigation.
+- Skipped/quarantined tests without named owner and fix-by date.
+- Test depends on `new Date()`, `Math.random()`, or system locale.
+- `afterEach` resets database state.
+- Agent-written test has 6+ assertions and zero edge cases.
+- The diff contains the phrase "I'll mock this to be safe."
+## When NOT to use this skill
+- General code review unrelated to tests.
+- Library-specific debugging where the test is just a reproduction.
+- Non-testing CI pipeline design (deploys, artifacts, secrets).
+- Production observability and alerting.
+- Single-line typo fixes in existing tests.
+## Integration with dev-workflow commands
+- `/dw-create-tasks` applies the placement doctrine — each test-adding task names the invariant.
+- `/dw-run-task` runs the 6 agent guardrails when generating tests during implementation.
+- `/dw-code-review` runs the anti-pattern checks on diff hunks under test paths.
+- `/dw-fix-qa` applies the flaky-discipline taxonomy in retest cycles.
+- `/dw-run-qa` (UI mode) references `playwright-recipes.md` for concrete recipes.
+## Bottom line
+> A test that cannot fail is decorative. A test that fails for the wrong reason is misleading. Build tests that fail for exactly one reason — the reason the invariant was violated — and trust them when they do. Mocks isolate. Real systems validate. Coverage shines a light. Mutation score grades the suite. Agents will reach for the mock and the snapshot; the guardrails make them put both down. Tests reveal bugs, not just pass.