npm - @brunosps00/dev-workflow - Versions diffs - 0.13.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md ADDED Viewed

@@ -0,0 +1,171 @@
+# Oracle ladder — climb deliberately
+Five rungs ordered by cost (cheap → expensive) and rigor (strict → subjective). Start at the bottom. Every rung up costs an order of magnitude more in latency, money, or calibration effort. Don't reach for an upper rung when a lower one can prove the case.
+## Rung 1 — Exact match
+**What it checks:** the output equals the expected output, byte-for-byte (or after a normalization step like JSON canonicalization).
+**Use when:**
+- Output is a structured function call: `expect(toolCalls[0]).toEqual({ name: 'search', args: { q: 'invoices' } })`.
+- Output is a classification from a fixed label set: `expect(label).toBe('refund-request')`.
+- Output is a parsed value from a JSON contract: `expect(result.user_id).toBe('u-42')`.
+**Example:**
+```javascript
+test('classifier labels refund requests correctly', async () => {
+  const cases = await loadDataset('.dw/eval/datasets/classifier/cases.jsonl');
+  for (const c of cases.filter(c => c.expected === 'refund-request')) {
+    expect(await classify(c.input)).toBe('refund-request');
+  }
+});
+```
+**Cost:** ~free.
+**Limitation:** can't handle creative outputs (paragraphs, summaries). Don't try to force-fit.
+## Rung 2 — Schema validation
+**What it checks:** the output matches a structural contract — types, required fields, value ranges. The SHAPE is fixed; specific values can vary.
+**Use when:**
+- LLM returns structured data with stable schema (JSON, function call args) but variable content.
+- You need to detect "agent returned garbage" without asserting on the exact garbage.
+**Example:**
+```typescript
+import { z } from 'zod';
+const ResponseSchema = z.object({
+  summary: z.string().min(20).max(500),
+  citations: z.array(z.object({
+    url: z.string().url(),
+    page: z.number().int().optional(),
+  })).min(1),
+  confidence: z.number().min(0).max(1),
+});
+test('summarizer returns valid shape', async () => {
+  const result = await summarize(input);
+  expect(() => ResponseSchema.parse(result)).not.toThrow();
+});
+```
+**Cost:** ~free (schema check is cheap).
+**Limitation:** doesn't tell you if the CONTENT is correct, only that it's the right shape. Pair with another rung.
+## Rung 3 — Outcome state
+**What it checks:** a side effect occurred — DB row was created, file was written, tool was called with valid arguments, ticket was opened. The state of the world matches expectations.
+**Use when:**
+- Agent has tool access and the GOAL is to change state, not produce prose.
+- RAG answer is supposed to lead to an action (e.g., "user clicked the suggested invoice and reconciled it").
+- The system has observable side effects you can query post-hoc.
+**Example:**
+```javascript
+test('agent files refund request when user asks', async () => {
+  await agent.run('I want a refund for order #123');
+  const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
+  expect(tickets).toHaveLength(1);
+  expect(tickets[0].type).toBe('refund');
+  expect(tickets[0].status).toBe('pending');
+});
+```
+**Cost:** cheap (1 DB query / API call per assertion).
+**Limitation:** doesn't validate the PROSE the agent produced along the way. If the goal was "answer the user politely AND file the refund," rung 3 catches the action but not the politeness — climb to rung 4 for that.
+**Key benefit:** catches "ghost actions" — agent claims to have done X but didn't actually do it. Rungs 1-2 trust the agent's word; rung 3 verifies the world.
+## Rung 4 — LLM-as-judge
+**What it checks:** a different model grades the output against a rubric. Used for genuinely subjective quality — helpfulness, tone, faithfulness, completeness.
+**Mandatory before using:**
+- Calibrated against ≥20 human-graded cases (Spearman ≥0.80) — see `judge-calibration.md`.
+- Different model than the system under test.
+- Structured rubric, not free-form "rate 1-10."
+**Example:**
+```javascript
+test('chat response is faithful to retrieved context', async () => {
+  const cases = await loadDataset('.dw/eval/datasets/rag-chat/cases.jsonl');
+  const scores = [];
+  for (const c of cases) {
+    const answer = await chat(c.input, c.context);
+    const judgment = await llmJudge({
+      model: 'claude-opus-4-7', // different from system under test (GPT-4)
+      rubric: faithfulnessRubric,
+      input: c.input,
+      context: c.context,
+      output: answer,
+    });
+    scores.push(judgment.score);
+  }
+  // 80% of cases must score ≥4 on the 1-5 faithfulness rubric
+  const passing = scores.filter(s => s >= 4).length / scores.length;
+  expect(passing).toBeGreaterThan(0.8);
+});
+```
+**Cost:** medium-to-high (one judge call per case; pay per case at API rates).
+**Limitation:** the judge has bias and drift; without calibration, you're measuring the judge's mood. Re-calibrate every quarter, every model swap, and after rubric changes.
+## Rung 5 — Human review
+**What it checks:** a domain expert scores. The gold standard for the rubrics rung 4 calibrates against.
+**Use when:**
+- Calibrating LLM-as-judge (rung 4 setup).
+- High-stakes outputs where automation isn't trusted (medical, legal, financial).
+- Edge cases that automated rungs flag as borderline.
+**Cost:** expensive. Don't scale; sample.
+**Pattern:**
+- Spot-check 5-10% of LLM-as-judge results randomly each week.
+- Whenever LLM-as-judge score is "borderline" (e.g., 2.5-3.5 on 1-5 scale), kick to human.
+- Full human review only for the calibration dataset and high-stakes edge cases.
+## The climbing decision tree
+```
+Is the output a fixed-structure value (function call, classification, JSON with stable shape)?
+├── YES → Rung 1 (exact match) or Rung 2 (schema)
+└── NO → does the output cause an observable side effect (DB write, tool call, ticket opened)?
+    ├── YES → Rung 3 (outcome state)
+    └── NO → output is subjective (prose, summary, recommendation). Rung 4 required.
+        └── Did you calibrate the judge against humans (≥20 cases, Spearman ≥0.80)?
+            ├── YES → Rung 4 is valid signal
+            └── NO → DO NOT USE Rung 4 yet. Calibrate first via Rung 5.
+```
+## Anti-patterns
+- **Reaching for Rung 4 first** because "everything else seems hard." Climb the ladder; lower rungs catch loud failures cheaply.
+- **Pretending Rung 4 is calibrated** by running it without checking against humans. Score numbers without calibration are decorative.
+- **Skipping Rung 3 because "we have unit tests"** — unit tests with mocked tools prove the agent CALLED the tool. Rung 3 proves the tool's effect happened.
+- **Mixing rungs in one assertion**: `expect(answer).toBe('Yes, your refund is being processed' /* exact */)` — when the exact text doesn't matter, rung 1 is the wrong tool.
+## Combining rungs
+For a serious AI feature, expect to use 2-3 rungs together:
+| Feature | Typical rung mix |
+|---------|------------------|
+| Classifier | Rung 1 (label correctness) + Rung 4 (rationale quality, if exposed to user) |
+| RAG chat | Rung 2 (response shape) + Rung 3 (citations are valid URLs/IDs) + Rung 4 (faithfulness) |
+| Agent (filing tickets) | Rung 3 (ticket created with correct fields) + Rung 4 (user-facing message tone) |
+| Summarization | Rung 2 (length, structure) + Rung 4 (faithfulness, completeness) |
+| Tool-use trajectory | Rung 1 (specific tool calls expected) + Rung 4 (intermediate reasoning quality, optional) |
+The rule: cheap rungs catch the failures that scream; expensive rungs catch the failures that whisper. You need both.

package/scaffold/skills/dw-llm-eval/references/rag-metrics.md ADDED Viewed

@@ -0,0 +1,186 @@
+# RAG evaluation — three orthogonal metrics
+Retrieval-augmented generation (RAG) has three failure modes, each requiring its own metric. Measure all three. Measuring only one creates blindspots.
+## The three metrics
+### 1. Retrieval precision@k
+**What it measures:** of the top-K chunks retrieved, how many were RELEVANT to the user's query?
+**How to compute:**
+```python
+def precision_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = retrieved_chunk_ids[:k]
+    relevant_in_top_k = sum(1 for cid in top_k if cid in relevant_chunk_ids)
+    return relevant_in_top_k / k
+```
+**Reference data needed:** for each test case, the human-labeled set of "chunks that should have been retrieved" — the ground truth.
+**Target:** depends on K. For k=5, target precision >0.6 (3 of 5 chunks relevant). For k=10, target >0.5.
+**What it catches:** retrieval is bringing back junk. Chunk embeddings are wrong, the index is stale, the query rewriting is broken.
+**What it misses:** the LLM may still produce a great answer even from imperfect retrieval — or a hallucinated answer despite perfect retrieval. Pair with metrics #2 and #3.
+### 2. Answer faithfulness
+**What it measures:** does the answer make claims that are SUPPORTED by the retrieved context? Or does it fabricate?
+**How to compute (rung-4 LLM-as-judge with rubric):**
+The judge sees: user question + retrieved context + generated answer. Scores 1-5 per the faithfulness rubric (see `judge-calibration.md` for an example).
+**Reference data needed:** the retrieved context (saved from the run) and the answer. No ground-truth answer required — the judge checks claim-by-claim against the context.
+**Target:** 80% of cases score ≥4 on the 1-5 scale.
+**What it catches:** hallucination — the answer says things the context didn't support. This is the #1 failure mode in production RAG.
+**What it misses:** the answer might be faithful to the retrieved context but the retrieved context might be WRONG. Pair with metric #1.
+### 3. Context utilization
+**What it measures:** did the answer USE the retrieved context, or ignore it and produce a generic / parametric-memory response?
+**How to compute (heuristic + LLM-as-judge hybrid):**
+Heuristic part — n-gram overlap or semantic similarity:
+```python
+def context_overlap(answer, context, n=3):
+    answer_ngrams = set(ngrams(answer, n))
+    context_ngrams = set(ngrams(context, n))
+    if not answer_ngrams:
+        return 0
+    return len(answer_ngrams & context_ngrams) / len(answer_ngrams)
+```
+Judge part — ask if the answer would change materially without the context:
+> "If the retrieved context were removed, would the answer be substantially different? 1 = same as without context (didn't use it), 5 = fully context-grounded."
+**Target:** 70%+ overlap on substantive answers; judge score ≥4 on 80% of cases.
+**What it catches:** the answer is faithful to the context (metric #2 passes) but ignores it — the model used its parametric memory instead. This means retrieval is doing nothing.
+**What it misses:** the answer might use the context but cite it incorrectly. Pair with metric #2.
+## Why all three are needed
+| Metric | Detects | Misses |
+|--------|---------|--------|
+| Retrieval precision@k | Junk in retrieval | Faithfulness; utilization |
+| Answer faithfulness | Hallucination | Retrieval quality; whether context was used |
+| Context utilization | Ignoring retrieval | Hallucination beyond context; retrieval quality |
+A RAG system can fail in all three independent ways. Measuring only one creates blind spots in the other two.
+## Combined metric example
+```python
+def evaluate_rag(case):
+    retrieved = retrieve(case.query)
+    answer = generate(case.query, retrieved)
+    return {
+        'precision_at_5': precision_at_k(
+            [c.id for c in retrieved],
+            case.relevant_chunk_ids,
+            k=5
+        ),
+        'faithfulness': llm_judge_faithfulness(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+        'context_utilization_overlap': context_overlap(answer, retrieved),
+        'context_utilization_judge': llm_judge_utilization(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+    }
+```
+Aggregate per-case scores into the per-run summary:
+```
+Run 2026-05-12:
+  precision@5:          0.68  (target >0.6) ✓
+  faithfulness ≥4:      83%   (target >80%) ✓
+  context utilization:  72%   (target >70%) ✓
+  Overall: PASS
+```
+## Common RAG failure modes
+| Symptom | Likely metric that catches it |
+|---------|------------------------------|
+| User says "the bot is making stuff up" | Faithfulness |
+| User says "the bot didn't see my documents" | Context utilization (or retrieval precision) |
+| User says "the bot is bad at finding things" | Retrieval precision@k |
+| User says "the answer is correct but ignores recent updates" | Retrieval recall (precision's partner — different metric) |
+| User says "the bot gives the same generic answer no matter what I ask" | Context utilization |
+| User says "the bot says the doc says X but it doesn't" | Faithfulness |
+The metric points at the layer to fix. Without it, debugging is guesswork.
+## Retrieval recall (the fourth metric, conditional)
+Precision asks "of what we retrieved, how much was good?" Recall asks "of what was good, how much did we retrieve?"
+In production RAG with many candidate chunks, recall is often the limiting factor — the right chunk exists in the index but doesn't surface.
+Compute:
+```python
+def recall_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = set(retrieved_chunk_ids[:k])
+    return len(top_k & set(relevant_chunk_ids)) / len(relevant_chunk_ids)
+```
+Track recall when:
+- The corpus is large (>1000 chunks per query domain).
+- Users report "the bot can't find things that exist in our docs."
+- You're tuning the retrieval pipeline (chunking strategy, embedding model, search algorithm).
+Skip recall when:
+- The corpus is small (top-K = ~10% of the corpus; recall is high by default).
+- Precision is the dominant problem.
+## Dataset structure for RAG
+```json
+{
+  "id": "rag-case-001",
+  "query": "What's our PTO policy for sabbatical years?",
+  "expected": {
+    "relevant_chunk_ids": ["chunk-policy-pto-2024", "chunk-policy-sabbatical"],
+    "expected_answer_themes": ["accrual rate", "carryover limits", "sabbatical exception"],
+    "should_cite": true
+  },
+  "metadata": {
+    "source": "production-2026-04-12-support-thread-S-892",
+    "difficulty": "medium",
+    "tags": ["pto-policy", "sabbatical", "rare-query"]
+  }
+}
+```
+The `relevant_chunk_ids` field requires human labeling — domain expert reviews the corpus, identifies which chunks SHOULD surface for that query.
+## Anti-patterns
+- **Measuring only one metric** (usually faithfulness via LLM-as-judge) → blind to retrieval and utilization failures.
+- **No human-labeled relevance** → can't compute precision/recall.
+- **Treating retrieval and generation as one black box** → can't tell which layer regressed.
+- **Eval set drawn only from "easy" queries** → metrics are good in test, terrible in production.
+- **Ignoring recent-information bias** (RAG must use retrieval; parametric memory is stale) → context utilization metric catches this.
+## Tooling
+- **ragas** (open source) implements precision, recall, faithfulness, and other RAG metrics with LLM judges. Use as reference implementation.
+- **Custom implementation** is straightforward — the metrics above are <100 lines of Python each.
+- **LangSmith / Weights & Biases** wrap eval runs with tracking but don't replace the core metrics.
+The discipline isn't tool choice; it's measuring all three orthogonal dimensions every run.

package/scaffold/skills/dw-llm-eval/references/reference-dataset.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Reference dataset — 20 from failures beats 200 perfect
+The dataset is the bedrock. Without one, every "improvement" is anecdote and every regression goes unnoticed until users complain. With one, you can measure change.
+## The 20-from-failures principle
+> 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
+Why:
+- Synthetic cases reflect what the team IMAGINED would happen — they cover the cases the team already knows about.
+- Production failures reflect what ACTUALLY happens — they cover blind spots and edge cases.
+- 20 well-curated cases at the right level of difficulty discriminate models better than 200 average-difficulty cases.
+A 20-case dataset is enough to:
+- Detect regressions of >10% accuracy.
+- Calibrate LLM-as-judge.
+- Run cheaply enough to evaluate on every PR.
+Scale up only when you've validated that 20 is producing useful signal.
+## Where cases come from
+In rough order of priority:
+1. **Real production failures.** User reported a bug, support escalated a case, error logs show an unexpected output. Each becomes a case. Sanitize PII before saving.
+2. **Edge cases discovered during development.** "Oh, what if the user asks X?" — add it.
+3. **Adversarial examples.** Inputs designed to trip the system (prompt injection, ambiguity, contradiction). Especially important for chat/RAG.
+4. **Boundary inputs.** Empty, very long, special characters, mixed languages, unusual encodings.
+5. **Synthetic — last resort.** Only when no real input exists yet (e.g., pre-launch). Mark them as synthetic; replace as production data arrives.
+Target distribution: **80% real production-sourced**, 20% adversarial/boundary. Pure-synthetic datasets give pure-synthetic confidence.
+## Case structure
+Each case is one line in `cases.jsonl`:
+```json
+{
+  "id": "case-001",
+  "input": {
+    "user_message": "I want to cancel my subscription",
+    "user_context": { "tier": "premium", "tenure_months": 18 }
+  },
+  "expected": {
+    "intent": "cancellation_request",
+    "should_offer_retention": true,
+    "tone_targets": ["empathetic", "non-manipulative"]
+  },
+  "rubric_criteria": ["faithfulness", "tone", "completeness"],
+  "metadata": {
+    "source": "production-2026-04-12-ticket-T-1234",
+    "added_at": "2026-04-15",
+    "added_by": "@bruno",
+    "difficulty": "medium",
+    "tags": ["cancellation", "retention", "premium-user"]
+  }
+}
+```
+Fields:
+- **`id`** — stable identifier (you'll reference it in regression reports).
+- **`input`** — what the system receives. Match the production input shape exactly.
+- **`expected`** — for rungs 1-3, the deterministic expected output or state change. For rung 4, the rubric-target (what a 5/5 answer would do).
+- **`rubric_criteria`** — which rubric dimensions this case exercises.
+- **`metadata.source`** — provenance. Production ticket? Synthetic? Adversarial?
+- **`metadata.difficulty`** — easy/medium/hard. Track score by difficulty bucket.
+- **`metadata.tags`** — for filtering ("show me cases that exercise retention logic").
+## Dataset layout
+```
+.dw/eval/datasets/<feature-name>/
+├── README.md                    # provenance, sample size, last review, change log
+├── cases.jsonl                  # the cases themselves
+├── rubric.md                    # the rubric used for rung-4 scoring
+├── runs/
+│   ├── 2026-05-01.jsonl         # one line per case with scores from that run
+│   ├── 2026-05-08.jsonl
+│   └── ...
+├── calibration/
+│   ├── 2026-05-12-human-scores.jsonl
+│   └── spearman-2026-05-12.txt
+└── changelog.md                 # when cases were added/removed and why
+```
+Everything is committed. Datasets evolve with the feature; the git history shows when and why.
+## README.md template
+```markdown
+# Reference dataset — <feature name>
+**Purpose:** evaluate <feature> for <quality dimensions>.
+**Current size:** N cases (X production-sourced, Y adversarial, Z synthetic).
+**Difficulty distribution:** easy: A, medium: B, hard: C.
+**Last reviewed:** YYYY-MM-DD.
+**Maintainers:** @name1, @name2.
+## When to expand
+Add a case when:
+- A new production failure is observed (always — that's the primary signal).
+- A new edge case is identified during development.
+- A new adversarial pattern is discovered (security review, red-team session).
+Do NOT add cases just to inflate the count. The 20-from-failures principle: quality over quantity.
+## When to retire a case
+Retire (don't delete) when:
+- The behavior the case checked is no longer relevant (feature removed).
+- The case became trivially passing across all model versions (it's no longer discriminating).
+Move retired cases to `cases-retired.jsonl` with a `retired_reason`.
+```
+## Adding cases from production
+Process:
+1. **Capture the failure** — paste the actual input that failed in `cases-pending.jsonl`.
+2. **Sanitize PII** — replace names, emails, IDs, account numbers with realistic-but-fake equivalents. NEVER commit real user data.
+3. **Define expected behavior** — what SHOULD have happened. Get sign-off from a domain expert if subjective.
+4. **Categorize** — difficulty, tags, rubric criteria.
+5. **Promote** to `cases.jsonl` after review.
+## Sampling for regression runs
+You don't need to re-run the entire dataset every time. Smart sampling:
+- **PR-time:** random sample of 30% + all "high difficulty" cases + any case added in the last 30 days. Fast feedback.
+- **Pre-merge to main:** full dataset.
+- **Nightly:** full dataset + judge re-calibration check.
+- **Pre-deploy:** full dataset + manual eyeball on 10 random outputs.
+## Detecting drift
+After each run, compare against the prior run on the SAME cases:
+```
+Run 2026-05-08 vs 2026-05-01:
+  faithfulness: 4.2 → 3.9 (-0.3) ⚠ regression
+  completeness: 4.0 → 4.1 (+0.1)
+  tone:         4.5 → 4.4 (-0.1)
+  outcome accuracy: 95% → 92% ⚠ regression
+```
+Two ways drift happens:
+1. **Code change degraded quality** — your model swap or prompt tweak hurt something. Bisect.
+2. **Judge drift** — the LLM-as-judge itself changed (vendor updated the model). Re-calibrate; the "regression" may be the judge, not the system.
+## Dataset versioning
+When the dataset materially changes (cases added/removed in batch, rubric updated), bump a version in the README:
+```
+Dataset version: 2.3
+- v2.3 (2026-05-12): added 8 cases from production tickets in last 30 days
+- v2.2 (2026-04-15): retired 3 cases that became trivially passing
+- v2.1 (2026-03-20): rubric updated to add "completeness" criterion
+```
+Each run logs which dataset version it ran against. You can't compare a v1 score to a v3 score directly.
+## Cost discipline
+- Keep the dataset SMALL on purpose. 20-50 cases for most features. 100+ only if the feature has many categorically different inputs.
+- Cheap evaluations (rungs 1-3) run on every case every time.
+- Expensive evaluations (rung 4) run on samples — 20-50 random cases — except for full pre-deploy runs.
+## Anti-patterns
+- **Synthetic-only dataset.** No connection to real production. Confidence isn't real.
+- **Dataset grew to 500 cases nobody re-reads.** Half are duplicates; half are no longer discriminating. Audit and prune.
+- **Cases without expected behavior.** "Just look at the output." No measurement possible.
+- **Dataset not committed.** Lives in a notebook; ephemeral; lost when person leaves.
+- **No metadata tracking source.** Can't tell synthetic from real; can't audit dataset quality.
+- **Dataset reused across features.** Each feature has its own dataset; one-size-fits-all is one-size-fits-none.
+## Cross-reference
+- `oracle-ladder.md` — what to assert per case.
+- `judge-calibration.md` — how to make rung-4 judgments meaningful.
+- `rag-metrics.md` — RAG-specific extras for the dataset structure.
+- `agent-eval.md` — agent-specific extras (trajectory matching).