npm - @brunosps00/dev-workflow - Versions diffs - 0.13.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/scaffold/skills/dw-llm-eval/SKILL.md ADDED Viewed

@@ -0,0 +1,148 @@
+---
+name: dw-llm-eval
+description: Use when authoring or reviewing AI/LLM features (chat, RAG, summarization, classifiers, agents) — enforces an oracle ladder (climb from exact match up to LLM-as-judge), reference-dataset discipline, judge calibration (Spearman ≥0.80), and trajectory-vs-outcome agent eval so AI features ship with measurable behavior instead of "looks good to me" QA.
+---
+# LLM Evaluation
+> Adapted patterns from [`langchain-ai/agentevals`](https://github.com/langchain-ai/agentevals) (MIT) for trajectory-match modes, plus general LLM-eval discipline from OpenAI evals cookbook, Anthropic's evals guidance, and the broader open evaluations literature. Material rewritten in our voice.
+## When this skill applies
+- Any feature that uses an LLM in production: chat, summarization, classification, RAG (retrieval-augmented generation), agents, tool-use, structured extraction, code generation.
+- `/dw-create-tasks` when the PRD mentions an AI feature — eval planning becomes a mandatory subtask.
+- `/dw-code-review` when the diff touches AI feature code paths.
+- `/dw-run-qa --ai` when validating an AI feature against its reference dataset.
+If the feature is fully deterministic (no LLM in the loop), use `dw-testing-discipline` instead — Iron rules and 25 anti-patterns. This skill is specifically for entropy-tolerant systems.
+## First principle
+> Tests for deterministic code assert exact outputs.
+> Tests for LLM features assert behaviors within tolerance.
+> The discipline is choosing the right tolerance — and proving it's not "anything passes."
+## The oracle ladder
+Five rungs, climb from CHEAPEST/STRICTEST to MOST EXPENSIVE/SUBJECTIVE. Always start at the bottom; only climb when the lower rung can't cover the case.
+| Rung | What it checks | Cost | When to use |
+|------|----------------|------|-------------|
+| 1. **Exact match** | `output === expected` | ~free | Structured outputs (function calls, JSON with stable shape, classifications) |
+| 2. **Schema validation** | Output matches JSON schema / type contract | ~free | Output shape matters; specific values vary |
+| 3. **Outcome state** | Side effect produced the expected change (DB row, file written, tool called) | cheap | Agents, tool-use, RAG with concrete answers |
+| 4. **LLM-as-judge** | A different model grades the output against a rubric | medium ($$$) | Subjective quality (helpfulness, tone, faithfulness) where no rule can decide |
+| 5. **Human review** | Domain expert scores | expensive | Calibration of rung 4; high-stakes outputs; edge cases |
+**Rule:** never reach for rung 4 before checking if rungs 1-3 can cover the case. Every rung up costs an order of magnitude more (latency, money, calibration effort) — and adds entropy.
+See `references/oracle-ladder.md` for examples per rung and the climbing decision tree.
+## LLM-as-judge discipline (when rung 4 is needed)
+Without calibration, LLM-as-judge produces noise dressed as signal. Three non-negotiables:
+1. **Calibrate against humans** — ≥20 human-graded cases, compute Spearman correlation against LLM-as-judge. Target ≥0.80. Below that, reject the judge configuration.
+2. **Use a different model than the system under test** — same model judging itself produces false positives. Pair: GPT-4 generates → Claude judges. Or vice versa.
+3. **Rubric, not free-form** — provide the judge a structured rubric (criteria + scale + examples) instead of "rate quality 1-10."
+See `references/judge-calibration.md` for the full calibration recipe, rubric templates, and the "judge drift" monitoring pattern.
+## Reference dataset principle
+> 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
+The dataset is the bedrock. Without a reference set, every "improvement" is anecdote.
+Structure:
+```
+.dw/eval/datasets/<feature-name>/
+├── cases.jsonl           # input + expected (or rubric reference) per line
+├── README.md             # provenance, sample size, when last reviewed
+└── runs/<YYYY-MM-DD>.jsonl  # results of each eval run
+```
+See `references/reference-dataset.md` for case-design principles, sampling from production, and when to expand the set.
+## RAG evaluation
+Three orthogonal metrics — measure all three, not just one:
+| Metric | What it measures | Tool |
+|--------|-----------------|------|
+| **Retrieval precision@k** | Of the top-K retrieved chunks, how many were relevant | Exact match against labeled ground-truth |
+| **Answer faithfulness** | Does the answer cite only what the retrieved context supports? | LLM-as-judge with rubric |
+| **Context utilization** | Did the answer USE the retrieved context, or hallucinate around it? | Heuristic + LLM-as-judge |
+Precision alone misses hallucination. Faithfulness alone misses retrieval failure. Context utilization alone misses both. See `references/rag-metrics.md` for the full implementation.
+## Agent / tool-use evaluation
+Two questions distinguish good agent eval from bad:
+### Question 1: outcome or trajectory?
+| Approach | What it checks | Failure mode |
+|----------|---------------|--------------|
+| **Outcome-only** | Did the agent achieve the goal? Was the final state correct? | Misses "ghost actions" — agent did the right thing for the wrong reasons |
+| **Trajectory** | Did the agent take the expected sequence of steps / tool calls? | Punishes legitimate creativity — agent solved it via a different valid path |
+**Recommendation:** outcome-only with side-effect assertion as default. Trajectory match for cases where the path matters (e.g., "must call `get-user` before `update-user`").
+### Question 2: which trajectory match mode?
+When trajectory matching IS the right call, four modes are available:
+- **Strict** — same tool calls, same order, same arguments. Use when both sequence and parameters are part of the contract.
+- **Unordered** — same tool calls, any order. Use when concurrent calls are valid.
+- **Subset** — actual trajectory contains a subset of reference calls. Use to enforce "don't exceed expected tool use" (frugality / cost).
+- **Superset** — actual contains all reference calls plus possibly more. Use when specific tools are mandatory but extras are acceptable.
+See `references/agent-eval.md` for examples and the decision tree.
+## Required reading by context
+| Doing what | Read |
+|------------|------|
+| Designing an eval suite for an AI feature | `references/oracle-ladder.md` (climb the ladder) |
+| Using LLM-as-judge | `references/judge-calibration.md` (mandatory before relying on it) |
+| Building / curating a reference dataset | `references/reference-dataset.md` |
+| RAG-specific feature | `references/rag-metrics.md` |
+| Agent / tool-use feature | `references/agent-eval.md` |
+## Anti-patterns (will block in `/dw-code-review`)
+- **LLM-as-judge without calibration evidence.** PR adds LLM-as-judge but the calibration Spearman score is missing or < 0.80. REJECTED.
+- **Same-model judge.** Judge model is the same as the system under test. REJECTED unless explicitly documented (and even then, results are suspect).
+- **Single-rung eval.** Feature ships with only LLM-as-judge; no rung 1-3 grounding. REJECTED — the cheap rungs catch the loud failures.
+- **Synthetic-only dataset.** No traceable production-failure source for any case. REJECTED — confirm at least 20% of cases come from real user inputs.
+- **"Looks good to me" QA.** No reference dataset, no metric, no rubric — just sampling output and calling it good. REJECTED.
+- **Coverage as metric.** Quoting "we tested 50 prompts" without saying what was measured. The number is meaningless without the metric.
+## Integration with dev-workflow commands
+- `/dw-create-tasks`: when the PRD has an AI feature requirement, an eval-plan subtask is mandatory. The task references this skill's oracle ladder.
+- `/dw-code-review`: AI feature PRs require a reference dataset + ≥2 oracle rungs (lower rungs FIRST). The constitution gate also applies — if the project has principles about AI feature reliability, they're enforced here.
+- `/dw-run-qa --ai`: new mode (when this skill is bundled) — runs the reference dataset against the current implementation, logs to `QA/logs/ai/<feature>-<date>.jsonl`, computes precision@k / faithfulness / outcome accuracy per the feature type.
+- `/dw-bugfix` when the bug is an AI failure mode (hallucination, tool misuse, classification error): adds the failing case to the reference dataset BEFORE fixing — the case is now a regression test forever.
+## When the discipline bends
+- **Prototype / spike phase**: skip calibration; document as "spike — eval added before merge to main."
+- **Internal-only AI feature with low blast radius** (e.g., classifier for internal CRM tags): rung 1-3 only is fine; LLM-as-judge may be overkill.
+- **Real-time features where eval can't run synchronously**: shadow-eval pattern — run the eval async on a sample of production traffic; alert on regression.
+In all bend cases, document the deviation in the techspec / PR. "Skipped judge calibration because internal-only feature affecting <100 users" is fine; just say it.
+## Why this approach
+Two failure modes drive most AI feature regressions:
+1. **No measurement** — team ships, suspects it's worse, can't prove it, debate.
+2. **Wrong measurement** — team measures LLM-as-judge only, judge drifts with the model, scores rise while real quality falls.
+The oracle ladder fixes both: forces measurement, forces ANCHORED measurement (lower rungs are deterministic; upper rungs are calibrated against them).
+## Bottom line
+> An AI feature without an eval suite is a feature you can't ship safely. An eval suite without calibration is a number you can't trust. Build the dataset from real failures, climb the ladder from cheap to expensive, calibrate the judge against humans, and re-run before every model swap. The discipline is small; the absence of it is one of the largest sources of "we shipped and don't know if it's worse" experiences in the industry.

package/scaffold/skills/dw-llm-eval/references/agent-eval.md ADDED Viewed

@@ -0,0 +1,252 @@
+# Agent evaluation — outcome vs trajectory
+Agent eval has a foundational question: do you grade **what the agent did along the way** (trajectory) or **what state the world is in at the end** (outcome)?
+The answer determines what you measure and what failure modes you catch.
+## Outcome-only evaluation (recommended default)
+**What it checks:** at the end of the agent's run, does the world look the way it should? Was the right tool called? Was the right ticket filed? Was the user's question answered correctly?
+**Pattern:**
+```javascript
+test('agent files refund ticket when user requests refund', async () => {
+  await agent.run('I want a refund for order #123');
+  // Outcome assertions
+  const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
+  expect(tickets).toHaveLength(1);
+  expect(tickets[0].type).toBe('refund');
+  const userMessage = agent.lastMessage();
+  expect(userMessage).toMatch(/refund.*processed|filed|submitted/i);
+});
+```
+**Strengths:**
+- Permits creative paths — agent solved it via tool A → B → C OR via tool A → C → B; both pass if the outcome is right.
+- Robust to internal refactor — restructuring the agent's prompt or tool descriptions doesn't break the test as long as the outcome holds.
+- Aligned with what users care about: did the system do the right thing?
+**Weaknesses:**
+- Misses "ghost actions" — agent claims to have done X but the outcome state shows it didn't.
+  - Defense: combine with rung-3 outcome-state assertions (DB writes, API calls). Don't trust the agent's word.
+- Misses inefficiency — agent took 17 tool calls to do what should be 3. Outcome OK but cost is bad.
+  - Defense: track tool-call count as a separate metric; alert if it exceeds budget.
+## Trajectory evaluation (when path matters)
+**What it checks:** did the agent take the expected sequence (or set) of tool calls? Match against a reference trajectory.
+**Use when:**
+- Compliance / audit requires specific actions in specific order (e.g., "ALWAYS verify identity before disclosing balance").
+- Safety-critical: a specific tool MUST be called (e.g., "if user mentions self-harm, must invoke `escalate-to-human` BEFORE any other action").
+- The path itself is the contract (e.g., a workflow agent that must traverse a specific decision tree).
+## Trajectory match modes
+Four modes, from strictest to most permissive:
+### Strict
+**Rule:** actual trajectory contains identical tool calls in identical order with identical arguments.
+**Use when:** path AND parameters are both part of the contract. Compliance, deterministic workflow agents.
+```javascript
+expect(actualToolCalls).toEqual([
+  { name: 'verify_identity', args: { user_id: 'u-42' } },
+  { name: 'get_balance', args: { account_id: 'a-99' } },
+  { name: 'respond_to_user', args: { template: 'balance-inquiry' } },
+]);
+```
+### Unordered
+**Rule:** actual contains the same set of tool calls, any order; arguments match.
+**Use when:** the agent legitimately may parallelize or reorder calls without affecting correctness.
+```javascript
+expect(new Set(actualToolNames)).toEqual(new Set(['fetch_user', 'fetch_orders', 'fetch_addresses']));
+```
+### Subset
+**Rule:** actual trajectory is a SUBSET of reference — agent didn't exceed expected tool calls.
+**Use when:** frugality / cost discipline — "agent should NOT call expensive tools unnecessarily."
+```javascript
+// Reference is the maximum allowed set
+const referenceToolCalls = ['fetch_user', 'classify_intent', 'respond'];
+const allActualInReference = actualToolNames.every(t => referenceToolCalls.includes(t));
+expect(allActualInReference).toBe(true);
+```
+### Superset
+**Rule:** actual contains ALL reference tool calls, possibly plus extras.
+**Use when:** specific tools are mandatory but extras are acceptable. Often safety-critical ("MUST log audit event") while permitting agent autonomy in other tool choices.
+```javascript
+// Reference is the minimum required set
+const requiredToolCalls = ['log_audit_event', 'verify_user'];
+const allRequiredCalled = requiredToolCalls.every(t => actualToolNames.includes(t));
+expect(allRequiredCalled).toBe(true);
+```
+## Argument matching strategies
+When trajectory matching, how should tool ARGUMENTS be compared?
+| Strategy | Behavior | Use |
+|----------|---------|-----|
+| **Exact** | Arguments must match byte-for-byte | Deterministic args (IDs, fixed strings) |
+| **Ignore** | Any call to the right tool counts | When the call ITSELF is what matters, not args |
+| **Subset** | Actual args contain at least the reference args | Required fields enforced; extras OK |
+| **Superset** | Actual args are within reference args set | Frugality — agent didn't add unexpected fields |
+| **Custom comparator** | Per-tool comparison function | Domain-specific equivalence (case-insensitive, semantic match) |
+Example with custom comparator:
+```javascript
+const matchers = {
+  'search_cities': (actualArgs, refArgs) => {
+    // City name comparison: case-insensitive, trimmed
+    return actualArgs.name.toLowerCase().trim() === refArgs.name.toLowerCase().trim();
+  },
+  'fetch_user': 'exact',
+};
+```
+## Decision tree: outcome vs trajectory
+```
+Is correctness defined by the FINAL STATE or by the PATH?
+├── Final state only — agent can solve it however it likes
+│   → Outcome-only eval. Combine with rung-3 state assertions.
+│
+├── Specific tool calls MUST happen for compliance/safety
+│   → Superset trajectory mode. Outcome too, as a separate check.
+│
+├── Specific tool calls MUST NOT happen (cost / privacy)
+│   → Subset trajectory mode.
+│
+├── The full workflow path is the contract (legal, audit)
+│   → Strict trajectory mode.
+│
+└── Path is partly fixed, partly free
+    → Trajectory with custom comparator OR split into multiple smaller tests.
+```
+## Cost vs accuracy tracking
+Two metrics that trajectory eval naturally enables:
+### Tool-call efficiency
+```python
+def efficiency_score(actual_trajectory, reference_trajectory):
+    actual_calls = len(actual_trajectory)
+    reference_calls = len(reference_trajectory)
+    if reference_calls == 0:
+        return 1.0
+    return min(1.0, reference_calls / actual_calls)
+```
+Score 1.0 = matched or beat reference. Score 0.5 = took 2× the expected calls.
+Track this over time; agent regressions sometimes show up as efficiency loss before outcome loss.
+### Step-count percentile
+```
+Run 2026-05-12:
+  p50 tool calls: 4 (reference 3)
+  p95 tool calls: 9 (reference 6)
+  p99 tool calls: 18 (reference 12)
+```
+p99 spikes catch cases where the agent enters a loop or backtracks excessively — outcome may still be correct but cost runaway is real.
+## Dataset structure for agents
+```json
+{
+  "id": "agent-case-001",
+  "input": {
+    "user_message": "Cancel my order #123 and request a refund"
+  },
+  "expected_outcome": {
+    "tickets_created": [
+      { "type": "refund", "order_id": "123" }
+    ],
+    "order_status": "cancelled"
+  },
+  "expected_trajectory": {
+    "mode": "superset",
+    "required_calls": [
+      { "name": "verify_user_owns_order", "args_match": "exact" },
+      { "name": "update_order_status", "args_match": "subset", "args": { "status": "cancelled" } },
+      { "name": "create_refund_ticket", "args_match": "subset" }
+    ],
+    "forbidden_calls": ["delete_user_data"]
+  },
+  "tool_budget": { "p95_max_calls": 8 }
+}
+```
+The `forbidden_calls` field is powerful — explicitly enumerate tools that MUST NOT fire for this input class. Catches "agent escalated to a dangerous tool that wasn't necessary."
+## Combining outcome + trajectory
+For serious agent eval, combine both:
+```javascript
+test('agent handles refund request', async () => {
+  const result = await agent.run(case.input);
+  // Outcome
+  expectOutcomeMatch(result.outcome, case.expected_outcome);
+  // Trajectory — superset mode (required tools called)
+  expectTrajectoryMatch(result.trajectory, case.expected_trajectory, 'superset');
+  // Forbidden — none of these tools fired
+  for (const forbidden of case.expected_trajectory.forbidden_calls) {
+    expect(result.trajectory.some(t => t.name === forbidden)).toBe(false);
+  }
+  // Budget — didn't exceed expected tool calls
+  expect(result.trajectory.length).toBeLessThanOrEqual(case.tool_budget.p95_max_calls);
+});
+```
+## LLM-as-judge for agent quality
+Beyond mechanical trajectory matching, judge for:
+- Was the agent's intermediate reasoning sound? (rubric: logical, evidence-based, non-hallucinated)
+- Was the final user message appropriate? (rubric: tone, completeness, accuracy)
+- Did the agent handle ambiguity well? (rubric: did it ask for clarification when needed?)
+These are rung-4 evaluations on top of rung-1/2/3 outcome and trajectory checks.
+## Anti-patterns
+- **Trajectory-only eval** → punishes creative paths; brittle to refactor; ignores real outcome.
+- **Outcome-only eval without state assertion** → trusts the agent's word; misses ghost actions.
+- **Strict trajectory mode when subset/superset would do** → false negatives every time the agent legitimately reorders.
+- **No tool-budget tracking** → agent regresses to expensive paths; you don't notice until the bill spikes.
+- **No `forbidden_calls` enumeration** → agent silently learns to call dangerous tools.
+## Tools
+- `langchain-ai/agentevals` (MIT) — Python library implementing all four trajectory match modes + LLM-as-judge for trajectories. Source of the taxonomy above.
+- `langsmith` — observability + eval orchestration; tracks experiments over time.
+- Custom implementation — the modes above are ~50 lines each in any language.
+The discipline isn't the library choice; it's choosing outcome-vs-trajectory deliberately, picking the right match mode, and tracking efficiency alongside accuracy.

package/scaffold/skills/dw-llm-eval/references/judge-calibration.md ADDED Viewed

@@ -0,0 +1,169 @@
+# LLM-as-judge calibration — how to make rung 4 mean something
+LLM-as-judge sounds simple: a model grades the output. In practice, without calibration it produces NUMBERS WITHOUT SIGNAL — judge scores drift with the model, with rubric phrasing, with prompt minutiae. You read "judge says 4.2 average" and have no idea if that means the system is good.
+Calibration anchors the judge to human assessment. After calibration, a judge score has meaning. Before, it doesn't.
+## The three non-negotiables
+### 1. Calibrate against ≥20 human-graded cases
+Process:
+1. Sample ≥20 cases from the reference dataset (or representative production traffic).
+2. Have ≥1 domain expert grade each case using the same rubric the judge will use. Multiple humans per case is better (inter-rater agreement is useful signal).
+3. Run the judge against the same cases.
+4. Compute Spearman rank correlation between human scores and judge scores.
+**Target:** Spearman ≥0.80.
+**Acceptable:** 0.70-0.80 with documented rationale (e.g., "subjective tone judgments inherently noisy").
+**Reject:** <0.70. The judge is not measuring what you think it's measuring.
+### 2. Use a different model than the system under test
+A model judging its own output produces false positives. The judge agrees with itself even when wrong because it shares the same biases and blind spots.
+Pairing examples:
+- System: GPT-4 → Judge: Claude Opus.
+- System: Claude Sonnet → Judge: GPT-4o.
+- System: Gemini → Judge: Claude.
+If both system and judge MUST be from the same provider, at minimum use different model sizes (Sonnet judges Opus output, not vice versa).
+### 3. Structured rubric, not free-form scoring
+"Rate this answer 1-10" → noise. Different runs give different scores; different humans disagree wildly; the score has no anchor.
+Structured rubric: ≥3 criteria, each with a defined scale and an example per score point.
+Example rubric for FAITHFULNESS (RAG):
+```markdown
+# Faithfulness rubric (1-5 scale)
+Score each answer against the retrieved context. A faithful answer makes claims supported by the context; an unfaithful one fabricates or extrapolates.
+## 1 — Severely unfaithful
+The answer contains claims that contradict the context, or fabricates facts not present in any chunk. Example: context says "Q3 revenue was $1.2M"; answer says "Q3 revenue exceeded $5M."
+## 2 — Mostly unfaithful
+The answer mixes context-supported and fabricated claims, where the fabrication is meaningful. Example: cites a study that wasn't in the context.
+## 3 — Mixed
+Half the answer is grounded; half is reasonable inference or generalization beyond the context. Example: context describes the API; answer adds advice not derivable from context.
+## 4 — Mostly faithful
+All claims are supported by context; minor paraphrasing or summarization without distortion. Example: rewords a passage accurately.
+## 5 — Strictly faithful
+Every claim is directly traceable to a specific chunk; no information added beyond what context contains. Example: quotes-with-attribution style.
+```
+Provide this rubric INSIDE the judge prompt. Free-form is forbidden.
+## The calibration loop
+```
+1. Sample 20-30 cases for calibration set.
+2. Human-grade them blind (without seeing other graders or judge).
+3. Run judge with rubric.
+4. Compute Spearman vs human scores.
+5. If <0.70:
+   - Examine disagreements: where does judge consistently miss?
+   - Refine rubric: more specific scale, more examples, narrower scope.
+   - OR switch judge model: try a different vendor/size.
+   - Re-run step 3-4.
+6. If 0.70-0.80: document the noise floor; accept with caveats.
+7. If ≥0.80: judge is calibrated. Save the rubric + judge config in version control.
+```
+Calibration is one-time-per-config but RECURRING-PER-MODEL-CHANGE. Every model swap (you upgrade GPT-4 to GPT-5; vendor deprecates Opus 4.7) invalidates the calibration. Re-calibrate.
+## Judge drift monitoring
+After deployment:
+- Re-run calibration set monthly.
+- Plot Spearman over time.
+- Alert if Spearman drops below 0.75 between calibration runs — the judge has drifted (model update, rubric got stale, traffic distribution shifted).
+```
+.dw/eval/judges/<feature>/
+├── rubric.md                       # the rubric, version-controlled
+├── calibration-2026-05-12.jsonl    # 20+ cases with human + judge scores
+├── spearman-2026-05-12.txt         # 0.84
+├── calibration-2026-08-12.jsonl    # quarterly re-calibration
+└── spearman-2026-08-12.txt         # 0.81
+```
+## Rubric design patterns
+### DO
+- **3-5 criteria** per rubric (one for each dimension you care about: faithfulness, completeness, tone, format, ...).
+- **1-5 scale** with anchored descriptions per point (not 1-10 — too granular for reliable agreement).
+- **Example per score point** showing the kind of output that earns that score.
+- **Explicit "what to ignore"** — e.g., "ignore minor grammar; score on substance."
+### DON'T
+- Single-criterion "quality" score — too vague to calibrate.
+- 1-100 scales — humans can't reliably distinguish 73 from 76.
+- Rubrics longer than 500 words — the judge skips and lazy-scores.
+- "Holistic" scoring without breakdown — opaque to debug.
+## Multi-criterion rubrics
+For complex outputs (RAG, agents), one number rarely captures quality. Use per-criterion scores:
+```json
+{
+  "faithfulness": 4,
+  "completeness": 3,
+  "tone": 5,
+  "format": 5,
+  "overall": null
+}
+```
+Aggregate as needed downstream (weighted average, minimum, "all must be ≥3"). Don't have the judge compute the aggregate — bias compounds.
+## Anti-patterns
+- **Judge with no rubric.** "Rate this 1-10." Numbers, no signal.
+- **Judge is the system being tested.** False positives baked in.
+- **No calibration evidence in PR.** "We added LLM-as-judge" — okay, what's the Spearman?
+- **Rubric stuffed with all criteria in one prompt** → judge lazy-scores. Split into criterion-per-call if needed.
+- **Calibration done once, never revisited.** Model upgrades silently break it. Re-calibrate monthly or per model swap.
+- **Judge scoring its own scoring.** Recursive trust collapse.
+## Bias to watch
+LLM judges have characteristic biases:
+- **Length bias** — longer outputs score higher even when shorter is better. Normalize length in the rubric.
+- **Self-similarity bias** — judges rate outputs that resemble their own writing higher. Cross-model pairing helps.
+- **Position bias** (in comparative judging) — first item often wins. Randomize order, run both A/B and B/A.
+- **Recency bias** — last item in context is overweighted. Vary order.
+- **Sycophancy** — judges agree with strongly-stated input even when wrong. Frame the judge prompt neutrally.
+Document which biases you tested for in the calibration write-up.
+## Cost discipline
+LLM-as-judge can dominate eval costs. At $0.01-$0.10 per judgment, 100 cases × 4 rubric criteria × monthly = real money.
+Optimizations (in order of impact):
+1. Run judge against SAMPLES, not the whole dataset every time. 50 random cases weekly catches regression.
+2. Use the cheapest model that maintains Spearman ≥0.80. GPT-4 mini may calibrate as well as GPT-4 for your rubric.
+3. Batch judge calls when the API supports it.
+4. Cache judge results per (input, output, rubric-version) tuple — same eval run shouldn't pay twice.
+5. Skip judge for cases where rungs 1-3 already failed — they're broken; no point asking subjective quality.
+## When NOT to use LLM-as-judge
+- The output has a deterministic correct answer. Use rung 1 or 2.
+- The output has a measurable side effect. Use rung 3.
+- The team won't budget for calibration. The judge will produce noise.
+- The rubric can't be written in <500 words. The criterion is too vague.
+A poorly-calibrated judge is worse than no judge: it gives false confidence. Better to ship with "tested manually by domain expert on 20 cases" than with "judge score 4.1" that means nothing.