@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (127) hide show
  1. package/README.md +54 -5
  2. package/lib/constants.js +20 -20
  3. package/lib/init.js +24 -1
  4. package/lib/migrate-skills.js +129 -0
  5. package/lib/removed-bundled-skills.js +16 -0
  6. package/lib/uninstall.js +6 -2
  7. package/lib/utils.js +43 -1
  8. package/package.json +1 -1
  9. package/scaffold/en/agent-instructions.md +68 -0
  10. package/scaffold/en/commands/dw-autopilot.md +1 -1
  11. package/scaffold/en/commands/dw-brainstorm.md +1 -1
  12. package/scaffold/en/commands/dw-bugfix.md +4 -3
  13. package/scaffold/en/commands/dw-code-review.md +1 -0
  14. package/scaffold/en/commands/dw-create-tasks.md +6 -0
  15. package/scaffold/en/commands/dw-create-techspec.md +1 -1
  16. package/scaffold/en/commands/dw-deps-audit.md +1 -1
  17. package/scaffold/en/commands/dw-fix-qa.md +1 -1
  18. package/scaffold/en/commands/dw-functional-doc.md +2 -2
  19. package/scaffold/en/commands/dw-help.md +2 -2
  20. package/scaffold/en/commands/dw-redesign-ui.md +7 -7
  21. package/scaffold/en/commands/dw-run-qa.md +5 -4
  22. package/scaffold/en/commands/dw-run-task.md +2 -2
  23. package/scaffold/en/templates/constitution-template.md +1 -1
  24. package/scaffold/pt-br/agent-instructions.md +68 -0
  25. package/scaffold/pt-br/commands/dw-autopilot.md +1 -1
  26. package/scaffold/pt-br/commands/dw-brainstorm.md +1 -1
  27. package/scaffold/pt-br/commands/dw-bugfix.md +4 -3
  28. package/scaffold/pt-br/commands/dw-code-review.md +1 -0
  29. package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
  30. package/scaffold/pt-br/commands/dw-create-techspec.md +1 -1
  31. package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
  32. package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
  33. package/scaffold/pt-br/commands/dw-functional-doc.md +2 -2
  34. package/scaffold/pt-br/commands/dw-help.md +2 -2
  35. package/scaffold/pt-br/commands/dw-redesign-ui.md +7 -7
  36. package/scaffold/pt-br/commands/dw-run-qa.md +5 -4
  37. package/scaffold/pt-br/commands/dw-run-task.md +2 -2
  38. package/scaffold/pt-br/templates/constitution-template.md +1 -1
  39. package/scaffold/skills/dw-council/SKILL.md +1 -1
  40. package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
  41. package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
  42. package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
  43. package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
  44. package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
  45. package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
  46. package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
  47. package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
  48. package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
  49. package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
  50. package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
  51. package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
  52. package/scaffold/skills/dw-testing-discipline/SKILL.md +171 -0
  53. package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
  54. package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +336 -0
  55. package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
  56. package/scaffold/skills/dw-testing-discipline/references/flaky-discipline.md +163 -0
  57. package/scaffold/skills/dw-testing-discipline/references/patterns.md +241 -0
  58. package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +282 -0
  59. package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/security-boundary.md +1 -1
  60. package/scaffold/skills/dw-ui-discipline/SKILL.md +150 -0
  61. package/scaffold/skills/dw-ui-discipline/references/accessibility-floor.md +225 -0
  62. package/scaffold/skills/dw-ui-discipline/references/curated-defaults.md +195 -0
  63. package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +162 -0
  64. package/scaffold/skills/dw-ui-discipline/references/state-matrix.md +101 -0
  65. package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
  66. package/scaffold/skills/ui-ux-pro-max/LICENSE +0 -21
  67. package/scaffold/skills/ui-ux-pro-max/SKILL.md +0 -659
  68. package/scaffold/skills/ui-ux-pro-max/data/_sync_all.py +0 -414
  69. package/scaffold/skills/ui-ux-pro-max/data/app-interface.csv +0 -31
  70. package/scaffold/skills/ui-ux-pro-max/data/charts.csv +0 -26
  71. package/scaffold/skills/ui-ux-pro-max/data/colors.csv +0 -162
  72. package/scaffold/skills/ui-ux-pro-max/data/design.csv +0 -1776
  73. package/scaffold/skills/ui-ux-pro-max/data/draft.csv +0 -1779
  74. package/scaffold/skills/ui-ux-pro-max/data/google-fonts.csv +0 -1924
  75. package/scaffold/skills/ui-ux-pro-max/data/icons.csv +0 -106
  76. package/scaffold/skills/ui-ux-pro-max/data/landing.csv +0 -35
  77. package/scaffold/skills/ui-ux-pro-max/data/products.csv +0 -162
  78. package/scaffold/skills/ui-ux-pro-max/data/react-performance.csv +0 -45
  79. package/scaffold/skills/ui-ux-pro-max/data/stacks/angular.csv +0 -51
  80. package/scaffold/skills/ui-ux-pro-max/data/stacks/astro.csv +0 -54
  81. package/scaffold/skills/ui-ux-pro-max/data/stacks/flutter.csv +0 -53
  82. package/scaffold/skills/ui-ux-pro-max/data/stacks/html-tailwind.csv +0 -56
  83. package/scaffold/skills/ui-ux-pro-max/data/stacks/jetpack-compose.csv +0 -53
  84. package/scaffold/skills/ui-ux-pro-max/data/stacks/laravel.csv +0 -51
  85. package/scaffold/skills/ui-ux-pro-max/data/stacks/nextjs.csv +0 -53
  86. package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxt-ui.csv +0 -51
  87. package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxtjs.csv +0 -59
  88. package/scaffold/skills/ui-ux-pro-max/data/stacks/react-native.csv +0 -52
  89. package/scaffold/skills/ui-ux-pro-max/data/stacks/react.csv +0 -54
  90. package/scaffold/skills/ui-ux-pro-max/data/stacks/shadcn.csv +0 -61
  91. package/scaffold/skills/ui-ux-pro-max/data/stacks/svelte.csv +0 -54
  92. package/scaffold/skills/ui-ux-pro-max/data/stacks/swiftui.csv +0 -51
  93. package/scaffold/skills/ui-ux-pro-max/data/stacks/threejs.csv +0 -54
  94. package/scaffold/skills/ui-ux-pro-max/data/stacks/vue.csv +0 -50
  95. package/scaffold/skills/ui-ux-pro-max/data/styles.csv +0 -85
  96. package/scaffold/skills/ui-ux-pro-max/data/typography.csv +0 -74
  97. package/scaffold/skills/ui-ux-pro-max/data/ui-reasoning.csv +0 -162
  98. package/scaffold/skills/ui-ux-pro-max/data/ux-guidelines.csv +0 -100
  99. package/scaffold/skills/ui-ux-pro-max/scripts/core.py +0 -262
  100. package/scaffold/skills/ui-ux-pro-max/scripts/design_system.py +0 -1148
  101. package/scaffold/skills/ui-ux-pro-max/scripts/search.py +0 -114
  102. package/scaffold/skills/ui-ux-pro-max/skills/brand/SKILL.md +0 -97
  103. package/scaffold/skills/ui-ux-pro-max/skills/design/SKILL.md +0 -302
  104. package/scaffold/skills/ui-ux-pro-max/skills/design-system/SKILL.md +0 -244
  105. package/scaffold/skills/ui-ux-pro-max/templates/base/quick-reference.md +0 -297
  106. package/scaffold/skills/ui-ux-pro-max/templates/base/skill-content.md +0 -358
  107. package/scaffold/skills/ui-ux-pro-max/templates/platforms/agent.json +0 -21
  108. package/scaffold/skills/ui-ux-pro-max/templates/platforms/augment.json +0 -18
  109. package/scaffold/skills/ui-ux-pro-max/templates/platforms/claude.json +0 -21
  110. package/scaffold/skills/ui-ux-pro-max/templates/platforms/codebuddy.json +0 -21
  111. package/scaffold/skills/ui-ux-pro-max/templates/platforms/codex.json +0 -21
  112. package/scaffold/skills/ui-ux-pro-max/templates/platforms/continue.json +0 -21
  113. package/scaffold/skills/ui-ux-pro-max/templates/platforms/copilot.json +0 -21
  114. package/scaffold/skills/ui-ux-pro-max/templates/platforms/cursor.json +0 -21
  115. package/scaffold/skills/ui-ux-pro-max/templates/platforms/droid.json +0 -21
  116. package/scaffold/skills/ui-ux-pro-max/templates/platforms/gemini.json +0 -21
  117. package/scaffold/skills/ui-ux-pro-max/templates/platforms/kilocode.json +0 -21
  118. package/scaffold/skills/ui-ux-pro-max/templates/platforms/kiro.json +0 -21
  119. package/scaffold/skills/ui-ux-pro-max/templates/platforms/opencode.json +0 -21
  120. package/scaffold/skills/ui-ux-pro-max/templates/platforms/qoder.json +0 -21
  121. package/scaffold/skills/ui-ux-pro-max/templates/platforms/roocode.json +0 -21
  122. package/scaffold/skills/ui-ux-pro-max/templates/platforms/trae.json +0 -21
  123. package/scaffold/skills/ui-ux-pro-max/templates/platforms/warp.json +0 -18
  124. package/scaffold/skills/ui-ux-pro-max/templates/platforms/windsurf.json +0 -21
  125. package/scaffold/skills/webapp-testing/SKILL.md +0 -138
  126. package/scaffold/skills/webapp-testing/assets/test-helper.js +0 -56
  127. /package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/three-workflow-patterns.md +0 -0
@@ -0,0 +1,252 @@
1
+ # Agent evaluation — outcome vs trajectory
2
+
3
+ Agent eval has a foundational question: do you grade **what the agent did along the way** (trajectory) or **what state the world is in at the end** (outcome)?
4
+
5
+ The answer determines what you measure and what failure modes you catch.
6
+
7
+ ## Outcome-only evaluation (recommended default)
8
+
9
+ **What it checks:** at the end of the agent's run, does the world look the way it should? Was the right tool called? Was the right ticket filed? Was the user's question answered correctly?
10
+
11
+ **Pattern:**
12
+
13
+ ```javascript
14
+ test('agent files refund ticket when user requests refund', async () => {
15
+ await agent.run('I want a refund for order #123');
16
+
17
+ // Outcome assertions
18
+ const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
19
+ expect(tickets).toHaveLength(1);
20
+ expect(tickets[0].type).toBe('refund');
21
+
22
+ const userMessage = agent.lastMessage();
23
+ expect(userMessage).toMatch(/refund.*processed|filed|submitted/i);
24
+ });
25
+ ```
26
+
27
+ **Strengths:**
28
+ - Permits creative paths — agent solved it via tool A → B → C OR via tool A → C → B; both pass if the outcome is right.
29
+ - Robust to internal refactor — restructuring the agent's prompt or tool descriptions doesn't break the test as long as the outcome holds.
30
+ - Aligned with what users care about: did the system do the right thing?
31
+
32
+ **Weaknesses:**
33
+ - Misses "ghost actions" — agent claims to have done X but the outcome state shows it didn't.
34
+ - Defense: combine with rung-3 outcome-state assertions (DB writes, API calls). Don't trust the agent's word.
35
+ - Misses inefficiency — agent took 17 tool calls to do what should be 3. Outcome OK but cost is bad.
36
+ - Defense: track tool-call count as a separate metric; alert if it exceeds budget.
37
+
38
+ ## Trajectory evaluation (when path matters)
39
+
40
+ **What it checks:** did the agent take the expected sequence (or set) of tool calls? Match against a reference trajectory.
41
+
42
+ **Use when:**
43
+ - Compliance / audit requires specific actions in specific order (e.g., "ALWAYS verify identity before disclosing balance").
44
+ - Safety-critical: a specific tool MUST be called (e.g., "if user mentions self-harm, must invoke `escalate-to-human` BEFORE any other action").
45
+ - The path itself is the contract (e.g., a workflow agent that must traverse a specific decision tree).
46
+
47
+ ## Trajectory match modes
48
+
49
+ Four modes, from strictest to most permissive:
50
+
51
+ ### Strict
52
+
53
+ **Rule:** actual trajectory contains identical tool calls in identical order with identical arguments.
54
+
55
+ **Use when:** path AND parameters are both part of the contract. Compliance, deterministic workflow agents.
56
+
57
+ ```javascript
58
+ expect(actualToolCalls).toEqual([
59
+ { name: 'verify_identity', args: { user_id: 'u-42' } },
60
+ { name: 'get_balance', args: { account_id: 'a-99' } },
61
+ { name: 'respond_to_user', args: { template: 'balance-inquiry' } },
62
+ ]);
63
+ ```
64
+
65
+ ### Unordered
66
+
67
+ **Rule:** actual contains the same set of tool calls, any order; arguments match.
68
+
69
+ **Use when:** the agent legitimately may parallelize or reorder calls without affecting correctness.
70
+
71
+ ```javascript
72
+ expect(new Set(actualToolNames)).toEqual(new Set(['fetch_user', 'fetch_orders', 'fetch_addresses']));
73
+ ```
74
+
75
+ ### Subset
76
+
77
+ **Rule:** actual trajectory is a SUBSET of reference — agent didn't exceed expected tool calls.
78
+
79
+ **Use when:** frugality / cost discipline — "agent should NOT call expensive tools unnecessarily."
80
+
81
+ ```javascript
82
+ // Reference is the maximum allowed set
83
+ const referenceToolCalls = ['fetch_user', 'classify_intent', 'respond'];
84
+ const allActualInReference = actualToolNames.every(t => referenceToolCalls.includes(t));
85
+ expect(allActualInReference).toBe(true);
86
+ ```
87
+
88
+ ### Superset
89
+
90
+ **Rule:** actual contains ALL reference tool calls, possibly plus extras.
91
+
92
+ **Use when:** specific tools are mandatory but extras are acceptable. Often safety-critical ("MUST log audit event") while permitting agent autonomy in other tool choices.
93
+
94
+ ```javascript
95
+ // Reference is the minimum required set
96
+ const requiredToolCalls = ['log_audit_event', 'verify_user'];
97
+ const allRequiredCalled = requiredToolCalls.every(t => actualToolNames.includes(t));
98
+ expect(allRequiredCalled).toBe(true);
99
+ ```
100
+
101
+ ## Argument matching strategies
102
+
103
+ When trajectory matching, how should tool ARGUMENTS be compared?
104
+
105
+ | Strategy | Behavior | Use |
106
+ |----------|---------|-----|
107
+ | **Exact** | Arguments must match byte-for-byte | Deterministic args (IDs, fixed strings) |
108
+ | **Ignore** | Any call to the right tool counts | When the call ITSELF is what matters, not args |
109
+ | **Subset** | Actual args contain at least the reference args | Required fields enforced; extras OK |
110
+ | **Superset** | Actual args are within reference args set | Frugality — agent didn't add unexpected fields |
111
+ | **Custom comparator** | Per-tool comparison function | Domain-specific equivalence (case-insensitive, semantic match) |
112
+
113
+ Example with custom comparator:
114
+
115
+ ```javascript
116
+ const matchers = {
117
+ 'search_cities': (actualArgs, refArgs) => {
118
+ // City name comparison: case-insensitive, trimmed
119
+ return actualArgs.name.toLowerCase().trim() === refArgs.name.toLowerCase().trim();
120
+ },
121
+ 'fetch_user': 'exact',
122
+ };
123
+ ```
124
+
125
+ ## Decision tree: outcome vs trajectory
126
+
127
+ ```
128
+ Is correctness defined by the FINAL STATE or by the PATH?
129
+
130
+ ├── Final state only — agent can solve it however it likes
131
+ │ → Outcome-only eval. Combine with rung-3 state assertions.
132
+
133
+ ├── Specific tool calls MUST happen for compliance/safety
134
+ │ → Superset trajectory mode. Outcome too, as a separate check.
135
+
136
+ ├── Specific tool calls MUST NOT happen (cost / privacy)
137
+ │ → Subset trajectory mode.
138
+
139
+ ├── The full workflow path is the contract (legal, audit)
140
+ │ → Strict trajectory mode.
141
+
142
+ └── Path is partly fixed, partly free
143
+ → Trajectory with custom comparator OR split into multiple smaller tests.
144
+ ```
145
+
146
+ ## Cost vs accuracy tracking
147
+
148
+ Two metrics that trajectory eval naturally enables:
149
+
150
+ ### Tool-call efficiency
151
+
152
+ ```python
153
+ def efficiency_score(actual_trajectory, reference_trajectory):
154
+ actual_calls = len(actual_trajectory)
155
+ reference_calls = len(reference_trajectory)
156
+ if reference_calls == 0:
157
+ return 1.0
158
+ return min(1.0, reference_calls / actual_calls)
159
+ ```
160
+
161
+ Score 1.0 = matched or beat reference. Score 0.5 = took 2× the expected calls.
162
+
163
+ Track this over time; agent regressions sometimes show up as efficiency loss before outcome loss.
164
+
165
+ ### Step-count percentile
166
+
167
+ ```
168
+ Run 2026-05-12:
169
+ p50 tool calls: 4 (reference 3)
170
+ p95 tool calls: 9 (reference 6)
171
+ p99 tool calls: 18 (reference 12)
172
+ ```
173
+
174
+ p99 spikes catch cases where the agent enters a loop or backtracks excessively — outcome may still be correct but cost runaway is real.
175
+
176
+ ## Dataset structure for agents
177
+
178
+ ```json
179
+ {
180
+ "id": "agent-case-001",
181
+ "input": {
182
+ "user_message": "Cancel my order #123 and request a refund"
183
+ },
184
+ "expected_outcome": {
185
+ "tickets_created": [
186
+ { "type": "refund", "order_id": "123" }
187
+ ],
188
+ "order_status": "cancelled"
189
+ },
190
+ "expected_trajectory": {
191
+ "mode": "superset",
192
+ "required_calls": [
193
+ { "name": "verify_user_owns_order", "args_match": "exact" },
194
+ { "name": "update_order_status", "args_match": "subset", "args": { "status": "cancelled" } },
195
+ { "name": "create_refund_ticket", "args_match": "subset" }
196
+ ],
197
+ "forbidden_calls": ["delete_user_data"]
198
+ },
199
+ "tool_budget": { "p95_max_calls": 8 }
200
+ }
201
+ ```
202
+
203
+ The `forbidden_calls` field is powerful — explicitly enumerate tools that MUST NOT fire for this input class. Catches "agent escalated to a dangerous tool that wasn't necessary."
204
+
205
+ ## Combining outcome + trajectory
206
+
207
+ For serious agent eval, combine both:
208
+
209
+ ```javascript
210
+ test('agent handles refund request', async () => {
211
+ const result = await agent.run(case.input);
212
+
213
+ // Outcome
214
+ expectOutcomeMatch(result.outcome, case.expected_outcome);
215
+
216
+ // Trajectory — superset mode (required tools called)
217
+ expectTrajectoryMatch(result.trajectory, case.expected_trajectory, 'superset');
218
+
219
+ // Forbidden — none of these tools fired
220
+ for (const forbidden of case.expected_trajectory.forbidden_calls) {
221
+ expect(result.trajectory.some(t => t.name === forbidden)).toBe(false);
222
+ }
223
+
224
+ // Budget — didn't exceed expected tool calls
225
+ expect(result.trajectory.length).toBeLessThanOrEqual(case.tool_budget.p95_max_calls);
226
+ });
227
+ ```
228
+
229
+ ## LLM-as-judge for agent quality
230
+
231
+ Beyond mechanical trajectory matching, judge for:
232
+ - Was the agent's intermediate reasoning sound? (rubric: logical, evidence-based, non-hallucinated)
233
+ - Was the final user message appropriate? (rubric: tone, completeness, accuracy)
234
+ - Did the agent handle ambiguity well? (rubric: did it ask for clarification when needed?)
235
+
236
+ These are rung-4 evaluations on top of rung-1/2/3 outcome and trajectory checks.
237
+
238
+ ## Anti-patterns
239
+
240
+ - **Trajectory-only eval** → punishes creative paths; brittle to refactor; ignores real outcome.
241
+ - **Outcome-only eval without state assertion** → trusts the agent's word; misses ghost actions.
242
+ - **Strict trajectory mode when subset/superset would do** → false negatives every time the agent legitimately reorders.
243
+ - **No tool-budget tracking** → agent regresses to expensive paths; you don't notice until the bill spikes.
244
+ - **No `forbidden_calls` enumeration** → agent silently learns to call dangerous tools.
245
+
246
+ ## Tools
247
+
248
+ - `langchain-ai/agentevals` (MIT) — Python library implementing all four trajectory match modes + LLM-as-judge for trajectories. Source of the taxonomy above.
249
+ - `langsmith` — observability + eval orchestration; tracks experiments over time.
250
+ - Custom implementation — the modes above are ~50 lines each in any language.
251
+
252
+ The discipline isn't the library choice; it's choosing outcome-vs-trajectory deliberately, picking the right match mode, and tracking efficiency alongside accuracy.
@@ -0,0 +1,169 @@
1
+ # LLM-as-judge calibration — how to make rung 4 mean something
2
+
3
+ LLM-as-judge sounds simple: a model grades the output. In practice, without calibration it produces NUMBERS WITHOUT SIGNAL — judge scores drift with the model, with rubric phrasing, with prompt minutiae. You read "judge says 4.2 average" and have no idea if that means the system is good.
4
+
5
+ Calibration anchors the judge to human assessment. After calibration, a judge score has meaning. Before, it doesn't.
6
+
7
+ ## The three non-negotiables
8
+
9
+ ### 1. Calibrate against ≥20 human-graded cases
10
+
11
+ Process:
12
+
13
+ 1. Sample ≥20 cases from the reference dataset (or representative production traffic).
14
+ 2. Have ≥1 domain expert grade each case using the same rubric the judge will use. Multiple humans per case is better (inter-rater agreement is useful signal).
15
+ 3. Run the judge against the same cases.
16
+ 4. Compute Spearman rank correlation between human scores and judge scores.
17
+
18
+ **Target:** Spearman ≥0.80.
19
+ **Acceptable:** 0.70-0.80 with documented rationale (e.g., "subjective tone judgments inherently noisy").
20
+ **Reject:** <0.70. The judge is not measuring what you think it's measuring.
21
+
22
+ ### 2. Use a different model than the system under test
23
+
24
+ A model judging its own output produces false positives. The judge agrees with itself even when wrong because it shares the same biases and blind spots.
25
+
26
+ Pairing examples:
27
+ - System: GPT-4 → Judge: Claude Opus.
28
+ - System: Claude Sonnet → Judge: GPT-4o.
29
+ - System: Gemini → Judge: Claude.
30
+
31
+ If both system and judge MUST be from the same provider, at minimum use different model sizes (Sonnet judges Opus output, not vice versa).
32
+
33
+ ### 3. Structured rubric, not free-form scoring
34
+
35
+ "Rate this answer 1-10" → noise. Different runs give different scores; different humans disagree wildly; the score has no anchor.
36
+
37
+ Structured rubric: ≥3 criteria, each with a defined scale and an example per score point.
38
+
39
+ Example rubric for FAITHFULNESS (RAG):
40
+
41
+ ```markdown
42
+ # Faithfulness rubric (1-5 scale)
43
+
44
+ Score each answer against the retrieved context. A faithful answer makes claims supported by the context; an unfaithful one fabricates or extrapolates.
45
+
46
+ ## 1 — Severely unfaithful
47
+ The answer contains claims that contradict the context, or fabricates facts not present in any chunk. Example: context says "Q3 revenue was $1.2M"; answer says "Q3 revenue exceeded $5M."
48
+
49
+ ## 2 — Mostly unfaithful
50
+ The answer mixes context-supported and fabricated claims, where the fabrication is meaningful. Example: cites a study that wasn't in the context.
51
+
52
+ ## 3 — Mixed
53
+ Half the answer is grounded; half is reasonable inference or generalization beyond the context. Example: context describes the API; answer adds advice not derivable from context.
54
+
55
+ ## 4 — Mostly faithful
56
+ All claims are supported by context; minor paraphrasing or summarization without distortion. Example: rewords a passage accurately.
57
+
58
+ ## 5 — Strictly faithful
59
+ Every claim is directly traceable to a specific chunk; no information added beyond what context contains. Example: quotes-with-attribution style.
60
+ ```
61
+
62
+ Provide this rubric INSIDE the judge prompt. Free-form is forbidden.
63
+
64
+ ## The calibration loop
65
+
66
+ ```
67
+ 1. Sample 20-30 cases for calibration set.
68
+ 2. Human-grade them blind (without seeing other graders or judge).
69
+ 3. Run judge with rubric.
70
+ 4. Compute Spearman vs human scores.
71
+ 5. If <0.70:
72
+ - Examine disagreements: where does judge consistently miss?
73
+ - Refine rubric: more specific scale, more examples, narrower scope.
74
+ - OR switch judge model: try a different vendor/size.
75
+ - Re-run step 3-4.
76
+ 6. If 0.70-0.80: document the noise floor; accept with caveats.
77
+ 7. If ≥0.80: judge is calibrated. Save the rubric + judge config in version control.
78
+ ```
79
+
80
+ Calibration is one-time-per-config but RECURRING-PER-MODEL-CHANGE. Every model swap (you upgrade GPT-4 to GPT-5; vendor deprecates Opus 4.7) invalidates the calibration. Re-calibrate.
81
+
82
+ ## Judge drift monitoring
83
+
84
+ After deployment:
85
+ - Re-run calibration set monthly.
86
+ - Plot Spearman over time.
87
+ - Alert if Spearman drops below 0.75 between calibration runs — the judge has drifted (model update, rubric got stale, traffic distribution shifted).
88
+
89
+ ```
90
+ .dw/eval/judges/<feature>/
91
+ ├── rubric.md # the rubric, version-controlled
92
+ ├── calibration-2026-05-12.jsonl # 20+ cases with human + judge scores
93
+ ├── spearman-2026-05-12.txt # 0.84
94
+ ├── calibration-2026-08-12.jsonl # quarterly re-calibration
95
+ └── spearman-2026-08-12.txt # 0.81
96
+ ```
97
+
98
+ ## Rubric design patterns
99
+
100
+ ### DO
101
+
102
+ - **3-5 criteria** per rubric (one for each dimension you care about: faithfulness, completeness, tone, format, ...).
103
+ - **1-5 scale** with anchored descriptions per point (not 1-10 — too granular for reliable agreement).
104
+ - **Example per score point** showing the kind of output that earns that score.
105
+ - **Explicit "what to ignore"** — e.g., "ignore minor grammar; score on substance."
106
+
107
+ ### DON'T
108
+
109
+ - Single-criterion "quality" score — too vague to calibrate.
110
+ - 1-100 scales — humans can't reliably distinguish 73 from 76.
111
+ - Rubrics longer than 500 words — the judge skips and lazy-scores.
112
+ - "Holistic" scoring without breakdown — opaque to debug.
113
+
114
+ ## Multi-criterion rubrics
115
+
116
+ For complex outputs (RAG, agents), one number rarely captures quality. Use per-criterion scores:
117
+
118
+ ```json
119
+ {
120
+ "faithfulness": 4,
121
+ "completeness": 3,
122
+ "tone": 5,
123
+ "format": 5,
124
+ "overall": null
125
+ }
126
+ ```
127
+
128
+ Aggregate as needed downstream (weighted average, minimum, "all must be ≥3"). Don't have the judge compute the aggregate — bias compounds.
129
+
130
+ ## Anti-patterns
131
+
132
+ - **Judge with no rubric.** "Rate this 1-10." Numbers, no signal.
133
+ - **Judge is the system being tested.** False positives baked in.
134
+ - **No calibration evidence in PR.** "We added LLM-as-judge" — okay, what's the Spearman?
135
+ - **Rubric stuffed with all criteria in one prompt** → judge lazy-scores. Split into criterion-per-call if needed.
136
+ - **Calibration done once, never revisited.** Model upgrades silently break it. Re-calibrate monthly or per model swap.
137
+ - **Judge scoring its own scoring.** Recursive trust collapse.
138
+
139
+ ## Bias to watch
140
+
141
+ LLM judges have characteristic biases:
142
+
143
+ - **Length bias** — longer outputs score higher even when shorter is better. Normalize length in the rubric.
144
+ - **Self-similarity bias** — judges rate outputs that resemble their own writing higher. Cross-model pairing helps.
145
+ - **Position bias** (in comparative judging) — first item often wins. Randomize order, run both A/B and B/A.
146
+ - **Recency bias** — last item in context is overweighted. Vary order.
147
+ - **Sycophancy** — judges agree with strongly-stated input even when wrong. Frame the judge prompt neutrally.
148
+
149
+ Document which biases you tested for in the calibration write-up.
150
+
151
+ ## Cost discipline
152
+
153
+ LLM-as-judge can dominate eval costs. At $0.01-$0.10 per judgment, 100 cases × 4 rubric criteria × monthly = real money.
154
+
155
+ Optimizations (in order of impact):
156
+ 1. Run judge against SAMPLES, not the whole dataset every time. 50 random cases weekly catches regression.
157
+ 2. Use the cheapest model that maintains Spearman ≥0.80. GPT-4 mini may calibrate as well as GPT-4 for your rubric.
158
+ 3. Batch judge calls when the API supports it.
159
+ 4. Cache judge results per (input, output, rubric-version) tuple — same eval run shouldn't pay twice.
160
+ 5. Skip judge for cases where rungs 1-3 already failed — they're broken; no point asking subjective quality.
161
+
162
+ ## When NOT to use LLM-as-judge
163
+
164
+ - The output has a deterministic correct answer. Use rung 1 or 2.
165
+ - The output has a measurable side effect. Use rung 3.
166
+ - The team won't budget for calibration. The judge will produce noise.
167
+ - The rubric can't be written in <500 words. The criterion is too vague.
168
+
169
+ A poorly-calibrated judge is worse than no judge: it gives false confidence. Better to ship with "tested manually by domain expert on 20 cases" than with "judge score 4.1" that means nothing.
@@ -0,0 +1,171 @@
1
+ # Oracle ladder — climb deliberately
2
+
3
+ Five rungs ordered by cost (cheap → expensive) and rigor (strict → subjective). Start at the bottom. Every rung up costs an order of magnitude more in latency, money, or calibration effort. Don't reach for an upper rung when a lower one can prove the case.
4
+
5
+ ## Rung 1 — Exact match
6
+
7
+ **What it checks:** the output equals the expected output, byte-for-byte (or after a normalization step like JSON canonicalization).
8
+
9
+ **Use when:**
10
+ - Output is a structured function call: `expect(toolCalls[0]).toEqual({ name: 'search', args: { q: 'invoices' } })`.
11
+ - Output is a classification from a fixed label set: `expect(label).toBe('refund-request')`.
12
+ - Output is a parsed value from a JSON contract: `expect(result.user_id).toBe('u-42')`.
13
+
14
+ **Example:**
15
+
16
+ ```javascript
17
+ test('classifier labels refund requests correctly', async () => {
18
+ const cases = await loadDataset('.dw/eval/datasets/classifier/cases.jsonl');
19
+ for (const c of cases.filter(c => c.expected === 'refund-request')) {
20
+ expect(await classify(c.input)).toBe('refund-request');
21
+ }
22
+ });
23
+ ```
24
+
25
+ **Cost:** ~free.
26
+ **Limitation:** can't handle creative outputs (paragraphs, summaries). Don't try to force-fit.
27
+
28
+ ## Rung 2 — Schema validation
29
+
30
+ **What it checks:** the output matches a structural contract — types, required fields, value ranges. The SHAPE is fixed; specific values can vary.
31
+
32
+ **Use when:**
33
+ - LLM returns structured data with stable schema (JSON, function call args) but variable content.
34
+ - You need to detect "agent returned garbage" without asserting on the exact garbage.
35
+
36
+ **Example:**
37
+
38
+ ```typescript
39
+ import { z } from 'zod';
40
+
41
+ const ResponseSchema = z.object({
42
+ summary: z.string().min(20).max(500),
43
+ citations: z.array(z.object({
44
+ url: z.string().url(),
45
+ page: z.number().int().optional(),
46
+ })).min(1),
47
+ confidence: z.number().min(0).max(1),
48
+ });
49
+
50
+ test('summarizer returns valid shape', async () => {
51
+ const result = await summarize(input);
52
+ expect(() => ResponseSchema.parse(result)).not.toThrow();
53
+ });
54
+ ```
55
+
56
+ **Cost:** ~free (schema check is cheap).
57
+ **Limitation:** doesn't tell you if the CONTENT is correct, only that it's the right shape. Pair with another rung.
58
+
59
+ ## Rung 3 — Outcome state
60
+
61
+ **What it checks:** a side effect occurred — DB row was created, file was written, tool was called with valid arguments, ticket was opened. The state of the world matches expectations.
62
+
63
+ **Use when:**
64
+ - Agent has tool access and the GOAL is to change state, not produce prose.
65
+ - RAG answer is supposed to lead to an action (e.g., "user clicked the suggested invoice and reconciled it").
66
+ - The system has observable side effects you can query post-hoc.
67
+
68
+ **Example:**
69
+
70
+ ```javascript
71
+ test('agent files refund request when user asks', async () => {
72
+ await agent.run('I want a refund for order #123');
73
+
74
+ const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
75
+ expect(tickets).toHaveLength(1);
76
+ expect(tickets[0].type).toBe('refund');
77
+ expect(tickets[0].status).toBe('pending');
78
+ });
79
+ ```
80
+
81
+ **Cost:** cheap (1 DB query / API call per assertion).
82
+ **Limitation:** doesn't validate the PROSE the agent produced along the way. If the goal was "answer the user politely AND file the refund," rung 3 catches the action but not the politeness — climb to rung 4 for that.
83
+
84
+ **Key benefit:** catches "ghost actions" — agent claims to have done X but didn't actually do it. Rungs 1-2 trust the agent's word; rung 3 verifies the world.
85
+
86
+ ## Rung 4 — LLM-as-judge
87
+
88
+ **What it checks:** a different model grades the output against a rubric. Used for genuinely subjective quality — helpfulness, tone, faithfulness, completeness.
89
+
90
+ **Mandatory before using:**
91
+ - Calibrated against ≥20 human-graded cases (Spearman ≥0.80) — see `judge-calibration.md`.
92
+ - Different model than the system under test.
93
+ - Structured rubric, not free-form "rate 1-10."
94
+
95
+ **Example:**
96
+
97
+ ```javascript
98
+ test('chat response is faithful to retrieved context', async () => {
99
+ const cases = await loadDataset('.dw/eval/datasets/rag-chat/cases.jsonl');
100
+ const scores = [];
101
+
102
+ for (const c of cases) {
103
+ const answer = await chat(c.input, c.context);
104
+ const judgment = await llmJudge({
105
+ model: 'claude-opus-4-7', // different from system under test (GPT-4)
106
+ rubric: faithfulnessRubric,
107
+ input: c.input,
108
+ context: c.context,
109
+ output: answer,
110
+ });
111
+ scores.push(judgment.score);
112
+ }
113
+
114
+ // 80% of cases must score ≥4 on the 1-5 faithfulness rubric
115
+ const passing = scores.filter(s => s >= 4).length / scores.length;
116
+ expect(passing).toBeGreaterThan(0.8);
117
+ });
118
+ ```
119
+
120
+ **Cost:** medium-to-high (one judge call per case; pay per case at API rates).
121
+ **Limitation:** the judge has bias and drift; without calibration, you're measuring the judge's mood. Re-calibrate every quarter, every model swap, and after rubric changes.
122
+
123
+ ## Rung 5 — Human review
124
+
125
+ **What it checks:** a domain expert scores. The gold standard for the rubrics rung 4 calibrates against.
126
+
127
+ **Use when:**
128
+ - Calibrating LLM-as-judge (rung 4 setup).
129
+ - High-stakes outputs where automation isn't trusted (medical, legal, financial).
130
+ - Edge cases that automated rungs flag as borderline.
131
+
132
+ **Cost:** expensive. Don't scale; sample.
133
+
134
+ **Pattern:**
135
+ - Spot-check 5-10% of LLM-as-judge results randomly each week.
136
+ - Whenever LLM-as-judge score is "borderline" (e.g., 2.5-3.5 on 1-5 scale), kick to human.
137
+ - Full human review only for the calibration dataset and high-stakes edge cases.
138
+
139
+ ## The climbing decision tree
140
+
141
+ ```
142
+ Is the output a fixed-structure value (function call, classification, JSON with stable shape)?
143
+ ├── YES → Rung 1 (exact match) or Rung 2 (schema)
144
+ └── NO → does the output cause an observable side effect (DB write, tool call, ticket opened)?
145
+ ├── YES → Rung 3 (outcome state)
146
+ └── NO → output is subjective (prose, summary, recommendation). Rung 4 required.
147
+ └── Did you calibrate the judge against humans (≥20 cases, Spearman ≥0.80)?
148
+ ├── YES → Rung 4 is valid signal
149
+ └── NO → DO NOT USE Rung 4 yet. Calibrate first via Rung 5.
150
+ ```
151
+
152
+ ## Anti-patterns
153
+
154
+ - **Reaching for Rung 4 first** because "everything else seems hard." Climb the ladder; lower rungs catch loud failures cheaply.
155
+ - **Pretending Rung 4 is calibrated** by running it without checking against humans. Score numbers without calibration are decorative.
156
+ - **Skipping Rung 3 because "we have unit tests"** — unit tests with mocked tools prove the agent CALLED the tool. Rung 3 proves the tool's effect happened.
157
+ - **Mixing rungs in one assertion**: `expect(answer).toBe('Yes, your refund is being processed' /* exact */)` — when the exact text doesn't matter, rung 1 is the wrong tool.
158
+
159
+ ## Combining rungs
160
+
161
+ For a serious AI feature, expect to use 2-3 rungs together:
162
+
163
+ | Feature | Typical rung mix |
164
+ |---------|------------------|
165
+ | Classifier | Rung 1 (label correctness) + Rung 4 (rationale quality, if exposed to user) |
166
+ | RAG chat | Rung 2 (response shape) + Rung 3 (citations are valid URLs/IDs) + Rung 4 (faithfulness) |
167
+ | Agent (filing tickets) | Rung 3 (ticket created with correct fields) + Rung 4 (user-facing message tone) |
168
+ | Summarization | Rung 2 (length, structure) + Rung 4 (faithfulness, completeness) |
169
+ | Tool-use trajectory | Rung 1 (specific tool calls expected) + Rung 4 (intermediate reasoning quality, optional) |
170
+
171
+ The rule: cheap rungs catch the failures that scream; expensive rungs catch the failures that whisper. You need both.