@brunosps00/dev-workflow 0.13.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +106 -122
- package/lib/constants.js +16 -36
- package/lib/migrate-skills.js +11 -4
- package/lib/removed-commands.js +30 -0
- package/package.json +1 -1
- package/scaffold/en/agent-instructions.md +27 -16
- package/scaffold/en/commands/dw-adr.md +2 -2
- package/scaffold/en/commands/dw-analyze-project.md +7 -7
- package/scaffold/en/commands/dw-autopilot.md +20 -20
- package/scaffold/en/commands/dw-brainstorm.md +160 -9
- package/scaffold/en/commands/dw-bugfix.md +7 -6
- package/scaffold/en/commands/dw-commit.md +1 -1
- package/scaffold/en/commands/dw-dockerize.md +9 -9
- package/scaffold/en/commands/dw-find-skills.md +4 -4
- package/scaffold/en/commands/dw-functional-doc.md +2 -2
- package/scaffold/en/commands/dw-generate-pr.md +4 -4
- package/scaffold/en/commands/dw-help.md +95 -351
- package/scaffold/en/commands/dw-intel.md +76 -12
- package/scaffold/en/commands/dw-new-project.md +9 -9
- package/scaffold/en/commands/dw-plan.md +175 -0
- package/scaffold/en/commands/dw-qa.md +166 -0
- package/scaffold/en/commands/dw-redesign-ui.md +7 -7
- package/scaffold/en/commands/dw-review.md +198 -0
- package/scaffold/en/commands/dw-run.md +176 -0
- package/scaffold/en/commands/dw-secure-audit.md +222 -0
- package/scaffold/en/commands/dw-update.md +1 -1
- package/scaffold/en/references/playwright-patterns.md +1 -1
- package/scaffold/en/references/refactoring-catalog.md +1 -1
- package/scaffold/en/templates/brainstorm-matrix.md +1 -1
- package/scaffold/en/templates/idea-onepager.md +3 -3
- package/scaffold/en/templates/project-onepager.md +5 -5
- package/scaffold/pt-br/agent-instructions.md +27 -16
- package/scaffold/pt-br/commands/dw-adr.md +2 -2
- package/scaffold/pt-br/commands/dw-analyze-project.md +7 -7
- package/scaffold/pt-br/commands/dw-autopilot.md +20 -20
- package/scaffold/pt-br/commands/dw-brainstorm.md +160 -9
- package/scaffold/pt-br/commands/dw-bugfix.md +10 -9
- package/scaffold/pt-br/commands/dw-commit.md +1 -1
- package/scaffold/pt-br/commands/dw-dockerize.md +9 -9
- package/scaffold/pt-br/commands/dw-find-skills.md +4 -4
- package/scaffold/pt-br/commands/dw-functional-doc.md +2 -2
- package/scaffold/pt-br/commands/dw-generate-pr.md +4 -4
- package/scaffold/pt-br/commands/dw-help.md +97 -300
- package/scaffold/pt-br/commands/dw-intel.md +77 -13
- package/scaffold/pt-br/commands/dw-new-project.md +9 -9
- package/scaffold/pt-br/commands/dw-plan.md +175 -0
- package/scaffold/pt-br/commands/dw-qa.md +166 -0
- package/scaffold/pt-br/commands/dw-redesign-ui.md +7 -7
- package/scaffold/pt-br/commands/dw-review.md +198 -0
- package/scaffold/pt-br/commands/dw-run.md +176 -0
- package/scaffold/pt-br/commands/dw-secure-audit.md +222 -0
- package/scaffold/pt-br/commands/dw-update.md +1 -1
- package/scaffold/pt-br/references/playwright-patterns.md +1 -1
- package/scaffold/pt-br/references/refactoring-catalog.md +1 -1
- package/scaffold/pt-br/templates/brainstorm-matrix.md +1 -1
- package/scaffold/pt-br/templates/idea-onepager.md +3 -3
- package/scaffold/pt-br/templates/project-onepager.md +5 -5
- package/scaffold/pt-br/templates/tasks-template.md +1 -1
- package/scaffold/skills/api-testing-recipes/SKILL.md +6 -6
- package/scaffold/skills/api-testing-recipes/references/auth-patterns.md +1 -1
- package/scaffold/skills/api-testing-recipes/references/matrix-conventions.md +1 -1
- package/scaffold/skills/api-testing-recipes/references/openapi-driven.md +3 -3
- package/scaffold/skills/docker-compose-recipes/SKILL.md +1 -1
- package/scaffold/skills/dw-codebase-intel/SKILL.md +9 -9
- package/scaffold/skills/dw-codebase-intel/agents/intel-updater.md +4 -4
- package/scaffold/skills/dw-codebase-intel/references/api-design-discipline.md +1 -1
- package/scaffold/skills/dw-codebase-intel/references/incremental-update.md +5 -5
- package/scaffold/skills/dw-codebase-intel/references/intel-format.md +1 -1
- package/scaffold/skills/dw-codebase-intel/references/query-patterns.md +3 -3
- package/scaffold/skills/dw-council/SKILL.md +2 -2
- package/scaffold/skills/dw-debug-protocol/SKILL.md +5 -3
- package/scaffold/skills/dw-execute-phase/SKILL.md +16 -16
- package/scaffold/skills/dw-execute-phase/agents/executor.md +5 -5
- package/scaffold/skills/dw-execute-phase/agents/plan-checker.md +4 -4
- package/scaffold/skills/dw-execute-phase/references/atomic-commits.md +1 -1
- package/scaffold/skills/dw-execute-phase/references/plan-verification.md +2 -2
- package/scaffold/skills/dw-execute-phase/references/wave-coordination.md +1 -1
- package/scaffold/skills/dw-git-discipline/SKILL.md +5 -2
- package/scaffold/skills/dw-incident-response/SKILL.md +168 -0
- package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
- package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
- package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
- package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
- package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
- package/scaffold/skills/dw-llm-eval/SKILL.md +150 -0
- package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
- package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
- package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
- package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
- package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
- package/scaffold/skills/dw-memory/SKILL.md +2 -2
- package/scaffold/skills/dw-review-rigor/SKILL.md +5 -5
- package/scaffold/skills/dw-simplification/SKILL.md +4 -4
- package/scaffold/skills/dw-source-grounding/SKILL.md +1 -1
- package/scaffold/skills/dw-testing-discipline/SKILL.md +103 -78
- package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
- package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +7 -7
- package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
- package/scaffold/skills/dw-testing-discipline/references/flaky-discipline.md +3 -3
- package/scaffold/skills/dw-testing-discipline/references/{positive-patterns.md → patterns.md} +1 -1
- package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +3 -3
- package/scaffold/skills/dw-ui-discipline/SKILL.md +103 -79
- package/scaffold/skills/dw-ui-discipline/references/accessibility-floor.md +2 -2
- package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +93 -73
- package/scaffold/skills/dw-ui-discipline/references/state-matrix.md +1 -1
- package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
- package/scaffold/skills/dw-verify/SKILL.md +4 -4
- package/scaffold/skills/humanizer/SKILL.md +1 -7
- package/scaffold/skills/remotion-best-practices/SKILL.md +3 -1
- package/scaffold/skills/security-review/SKILL.md +1 -1
- package/scaffold/skills/security-review/languages/csharp.md +1 -1
- package/scaffold/skills/security-review/languages/rust.md +1 -1
- package/scaffold/skills/security-review/languages/typescript.md +1 -1
- package/scaffold/skills/vercel-react-best-practices/SKILL.md +3 -1
- package/scaffold/templates-overrides-readme.md +3 -3
- package/scaffold/en/commands/dw-code-review.md +0 -385
- package/scaffold/en/commands/dw-create-prd.md +0 -148
- package/scaffold/en/commands/dw-create-tasks.md +0 -195
- package/scaffold/en/commands/dw-create-techspec.md +0 -210
- package/scaffold/en/commands/dw-deep-research.md +0 -418
- package/scaffold/en/commands/dw-deps-audit.md +0 -327
- package/scaffold/en/commands/dw-fix-qa.md +0 -152
- package/scaffold/en/commands/dw-map-codebase.md +0 -125
- package/scaffold/en/commands/dw-refactoring-analysis.md +0 -340
- package/scaffold/en/commands/dw-revert-task.md +0 -114
- package/scaffold/en/commands/dw-review-implementation.md +0 -349
- package/scaffold/en/commands/dw-run-plan.md +0 -300
- package/scaffold/en/commands/dw-run-qa.md +0 -496
- package/scaffold/en/commands/dw-run-task.md +0 -209
- package/scaffold/en/commands/dw-security-check.md +0 -271
- package/scaffold/pt-br/commands/dw-code-review.md +0 -365
- package/scaffold/pt-br/commands/dw-create-prd.md +0 -148
- package/scaffold/pt-br/commands/dw-create-tasks.md +0 -195
- package/scaffold/pt-br/commands/dw-create-techspec.md +0 -208
- package/scaffold/pt-br/commands/dw-deep-research.md +0 -172
- package/scaffold/pt-br/commands/dw-deps-audit.md +0 -327
- package/scaffold/pt-br/commands/dw-fix-qa.md +0 -152
- package/scaffold/pt-br/commands/dw-map-codebase.md +0 -125
- package/scaffold/pt-br/commands/dw-refactoring-analysis.md +0 -340
- package/scaffold/pt-br/commands/dw-revert-task.md +0 -114
- package/scaffold/pt-br/commands/dw-review-implementation.md +0 -337
- package/scaffold/pt-br/commands/dw-run-plan.md +0 -296
- package/scaffold/pt-br/commands/dw-run-qa.md +0 -494
- package/scaffold/pt-br/commands/dw-run-task.md +0 -208
- package/scaffold/pt-br/commands/dw-security-check.md +0 -271
- package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +0 -170
- package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +0 -128
- package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +0 -162
|
@@ -0,0 +1,252 @@
|
|
|
1
|
+
# Agent evaluation — outcome vs trajectory
|
|
2
|
+
|
|
3
|
+
Agent eval has a foundational question: do you grade **what the agent did along the way** (trajectory) or **what state the world is in at the end** (outcome)?
|
|
4
|
+
|
|
5
|
+
The answer determines what you measure and what failure modes you catch.
|
|
6
|
+
|
|
7
|
+
## Outcome-only evaluation (recommended default)
|
|
8
|
+
|
|
9
|
+
**What it checks:** at the end of the agent's run, does the world look the way it should? Was the right tool called? Was the right ticket filed? Was the user's question answered correctly?
|
|
10
|
+
|
|
11
|
+
**Pattern:**
|
|
12
|
+
|
|
13
|
+
```javascript
|
|
14
|
+
test('agent files refund ticket when user requests refund', async () => {
|
|
15
|
+
await agent.run('I want a refund for order #123');
|
|
16
|
+
|
|
17
|
+
// Outcome assertions
|
|
18
|
+
const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
|
|
19
|
+
expect(tickets).toHaveLength(1);
|
|
20
|
+
expect(tickets[0].type).toBe('refund');
|
|
21
|
+
|
|
22
|
+
const userMessage = agent.lastMessage();
|
|
23
|
+
expect(userMessage).toMatch(/refund.*processed|filed|submitted/i);
|
|
24
|
+
});
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
**Strengths:**
|
|
28
|
+
- Permits creative paths — agent solved it via tool A → B → C OR via tool A → C → B; both pass if the outcome is right.
|
|
29
|
+
- Robust to internal refactor — restructuring the agent's prompt or tool descriptions doesn't break the test as long as the outcome holds.
|
|
30
|
+
- Aligned with what users care about: did the system do the right thing?
|
|
31
|
+
|
|
32
|
+
**Weaknesses:**
|
|
33
|
+
- Misses "ghost actions" — agent claims to have done X but the outcome state shows it didn't.
|
|
34
|
+
- Defense: combine with rung-3 outcome-state assertions (DB writes, API calls). Don't trust the agent's word.
|
|
35
|
+
- Misses inefficiency — agent took 17 tool calls to do what should be 3. Outcome OK but cost is bad.
|
|
36
|
+
- Defense: track tool-call count as a separate metric; alert if it exceeds budget.
|
|
37
|
+
|
|
38
|
+
## Trajectory evaluation (when path matters)
|
|
39
|
+
|
|
40
|
+
**What it checks:** did the agent take the expected sequence (or set) of tool calls? Match against a reference trajectory.
|
|
41
|
+
|
|
42
|
+
**Use when:**
|
|
43
|
+
- Compliance / audit requires specific actions in specific order (e.g., "ALWAYS verify identity before disclosing balance").
|
|
44
|
+
- Safety-critical: a specific tool MUST be called (e.g., "if user mentions self-harm, must invoke `escalate-to-human` BEFORE any other action").
|
|
45
|
+
- The path itself is the contract (e.g., a workflow agent that must traverse a specific decision tree).
|
|
46
|
+
|
|
47
|
+
## Trajectory match modes
|
|
48
|
+
|
|
49
|
+
Four modes, from strictest to most permissive:
|
|
50
|
+
|
|
51
|
+
### Strict
|
|
52
|
+
|
|
53
|
+
**Rule:** actual trajectory contains identical tool calls in identical order with identical arguments.
|
|
54
|
+
|
|
55
|
+
**Use when:** path AND parameters are both part of the contract. Compliance, deterministic workflow agents.
|
|
56
|
+
|
|
57
|
+
```javascript
|
|
58
|
+
expect(actualToolCalls).toEqual([
|
|
59
|
+
{ name: 'verify_identity', args: { user_id: 'u-42' } },
|
|
60
|
+
{ name: 'get_balance', args: { account_id: 'a-99' } },
|
|
61
|
+
{ name: 'respond_to_user', args: { template: 'balance-inquiry' } },
|
|
62
|
+
]);
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### Unordered
|
|
66
|
+
|
|
67
|
+
**Rule:** actual contains the same set of tool calls, any order; arguments match.
|
|
68
|
+
|
|
69
|
+
**Use when:** the agent legitimately may parallelize or reorder calls without affecting correctness.
|
|
70
|
+
|
|
71
|
+
```javascript
|
|
72
|
+
expect(new Set(actualToolNames)).toEqual(new Set(['fetch_user', 'fetch_orders', 'fetch_addresses']));
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Subset
|
|
76
|
+
|
|
77
|
+
**Rule:** actual trajectory is a SUBSET of reference — agent didn't exceed expected tool calls.
|
|
78
|
+
|
|
79
|
+
**Use when:** frugality / cost discipline — "agent should NOT call expensive tools unnecessarily."
|
|
80
|
+
|
|
81
|
+
```javascript
|
|
82
|
+
// Reference is the maximum allowed set
|
|
83
|
+
const referenceToolCalls = ['fetch_user', 'classify_intent', 'respond'];
|
|
84
|
+
const allActualInReference = actualToolNames.every(t => referenceToolCalls.includes(t));
|
|
85
|
+
expect(allActualInReference).toBe(true);
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Superset
|
|
89
|
+
|
|
90
|
+
**Rule:** actual contains ALL reference tool calls, possibly plus extras.
|
|
91
|
+
|
|
92
|
+
**Use when:** specific tools are mandatory but extras are acceptable. Often safety-critical ("MUST log audit event") while permitting agent autonomy in other tool choices.
|
|
93
|
+
|
|
94
|
+
```javascript
|
|
95
|
+
// Reference is the minimum required set
|
|
96
|
+
const requiredToolCalls = ['log_audit_event', 'verify_user'];
|
|
97
|
+
const allRequiredCalled = requiredToolCalls.every(t => actualToolNames.includes(t));
|
|
98
|
+
expect(allRequiredCalled).toBe(true);
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
## Argument matching strategies
|
|
102
|
+
|
|
103
|
+
When trajectory matching, how should tool ARGUMENTS be compared?
|
|
104
|
+
|
|
105
|
+
| Strategy | Behavior | Use |
|
|
106
|
+
|----------|---------|-----|
|
|
107
|
+
| **Exact** | Arguments must match byte-for-byte | Deterministic args (IDs, fixed strings) |
|
|
108
|
+
| **Ignore** | Any call to the right tool counts | When the call ITSELF is what matters, not args |
|
|
109
|
+
| **Subset** | Actual args contain at least the reference args | Required fields enforced; extras OK |
|
|
110
|
+
| **Superset** | Actual args are within reference args set | Frugality — agent didn't add unexpected fields |
|
|
111
|
+
| **Custom comparator** | Per-tool comparison function | Domain-specific equivalence (case-insensitive, semantic match) |
|
|
112
|
+
|
|
113
|
+
Example with custom comparator:
|
|
114
|
+
|
|
115
|
+
```javascript
|
|
116
|
+
const matchers = {
|
|
117
|
+
'search_cities': (actualArgs, refArgs) => {
|
|
118
|
+
// City name comparison: case-insensitive, trimmed
|
|
119
|
+
return actualArgs.name.toLowerCase().trim() === refArgs.name.toLowerCase().trim();
|
|
120
|
+
},
|
|
121
|
+
'fetch_user': 'exact',
|
|
122
|
+
};
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
## Decision tree: outcome vs trajectory
|
|
126
|
+
|
|
127
|
+
```
|
|
128
|
+
Is correctness defined by the FINAL STATE or by the PATH?
|
|
129
|
+
|
|
130
|
+
├── Final state only — agent can solve it however it likes
|
|
131
|
+
│ → Outcome-only eval. Combine with rung-3 state assertions.
|
|
132
|
+
│
|
|
133
|
+
├── Specific tool calls MUST happen for compliance/safety
|
|
134
|
+
│ → Superset trajectory mode. Outcome too, as a separate check.
|
|
135
|
+
│
|
|
136
|
+
├── Specific tool calls MUST NOT happen (cost / privacy)
|
|
137
|
+
│ → Subset trajectory mode.
|
|
138
|
+
│
|
|
139
|
+
├── The full workflow path is the contract (legal, audit)
|
|
140
|
+
│ → Strict trajectory mode.
|
|
141
|
+
│
|
|
142
|
+
└── Path is partly fixed, partly free
|
|
143
|
+
→ Trajectory with custom comparator OR split into multiple smaller tests.
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## Cost vs accuracy tracking
|
|
147
|
+
|
|
148
|
+
Two metrics that trajectory eval naturally enables:
|
|
149
|
+
|
|
150
|
+
### Tool-call efficiency
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
def efficiency_score(actual_trajectory, reference_trajectory):
|
|
154
|
+
actual_calls = len(actual_trajectory)
|
|
155
|
+
reference_calls = len(reference_trajectory)
|
|
156
|
+
if reference_calls == 0:
|
|
157
|
+
return 1.0
|
|
158
|
+
return min(1.0, reference_calls / actual_calls)
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Score 1.0 = matched or beat reference. Score 0.5 = took 2× the expected calls.
|
|
162
|
+
|
|
163
|
+
Track this over time; agent regressions sometimes show up as efficiency loss before outcome loss.
|
|
164
|
+
|
|
165
|
+
### Step-count percentile
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
Run 2026-05-12:
|
|
169
|
+
p50 tool calls: 4 (reference 3)
|
|
170
|
+
p95 tool calls: 9 (reference 6)
|
|
171
|
+
p99 tool calls: 18 (reference 12)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
p99 spikes catch cases where the agent enters a loop or backtracks excessively — outcome may still be correct but cost runaway is real.
|
|
175
|
+
|
|
176
|
+
## Dataset structure for agents
|
|
177
|
+
|
|
178
|
+
```json
|
|
179
|
+
{
|
|
180
|
+
"id": "agent-case-001",
|
|
181
|
+
"input": {
|
|
182
|
+
"user_message": "Cancel my order #123 and request a refund"
|
|
183
|
+
},
|
|
184
|
+
"expected_outcome": {
|
|
185
|
+
"tickets_created": [
|
|
186
|
+
{ "type": "refund", "order_id": "123" }
|
|
187
|
+
],
|
|
188
|
+
"order_status": "cancelled"
|
|
189
|
+
},
|
|
190
|
+
"expected_trajectory": {
|
|
191
|
+
"mode": "superset",
|
|
192
|
+
"required_calls": [
|
|
193
|
+
{ "name": "verify_user_owns_order", "args_match": "exact" },
|
|
194
|
+
{ "name": "update_order_status", "args_match": "subset", "args": { "status": "cancelled" } },
|
|
195
|
+
{ "name": "create_refund_ticket", "args_match": "subset" }
|
|
196
|
+
],
|
|
197
|
+
"forbidden_calls": ["delete_user_data"]
|
|
198
|
+
},
|
|
199
|
+
"tool_budget": { "p95_max_calls": 8 }
|
|
200
|
+
}
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
The `forbidden_calls` field is powerful — explicitly enumerate tools that MUST NOT fire for this input class. Catches "agent escalated to a dangerous tool that wasn't necessary."
|
|
204
|
+
|
|
205
|
+
## Combining outcome + trajectory
|
|
206
|
+
|
|
207
|
+
For serious agent eval, combine both:
|
|
208
|
+
|
|
209
|
+
```javascript
|
|
210
|
+
test('agent handles refund request', async () => {
|
|
211
|
+
const result = await agent.run(case.input);
|
|
212
|
+
|
|
213
|
+
// Outcome
|
|
214
|
+
expectOutcomeMatch(result.outcome, case.expected_outcome);
|
|
215
|
+
|
|
216
|
+
// Trajectory — superset mode (required tools called)
|
|
217
|
+
expectTrajectoryMatch(result.trajectory, case.expected_trajectory, 'superset');
|
|
218
|
+
|
|
219
|
+
// Forbidden — none of these tools fired
|
|
220
|
+
for (const forbidden of case.expected_trajectory.forbidden_calls) {
|
|
221
|
+
expect(result.trajectory.some(t => t.name === forbidden)).toBe(false);
|
|
222
|
+
}
|
|
223
|
+
|
|
224
|
+
// Budget — didn't exceed expected tool calls
|
|
225
|
+
expect(result.trajectory.length).toBeLessThanOrEqual(case.tool_budget.p95_max_calls);
|
|
226
|
+
});
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
## LLM-as-judge for agent quality
|
|
230
|
+
|
|
231
|
+
Beyond mechanical trajectory matching, judge for:
|
|
232
|
+
- Was the agent's intermediate reasoning sound? (rubric: logical, evidence-based, non-hallucinated)
|
|
233
|
+
- Was the final user message appropriate? (rubric: tone, completeness, accuracy)
|
|
234
|
+
- Did the agent handle ambiguity well? (rubric: did it ask for clarification when needed?)
|
|
235
|
+
|
|
236
|
+
These are rung-4 evaluations on top of rung-1/2/3 outcome and trajectory checks.
|
|
237
|
+
|
|
238
|
+
## Anti-patterns
|
|
239
|
+
|
|
240
|
+
- **Trajectory-only eval** → punishes creative paths; brittle to refactor; ignores real outcome.
|
|
241
|
+
- **Outcome-only eval without state assertion** → trusts the agent's word; misses ghost actions.
|
|
242
|
+
- **Strict trajectory mode when subset/superset would do** → false negatives every time the agent legitimately reorders.
|
|
243
|
+
- **No tool-budget tracking** → agent regresses to expensive paths; you don't notice until the bill spikes.
|
|
244
|
+
- **No `forbidden_calls` enumeration** → agent silently learns to call dangerous tools.
|
|
245
|
+
|
|
246
|
+
## Tools
|
|
247
|
+
|
|
248
|
+
- `langchain-ai/agentevals` (MIT) — Python library implementing all four trajectory match modes + LLM-as-judge for trajectories. Source of the taxonomy above.
|
|
249
|
+
- `langsmith` — observability + eval orchestration; tracks experiments over time.
|
|
250
|
+
- Custom implementation — the modes above are ~50 lines each in any language.
|
|
251
|
+
|
|
252
|
+
The discipline isn't the library choice; it's choosing outcome-vs-trajectory deliberately, picking the right match mode, and tracking efficiency alongside accuracy.
|
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# LLM-as-judge calibration — how to make rung 4 mean something
|
|
2
|
+
|
|
3
|
+
LLM-as-judge sounds simple: a model grades the output. In practice, without calibration it produces NUMBERS WITHOUT SIGNAL — judge scores drift with the model, with rubric phrasing, with prompt minutiae. You read "judge says 4.2 average" and have no idea if that means the system is good.
|
|
4
|
+
|
|
5
|
+
Calibration anchors the judge to human assessment. After calibration, a judge score has meaning. Before, it doesn't.
|
|
6
|
+
|
|
7
|
+
## The three non-negotiables
|
|
8
|
+
|
|
9
|
+
### 1. Calibrate against ≥20 human-graded cases
|
|
10
|
+
|
|
11
|
+
Process:
|
|
12
|
+
|
|
13
|
+
1. Sample ≥20 cases from the reference dataset (or representative production traffic).
|
|
14
|
+
2. Have ≥1 domain expert grade each case using the same rubric the judge will use. Multiple humans per case is better (inter-rater agreement is useful signal).
|
|
15
|
+
3. Run the judge against the same cases.
|
|
16
|
+
4. Compute Spearman rank correlation between human scores and judge scores.
|
|
17
|
+
|
|
18
|
+
**Target:** Spearman ≥0.80.
|
|
19
|
+
**Acceptable:** 0.70-0.80 with documented rationale (e.g., "subjective tone judgments inherently noisy").
|
|
20
|
+
**Reject:** <0.70. The judge is not measuring what you think it's measuring.
|
|
21
|
+
|
|
22
|
+
### 2. Use a different model than the system under test
|
|
23
|
+
|
|
24
|
+
A model judging its own output produces false positives. The judge agrees with itself even when wrong because it shares the same biases and blind spots.
|
|
25
|
+
|
|
26
|
+
Pairing examples:
|
|
27
|
+
- System: GPT-4 → Judge: Claude Opus.
|
|
28
|
+
- System: Claude Sonnet → Judge: GPT-4o.
|
|
29
|
+
- System: Gemini → Judge: Claude.
|
|
30
|
+
|
|
31
|
+
If both system and judge MUST be from the same provider, at minimum use different model sizes (Sonnet judges Opus output, not vice versa).
|
|
32
|
+
|
|
33
|
+
### 3. Structured rubric, not free-form scoring
|
|
34
|
+
|
|
35
|
+
"Rate this answer 1-10" → noise. Different runs give different scores; different humans disagree wildly; the score has no anchor.
|
|
36
|
+
|
|
37
|
+
Structured rubric: ≥3 criteria, each with a defined scale and an example per score point.
|
|
38
|
+
|
|
39
|
+
Example rubric for FAITHFULNESS (RAG):
|
|
40
|
+
|
|
41
|
+
```markdown
|
|
42
|
+
# Faithfulness rubric (1-5 scale)
|
|
43
|
+
|
|
44
|
+
Score each answer against the retrieved context. A faithful answer makes claims supported by the context; an unfaithful one fabricates or extrapolates.
|
|
45
|
+
|
|
46
|
+
## 1 — Severely unfaithful
|
|
47
|
+
The answer contains claims that contradict the context, or fabricates facts not present in any chunk. Example: context says "Q3 revenue was $1.2M"; answer says "Q3 revenue exceeded $5M."
|
|
48
|
+
|
|
49
|
+
## 2 — Mostly unfaithful
|
|
50
|
+
The answer mixes context-supported and fabricated claims, where the fabrication is meaningful. Example: cites a study that wasn't in the context.
|
|
51
|
+
|
|
52
|
+
## 3 — Mixed
|
|
53
|
+
Half the answer is grounded; half is reasonable inference or generalization beyond the context. Example: context describes the API; answer adds advice not derivable from context.
|
|
54
|
+
|
|
55
|
+
## 4 — Mostly faithful
|
|
56
|
+
All claims are supported by context; minor paraphrasing or summarization without distortion. Example: rewords a passage accurately.
|
|
57
|
+
|
|
58
|
+
## 5 — Strictly faithful
|
|
59
|
+
Every claim is directly traceable to a specific chunk; no information added beyond what context contains. Example: quotes-with-attribution style.
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Provide this rubric INSIDE the judge prompt. Free-form is forbidden.
|
|
63
|
+
|
|
64
|
+
## The calibration loop
|
|
65
|
+
|
|
66
|
+
```
|
|
67
|
+
1. Sample 20-30 cases for calibration set.
|
|
68
|
+
2. Human-grade them blind (without seeing other graders or judge).
|
|
69
|
+
3. Run judge with rubric.
|
|
70
|
+
4. Compute Spearman vs human scores.
|
|
71
|
+
5. If <0.70:
|
|
72
|
+
- Examine disagreements: where does judge consistently miss?
|
|
73
|
+
- Refine rubric: more specific scale, more examples, narrower scope.
|
|
74
|
+
- OR switch judge model: try a different vendor/size.
|
|
75
|
+
- Re-run step 3-4.
|
|
76
|
+
6. If 0.70-0.80: document the noise floor; accept with caveats.
|
|
77
|
+
7. If ≥0.80: judge is calibrated. Save the rubric + judge config in version control.
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Calibration is one-time-per-config but RECURRING-PER-MODEL-CHANGE. Every model swap (you upgrade GPT-4 to GPT-5; vendor deprecates Opus 4.7) invalidates the calibration. Re-calibrate.
|
|
81
|
+
|
|
82
|
+
## Judge drift monitoring
|
|
83
|
+
|
|
84
|
+
After deployment:
|
|
85
|
+
- Re-run calibration set monthly.
|
|
86
|
+
- Plot Spearman over time.
|
|
87
|
+
- Alert if Spearman drops below 0.75 between calibration runs — the judge has drifted (model update, rubric got stale, traffic distribution shifted).
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
.dw/eval/judges/<feature>/
|
|
91
|
+
├── rubric.md # the rubric, version-controlled
|
|
92
|
+
├── calibration-2026-05-12.jsonl # 20+ cases with human + judge scores
|
|
93
|
+
├── spearman-2026-05-12.txt # 0.84
|
|
94
|
+
├── calibration-2026-08-12.jsonl # quarterly re-calibration
|
|
95
|
+
└── spearman-2026-08-12.txt # 0.81
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Rubric design patterns
|
|
99
|
+
|
|
100
|
+
### DO
|
|
101
|
+
|
|
102
|
+
- **3-5 criteria** per rubric (one for each dimension you care about: faithfulness, completeness, tone, format, ...).
|
|
103
|
+
- **1-5 scale** with anchored descriptions per point (not 1-10 — too granular for reliable agreement).
|
|
104
|
+
- **Example per score point** showing the kind of output that earns that score.
|
|
105
|
+
- **Explicit "what to ignore"** — e.g., "ignore minor grammar; score on substance."
|
|
106
|
+
|
|
107
|
+
### DON'T
|
|
108
|
+
|
|
109
|
+
- Single-criterion "quality" score — too vague to calibrate.
|
|
110
|
+
- 1-100 scales — humans can't reliably distinguish 73 from 76.
|
|
111
|
+
- Rubrics longer than 500 words — the judge skips and lazy-scores.
|
|
112
|
+
- "Holistic" scoring without breakdown — opaque to debug.
|
|
113
|
+
|
|
114
|
+
## Multi-criterion rubrics
|
|
115
|
+
|
|
116
|
+
For complex outputs (RAG, agents), one number rarely captures quality. Use per-criterion scores:
|
|
117
|
+
|
|
118
|
+
```json
|
|
119
|
+
{
|
|
120
|
+
"faithfulness": 4,
|
|
121
|
+
"completeness": 3,
|
|
122
|
+
"tone": 5,
|
|
123
|
+
"format": 5,
|
|
124
|
+
"overall": null
|
|
125
|
+
}
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
Aggregate as needed downstream (weighted average, minimum, "all must be ≥3"). Don't have the judge compute the aggregate — bias compounds.
|
|
129
|
+
|
|
130
|
+
## Anti-patterns
|
|
131
|
+
|
|
132
|
+
- **Judge with no rubric.** "Rate this 1-10." Numbers, no signal.
|
|
133
|
+
- **Judge is the system being tested.** False positives baked in.
|
|
134
|
+
- **No calibration evidence in PR.** "We added LLM-as-judge" — okay, what's the Spearman?
|
|
135
|
+
- **Rubric stuffed with all criteria in one prompt** → judge lazy-scores. Split into criterion-per-call if needed.
|
|
136
|
+
- **Calibration done once, never revisited.** Model upgrades silently break it. Re-calibrate monthly or per model swap.
|
|
137
|
+
- **Judge scoring its own scoring.** Recursive trust collapse.
|
|
138
|
+
|
|
139
|
+
## Bias to watch
|
|
140
|
+
|
|
141
|
+
LLM judges have characteristic biases:
|
|
142
|
+
|
|
143
|
+
- **Length bias** — longer outputs score higher even when shorter is better. Normalize length in the rubric.
|
|
144
|
+
- **Self-similarity bias** — judges rate outputs that resemble their own writing higher. Cross-model pairing helps.
|
|
145
|
+
- **Position bias** (in comparative judging) — first item often wins. Randomize order, run both A/B and B/A.
|
|
146
|
+
- **Recency bias** — last item in context is overweighted. Vary order.
|
|
147
|
+
- **Sycophancy** — judges agree with strongly-stated input even when wrong. Frame the judge prompt neutrally.
|
|
148
|
+
|
|
149
|
+
Document which biases you tested for in the calibration write-up.
|
|
150
|
+
|
|
151
|
+
## Cost discipline
|
|
152
|
+
|
|
153
|
+
LLM-as-judge can dominate eval costs. At $0.01-$0.10 per judgment, 100 cases × 4 rubric criteria × monthly = real money.
|
|
154
|
+
|
|
155
|
+
Optimizations (in order of impact):
|
|
156
|
+
1. Run judge against SAMPLES, not the whole dataset every time. 50 random cases weekly catches regression.
|
|
157
|
+
2. Use the cheapest model that maintains Spearman ≥0.80. GPT-4 mini may calibrate as well as GPT-4 for your rubric.
|
|
158
|
+
3. Batch judge calls when the API supports it.
|
|
159
|
+
4. Cache judge results per (input, output, rubric-version) tuple — same eval run shouldn't pay twice.
|
|
160
|
+
5. Skip judge for cases where rungs 1-3 already failed — they're broken; no point asking subjective quality.
|
|
161
|
+
|
|
162
|
+
## When NOT to use LLM-as-judge
|
|
163
|
+
|
|
164
|
+
- The output has a deterministic correct answer. Use rung 1 or 2.
|
|
165
|
+
- The output has a measurable side effect. Use rung 3.
|
|
166
|
+
- The team won't budget for calibration. The judge will produce noise.
|
|
167
|
+
- The rubric can't be written in <500 words. The criterion is too vague.
|
|
168
|
+
|
|
169
|
+
A poorly-calibrated judge is worse than no judge: it gives false confidence. Better to ship with "tested manually by domain expert on 20 cases" than with "judge score 4.1" that means nothing.
|
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
# Oracle ladder — climb deliberately
|
|
2
|
+
|
|
3
|
+
Five rungs ordered by cost (cheap → expensive) and rigor (strict → subjective). Start at the bottom. Every rung up costs an order of magnitude more in latency, money, or calibration effort. Don't reach for an upper rung when a lower one can prove the case.
|
|
4
|
+
|
|
5
|
+
## Rung 1 — Exact match
|
|
6
|
+
|
|
7
|
+
**What it checks:** the output equals the expected output, byte-for-byte (or after a normalization step like JSON canonicalization).
|
|
8
|
+
|
|
9
|
+
**Use when:**
|
|
10
|
+
- Output is a structured function call: `expect(toolCalls[0]).toEqual({ name: 'search', args: { q: 'invoices' } })`.
|
|
11
|
+
- Output is a classification from a fixed label set: `expect(label).toBe('refund-request')`.
|
|
12
|
+
- Output is a parsed value from a JSON contract: `expect(result.user_id).toBe('u-42')`.
|
|
13
|
+
|
|
14
|
+
**Example:**
|
|
15
|
+
|
|
16
|
+
```javascript
|
|
17
|
+
test('classifier labels refund requests correctly', async () => {
|
|
18
|
+
const cases = await loadDataset('.dw/eval/datasets/classifier/cases.jsonl');
|
|
19
|
+
for (const c of cases.filter(c => c.expected === 'refund-request')) {
|
|
20
|
+
expect(await classify(c.input)).toBe('refund-request');
|
|
21
|
+
}
|
|
22
|
+
});
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
**Cost:** ~free.
|
|
26
|
+
**Limitation:** can't handle creative outputs (paragraphs, summaries). Don't try to force-fit.
|
|
27
|
+
|
|
28
|
+
## Rung 2 — Schema validation
|
|
29
|
+
|
|
30
|
+
**What it checks:** the output matches a structural contract — types, required fields, value ranges. The SHAPE is fixed; specific values can vary.
|
|
31
|
+
|
|
32
|
+
**Use when:**
|
|
33
|
+
- LLM returns structured data with stable schema (JSON, function call args) but variable content.
|
|
34
|
+
- You need to detect "agent returned garbage" without asserting on the exact garbage.
|
|
35
|
+
|
|
36
|
+
**Example:**
|
|
37
|
+
|
|
38
|
+
```typescript
|
|
39
|
+
import { z } from 'zod';
|
|
40
|
+
|
|
41
|
+
const ResponseSchema = z.object({
|
|
42
|
+
summary: z.string().min(20).max(500),
|
|
43
|
+
citations: z.array(z.object({
|
|
44
|
+
url: z.string().url(),
|
|
45
|
+
page: z.number().int().optional(),
|
|
46
|
+
})).min(1),
|
|
47
|
+
confidence: z.number().min(0).max(1),
|
|
48
|
+
});
|
|
49
|
+
|
|
50
|
+
test('summarizer returns valid shape', async () => {
|
|
51
|
+
const result = await summarize(input);
|
|
52
|
+
expect(() => ResponseSchema.parse(result)).not.toThrow();
|
|
53
|
+
});
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Cost:** ~free (schema check is cheap).
|
|
57
|
+
**Limitation:** doesn't tell you if the CONTENT is correct, only that it's the right shape. Pair with another rung.
|
|
58
|
+
|
|
59
|
+
## Rung 3 — Outcome state
|
|
60
|
+
|
|
61
|
+
**What it checks:** a side effect occurred — DB row was created, file was written, tool was called with valid arguments, ticket was opened. The state of the world matches expectations.
|
|
62
|
+
|
|
63
|
+
**Use when:**
|
|
64
|
+
- Agent has tool access and the GOAL is to change state, not produce prose.
|
|
65
|
+
- RAG answer is supposed to lead to an action (e.g., "user clicked the suggested invoice and reconciled it").
|
|
66
|
+
- The system has observable side effects you can query post-hoc.
|
|
67
|
+
|
|
68
|
+
**Example:**
|
|
69
|
+
|
|
70
|
+
```javascript
|
|
71
|
+
test('agent files refund request when user asks', async () => {
|
|
72
|
+
await agent.run('I want a refund for order #123');
|
|
73
|
+
|
|
74
|
+
const tickets = await db.tickets.findMany({ where: { order_id: '123' } });
|
|
75
|
+
expect(tickets).toHaveLength(1);
|
|
76
|
+
expect(tickets[0].type).toBe('refund');
|
|
77
|
+
expect(tickets[0].status).toBe('pending');
|
|
78
|
+
});
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**Cost:** cheap (1 DB query / API call per assertion).
|
|
82
|
+
**Limitation:** doesn't validate the PROSE the agent produced along the way. If the goal was "answer the user politely AND file the refund," rung 3 catches the action but not the politeness — climb to rung 4 for that.
|
|
83
|
+
|
|
84
|
+
**Key benefit:** catches "ghost actions" — agent claims to have done X but didn't actually do it. Rungs 1-2 trust the agent's word; rung 3 verifies the world.
|
|
85
|
+
|
|
86
|
+
## Rung 4 — LLM-as-judge
|
|
87
|
+
|
|
88
|
+
**What it checks:** a different model grades the output against a rubric. Used for genuinely subjective quality — helpfulness, tone, faithfulness, completeness.
|
|
89
|
+
|
|
90
|
+
**Mandatory before using:**
|
|
91
|
+
- Calibrated against ≥20 human-graded cases (Spearman ≥0.80) — see `judge-calibration.md`.
|
|
92
|
+
- Different model than the system under test.
|
|
93
|
+
- Structured rubric, not free-form "rate 1-10."
|
|
94
|
+
|
|
95
|
+
**Example:**
|
|
96
|
+
|
|
97
|
+
```javascript
|
|
98
|
+
test('chat response is faithful to retrieved context', async () => {
|
|
99
|
+
const cases = await loadDataset('.dw/eval/datasets/rag-chat/cases.jsonl');
|
|
100
|
+
const scores = [];
|
|
101
|
+
|
|
102
|
+
for (const c of cases) {
|
|
103
|
+
const answer = await chat(c.input, c.context);
|
|
104
|
+
const judgment = await llmJudge({
|
|
105
|
+
model: 'claude-opus-4-7', // different from system under test (GPT-4)
|
|
106
|
+
rubric: faithfulnessRubric,
|
|
107
|
+
input: c.input,
|
|
108
|
+
context: c.context,
|
|
109
|
+
output: answer,
|
|
110
|
+
});
|
|
111
|
+
scores.push(judgment.score);
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
// 80% of cases must score ≥4 on the 1-5 faithfulness rubric
|
|
115
|
+
const passing = scores.filter(s => s >= 4).length / scores.length;
|
|
116
|
+
expect(passing).toBeGreaterThan(0.8);
|
|
117
|
+
});
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
**Cost:** medium-to-high (one judge call per case; pay per case at API rates).
|
|
121
|
+
**Limitation:** the judge has bias and drift; without calibration, you're measuring the judge's mood. Re-calibrate every quarter, every model swap, and after rubric changes.
|
|
122
|
+
|
|
123
|
+
## Rung 5 — Human review
|
|
124
|
+
|
|
125
|
+
**What it checks:** a domain expert scores. The gold standard for the rubrics rung 4 calibrates against.
|
|
126
|
+
|
|
127
|
+
**Use when:**
|
|
128
|
+
- Calibrating LLM-as-judge (rung 4 setup).
|
|
129
|
+
- High-stakes outputs where automation isn't trusted (medical, legal, financial).
|
|
130
|
+
- Edge cases that automated rungs flag as borderline.
|
|
131
|
+
|
|
132
|
+
**Cost:** expensive. Don't scale; sample.
|
|
133
|
+
|
|
134
|
+
**Pattern:**
|
|
135
|
+
- Spot-check 5-10% of LLM-as-judge results randomly each week.
|
|
136
|
+
- Whenever LLM-as-judge score is "borderline" (e.g., 2.5-3.5 on 1-5 scale), kick to human.
|
|
137
|
+
- Full human review only for the calibration dataset and high-stakes edge cases.
|
|
138
|
+
|
|
139
|
+
## The climbing decision tree
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
Is the output a fixed-structure value (function call, classification, JSON with stable shape)?
|
|
143
|
+
├── YES → Rung 1 (exact match) or Rung 2 (schema)
|
|
144
|
+
└── NO → does the output cause an observable side effect (DB write, tool call, ticket opened)?
|
|
145
|
+
├── YES → Rung 3 (outcome state)
|
|
146
|
+
└── NO → output is subjective (prose, summary, recommendation). Rung 4 required.
|
|
147
|
+
└── Did you calibrate the judge against humans (≥20 cases, Spearman ≥0.80)?
|
|
148
|
+
├── YES → Rung 4 is valid signal
|
|
149
|
+
└── NO → DO NOT USE Rung 4 yet. Calibrate first via Rung 5.
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## Anti-patterns
|
|
153
|
+
|
|
154
|
+
- **Reaching for Rung 4 first** because "everything else seems hard." Climb the ladder; lower rungs catch loud failures cheaply.
|
|
155
|
+
- **Pretending Rung 4 is calibrated** by running it without checking against humans. Score numbers without calibration are decorative.
|
|
156
|
+
- **Skipping Rung 3 because "we have unit tests"** — unit tests with mocked tools prove the agent CALLED the tool. Rung 3 proves the tool's effect happened.
|
|
157
|
+
- **Mixing rungs in one assertion**: `expect(answer).toBe('Yes, your refund is being processed' /* exact */)` — when the exact text doesn't matter, rung 1 is the wrong tool.
|
|
158
|
+
|
|
159
|
+
## Combining rungs
|
|
160
|
+
|
|
161
|
+
For a serious AI feature, expect to use 2-3 rungs together:
|
|
162
|
+
|
|
163
|
+
| Feature | Typical rung mix |
|
|
164
|
+
|---------|------------------|
|
|
165
|
+
| Classifier | Rung 1 (label correctness) + Rung 4 (rationale quality, if exposed to user) |
|
|
166
|
+
| RAG chat | Rung 2 (response shape) + Rung 3 (citations are valid URLs/IDs) + Rung 4 (faithfulness) |
|
|
167
|
+
| Agent (filing tickets) | Rung 3 (ticket created with correct fields) + Rung 4 (user-facing message tone) |
|
|
168
|
+
| Summarization | Rung 2 (length, structure) + Rung 4 (faithfulness, completeness) |
|
|
169
|
+
| Tool-use trajectory | Rung 1 (specific tool calls expected) + Rung 4 (intermediate reasoning quality, optional) |
|
|
170
|
+
|
|
171
|
+
The rule: cheap rungs catch the failures that scream; expensive rungs catch the failures that whisper. You need both.
|