@bgicli/bgicli 2.2.8 → 2.2.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/data/skills/anthropic-algorithmic-art/SKILL.md +405 -0
- package/data/skills/anthropic-canvas-design/SKILL.md +130 -0
- package/data/skills/anthropic-claude-api/SKILL.md +243 -0
- package/data/skills/anthropic-doc-coauthoring/SKILL.md +375 -0
- package/data/skills/anthropic-docx/SKILL.md +590 -0
- package/data/skills/anthropic-frontend-design/SKILL.md +42 -0
- package/data/skills/anthropic-internal-comms/SKILL.md +32 -0
- package/data/skills/anthropic-mcp-builder/SKILL.md +236 -0
- package/data/skills/anthropic-pdf/SKILL.md +314 -0
- package/data/skills/anthropic-pptx/SKILL.md +232 -0
- package/data/skills/anthropic-skill-creator/SKILL.md +485 -0
- package/data/skills/anthropic-webapp-testing/SKILL.md +96 -0
- package/data/skills/anthropic-xlsx/SKILL.md +292 -0
- package/data/skills/arxiv-database/SKILL.md +362 -0
- package/data/skills/astropy/SKILL.md +329 -0
- package/data/skills/ctx-advanced-evaluation/SKILL.md +402 -0
- package/data/skills/ctx-bdi-mental-states/SKILL.md +311 -0
- package/data/skills/ctx-context-compression/SKILL.md +272 -0
- package/data/skills/ctx-context-degradation/SKILL.md +206 -0
- package/data/skills/ctx-context-fundamentals/SKILL.md +201 -0
- package/data/skills/ctx-context-optimization/SKILL.md +195 -0
- package/data/skills/ctx-evaluation/SKILL.md +251 -0
- package/data/skills/ctx-filesystem-context/SKILL.md +287 -0
- package/data/skills/ctx-hosted-agents/SKILL.md +260 -0
- package/data/skills/ctx-memory-systems/SKILL.md +225 -0
- package/data/skills/ctx-multi-agent-patterns/SKILL.md +257 -0
- package/data/skills/ctx-project-development/SKILL.md +291 -0
- package/data/skills/ctx-tool-design/SKILL.md +271 -0
- package/data/skills/dhdna-profiler/SKILL.md +162 -0
- package/data/skills/generate-image/SKILL.md +183 -0
- package/data/skills/geomaster/SKILL.md +365 -0
- package/data/skills/get-available-resources/SKILL.md +275 -0
- package/data/skills/hamelsmu-build-review-interface/SKILL.md +96 -0
- package/data/skills/hamelsmu-error-analysis/SKILL.md +164 -0
- package/data/skills/hamelsmu-eval-audit/SKILL.md +183 -0
- package/data/skills/hamelsmu-evaluate-rag/SKILL.md +177 -0
- package/data/skills/hamelsmu-generate-synthetic-data/SKILL.md +131 -0
- package/data/skills/hamelsmu-validate-evaluator/SKILL.md +212 -0
- package/data/skills/hamelsmu-write-judge-prompt/SKILL.md +144 -0
- package/data/skills/hf-cli/SKILL.md +174 -0
- package/data/skills/hf-mcp/SKILL.md +178 -0
- package/data/skills/hugging-face-dataset-viewer/SKILL.md +121 -0
- package/data/skills/hugging-face-datasets/SKILL.md +542 -0
- package/data/skills/hugging-face-evaluation/SKILL.md +651 -0
- package/data/skills/hugging-face-jobs/SKILL.md +1042 -0
- package/data/skills/hugging-face-model-trainer/SKILL.md +717 -0
- package/data/skills/hugging-face-paper-pages/SKILL.md +239 -0
- package/data/skills/hugging-face-paper-publisher/SKILL.md +624 -0
- package/data/skills/hugging-face-tool-builder/SKILL.md +110 -0
- package/data/skills/hugging-face-trackio/SKILL.md +115 -0
- package/data/skills/hugging-face-vision-trainer/SKILL.md +593 -0
- package/data/skills/huggingface-gradio/SKILL.md +245 -0
- package/data/skills/matlab/SKILL.md +376 -0
- package/data/skills/modal/SKILL.md +381 -0
- package/data/skills/openai-cloudflare-deploy/SKILL.md +224 -0
- package/data/skills/openai-develop-web-game/SKILL.md +149 -0
- package/data/skills/openai-doc/SKILL.md +80 -0
- package/data/skills/openai-figma/SKILL.md +42 -0
- package/data/skills/openai-figma-implement-design/SKILL.md +264 -0
- package/data/skills/openai-gh-address-comments/SKILL.md +25 -0
- package/data/skills/openai-gh-fix-ci/SKILL.md +69 -0
- package/data/skills/openai-imagegen/SKILL.md +174 -0
- package/data/skills/openai-jupyter-notebook/SKILL.md +107 -0
- package/data/skills/openai-linear/SKILL.md +87 -0
- package/data/skills/openai-netlify-deploy/SKILL.md +247 -0
- package/data/skills/openai-notion-knowledge-capture/SKILL.md +56 -0
- package/data/skills/openai-notion-meeting-intelligence/SKILL.md +60 -0
- package/data/skills/openai-notion-research-documentation/SKILL.md +59 -0
- package/data/skills/openai-notion-spec-to-implementation/SKILL.md +58 -0
- package/data/skills/openai-openai-docs/SKILL.md +69 -0
- package/data/skills/openai-pdf/SKILL.md +67 -0
- package/data/skills/openai-playwright/SKILL.md +147 -0
- package/data/skills/openai-render-deploy/SKILL.md +479 -0
- package/data/skills/openai-screenshot/SKILL.md +267 -0
- package/data/skills/openai-security-best-practices/SKILL.md +86 -0
- package/data/skills/openai-security-ownership-map/SKILL.md +206 -0
- package/data/skills/openai-security-threat-model/SKILL.md +81 -0
- package/data/skills/openai-sentry/SKILL.md +123 -0
- package/data/skills/openai-sora/SKILL.md +178 -0
- package/data/skills/openai-speech/SKILL.md +144 -0
- package/data/skills/openai-spreadsheet/SKILL.md +145 -0
- package/data/skills/openai-transcribe/SKILL.md +81 -0
- package/data/skills/openai-vercel-deploy/SKILL.md +77 -0
- package/data/skills/openai-yeet/SKILL.md +28 -0
- package/data/skills/pennylane/SKILL.md +224 -0
- package/data/skills/polars-bio/SKILL.md +374 -0
- package/data/skills/primekg/SKILL.md +97 -0
- package/data/skills/pymatgen/SKILL.md +689 -0
- package/data/skills/qiskit/SKILL.md +273 -0
- package/data/skills/qutip/SKILL.md +316 -0
- package/data/skills/recursive-decomposition/SKILL.md +185 -0
- package/data/skills/rowan/SKILL.md +427 -0
- package/data/skills/scholar-evaluation/SKILL.md +298 -0
- package/data/skills/sentry-create-alert/SKILL.md +210 -0
- package/data/skills/sentry-fix-issues/SKILL.md +126 -0
- package/data/skills/sentry-pr-code-review/SKILL.md +105 -0
- package/data/skills/sentry-python-sdk/SKILL.md +317 -0
- package/data/skills/sentry-setup-ai-monitoring/SKILL.md +217 -0
- package/data/skills/stable-baselines3/SKILL.md +297 -0
- package/data/skills/sympy/SKILL.md +498 -0
- package/data/skills/trailofbits-ask-questions-if-underspecified/SKILL.md +85 -0
- package/data/skills/trailofbits-audit-context-building/SKILL.md +302 -0
- package/data/skills/trailofbits-differential-review/SKILL.md +220 -0
- package/data/skills/trailofbits-insecure-defaults/SKILL.md +117 -0
- package/data/skills/trailofbits-modern-python/SKILL.md +333 -0
- package/data/skills/trailofbits-property-based-testing/SKILL.md +123 -0
- package/data/skills/trailofbits-semgrep-rule-creator/SKILL.md +172 -0
- package/data/skills/trailofbits-sharp-edges/SKILL.md +292 -0
- package/data/skills/trailofbits-variant-analysis/SKILL.md +142 -0
- package/data/skills/transformers.js/SKILL.md +637 -0
- package/data/skills/writing/SKILL.md +419 -0
- package/dist/bgi.js +2 -2
- package/package.json +1 -1
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: build-review-interface
|
|
3
|
+
description: >
|
|
4
|
+
Build a custom browser-based annotation interface tailored to your data for
|
|
5
|
+
reviewing LLM traces and collecting structured feedback. Use when you need to
|
|
6
|
+
build an annotation tool, review traces, or collect human labels.
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Build a Custom Annotation Interface
|
|
10
|
+
|
|
11
|
+
## Overview
|
|
12
|
+
|
|
13
|
+
Build an HTML page that loads traces from a data source (JSON/CSV file), displays one trace at a time with Pass/Fail buttons, a free-text notes field, and Next/Previous navigation. Save labels to a local file (CSV/SQLite/JSON). Then customize to the domain using the guidelines below.
|
|
14
|
+
|
|
15
|
+
## Data Display
|
|
16
|
+
|
|
17
|
+
Format all data in the most human-readable representation for the domain. Emails should look like emails. Code should have syntax highlighting. Markdown should be rendered. Tables should be tables. JSON should be pretty-printed and collapsible.
|
|
18
|
+
|
|
19
|
+
- **Collapse repetitive elements.** If every trace shares the same system prompt, put it in a `<details>` toggle.
|
|
20
|
+
- **Extract and surface key metadata.** If traces contain a property name, client type, or session ID buried in the data, extract it and display it prominently as a header or badge.
|
|
21
|
+
- **Color-code by role or status.** Use left-border colors to distinguish user messages, assistant messages, tool calls, and system prompts at a glance.
|
|
22
|
+
- **Group related elements visually.** Tool calls and their responses should be visually linked (indentation, shared border).
|
|
23
|
+
- **Collapse what doesn't help judgment.** Verbose tool response JSON, intermediate reasoning steps, and debugging context go behind toggles.
|
|
24
|
+
- **Highlight what matters most.** Make the primary content reviewers judge visually dominant. Bold key entities (prices, dates, names). Use font size and spacing to create hierarchy.
|
|
25
|
+
- **Show the full trace.** Include all intermediate steps (tool calls, retrieved context, reasoning), not just the final output. Collapse them by default but keep them accessible.
|
|
26
|
+
- **Sanitize rendered content.** Strip raw HTML from LLM outputs before rendering. Disable images in rendered markdown if they could be tracking pixels.
|
|
27
|
+
|
|
28
|
+
## Feedback Collection
|
|
29
|
+
|
|
30
|
+
Annotate at the trace level. The reviewer judges the whole trace, not individual spans.
|
|
31
|
+
|
|
32
|
+
- Binary Pass/Fail buttons as the primary action.
|
|
33
|
+
- Free-text notes field for the reviewer to describe what went wrong (or right).
|
|
34
|
+
- Defer button for uncertain cases.
|
|
35
|
+
- Auto-save on every action.
|
|
36
|
+
|
|
37
|
+
Once you have established failure categories from error analysis, you can later add predefined failure mode tags as clickable checkboxes, dropdowns or picklists so reviewers can select from known categories in addition to writing notes. But don't add these in the initial build.
|
|
38
|
+
|
|
39
|
+
## Navigation and Status
|
|
40
|
+
|
|
41
|
+
- Next/Previous buttons and keyboard arrow keys.
|
|
42
|
+
- Trace counter showing position and progress ("12 of 87 remaining").
|
|
43
|
+
- Jump to specific trace by ID.
|
|
44
|
+
- Counts of labeled vs unlabeled traces.
|
|
45
|
+
|
|
46
|
+
## Keyboard Shortcuts
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
Arrow keys = Navigate traces
|
|
50
|
+
1 = Pass 2 = Fail
|
|
51
|
+
D = Defer U = Undo last action
|
|
52
|
+
Cmd+S = Save Cmd+Enter = Save and next
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
## Selecting Traces to Load
|
|
56
|
+
|
|
57
|
+
Build the app to accept traces from any source (JSON/CSV file). Keep sampling logic outside the app in a separate script. Start with random sampling.
|
|
58
|
+
|
|
59
|
+
## Additional Features
|
|
60
|
+
|
|
61
|
+
**Reference panel:** Toggle-able panel showing ground truth, expected answers, or rubric definitions alongside the trace.
|
|
62
|
+
|
|
63
|
+
**Filtering:** Filter traces by metadata dimensions relevant to the product (channel, user type, pipeline version).
|
|
64
|
+
|
|
65
|
+
**Clustering:** Group traces by metadata or semantic similarity. Show representative traces per cluster with drill-down.
|
|
66
|
+
|
|
67
|
+
## Design Checklist
|
|
68
|
+
|
|
69
|
+
- [ ] Same layout, controls, and terminology on every trace
|
|
70
|
+
- [ ] Pass and Fail buttons are visually distinct (color, size)
|
|
71
|
+
- [ ] Keyboard shortcuts work for all primary actions
|
|
72
|
+
- [ ] Full trace accessible even when sections are collapsed
|
|
73
|
+
- [ ] Labels persist automatically without explicit save
|
|
74
|
+
- [ ] Trace-level annotation (not span-level) as the default
|
|
75
|
+
- [ ] All data rendered in its native format (markdown as HTML, code with highlighting, JSON pretty-printed, tables as HTML tables, URLs as clickable links)
|
|
76
|
+
|
|
77
|
+
## Testing
|
|
78
|
+
|
|
79
|
+
After building the interface, verify it with Playwright.
|
|
80
|
+
|
|
81
|
+
**Visual review:** Take screenshots of the interface with representative trace data loaded. Review each screenshot for:
|
|
82
|
+
- Layout and spacing: is the visual hierarchy clear? Can you immediately see what matters?
|
|
83
|
+
- Readability: is all data rendered in its native format? Are there any raw JSON blobs, unrendered markdown, or unstyled content?
|
|
84
|
+
- Aesthetics: does the interface look professional and clean? Would a domain expert use this?
|
|
85
|
+
- Responsiveness: does the layout hold at different window sizes?
|
|
86
|
+
|
|
87
|
+
**Functional test:** Write a Playwright script that performs a full annotation workflow:
|
|
88
|
+
1. Load the app and verify traces are displayed
|
|
89
|
+
2. Click Pass on a trace, verify the label is saved
|
|
90
|
+
3. Click Fail on a trace, add a note, verify both are saved
|
|
91
|
+
4. Click Defer, verify it is recorded
|
|
92
|
+
5. Navigate forward and backward with buttons and keyboard shortcuts
|
|
93
|
+
6. Verify the trace counter updates correctly
|
|
94
|
+
7. Verify auto-save by reloading the page and checking labels persist
|
|
95
|
+
8. Expand collapsed sections (system prompts, tool calls) and verify content is accessible
|
|
96
|
+
9. Test that all keyboard shortcuts trigger the correct actions
|
|
@@ -0,0 +1,164 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: error-analysis
|
|
3
|
+
description: >
|
|
4
|
+
Help the user systematically identify and categorize failure modes in an LLM
|
|
5
|
+
pipeline by reading traces. Use when starting a new eval project, after
|
|
6
|
+
significant pipeline changes (new features, model switches, prompt rewrites),
|
|
7
|
+
when production metrics drop, or after incidents.
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Error Analysis
|
|
11
|
+
|
|
12
|
+
Guide the user through reading LLM pipeline traces and building a catalog of how the system fails.
|
|
13
|
+
|
|
14
|
+
## Overview
|
|
15
|
+
|
|
16
|
+
1. Collect ~100 representative traces
|
|
17
|
+
2. Read each trace, judge pass/fail, and note what went wrong
|
|
18
|
+
3. Group similar failures into categories
|
|
19
|
+
4. Label every trace against those categories
|
|
20
|
+
5. Compute failure rates to prioritize what to fix
|
|
21
|
+
|
|
22
|
+
## Core Process
|
|
23
|
+
|
|
24
|
+
### Step 1: Collect Traces
|
|
25
|
+
|
|
26
|
+
Capture the full trace: input, all intermediate LLM calls, tool uses, retrieved documents, reasoning steps, and final output.
|
|
27
|
+
|
|
28
|
+
**Target: ~100 traces.** This is roughly where new traces stop revealing new kinds of failures. The number depends on system complexity.
|
|
29
|
+
|
|
30
|
+
**From real user data (preferred):**
|
|
31
|
+
- Small volume: random sample
|
|
32
|
+
- Large volume: sample across key dimensions (query type, user segment, feature area)
|
|
33
|
+
- Use embedding clustering (K-means) to ensure diversity
|
|
34
|
+
|
|
35
|
+
**From synthetic data (when real data is sparse):**
|
|
36
|
+
- Use the generate-synthetic-data skill
|
|
37
|
+
- Run synthetic queries through the full pipeline and capture complete traces
|
|
38
|
+
|
|
39
|
+
### Step 2: Read Traces and Take Notes
|
|
40
|
+
|
|
41
|
+
Present each trace to the user. For each one, ask: **did the system produce a good result?** Pass or Fail.
|
|
42
|
+
|
|
43
|
+
For failures, note what went wrong. Focus on the **first thing that went wrong** in the trace — errors cascade, so downstream symptoms disappear when the root cause is fixed. Don't chase every issue in a single trace.
|
|
44
|
+
|
|
45
|
+
Write observations, not explanations. "SQL missed the budget constraint" not "The model probably didn't understand the budget."
|
|
46
|
+
|
|
47
|
+
**Template:**
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
| Trace ID | Trace | What went wrong | Pass/Fail |
|
|
51
|
+
|----------|-------|-----------------|-----------|
|
|
52
|
+
| 001 | [full trace] | Missing filter: pet-friendly requirement ignored in SQL | Fail |
|
|
53
|
+
| 002 | [full trace] | Proposed unavailable times despite calendar conflicts | Fail |
|
|
54
|
+
| 003 | [full trace] | Used casual tone for luxury client; wrong property type | Fail |
|
|
55
|
+
| 004 | [full trace] | - | Pass |
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
**Heuristics:**
|
|
59
|
+
- Do NOT start with a pre-defined failure list. Let categories emerge from what the user actually sees.
|
|
60
|
+
- If the user is stuck articulating what feels wrong, prompt with common failure types: made-up facts, malformed output, ignored user requirements, wrong tone, tool misuse.
|
|
61
|
+
|
|
62
|
+
### Step 3: Group Failures into Categories
|
|
63
|
+
|
|
64
|
+
After reviewing 30-50 traces, start grouping similar notes into categories. Don't wait until all 100 are done — grouping early helps sharpen what to look for in the remaining traces. The categories will evolve. The goal is names that are specific and actionable, not perfect.
|
|
65
|
+
|
|
66
|
+
1. Read through all the failure notes
|
|
67
|
+
2. Group similar ones together
|
|
68
|
+
3. Split notes that look alike but have different root causes
|
|
69
|
+
4. Give each category a clear name and one-sentence definition
|
|
70
|
+
|
|
71
|
+
**When to split vs. group:**
|
|
72
|
+
|
|
73
|
+
Split these (different root causes):
|
|
74
|
+
- "Made up property features (solar panels)" vs. "Made up client activity (scheduled a tour never requested)" — one fabricates external facts, the other fabricates user intent.
|
|
75
|
+
|
|
76
|
+
Group these (same root cause):
|
|
77
|
+
- "Missing bedroom count filter" + "Missing pet-friendly filter" + "Missing price range filter" → **Missing Query Constraints**
|
|
78
|
+
|
|
79
|
+
**LLM-assisted clustering** (use only after the user has reviewed 30-50 traces):
|
|
80
|
+
|
|
81
|
+
```
|
|
82
|
+
Here are failure annotations from reviewing LLM pipeline traces.
|
|
83
|
+
Group similar failures into 5-10 distinct categories.
|
|
84
|
+
For each category, provide:
|
|
85
|
+
- A clear name
|
|
86
|
+
- A one-sentence definition
|
|
87
|
+
- Which annotations belong to it
|
|
88
|
+
|
|
89
|
+
Annotations:
|
|
90
|
+
[paste annotations]
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Always review LLM-suggested groupings with the user. LLMs cluster by surface similarity (e.g., grouping "app crashes" and "login is slow" because both mention login).
|
|
94
|
+
|
|
95
|
+
**Aim for 5-10 categories** that are:
|
|
96
|
+
- Distinct (each failure belongs to one category)
|
|
97
|
+
- Clear enough that someone else could apply them consistently
|
|
98
|
+
- Actionable (each points toward a specific fix)
|
|
99
|
+
|
|
100
|
+
### Step 4: Label Every Trace
|
|
101
|
+
|
|
102
|
+
Go back through all traces and apply binary labels (pass/fail) for each failure category. Each trace gets a column per category. Use whatever tool the user prefers — spreadsheet, annotation app (see build-review-interface), or a simple script.
|
|
103
|
+
|
|
104
|
+
### Step 5: Compute Failure Rates
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
|
|
108
|
+
failure_rates.sort_values(ascending=False)
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
The most frequent failure category is where to focus first.
|
|
112
|
+
|
|
113
|
+
### Step 6: Decide What to Do About Each Failure
|
|
114
|
+
|
|
115
|
+
Work through each category with the user in this order:
|
|
116
|
+
|
|
117
|
+
**Can we just fix it?** Many failures have obvious fixes that don't need an evaluator at all:
|
|
118
|
+
- The prompt never mentioned the requirement. Example: the LLM never includes photo links in emails because the prompt never asked for them. Add the instruction.
|
|
119
|
+
- A tool is missing or misconfigured. Example: the user wants to reschedule but there's no rescheduling tool exposed to the LLM. Add the tool.
|
|
120
|
+
- An engineering bug in retrieval, parsing, or integration. Fix the code.
|
|
121
|
+
|
|
122
|
+
If a clear fix resolves the failure, do that first. Only consider an evaluator for failures that persist after fixing.
|
|
123
|
+
|
|
124
|
+
**Is an evaluator worth the effort?** Not every remaining failure needs one. Building and maintaining evaluators has real cost. Ask the user:
|
|
125
|
+
- Does this failure happen frequently enough to matter?
|
|
126
|
+
- What's the business impact when it does happen? A rare failure that causes revenue loss may outrank a frequent failure that's merely annoying.
|
|
127
|
+
- Will this evaluator actually get used to iterate on the system, or is it checkbox work?
|
|
128
|
+
|
|
129
|
+
Reserve evaluators for failures the user will iterate on repeatedly. Start with the highest-frequency, highest-impact category.
|
|
130
|
+
|
|
131
|
+
**For failures that warrant an evaluator:** prefer code-based checks (regex, parsing, schema validation) for anything objective. Use write-judge-prompt only for failures that require judgment. Critical requirements (safety, compliance) may warrant an evaluator even after fixing the prompt, as a guardrail.
|
|
132
|
+
|
|
133
|
+
### Step 7: Iterate
|
|
134
|
+
|
|
135
|
+
Expect 2-3 rounds of reviewing and refining categories. After each round:
|
|
136
|
+
- Merge categories that overlap
|
|
137
|
+
- Split categories that are too broad
|
|
138
|
+
- Clarify definitions where the user would hesitate
|
|
139
|
+
- Re-label traces with the refined categories
|
|
140
|
+
|
|
141
|
+
## Stopping Criteria
|
|
142
|
+
|
|
143
|
+
Stop reviewing when new traces aren't revealing new kinds of failures. Roughly: ~100 traces reviewed with no new failure types appearing in the last 20. The exact number depends on system complexity.
|
|
144
|
+
|
|
145
|
+
## Trace Sampling Strategies
|
|
146
|
+
|
|
147
|
+
When production volume is high, use a mix:
|
|
148
|
+
|
|
149
|
+
| Strategy | When to Use | Method |
|
|
150
|
+
|----------|------------|--------|
|
|
151
|
+
| **Random** | Default starting point | Sample uniformly from recent traces |
|
|
152
|
+
| **Outlier** | Surface unusual behavior | Sort by response length, latency, tool call count; review extremes |
|
|
153
|
+
| **Failure-driven** | After guardrail violations or user complaints | Prioritize flagged traces |
|
|
154
|
+
| **Uncertainty** | When automated judges exist | Focus on traces where judges disagree or have low confidence |
|
|
155
|
+
| **Stratified** | Ensure coverage across user segments | Sample within each dimension |
|
|
156
|
+
|
|
157
|
+
## Anti-Patterns
|
|
158
|
+
|
|
159
|
+
- **Brainstorming failure categories before reading traces.** Read first, categorize what you find.
|
|
160
|
+
- **Starting with pre-defined categories.** A fixed list causes confirmation bias. Let categories emerge.
|
|
161
|
+
- **Skipping the user for initial review.** The user must review the first 30-50 traces to ground categories in domain knowledge.
|
|
162
|
+
- **Using generic scores as categories.** "Hallucination score," "helpfulness score," "coherence score" are not grounded in the application's actual failure modes.
|
|
163
|
+
- **Building evaluators before fixing obvious problems.** Fix prompt gaps, missing tools, and engineering bugs first.
|
|
164
|
+
- **Treating this as a one-time activity.** Re-run after every significant change: new features, prompt rewrites, model switches, production incidents.
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-audit
|
|
3
|
+
description: >
|
|
4
|
+
Audit an LLM eval pipeline and surface problems: missing error analysis,
|
|
5
|
+
unvalidated judges, vanity metrics, etc. Use when
|
|
6
|
+
inheriting an eval system, when unsure whether evals are trustworthy, or as a
|
|
7
|
+
starting point when no eval infrastructure exists. Do NOT use when the goal
|
|
8
|
+
is to build a new evaluator from scratch (use error-analysis,
|
|
9
|
+
write-judge-prompt, or validate-evaluator instead).
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Eval Audit
|
|
13
|
+
|
|
14
|
+
Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.
|
|
15
|
+
|
|
16
|
+
## Overview
|
|
17
|
+
|
|
18
|
+
1. Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
|
|
19
|
+
2. Run diagnostic checks across six areas
|
|
20
|
+
3. Produce a findings report ordered by impact, with each finding linking to a fix
|
|
21
|
+
|
|
22
|
+
## Prerequisites
|
|
23
|
+
|
|
24
|
+
Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."
|
|
25
|
+
|
|
26
|
+
## Connecting to Eval Infrastructure
|
|
27
|
+
|
|
28
|
+
Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.
|
|
29
|
+
|
|
30
|
+
## Diagnostic Checks
|
|
31
|
+
|
|
32
|
+
Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.
|
|
33
|
+
|
|
34
|
+
Prioritize findings by impact on the user's product. Present the most impactful findings first.
|
|
35
|
+
|
|
36
|
+
### 1. Error Analysis
|
|
37
|
+
|
|
38
|
+
**Check:** Has the user done systematic error analysis on real or synthetic traces?
|
|
39
|
+
|
|
40
|
+
Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.
|
|
41
|
+
|
|
42
|
+
**Finding if missing:** Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with `error-analysis`, or `generate-synthetic-data` first if no traces exist.
|
|
43
|
+
|
|
44
|
+
See: [Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/index.html), [LLM Evals FAQ](https://hamel.dev/blog/posts/evals-faq/)
|
|
45
|
+
|
|
46
|
+
**Check:** Were failure categories brainstormed or observed?
|
|
47
|
+
|
|
48
|
+
Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.
|
|
49
|
+
|
|
50
|
+
**Finding if brainstormed:** Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with `error-analysis`, starting from traces.
|
|
51
|
+
|
|
52
|
+
See: [Who Validates the Validators?](https://arxiv.org/abs/2404.12272)
|
|
53
|
+
|
|
54
|
+
### 2. Evaluator Design
|
|
55
|
+
|
|
56
|
+
**Check:** Are evaluators binary pass/fail?
|
|
57
|
+
|
|
58
|
+
Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.
|
|
59
|
+
|
|
60
|
+
**Finding if not binary:** Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using `write-judge-prompt`.
|
|
61
|
+
|
|
62
|
+
See: [Creating an LLM Judge That Drives Business Results](https://hamel.dev/blog/posts/llm-judge/)
|
|
63
|
+
|
|
64
|
+
**Check:** Do LLM judge prompts target specific failure modes?
|
|
65
|
+
|
|
66
|
+
Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").
|
|
67
|
+
|
|
68
|
+
**Finding if vague:** Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use `write-judge-prompt`.
|
|
69
|
+
|
|
70
|
+
**Check:** Are code-based checks used where possible?
|
|
71
|
+
|
|
72
|
+
Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.
|
|
73
|
+
|
|
74
|
+
**Finding if over-relying on judges:** Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.
|
|
75
|
+
|
|
76
|
+
**Check:** Are similarity metrics used as primary evaluation?
|
|
77
|
+
|
|
78
|
+
Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.
|
|
79
|
+
|
|
80
|
+
**Finding if present:** These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.
|
|
81
|
+
|
|
82
|
+
See: [LLM Evals FAQ](https://hamel.dev/blog/posts/evals-faq/)
|
|
83
|
+
|
|
84
|
+
### 3. Judge Validation
|
|
85
|
+
|
|
86
|
+
**Check:** Are LLM judges validated against human labels?
|
|
87
|
+
|
|
88
|
+
Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.
|
|
89
|
+
|
|
90
|
+
**Finding if unvalidated:** An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use `validate-evaluator`.
|
|
91
|
+
|
|
92
|
+
See: [Creating an LLM Judge That Drives Business Results](https://hamel.dev/blog/posts/llm-judge/)
|
|
93
|
+
|
|
94
|
+
**Check:** Is alignment measured with TPR/TNR or with raw accuracy?
|
|
95
|
+
|
|
96
|
+
Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.
|
|
97
|
+
|
|
98
|
+
**Finding if using accuracy:** With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use `validate-evaluator`.
|
|
99
|
+
|
|
100
|
+
**Check:** Is there a proper train/dev/test split?
|
|
101
|
+
|
|
102
|
+
Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.
|
|
103
|
+
|
|
104
|
+
**Finding if leaking:** Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use `validate-evaluator`.
|
|
105
|
+
|
|
106
|
+
### 4. Human Review Process
|
|
107
|
+
|
|
108
|
+
**Check:** Who is reviewing traces?
|
|
109
|
+
|
|
110
|
+
Determine whether domain experts or outsourced annotators are labeling data.
|
|
111
|
+
|
|
112
|
+
**Finding if outsourced without domain expertise:** General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.
|
|
113
|
+
|
|
114
|
+
See: [A Field Guide to Improving AI Products](https://hamel.dev/blog/posts/field-guide/)
|
|
115
|
+
|
|
116
|
+
**Check:** Are reviewers seeing full traces or just final outputs?
|
|
117
|
+
|
|
118
|
+
**Finding if output-only:** Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.
|
|
119
|
+
|
|
120
|
+
**Check:** How is data displayed to reviewers?
|
|
121
|
+
|
|
122
|
+
Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.
|
|
123
|
+
|
|
124
|
+
**Finding if raw format:** Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use `build-review-interface`.
|
|
125
|
+
|
|
126
|
+
See: [LLM Evals FAQ](https://hamel.dev/blog/posts/evals-faq/)
|
|
127
|
+
|
|
128
|
+
### 5. Labeled Data
|
|
129
|
+
|
|
130
|
+
**Check:** Is there enough labeled data?
|
|
131
|
+
|
|
132
|
+
For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:
|
|
133
|
+
|
|
134
|
+
- **Random:** Always include a random sample alongside other strategies to discover unknown issues.
|
|
135
|
+
- **Clustering:** Group traces by semantic similarity and review representatives from each cluster.
|
|
136
|
+
- **Data analysis:** Analyze statistics on latency, turns, tool calls, and tokens for outliers.
|
|
137
|
+
- **Classification:** Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
|
|
138
|
+
- **Feedback:** Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.
|
|
139
|
+
|
|
140
|
+
**Finding if insufficient:** Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with `generate-synthetic-data`.
|
|
141
|
+
|
|
142
|
+
### 6. Pipeline Hygiene
|
|
143
|
+
|
|
144
|
+
**Check:** Is error analysis re-run after significant changes?
|
|
145
|
+
|
|
146
|
+
Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.
|
|
147
|
+
|
|
148
|
+
**Finding if stale:** Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.
|
|
149
|
+
|
|
150
|
+
**Check:** Are evaluators maintained?
|
|
151
|
+
|
|
152
|
+
Look for periodic re-validation of judges or refreshed evaluation datasets.
|
|
153
|
+
|
|
154
|
+
**Finding if set-and-forget:** Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.
|
|
155
|
+
|
|
156
|
+
## No Eval Infrastructure
|
|
157
|
+
|
|
158
|
+
If the user has no eval artifacts (no traces, no evaluators, no labeled data):
|
|
159
|
+
|
|
160
|
+
1. Start with `error-analysis` on a sample of real traces.
|
|
161
|
+
2. If no production data exists, use `generate-synthetic-data` to create test inputs, run them through the pipeline, then apply `error-analysis` to the resulting traces.
|
|
162
|
+
3. Do not recommend building evaluators, judges, or dashboards before completing error analysis.
|
|
163
|
+
|
|
164
|
+
## Report Format
|
|
165
|
+
|
|
166
|
+
Present findings ordered by impact. For each:
|
|
167
|
+
|
|
168
|
+
```
|
|
169
|
+
### [Problem Title]
|
|
170
|
+
**Status:** [Problem exists / OK / Cannot determine]
|
|
171
|
+
[1-2 sentence explanation of the specific problem found]
|
|
172
|
+
**Fix:** [Concrete action, referencing a skill or article]
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Group under the six diagnostic areas. Omit areas where no problems were found.
|
|
176
|
+
|
|
177
|
+
## Anti-Patterns
|
|
178
|
+
|
|
179
|
+
- Running the audit as a checklist without inspecting actual artifacts.
|
|
180
|
+
- Reporting generic advice disconnected from what was found in the user's pipeline.
|
|
181
|
+
- Recommending evaluators before error analysis is complete.
|
|
182
|
+
- Suggesting LLM judges for failures that code-based checks can handle.
|
|
183
|
+
- Treating this audit as a one-time event. Re-audit after significant pipeline changes.
|
|
@@ -0,0 +1,177 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evaluate-rag
|
|
3
|
+
description: >
|
|
4
|
+
Guides evaluation of RAG pipeline retrieval and generation quality.
|
|
5
|
+
Use when evaluating a retrieval-augmented generation system, measuring retrieval quality,
|
|
6
|
+
assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval
|
|
7
|
+
testing, or optimizing chunking strategies.
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Evaluate RAG
|
|
11
|
+
|
|
12
|
+
## Overview
|
|
13
|
+
|
|
14
|
+
1. Do error analysis on end-to-end traces first. Determine whether failures come from retrieval, generation, or both.
|
|
15
|
+
2. Build a retrieval evaluation dataset: queries paired with relevant document chunks.
|
|
16
|
+
3. Measure retrieval quality with Recall@k (most important for first-pass retrieval).
|
|
17
|
+
4. Evaluate generation separately: faithfulness (grounded in context?) and relevance (answers the query?).
|
|
18
|
+
5. If retrieval is the bottleneck, optimize chunking via grid search before tuning generation.
|
|
19
|
+
|
|
20
|
+
## Prerequisites
|
|
21
|
+
|
|
22
|
+
Complete error analysis on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generation, or both. Fix retrieval first.
|
|
23
|
+
|
|
24
|
+
## Core Instructions
|
|
25
|
+
|
|
26
|
+
### Evaluate Retrieval and Generation Separately
|
|
27
|
+
|
|
28
|
+
Measure each component independently. Use the appropriate metric for each retrieval stage:
|
|
29
|
+
|
|
30
|
+
- **First-pass retrieval:** Optimize for Recall@k. Include all relevant documents, even at the cost of noise.
|
|
31
|
+
- **Reranking:** Optimize for Precision@k, MRR, or NDCG@k. Rank the most relevant documents first.
|
|
32
|
+
|
|
33
|
+
### Building a Retrieval Evaluation Dataset
|
|
34
|
+
|
|
35
|
+
You need queries paired with ground-truth relevant document chunks.
|
|
36
|
+
|
|
37
|
+
**Manual curation (highest quality):** Write realistic questions and map each to the exact chunk(s) containing the answer.
|
|
38
|
+
|
|
39
|
+
**Synthetic QA generation (scalable):** For each document chunk, prompt an LLM to extract a fact and generate a question answerable only from that fact.
|
|
40
|
+
|
|
41
|
+
Synthetic QA prompt template:
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
Given a chunk of text, extract a specific, self-contained fact from it.
|
|
45
|
+
Then write a question that is directly and unambiguously answered
|
|
46
|
+
by that fact alone.
|
|
47
|
+
|
|
48
|
+
Return output in JSON format:
|
|
49
|
+
{ "fact": "...", "question": "..." }
|
|
50
|
+
|
|
51
|
+
Chunk: "{text_chunk}"
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
**Adversarial question generation:** Create harder queries that resemble content in multiple chunks but are only answered by one.
|
|
55
|
+
|
|
56
|
+
Process:
|
|
57
|
+
1. Select target chunk A containing a clear fact.
|
|
58
|
+
2. Find similar chunks B, C using embedding search (chunks that share terminology but lack the answer).
|
|
59
|
+
3. Prompt the LLM to write a question using terminology from B and C that only chunk A answers.
|
|
60
|
+
|
|
61
|
+
Example:
|
|
62
|
+
- Chunk A: "In April 2020, the company reported a 17% drop in quarterly revenue, its largest decline since 2008."
|
|
63
|
+
- Chunk B: "The company experienced significant losses in 2008 during the financial crisis."
|
|
64
|
+
- Generated question: "When did the company experience its largest revenue decline since the 2008 financial crisis?"
|
|
65
|
+
|
|
66
|
+
Only chunk A contains the answer. Chunk B is a plausible distractor.
|
|
67
|
+
|
|
68
|
+
**Filtering synthetic questions:** Rate synthetic queries for realism using few-shot LLM scoring. Keep only those rated realistic (4-5 on a 1-5 scale). Likert scoring is appropriate here, since the goal is fuzzy ranking for dataset curation, not measuring failure rates.
|
|
69
|
+
|
|
70
|
+
### Retrieval Metrics
|
|
71
|
+
|
|
72
|
+
**Recall@k:** Fraction of relevant documents found in the top k results.
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
Recall@k = (relevant docs in top k) / (total relevant docs for query)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Prioritize recall for first-pass retrieval. LLMs can ignore irrelevant content but cannot generate from missing content.
|
|
79
|
+
|
|
80
|
+
**Precision@k:** Fraction of top k results that are relevant.
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
Precision@k = (relevant docs in top k) / k
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Use for reranking evaluation.
|
|
87
|
+
|
|
88
|
+
**Mean Reciprocal Rank (MRR):** How early the first relevant document appears.
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
MRR = (1/N) * sum(1/rank_of_first_relevant_doc)
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Best for single-fact lookups where only one key chunk is needed.
|
|
95
|
+
|
|
96
|
+
**NDCG@k (Normalized Discounted Cumulative Gain):** For graded relevance where documents have varying utility. Rewards placing more relevant items higher.
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
DCG@k = sum over i=1..k of: rel_i / log2(i+1)
|
|
100
|
+
IDCG@k = DCG@k with documents sorted by decreasing relevance
|
|
101
|
+
NDCG@k = DCG@k / IDCG@k
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
Caveat: Optimal ranking of weakly relevant documents can outscore a highly relevant document ranked lower. Supplement with Recall@k.
|
|
105
|
+
|
|
106
|
+
**Choosing k:** k varies by query type. A factual lookup uses k=1-2. A synthesis query ("summarize market trends") uses k=5-10.
|
|
107
|
+
|
|
108
|
+
#### Metric Selection
|
|
109
|
+
|
|
110
|
+
| Query Type | Primary Metric |
|
|
111
|
+
|---|---|
|
|
112
|
+
| Single-fact lookups | MRR |
|
|
113
|
+
| Broad coverage needed | Recall@k |
|
|
114
|
+
| Ranked quality matters | NDCG@k or Precision@k |
|
|
115
|
+
| Multi-hop reasoning | Two-hop Recall@k |
|
|
116
|
+
|
|
117
|
+
### Evaluating and Optimizing Chunking
|
|
118
|
+
|
|
119
|
+
Treat chunking as a tunable hyperparameter. Even with the same retriever, metrics vary based on chunking alone.
|
|
120
|
+
|
|
121
|
+
**Grid search for fixed-size chunking:** Test combinations of chunk size and overlap. Re-index the corpus for each configuration. Measure retrieval metrics on your evaluation dataset.
|
|
122
|
+
|
|
123
|
+
Example search grid:
|
|
124
|
+
|
|
125
|
+
| Chunk size | Overlap | Recall@5 | NDCG@5 |
|
|
126
|
+
|-----------|---------|----------|--------|
|
|
127
|
+
| 128 tokens | 0 | 0.82 | 0.69 |
|
|
128
|
+
| 128 tokens | 64 | 0.88 | 0.75 |
|
|
129
|
+
| 256 tokens | 0 | 0.86 | 0.74 |
|
|
130
|
+
| 256 tokens | 128 | 0.89 | 0.77 |
|
|
131
|
+
| 512 tokens | 0 | 0.80 | 0.72 |
|
|
132
|
+
| 512 tokens | 256 | 0.83 | 0.74 |
|
|
133
|
+
|
|
134
|
+
**Content-aware chunking:** When fixed-size chunks split related information:
|
|
135
|
+
- Use natural document boundaries (sections, paragraphs, steps).
|
|
136
|
+
- Augment chunks with context: prepend document title and section headings to each chunk before embedding.
|
|
137
|
+
|
|
138
|
+
### Evaluating Generation Quality
|
|
139
|
+
|
|
140
|
+
After confirming retrieval works, evaluate what the LLM does with the retrieved context along two dimensions:
|
|
141
|
+
|
|
142
|
+
**Answer faithfulness:** Does the output accurately reflect the retrieved context? Check for:
|
|
143
|
+
- **Hallucinations:** Information absent from source documents. In RAG, even correct facts from the LLM's own knowledge count as hallucinations.
|
|
144
|
+
- **Omissions:** Relevant information from the context ignored in the output.
|
|
145
|
+
- **Misinterpretations:** Context information represented inaccurately.
|
|
146
|
+
|
|
147
|
+
**Answer relevance:** Does the output address the original query? An answer can be faithful to the context but fail to answer what the user asked.
|
|
148
|
+
|
|
149
|
+
Use error analysis to discover specific manifestations in your pipeline. Identify what kind of information gets hallucinated and which constraints get omitted.
|
|
150
|
+
|
|
151
|
+
#### Diagnosing Failures by Metric Pattern
|
|
152
|
+
|
|
153
|
+
| Context Relevance | Faithfulness | Answer Relevance | Diagnosis |
|
|
154
|
+
|---|---|---|---|
|
|
155
|
+
| High | High | Low | Generator attended to wrong section of a correct document |
|
|
156
|
+
| High | Low | -- | Hallucination or misinterpretation of retrieved content |
|
|
157
|
+
| Low | -- | -- | Retrieval problem. Fix chunking, embeddings, or query preprocessing |
|
|
158
|
+
|
|
159
|
+
### Multi-Hop Retrieval Evaluation
|
|
160
|
+
|
|
161
|
+
For queries requiring information from multiple chunks:
|
|
162
|
+
|
|
163
|
+
**Two-hop Recall@k:** Fraction of 2-hop queries where both ground-truth chunks appear in the top k results.
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Diagnose failures by classifying: hop 1 miss, hop 2 miss, or rank-out-of-top-k.
|
|
170
|
+
|
|
171
|
+
## Anti-Patterns
|
|
172
|
+
|
|
173
|
+
- Using a single end-to-end correctness metric without separating retrieval and generation measurement.
|
|
174
|
+
- Jumping directly to metrics without reading traces first.
|
|
175
|
+
- Overfitting to synthetic evaluation data. Validate against real user queries regularly.
|
|
176
|
+
- Using similarity metrics (ROUGE, BERTScore, cosine similarity) as primary generation evaluation. Use binary evaluators driven by error analysis.
|
|
177
|
+
- Evaluating generation without checking context grounding.
|