@massiangelone/rag-eval 0.1.0-alpha.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +207 -0
- package/dist/index.cjs +1067 -0
- package/dist/index.d.cts +2 -0
- package/dist/index.d.ts +2 -0
- package/dist/index.js +1043 -0
- package/package.json +80 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Massimiliano Angelone
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
1
|
+
# @massiangelone/rag-eval
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/@massiangelone/rag-eval)
|
|
4
|
+
[](https://github.com/maxange-developer/rag-eval-cli/actions/workflows/ci.yml)
|
|
5
|
+
[](LICENSE)
|
|
6
|
+
|
|
7
|
+
> Evaluate RAG pipelines: retrieval precision, faithfulness, answer correctness. Zero-config CLI.
|
|
8
|
+
|
|
9
|
+
Building a RAG system is easy. Knowing if it's any good is hard. `rag-eval` runs your RAG endpoint against a labeled eval-set and reports three numbers:
|
|
10
|
+
|
|
11
|
+
- **Retrieval precision** — did your retriever find the right documents?
|
|
12
|
+
- **Faithfulness** — is the generated answer supported by retrieved context?
|
|
13
|
+
- **Correctness** — does the answer match the expected ground truth?
|
|
14
|
+
|
|
15
|
+
Extracted from production work on multi-tenant RAG SaaS. Built provider-agnostic: judge with Claude or OpenAI.
|
|
16
|
+
|
|
17
|
+
> **Status: alpha (v0.1.0).** Config shape may change before v1.0. CLI command surface is stable.
|
|
18
|
+
|
|
19
|
+
## Install
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
npx @massiangelone/rag-eval --help
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Or globally:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
npm install -g @massiangelone/rag-eval
|
|
29
|
+
rag-eval --help
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Quickstart
|
|
33
|
+
|
|
34
|
+
### 1. Create your eval-set (JSONL)
|
|
35
|
+
|
|
36
|
+
Each line is one question with expected source ID and optional expected answer.
|
|
37
|
+
|
|
38
|
+
```jsonl
|
|
39
|
+
{"id":"q1","question":"How do I reset SSO?","expected_source":"docs-sso-reset","expected_answer":"Settings → SSO → Reset"}
|
|
40
|
+
{"id":"q2","question":"Which webhook fires on downgrade?","expected_source":"docs-webhooks","expected_answer":"subscription.plan_changed with action=downgrade"}
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
See [`examples/eval-set.example.jsonl`](examples/eval-set.example.jsonl) for a full B2B SaaS example.
|
|
44
|
+
|
|
45
|
+
### 2. Create config (JSON)
|
|
46
|
+
|
|
47
|
+
```json
|
|
48
|
+
{
|
|
49
|
+
"endpoint": {
|
|
50
|
+
"url": "http://localhost:3000/api/rag",
|
|
51
|
+
"method": "POST",
|
|
52
|
+
"headers": { "Content-Type": "application/json" },
|
|
53
|
+
"body": { "query": "{{question}}" },
|
|
54
|
+
"responsePaths": {
|
|
55
|
+
"answer": "answer",
|
|
56
|
+
"sources": "sources[].id",
|
|
57
|
+
"sourceContents": "sources[].content"
|
|
58
|
+
}
|
|
59
|
+
},
|
|
60
|
+
"judge": {
|
|
61
|
+
"provider": "claude",
|
|
62
|
+
"model": "claude-sonnet-4-6"
|
|
63
|
+
},
|
|
64
|
+
"scoring": {
|
|
65
|
+
"retrievalK": 5,
|
|
66
|
+
"weights": {
|
|
67
|
+
"retrieval": 0.4,
|
|
68
|
+
"faithfulness": 0.3,
|
|
69
|
+
"correctness": 0.3
|
|
70
|
+
}
|
|
71
|
+
}
|
|
72
|
+
}
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
See [`examples/rag-eval.config.json`](examples/rag-eval.config.json) for the full reference config.
|
|
76
|
+
|
|
77
|
+
### 3. Set your judge API key
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
export ANTHROPIC_API_KEY=sk-ant-...
|
|
81
|
+
# or
|
|
82
|
+
export OPENAI_API_KEY=sk-...
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### 4. Run
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
rag-eval run -c rag-eval.config.json -q eval-set.jsonl --threshold 0.7
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Output: colored console table, CSV file, JSON file, exit code 0/1 based on threshold.
|
|
92
|
+
|
|
93
|
+
## Configuration reference
|
|
94
|
+
|
|
95
|
+
### `endpoint`
|
|
96
|
+
|
|
97
|
+
How `rag-eval` calls your RAG service.
|
|
98
|
+
|
|
99
|
+
| Field | Type | Default | Description |
|
|
100
|
+
|-------|------|---------|-------------|
|
|
101
|
+
| `url` | string | — | Your RAG endpoint URL |
|
|
102
|
+
| `method` | `GET` \| `POST` | `POST` | HTTP method |
|
|
103
|
+
| `headers` | object | — | Optional headers (auth, content-type) |
|
|
104
|
+
| `body` | object | — | Request body template. Use `{{question}}` and `{{id}}` as placeholders |
|
|
105
|
+
| `responsePaths.answer` | string | — | JSON path to the generated answer |
|
|
106
|
+
| `responsePaths.sources` | string | — | JSON path to retrieved source IDs (e.g. `sources[].id`) |
|
|
107
|
+
| `responsePaths.sourceContents` | string | optional | JSON path to retrieved source **text**. Required for accurate faithfulness scoring. |
|
|
108
|
+
| `timeoutMs` | number | `30000` | Request timeout in ms |
|
|
109
|
+
|
|
110
|
+
### Path syntax
|
|
111
|
+
|
|
112
|
+
- `a.b.c` — nested object access
|
|
113
|
+
- `a[].c` — array map: returns `c` from each element of `a`
|
|
114
|
+
|
|
115
|
+
### `judge`
|
|
116
|
+
|
|
117
|
+
| Field | Type | Default | Description |
|
|
118
|
+
|-------|------|---------|-------------|
|
|
119
|
+
| `provider` | `claude` \| `openai` | `claude` | LLM judge provider |
|
|
120
|
+
| `model` | string | provider-aware | Specific model. Defaults: `claude-sonnet-4-6` or `gpt-4o-mini` |
|
|
121
|
+
|
|
122
|
+
Required env vars: `ANTHROPIC_API_KEY` (claude) or `OPENAI_API_KEY` (openai).
|
|
123
|
+
|
|
124
|
+
### `scoring`
|
|
125
|
+
|
|
126
|
+
| Field | Type | Default | Description |
|
|
127
|
+
|-------|------|---------|-------------|
|
|
128
|
+
| `retrievalK` | number | `5` | Top-K sources to consider for retrieval precision |
|
|
129
|
+
| `weights.retrieval` | number | `0.4` | Weight in overall score |
|
|
130
|
+
| `weights.faithfulness` | number | `0.3` | Weight in overall score |
|
|
131
|
+
| `weights.correctness` | number | `0.3` | Weight in overall score |
|
|
132
|
+
|
|
133
|
+
Weights must sum to 1.0.
|
|
134
|
+
|
|
135
|
+
## CLI flags
|
|
136
|
+
|
|
137
|
+
| Flag | Description |
|
|
138
|
+
|------|-------------|
|
|
139
|
+
| `-c, --config <path>` | Config file (default: `rag-eval.config.json`) |
|
|
140
|
+
| `-q, --questions <path>` | Eval-set JSONL (default: `eval-set.jsonl`) |
|
|
141
|
+
| `-j, --judge <provider>` | Override judge provider (`claude` \| `openai`) |
|
|
142
|
+
| `--no-judge` | Skip judge LLM — retrieval scoring only, no API costs |
|
|
143
|
+
| `-o, --output <dir>` | Output directory (default: `./rag-eval-output`) |
|
|
144
|
+
| `--threshold <number>` | Min overall score to exit 0 (default: `0.7`) |
|
|
145
|
+
|
|
146
|
+
## Output
|
|
147
|
+
|
|
148
|
+
Three artifacts per run:
|
|
149
|
+
|
|
150
|
+
1. **Console** — colored table with per-question scores + summary
|
|
151
|
+
2. **CSV** — `rag-eval-output/eval-{timestamp}.csv` — for spreadsheet analysis
|
|
152
|
+
3. **JSON** — `rag-eval-output/eval-{timestamp}.json` — for programmatic use
|
|
153
|
+
|
|
154
|
+
Exit codes:
|
|
155
|
+
|
|
156
|
+
| Code | Meaning |
|
|
157
|
+
|------|---------|
|
|
158
|
+
| `0` | Passed — avg overall score ≥ threshold |
|
|
159
|
+
| `1` | Failed — below threshold |
|
|
160
|
+
| `2` | Config / eval-set validation error |
|
|
161
|
+
| `3` | Unexpected error |
|
|
162
|
+
|
|
163
|
+
## CI integration
|
|
164
|
+
|
|
165
|
+
```yaml
|
|
166
|
+
# .github/workflows/rag-eval.yml
|
|
167
|
+
- name: Run RAG evaluation
|
|
168
|
+
env:
|
|
169
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
170
|
+
run: |
|
|
171
|
+
npx @massiangelone/rag-eval run \
|
|
172
|
+
-c rag-eval.config.json \
|
|
173
|
+
-q eval-set.jsonl \
|
|
174
|
+
--threshold 0.75
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
Fails the build if RAG quality regresses below threshold.
|
|
178
|
+
|
|
179
|
+
## Faithfulness scoring requires source text
|
|
180
|
+
|
|
181
|
+
For meaningful faithfulness scoring, your RAG endpoint must return the **text** of retrieved chunks, not just IDs. Configure `responsePaths.sourceContents` to point at the chunk text in your response.
|
|
182
|
+
|
|
183
|
+
Without `sourceContents`, the judge detects that context items are opaque IDs and returns `null` for faithfulness. The overall score weight re-normalizes across retrieval and correctness only — no artificial penalty.
|
|
184
|
+
|
|
185
|
+
## Test the CLI locally
|
|
186
|
+
|
|
187
|
+
A mock RAG server and example eval-set are included:
|
|
188
|
+
|
|
189
|
+
```bash
|
|
190
|
+
node examples/mock-server.mjs &
|
|
191
|
+
rag-eval run \
|
|
192
|
+
-c examples/rag-eval.config.json \
|
|
193
|
+
-q examples/eval-set.example.jsonl \
|
|
194
|
+
--judge openai \
|
|
195
|
+
--threshold 0.7
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Status
|
|
199
|
+
|
|
200
|
+
- **OpenAI judge**: tested end-to-end against the real API
|
|
201
|
+
- **Claude judge**: implemented and structurally validated; full end-to-end testing pending Anthropic API credit
|
|
202
|
+
- **Retrieval scoring**: tested with mock and real endpoints
|
|
203
|
+
- **CSV/JSON output**: tested
|
|
204
|
+
|
|
205
|
+
## License
|
|
206
|
+
|
|
207
|
+
MIT © Massimiliano Angelone
|