@massiangelone/rag-eval 0.1.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Massimiliano Angelone
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,207 @@
1
+ # @massiangelone/rag-eval
2
+
3
+ [![npm](https://img.shields.io/npm/v/@massiangelone/rag-eval.svg)](https://www.npmjs.com/package/@massiangelone/rag-eval)
4
+ [![CI](https://github.com/maxange-developer/rag-eval-cli/actions/workflows/ci.yml/badge.svg)](https://github.com/maxange-developer/rag-eval-cli/actions/workflows/ci.yml)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
+
7
+ > Evaluate RAG pipelines: retrieval precision, faithfulness, answer correctness. Zero-config CLI.
8
+
9
+ Building a RAG system is easy. Knowing if it's any good is hard. `rag-eval` runs your RAG endpoint against a labeled eval-set and reports three numbers:
10
+
11
+ - **Retrieval precision** — did your retriever find the right documents?
12
+ - **Faithfulness** — is the generated answer supported by retrieved context?
13
+ - **Correctness** — does the answer match the expected ground truth?
14
+
15
+ Extracted from production work on multi-tenant RAG SaaS. Built provider-agnostic: judge with Claude or OpenAI.
16
+
17
+ > **Status: alpha (v0.1.0).** Config shape may change before v1.0. CLI command surface is stable.
18
+
19
+ ## Install
20
+
21
+ ```bash
22
+ npx @massiangelone/rag-eval --help
23
+ ```
24
+
25
+ Or globally:
26
+
27
+ ```bash
28
+ npm install -g @massiangelone/rag-eval
29
+ rag-eval --help
30
+ ```
31
+
32
+ ## Quickstart
33
+
34
+ ### 1. Create your eval-set (JSONL)
35
+
36
+ Each line is one question with expected source ID and optional expected answer.
37
+
38
+ ```jsonl
39
+ {"id":"q1","question":"How do I reset SSO?","expected_source":"docs-sso-reset","expected_answer":"Settings → SSO → Reset"}
40
+ {"id":"q2","question":"Which webhook fires on downgrade?","expected_source":"docs-webhooks","expected_answer":"subscription.plan_changed with action=downgrade"}
41
+ ```
42
+
43
+ See [`examples/eval-set.example.jsonl`](examples/eval-set.example.jsonl) for a full B2B SaaS example.
44
+
45
+ ### 2. Create config (JSON)
46
+
47
+ ```json
48
+ {
49
+ "endpoint": {
50
+ "url": "http://localhost:3000/api/rag",
51
+ "method": "POST",
52
+ "headers": { "Content-Type": "application/json" },
53
+ "body": { "query": "{{question}}" },
54
+ "responsePaths": {
55
+ "answer": "answer",
56
+ "sources": "sources[].id",
57
+ "sourceContents": "sources[].content"
58
+ }
59
+ },
60
+ "judge": {
61
+ "provider": "claude",
62
+ "model": "claude-sonnet-4-6"
63
+ },
64
+ "scoring": {
65
+ "retrievalK": 5,
66
+ "weights": {
67
+ "retrieval": 0.4,
68
+ "faithfulness": 0.3,
69
+ "correctness": 0.3
70
+ }
71
+ }
72
+ }
73
+ ```
74
+
75
+ See [`examples/rag-eval.config.json`](examples/rag-eval.config.json) for the full reference config.
76
+
77
+ ### 3. Set your judge API key
78
+
79
+ ```bash
80
+ export ANTHROPIC_API_KEY=sk-ant-...
81
+ # or
82
+ export OPENAI_API_KEY=sk-...
83
+ ```
84
+
85
+ ### 4. Run
86
+
87
+ ```bash
88
+ rag-eval run -c rag-eval.config.json -q eval-set.jsonl --threshold 0.7
89
+ ```
90
+
91
+ Output: colored console table, CSV file, JSON file, exit code 0/1 based on threshold.
92
+
93
+ ## Configuration reference
94
+
95
+ ### `endpoint`
96
+
97
+ How `rag-eval` calls your RAG service.
98
+
99
+ | Field | Type | Default | Description |
100
+ |-------|------|---------|-------------|
101
+ | `url` | string | — | Your RAG endpoint URL |
102
+ | `method` | `GET` \| `POST` | `POST` | HTTP method |
103
+ | `headers` | object | — | Optional headers (auth, content-type) |
104
+ | `body` | object | — | Request body template. Use `{{question}}` and `{{id}}` as placeholders |
105
+ | `responsePaths.answer` | string | — | JSON path to the generated answer |
106
+ | `responsePaths.sources` | string | — | JSON path to retrieved source IDs (e.g. `sources[].id`) |
107
+ | `responsePaths.sourceContents` | string | optional | JSON path to retrieved source **text**. Required for accurate faithfulness scoring. |
108
+ | `timeoutMs` | number | `30000` | Request timeout in ms |
109
+
110
+ ### Path syntax
111
+
112
+ - `a.b.c` — nested object access
113
+ - `a[].c` — array map: returns `c` from each element of `a`
114
+
115
+ ### `judge`
116
+
117
+ | Field | Type | Default | Description |
118
+ |-------|------|---------|-------------|
119
+ | `provider` | `claude` \| `openai` | `claude` | LLM judge provider |
120
+ | `model` | string | provider-aware | Specific model. Defaults: `claude-sonnet-4-6` or `gpt-4o-mini` |
121
+
122
+ Required env vars: `ANTHROPIC_API_KEY` (claude) or `OPENAI_API_KEY` (openai).
123
+
124
+ ### `scoring`
125
+
126
+ | Field | Type | Default | Description |
127
+ |-------|------|---------|-------------|
128
+ | `retrievalK` | number | `5` | Top-K sources to consider for retrieval precision |
129
+ | `weights.retrieval` | number | `0.4` | Weight in overall score |
130
+ | `weights.faithfulness` | number | `0.3` | Weight in overall score |
131
+ | `weights.correctness` | number | `0.3` | Weight in overall score |
132
+
133
+ Weights must sum to 1.0.
134
+
135
+ ## CLI flags
136
+
137
+ | Flag | Description |
138
+ |------|-------------|
139
+ | `-c, --config <path>` | Config file (default: `rag-eval.config.json`) |
140
+ | `-q, --questions <path>` | Eval-set JSONL (default: `eval-set.jsonl`) |
141
+ | `-j, --judge <provider>` | Override judge provider (`claude` \| `openai`) |
142
+ | `--no-judge` | Skip judge LLM — retrieval scoring only, no API costs |
143
+ | `-o, --output <dir>` | Output directory (default: `./rag-eval-output`) |
144
+ | `--threshold <number>` | Min overall score to exit 0 (default: `0.7`) |
145
+
146
+ ## Output
147
+
148
+ Three artifacts per run:
149
+
150
+ 1. **Console** — colored table with per-question scores + summary
151
+ 2. **CSV** — `rag-eval-output/eval-{timestamp}.csv` — for spreadsheet analysis
152
+ 3. **JSON** — `rag-eval-output/eval-{timestamp}.json` — for programmatic use
153
+
154
+ Exit codes:
155
+
156
+ | Code | Meaning |
157
+ |------|---------|
158
+ | `0` | Passed — avg overall score ≥ threshold |
159
+ | `1` | Failed — below threshold |
160
+ | `2` | Config / eval-set validation error |
161
+ | `3` | Unexpected error |
162
+
163
+ ## CI integration
164
+
165
+ ```yaml
166
+ # .github/workflows/rag-eval.yml
167
+ - name: Run RAG evaluation
168
+ env:
169
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
170
+ run: |
171
+ npx @massiangelone/rag-eval run \
172
+ -c rag-eval.config.json \
173
+ -q eval-set.jsonl \
174
+ --threshold 0.75
175
+ ```
176
+
177
+ Fails the build if RAG quality regresses below threshold.
178
+
179
+ ## Faithfulness scoring requires source text
180
+
181
+ For meaningful faithfulness scoring, your RAG endpoint must return the **text** of retrieved chunks, not just IDs. Configure `responsePaths.sourceContents` to point at the chunk text in your response.
182
+
183
+ Without `sourceContents`, the judge detects that context items are opaque IDs and returns `null` for faithfulness. The overall score weight re-normalizes across retrieval and correctness only — no artificial penalty.
184
+
185
+ ## Test the CLI locally
186
+
187
+ A mock RAG server and example eval-set are included:
188
+
189
+ ```bash
190
+ node examples/mock-server.mjs &
191
+ rag-eval run \
192
+ -c examples/rag-eval.config.json \
193
+ -q examples/eval-set.example.jsonl \
194
+ --judge openai \
195
+ --threshold 0.7
196
+ ```
197
+
198
+ ## Status
199
+
200
+ - **OpenAI judge**: tested end-to-end against the real API
201
+ - **Claude judge**: implemented and structurally validated; full end-to-end testing pending Anthropic API credit
202
+ - **Retrieval scoring**: tested with mock and real endpoints
203
+ - **CSV/JSON output**: tested
204
+
205
+ ## License
206
+
207
+ MIT © Massimiliano Angelone