evalseed 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,31 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ *$py.class
4
+ *.so
5
+ .Python
6
+ build/
7
+ dist/
8
+ *.egg-info/
9
+ *.egg
10
+ .eggs/
11
+ .venv/
12
+ venv/
13
+ env/
14
+ .env
15
+ .env.local
16
+ .pytest_cache/
17
+ .mypy_cache/
18
+ .ruff_cache/
19
+ .coverage
20
+ htmlcov/
21
+ .tox/
22
+ .idea/
23
+ .vscode/
24
+ *.swp
25
+ .DS_Store
26
+ node_modules/
27
+ .ipynb_checkpoints/
28
+
29
+ # local scratch
30
+ try_it.py
31
+ *.jsonl
evalseed-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Purnendu Das
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,480 @@
1
+ Metadata-Version: 2.4
2
+ Name: evalseed
3
+ Version: 0.1.0
4
+ Summary: Generate quality-filtered synthetic Q&A datasets for RAG evaluation.
5
+ Project-URL: Homepage, https://github.com/purnendu-das/evalseed
6
+ Project-URL: Repository, https://github.com/purnendu-das/evalseed
7
+ Project-URL: Issues, https://github.com/purnendu-das/evalseed/issues
8
+ Author: Purnendu Das
9
+ License: MIT License
10
+
11
+ Copyright (c) 2026 Purnendu Das
12
+
13
+ Permission is hereby granted, free of charge, to any person obtaining a copy
14
+ of this software and associated documentation files (the "Software"), to deal
15
+ in the Software without restriction, including without limitation the rights
16
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
17
+ copies of the Software, and to permit persons to whom the Software is
18
+ furnished to do so, subject to the following conditions:
19
+
20
+ The above copyright notice and this permission notice shall be included in all
21
+ copies or substantial portions of the Software.
22
+
23
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
24
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
25
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
26
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
27
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
28
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29
+ SOFTWARE.
30
+ License-File: LICENSE
31
+ Keywords: dataset-generation,evaluation,golden-dataset,llm,question-generation,rag,rag-evaluation,synthetic-data
32
+ Classifier: Development Status :: 3 - Alpha
33
+ Classifier: Intended Audience :: Developers
34
+ Classifier: Intended Audience :: Science/Research
35
+ Classifier: License :: OSI Approved :: MIT License
36
+ Classifier: Operating System :: OS Independent
37
+ Classifier: Programming Language :: Python :: 3
38
+ Classifier: Programming Language :: Python :: 3.10
39
+ Classifier: Programming Language :: Python :: 3.11
40
+ Classifier: Programming Language :: Python :: 3.12
41
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
42
+ Requires-Python: >=3.10
43
+ Requires-Dist: openai>=1.0
44
+ Requires-Dist: pydantic>=2.0
45
+ Requires-Dist: rich>=13.0
46
+ Requires-Dist: tiktoken>=0.5
47
+ Provides-Extra: dev
48
+ Requires-Dist: build>=1.0; extra == 'dev'
49
+ Requires-Dist: mypy>=1.8; extra == 'dev'
50
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
51
+ Requires-Dist: pytest>=7.0; extra == 'dev'
52
+ Requires-Dist: ruff>=0.4; extra == 'dev'
53
+ Description-Content-Type: text/markdown
54
+
55
+ # evalseed
56
+
57
+ **Generate RAG evaluation datasets you'd actually trust.**
58
+
59
+ evalseed produces synthetic question-answer pairs from your documents and **filters out the bad ones** — the unfaithful, ambiguous, or trivial pairs that quietly poison most RAG benchmarks.
60
+
61
+ ```python
62
+ from evalseed import Pipeline, OpenAIJudge
63
+
64
+ pipeline = Pipeline(
65
+ judge=OpenAIJudge(model="gpt-4o-mini"),
66
+ n_pairs=100,
67
+ types=["single_hop", "multi_hop", "distractor"],
68
+ )
69
+ dataset = pipeline.generate_from_corpus("./docs/")
70
+ dataset.save("eval.jsonl")
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Table of contents
76
+
77
+ - [What problem does this solve?](#what-problem-does-this-solve)
78
+ - [Core concepts (read this first)](#core-concepts-read-this-first)
79
+ - [Install](#install)
80
+ - [Your first run, step by step](#your-first-run-step-by-step)
81
+ - [Question types explained](#question-types-explained)
82
+ - [How filtering works](#how-filtering-works)
83
+ - [The CLI](#the-cli)
84
+ - [The Python library](#the-python-library)
85
+ - [Inspecting and auditing results](#inspecting-and-auditing-results)
86
+ - [Bring your own pieces](#bring-your-own-pieces)
87
+ - [Project layout](#project-layout)
88
+ - [Status](#status)
89
+ - [Contributing](#contributing)
90
+ - [License](#license)
91
+
92
+ ---
93
+
94
+ ## What problem does this solve?
95
+
96
+ If you're building a **RAG app** (retrieval-augmented generation — a chatbot that answers questions from your documents), you eventually need to answer: *is it actually working?*
97
+
98
+ To measure that, you need an **evaluation dataset**: a list of questions, the correct answers, and the document chunks the answers came from. Then you run your RAG app against the questions and check whether it returned the right answer.
99
+
100
+ You have three ways to build that dataset, and each one is bad in its own way:
101
+
102
+ 1. **Hand-write 200 Q&A pairs.** High quality, but takes days. Nobody actually does this.
103
+ 2. **Auto-generate them with an LLM.** Fast, but ~30–50% of generated pairs are garbage:
104
+ - The "answer" isn't actually supported by the cited context (**unfaithful**).
105
+ - The question is ambiguous or refers to "the above text" (**not self-contained**).
106
+ - The question is just a copy of a sentence from the source (**trivial**).
107
+ - Your benchmark looks great until you realize you're measuring noise.
108
+ 3. **Auto-generate, then filter.** Fast *and* trustworthy. ← this is evalseed.
109
+
110
+ evalseed runs each generated pair through cheap pattern-based pre-filters, then four LLM-judge filters, and rejects anything that fails. Every rejection comes with a structured `reason` so you can audit it.
111
+
112
+ ---
113
+
114
+ ## Core concepts (read this first)
115
+
116
+ If you're new to RAG evaluation, four terms will appear constantly in the docs and the code. Learn them once here.
117
+
118
+ ### Corpus
119
+ The pile of documents you want to evaluate against. Today evalseed reads `.txt` and `.md` files from a folder. PDFs / HTML / DOCX are deliberately out of scope for v0.1.
120
+
121
+ ### Chunk
122
+ A corpus document is split into smaller, paragraph-aware **chunks** (default ~1500 characters with 150-char overlap) before generation. The LLM generates Q&A pairs from one chunk at a time, so each pair has a clear, narrow source of truth — which is what makes "is this answer supported by the context?" a question you can actually check.
123
+
124
+ ### QA pair
125
+ The unit of an evaluation dataset. A `QAPair` is a `question`, an `answer`, the `context` (the chunk it was generated from), a `qa_type` (single-hop / multi-hop / distractor), an optional `difficulty`, and a list of `filter_results` (one per filter that ran). See [src/evalseed/schemas.py:32](src/evalseed/schemas.py#L32).
126
+
127
+ ### Judge
128
+ An LLM that is asked structured yes/no questions about a pair: "is this answer supported by this context?", "is this question self-contained?", etc. evalseed's filters call the judge; you call the filters. Today the only built-in judge is `OpenAIJudge`. The `Judge` interface is one method (`judge(system, user) -> dict`), so swapping in another provider later is intentionally easy.
129
+
130
+ ---
131
+
132
+ ## Install
133
+
134
+ ```bash
135
+ git clone https://github.com/purnendu-das/evalseed
136
+ cd evalseed
137
+ pip install -e .
138
+ ```
139
+
140
+ Then set your OpenAI key:
141
+
142
+ ```bash
143
+ # macOS / Linux
144
+ export OPENAI_API_KEY="sk-..."
145
+
146
+ # Windows PowerShell
147
+ $env:OPENAI_API_KEY = "sk-..."
148
+ ```
149
+
150
+ PyPI release is gated on the Phase 0 validation spike — see [Status](#status).
151
+
152
+ ---
153
+
154
+ ## Your first run, step by step
155
+
156
+ Total time: ~3 minutes. Cost on `gpt-4o-mini`: a few cents.
157
+
158
+ **1. Make a tiny corpus.** Create a folder with one or two `.txt` or `.md` files in it. A few paragraphs each is plenty for a smoke test.
159
+
160
+ ```
161
+ sample_corpus/
162
+ ├── about.md
163
+ └── faq.txt
164
+ ```
165
+
166
+ **2. Run the CLI.** Start small so you don't burn tokens before you know the wiring is right.
167
+
168
+ ```bash
169
+ evalseed ./sample_corpus/ -o eval.jsonl -n 10 --types single_hop,multi_hop
170
+ ```
171
+
172
+ You should see live progress: how many docs and chunks were loaded, how many pairs were generated, and a summary table of how many were rejected by each filter.
173
+
174
+ **3. Look at what came out.**
175
+
176
+ - `eval.jsonl` — one line per **passed** Q&A pair. Each line is JSON.
177
+ - If you also pass `--all`, rejected pairs are saved with the failing filter's `reason`. Read these to judge whether the filters are doing the right thing on *your* documents.
178
+
179
+ **4. Now scale up.** Once a small run looks sane, bump `-n` to 100 or 200 for a real eval set.
180
+
181
+ ---
182
+
183
+ ## Question types explained
184
+
185
+ The `--types` flag (or `types=...` in Python) controls **what kind of reasoning each generated question requires**. The "hop" is one retrieval / reasoning step. This is the single most important knob.
186
+
187
+ ### `single_hop`
188
+ Answerable from **one chunk, one fact**. No combining, no chaining.
189
+
190
+ > *Context:* "France is a country in Western Europe. Its capital is Paris."
191
+ > *Q:* "What is the capital of France?"
192
+ > *A:* "Paris."
193
+
194
+ This tests whether your retriever finds the right chunk and your model reads it correctly. It's the easy baseline — most RAG systems pass single-hop and fail the rest.
195
+
196
+ ### `multi_hop`
197
+ Requires **combining facts from two or more chunks** to answer. The retriever has to find all of them; the model has to join them.
198
+
199
+ > *Chunk A:* "Marie Curie was born in Warsaw in 1867."
200
+ > *Chunk B:* "Warsaw is the capital of Poland."
201
+ > *Q:* "In which country was Marie Curie born?"
202
+ > *A:* "Poland."
203
+
204
+ This tests retrieval recall (do you fetch *both* chunks?) and reasoning (does the model actually join A+B instead of guessing?). **This is where most RAG systems quietly fail** — and where having a good eval set actually matters.
205
+
206
+ ### `distractor`
207
+ Bundles **relevant chunks together with irrelevant-but-similar-looking ones**. The model has to ignore the distractors.
208
+
209
+ > Retrieve a chunk about *Paris, France* AND a chunk about *Paris, Texas*.
210
+ > *Q:* "What is the population of the capital of France?"
211
+ > The model must not be fooled by the Texas chunk.
212
+
213
+ This tests robustness — whether a noisy retriever (the realistic case) breaks the answer. Off by default because generating good distractors is more expensive; opt in with `--types single_hop,multi_hop,distractor`.
214
+
215
+ Defined in code at [src/evalseed/schemas.py:10-13](src/evalseed/schemas.py#L10-L13).
216
+
217
+ ---
218
+
219
+ ## How filtering works
220
+
221
+ Every generated pair runs through this gauntlet **in order**. The first failing filter short-circuits the rest for that pair (because the pair is already dead, and judge calls cost money).
222
+
223
+ | # | Stage | What it catches | Why it matters |
224
+ |---|---|---|---|
225
+ | 1 | `LengthPreFilter` | Pairs with implausibly short or long Q or A | Cheap regex/len check before any LLM cost |
226
+ | 2 | `RegexPreFilter` | Meta-questions ("what does the above text say…"), refusals, yes/no fragments | Same — cheap, kills obvious junk |
227
+ | 3 | `FaithfulnessFilter` | Answer **not entailed** by the cited context | Catches hallucinated answers — the #1 garbage source |
228
+ | 4 | `AnswerabilityFilter` | Ambiguous, externally-dependent, or non-self-contained questions | A question like "what year was it released?" is unusable without context |
229
+ | 5 | `TrivialityFilter` | Verbatim restatements of a source sentence (n-gram check + judge) | If Q is just "the source sentence with a question mark", you're testing string match, not RAG |
230
+ | 6 | `DifficultyFilter` | Labeled difficulty disagrees with the judge's assessment | Keeps the easy/medium/hard split honest |
231
+
232
+ Stages 1–2 are **PreFilters** (regex/length, no LLM call). Stages 3–6 are **Filters** (each makes one LLM call to the judge).
233
+
234
+ Every filter result lands on the pair as a `FilterResult` with a structured `reason`, so you can audit a rejection rather than trust the count blindly.
235
+
236
+ ---
237
+
238
+ ## The CLI
239
+
240
+ ```
241
+ evalseed CORPUS [-o OUT] [-n N_PAIRS] [--types T1,T2,...] [--model MODEL] [--seed SEED] [--all]
242
+ ```
243
+
244
+ | Flag | Default | Purpose |
245
+ |---|---|---|
246
+ | `CORPUS` (positional) | required | File or directory of `.txt`/`.md` to generate from |
247
+ | `-o, --out` | `eval.jsonl` | Where to write the dataset |
248
+ | `-n, --n-pairs` | `50` | Target number of pairs to generate |
249
+ | `--types` | `single_hop,multi_hop` | Comma-separated QA types: `single_hop`, `multi_hop`, `distractor` |
250
+ | `--model` | `gpt-4o-mini` | OpenAI model id used by both generator and judge |
251
+ | `--seed` | none | Set for deterministic generation (useful for tests / reproducible runs) |
252
+ | `--all` | off | Save rejected pairs too, with their `reason` field — use this to audit |
253
+
254
+ Example — generate 200 pairs of all three types from a docs folder, with a fixed seed, and keep rejected pairs for inspection:
255
+
256
+ ```bash
257
+ evalseed ./docs/ \
258
+ -o eval.jsonl \
259
+ -n 200 \
260
+ --types single_hop,multi_hop,distractor \
261
+ --seed 42 \
262
+ --all
263
+ ```
264
+
265
+ CLI source: [src/evalseed/cli.py](src/evalseed/cli.py).
266
+
267
+ ---
268
+
269
+ ## The Python library
270
+
271
+ The CLI is a thin wrapper. Driving evalseed from Python gives you more control: stats, custom filter sets, and the ability to filter pre-generated pairs.
272
+
273
+ ### Minimal example
274
+
275
+ ```python
276
+ from evalseed import Pipeline, OpenAIJudge
277
+
278
+ pipeline = Pipeline(
279
+ judge=OpenAIJudge(model="gpt-4o-mini"),
280
+ n_pairs=50,
281
+ types=["single_hop", "multi_hop"],
282
+ seed=42,
283
+ )
284
+ dataset = pipeline.generate_from_corpus("./docs/")
285
+
286
+ print(dataset.stats())
287
+ # {'total': 60, 'passed': 41, 'rejected': 19, 'pass_rate': 0.68,
288
+ # 'rejections_by_filter': {'faithfulness': 8, 'answerability': 6, ...}}
289
+
290
+ dataset.save("eval.jsonl") # passed pairs only
291
+ dataset.rejected.save("rejected.jsonl", only_passed=False) # for inspection
292
+ ```
293
+
294
+ ### What `Pipeline` accepts
295
+
296
+ The constructor (full signature at [src/evalseed/pipeline.py:37](src/evalseed/pipeline.py#L37)):
297
+
298
+ | Arg | Default | What it does |
299
+ |---|---|---|
300
+ | `judge` | required | The `Judge` instance the filters will call |
301
+ | `n_pairs` | `50` | Target pair count |
302
+ | `types` | `(SINGLE_HOP, MULTI_HOP)` | Which QA types to generate |
303
+ | `pairs_per_chunk` | `2` | How many pairs the generator tries to make per chunk |
304
+ | `chunk_chars` | `1500` | Target chunk size when splitting documents |
305
+ | `chunk_overlap` | `150` | Char overlap between adjacent chunks (preserves context across boundaries) |
306
+ | `prefilters` | `[LengthPreFilter, RegexPreFilter]` | Override the cheap pre-filter list |
307
+ | `filters` | the four LLM filters | Override the LLM filter list (see [Bring your own pieces](#bring-your-own-pieces)) |
308
+ | `generator` | a default `QAGenerator` | Plug in a custom generator |
309
+ | `seed` | `None` | Deterministic generation |
310
+ | `verbose` | `True` | Pretty-print progress and a stats table at the end |
311
+
312
+ ### Three ways to feed it data
313
+
314
+ ```python
315
+ # 1. From a folder/file of .txt/.md
316
+ dataset = pipeline.generate_from_corpus("./docs/")
317
+
318
+ # 2. From pre-chunked text (you already have a chunker you like)
319
+ from evalseed.corpus import Chunk
320
+ chunks = [Chunk(text="...", source="my_doc.md"), ...]
321
+ dataset = pipeline.generate_from_chunks(chunks)
322
+
323
+ # 3. Skip generation entirely — just filter pairs you already have
324
+ from evalseed.generator import parse_pairs_jsonl
325
+ pairs = parse_pairs_jsonl("ragas_output.jsonl")
326
+ dataset = pipeline.filter_pairs(pairs)
327
+ ```
328
+
329
+ Option 3 is useful if you already generated pairs with another tool (RAGAS, DeepEval, hand-written) and just want evalseed's filters as a quality gate.
330
+
331
+ ---
332
+
333
+ ## Inspecting and auditing results
334
+
335
+ The whole point of evalseed is that you can *trust* what comes out. That requires being able to look at it.
336
+
337
+ ### Stats
338
+
339
+ ```python
340
+ dataset.stats()
341
+ # {
342
+ # 'total': 60,
343
+ # 'passed': 41,
344
+ # 'rejected': 19,
345
+ # 'pass_rate': 0.68,
346
+ # 'rejections_by_filter': {
347
+ # 'faithfulness': 8,
348
+ # 'answerability': 6,
349
+ # 'triviality': 3,
350
+ # 'difficulty': 2,
351
+ # }
352
+ # }
353
+ ```
354
+
355
+ If `faithfulness` is rejecting half your pairs, your generator is hallucinating — try a stronger model or smaller chunks. If `triviality` is high, your corpus may be very fact-dense (encyclopedia-style) — that's a real signal.
356
+
357
+ ### Read the rejections
358
+
359
+ ```python
360
+ for pair in dataset.rejected:
361
+ print(pair.question)
362
+ for r in pair.filter_results:
363
+ if not r.passed:
364
+ print(f" rejected by {r.filter_name}: {r.reason}")
365
+ ```
366
+
367
+ Or just save them and grep:
368
+
369
+ ```python
370
+ dataset.rejected.save("rejected.jsonl", only_passed=False)
371
+ ```
372
+
373
+ Spend 10 minutes reading 20 rejected pairs from your first real run. You'll quickly see whether the filters are too strict, too lenient, or correct on your domain.
374
+
375
+ ### One QA pair, on disk
376
+
377
+ ```jsonc
378
+ {
379
+ "id": "9f2c…",
380
+ "question": "What does Section 12 of the IT Act 2000 require providers to do?",
381
+ "answer": "Acknowledge electronic records they receive…",
382
+ "context": "Section 12. Acknowledgement of receipt. — (1) Where the …",
383
+ "qa_type": "single_hop",
384
+ "difficulty": "medium",
385
+ "source": "it_act_2000.md",
386
+ "filter_results": [
387
+ {"filter_name": "length_prefilter", "passed": true, "score": null, "reason": null},
388
+ {"filter_name": "regex_prefilter", "passed": true, "score": null, "reason": null},
389
+ {"filter_name": "faithfulness", "passed": true, "score": 0.95, "reason": null}
390
+ // ...
391
+ ]
392
+ }
393
+ ```
394
+
395
+ ---
396
+
397
+ ## Bring your own pieces
398
+
399
+ evalseed is a thin orchestration layer. You can swap any of these.
400
+
401
+ ### Run only some filters
402
+
403
+ ```python
404
+ from evalseed import Pipeline, OpenAIJudge
405
+ from evalseed.filters import FaithfulnessFilter, AnswerabilityFilter
406
+
407
+ judge = OpenAIJudge(model="gpt-4o-mini")
408
+ pipeline = Pipeline(
409
+ judge=judge,
410
+ n_pairs=50,
411
+ filters=[ # only run two LLM stages
412
+ FaithfulnessFilter(judge, threshold=0.8),
413
+ AnswerabilityFilter(judge),
414
+ ],
415
+ )
416
+ ```
417
+
418
+ ### Tighten or loosen a threshold
419
+
420
+ Most filters take a `threshold` argument. `FaithfulnessFilter(judge, threshold=0.9)` will reject more aggressively; `0.6` will be more permissive.
421
+
422
+ ### Use evalseed only as a filter pass
423
+
424
+ If another tool (RAGAS, DeepEval, hand-written) generated the pairs, you can skip generation entirely — see option 3 in [Three ways to feed it data](#three-ways-to-feed-it-data).
425
+
426
+ ### Plug in a different LLM provider
427
+
428
+ Implement the one-method `Judge` protocol ([src/evalseed/judges.py](src/evalseed/judges.py)) and pass your instance to `Pipeline(judge=...)`. A first-party multi-provider judge (Anthropic / Gemini / Bedrock / local) is deferred to v0.2 — see [CONTRIBUTING.md](CONTRIBUTING.md).
429
+
430
+ ---
431
+
432
+ ## Project layout
433
+
434
+ ```
435
+ src/evalseed/
436
+ ├── __init__.py # public API: Pipeline, OpenAIJudge, Dataset, QAPair
437
+ ├── pipeline.py # orchestrates load → chunk → generate → filter
438
+ ├── generator.py # QAGenerator + JSONL loader for pre-generated pairs
439
+ ├── corpus.py # file loading + paragraph-aware chunking
440
+ ├── dataset.py # Dataset (iterable, sliceable, .stats(), .save())
441
+ ├── judges.py # Judge protocol + OpenAIJudge implementation
442
+ ├── schemas.py # pydantic models: QAPair, FilterResult, QAType, Difficulty
443
+ ├── exceptions.py
444
+ ├── cli.py # `evalseed` console script
445
+ └── filters/
446
+ ├── prefilters.py # LengthPreFilter, RegexPreFilter (no LLM call)
447
+ ├── faithfulness.py
448
+ ├── answerability.py
449
+ ├── triviality.py
450
+ └── difficulty.py
451
+ ```
452
+
453
+ ---
454
+
455
+ ## Status
456
+
457
+ Pre-release alpha. The Phase 0 validation spike (see [docs/phase0/](docs/phase0/)) has not been run yet — no `v0.1.0` PyPI release until the thesis is validated against real data.
458
+
459
+ <!-- TODO: fill in real numbers from Phase 0 spike before publishing -->
460
+ <!-- In benchmarks on <corpus type> documents, evalseed rejected X% of generated pairs that human reviewers also flagged as unusable. -->
461
+
462
+ ### Comparison
463
+
464
+ <!-- TODO: fill in once Phase 0 benchmark is run -->
465
+
466
+ | | evalseed | RAGAS | DeepEval Synthesizer |
467
+ |--------------------------|----------|-------|----------------------|
468
+ | Multi-stage filtering | | | |
469
+ | Type labeling | | | |
470
+ | Pluggable judge | | | |
471
+
472
+ ---
473
+
474
+ ## Contributing
475
+
476
+ See [CONTRIBUTING.md](CONTRIBUTING.md). The scope of v0.1 is deliberately tight — please open an issue before working on multi-provider judges, async, PDF loaders, or framework integrations.
477
+
478
+ ## License
479
+
480
+ MIT — see [LICENSE](LICENSE).