fitz-gov 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. fitz_gov-1.1.0/.gitignore +42 -0
  2. fitz_gov-1.1.0/LICENSE +21 -0
  3. fitz_gov-1.1.0/PKG-INFO +366 -0
  4. fitz_gov-1.1.0/README.md +335 -0
  5. fitz_gov-1.1.0/data/abstention/abstention.json +542 -0
  6. fitz_gov-1.1.0/data/confidence/confidence.json +383 -0
  7. fitz_gov-1.1.0/data/corpus/documents.jsonl +288 -0
  8. fitz_gov-1.1.0/data/corpus/manifest.json +48 -0
  9. fitz_gov-1.1.0/data/dispute/dispute.json +569 -0
  10. fitz_gov-1.1.0/data/grounding/grounding.json +358 -0
  11. fitz_gov-1.1.0/data/qualification/qualification.json +575 -0
  12. fitz_gov-1.1.0/data/queries/query_mappings.json +1108 -0
  13. fitz_gov-1.1.0/data/relevance/relevance.json +358 -0
  14. fitz_gov-1.1.0/data/tier0_sanity/abstention.json +179 -0
  15. fitz_gov-1.1.0/data/tier0_sanity/confidence.json +141 -0
  16. fitz_gov-1.1.0/data/tier0_sanity/dispute.json +179 -0
  17. fitz_gov-1.1.0/data/tier0_sanity/grounding.json +223 -0
  18. fitz_gov-1.1.0/data/tier0_sanity/qualification.json +156 -0
  19. fitz_gov-1.1.0/data/tier0_sanity/relevance.json +220 -0
  20. fitz_gov-1.1.0/data/tier1_core/abstention.json +443 -0
  21. fitz_gov-1.1.0/data/tier1_core/confidence.json +401 -0
  22. fitz_gov-1.1.0/data/tier1_core/dispute.json +473 -0
  23. fitz_gov-1.1.0/data/tier1_core/grounding.json +480 -0
  24. fitz_gov-1.1.0/data/tier1_core/qualification.json +472 -0
  25. fitz_gov-1.1.0/data/tier1_core/relevance.json +455 -0
  26. fitz_gov-1.1.0/fitz_gov/__init__.py +96 -0
  27. fitz_gov-1.1.0/fitz_gov/bootstrap.py +342 -0
  28. fitz_gov-1.1.0/fitz_gov/cli.py +507 -0
  29. fitz_gov-1.1.0/fitz_gov/evaluator.py +589 -0
  30. fitz_gov-1.1.0/fitz_gov/generator.py +667 -0
  31. fitz_gov-1.1.0/fitz_gov/llm_validator.py +440 -0
  32. fitz_gov-1.1.0/fitz_gov/loader.py +417 -0
  33. fitz_gov-1.1.0/fitz_gov/models.py +517 -0
  34. fitz_gov-1.1.0/fitz_gov/schema.py +138 -0
  35. fitz_gov-1.1.0/fitz_gov/validate.py +281 -0
  36. fitz_gov-1.1.0/pyproject.toml +75 -0
@@ -0,0 +1,42 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual environments
24
+ .venv/
25
+ venv/
26
+ ENV/
27
+
28
+ # IDE
29
+ .idea/
30
+ .vscode/
31
+ *.swp
32
+ *.swo
33
+
34
+ # Generated data (users generate their own)
35
+ generated_data/
36
+
37
+ # Release artifacts
38
+ *.zip
39
+
40
+ # OS
41
+ .DS_Store
42
+ Thumbs.db
fitz_gov-1.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Fitz AI
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,366 @@
1
+ Metadata-Version: 2.4
2
+ Name: fitz-gov
3
+ Version: 1.1.0
4
+ Summary: fitz-gov: Comprehensive RAG Governance Benchmark
5
+ Project-URL: Homepage, https://github.com/yafitzdev/fitz-gov
6
+ Project-URL: Documentation, https://github.com/yafitzdev/fitz-gov#readme
7
+ Project-URL: Repository, https://github.com/yafitzdev/fitz-gov
8
+ Project-URL: Issues, https://github.com/yafitzdev/fitz-gov/issues
9
+ Author-email: Fitz AI <dev@fitz.ai>
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: benchmark,evaluation,governance,llm,rag
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Requires-Python: >=3.10
23
+ Requires-Dist: httpx>=0.24.0
24
+ Provides-Extra: dev
25
+ Requires-Dist: black>=23.0.0; extra == 'dev'
26
+ Requires-Dist: isort>=5.12.0; extra == 'dev'
27
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
28
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
29
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
30
+ Description-Content-Type: text/markdown
31
+
32
+ # fitz-gov: Comprehensive RAG Governance Benchmark
33
+
34
+ fitz-gov is a benchmark for evaluating RAG system governance - the ability to know when to abstain, dispute, qualify, or confidently answer questions.
35
+
36
+ ## Why fitz-gov?
37
+
38
+ Most RAG benchmarks focus on retrieval quality (BEIR) or answer correctness (RAGAS). But real-world RAG systems need **epistemic honesty** - knowing what they don't know.
39
+
40
+ fitz-gov measures:
41
+
42
+ | Category | What it Tests | Maps to |
43
+ |----------|--------------|---------|
44
+ | **Abstention** | Refuses when context is insufficient | `ABSTAIN` mode |
45
+ | **Dispute** | Flags conflicting sources | `DISPUTED` mode |
46
+ | **Qualification** | Hedges uncertain claims | `QUALIFIED` mode |
47
+ | **Confidence** | Answers confidently when evidence is clear | `CONFIDENT` mode |
48
+ | **Grounding** | Answers are grounded in context (no hallucination) | Answer quality |
49
+ | **Relevance** | Answers address the actual question | Answer quality |
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ pip install fitz-gov
55
+ ```
56
+
57
+ Or install from local path during development:
58
+
59
+ ```bash
60
+ pip install -e path/to/fitz-gov
61
+ ```
62
+
63
+ ## Quick Start
64
+
65
+ ### Tiered Evaluation (Recommended)
66
+
67
+ fitz-gov uses a two-tier evaluation system:
68
+ - **Tier 0 (Sanity)**: 60 easy cases with 95% pass threshold - gates Tier 1
69
+ - **Tier 1 (Core)**: 160 discriminative cases with gradient scoring
70
+
71
+ ```python
72
+ from fitz_gov import FitzGovEvaluator, load_tier, Tier, AnswerMode
73
+
74
+ # Load tiered cases
75
+ tier0_cases = load_tier(Tier.SANITY) # 60 cases
76
+ tier1_cases = load_tier(Tier.CORE) # 160 cases
77
+
78
+ # Your RAG system generates responses and modes for each tier
79
+ tier0_responses, tier0_modes = your_rag_system.evaluate(tier0_cases)
80
+ tier1_responses, tier1_modes = your_rag_system.evaluate(tier1_cases)
81
+
82
+ # Run tiered evaluation
83
+ evaluator = FitzGovEvaluator()
84
+ result = evaluator.evaluate_tiered(
85
+ tier0_cases, tier0_responses, tier0_modes,
86
+ tier1_cases, tier1_responses, tier1_modes,
87
+ )
88
+
89
+ print(result)
90
+ # fitz-gov Tiered Evaluation
91
+ # ==========================
92
+ #
93
+ # TIER 0 (Sanity Check): PASSED
94
+ # Threshold: 95% | Achieved: 98.3% (59/60)
95
+ #
96
+ # TIER 1 (Core Benchmark): 78.1%
97
+ # By Category:
98
+ # abstention: 26/30 (86.7%)
99
+ # dispute: 22/30 (73.3%)
100
+ # ...
101
+ #
102
+ # Summary: Tier 0 PASSED, Tier 1 Score: 78.1%
103
+ ```
104
+
105
+ ### With Fitz RAG Engine
106
+
107
+ ```python
108
+ from fitz_ai.evaluation.benchmarks import FitzGovBenchmark
109
+
110
+ # Create benchmark and evaluate your engine
111
+ benchmark = FitzGovBenchmark()
112
+ results = benchmark.evaluate(engine)
113
+
114
+ print(results)
115
+ ```
116
+
117
+ ### Standalone Usage (Any RAG System)
118
+
119
+ The `fitz-gov` package contains all evaluation logic, so any RAG system can be evaluated:
120
+
121
+ ```python
122
+ from fitz_gov import FitzGovEvaluator, load_cases, FitzGovCategory, AnswerMode
123
+
124
+ # Load test cases
125
+ cases = load_cases()
126
+
127
+ # Create evaluator
128
+ evaluator = FitzGovEvaluator()
129
+
130
+ # Evaluate your RAG system's responses
131
+ responses = []
132
+ modes = []
133
+
134
+ for case in cases:
135
+ # Your RAG system generates response
136
+ response = your_rag_system.query(case.query, case.contexts)
137
+ mode = your_rag_system.classify_mode(response) # Your mode classification
138
+
139
+ responses.append(response)
140
+ modes.append(mode)
141
+
142
+ # Get comprehensive results
143
+ results = evaluator.evaluate_all(cases, responses, modes)
144
+ print(f"Overall accuracy: {results.overall_accuracy:.1%}")
145
+ ```
146
+
147
+ ### Evaluating Individual Cases
148
+
149
+ ```python
150
+ from fitz_gov import FitzGovEvaluator, load_case_by_id
151
+
152
+ evaluator = FitzGovEvaluator()
153
+
154
+ # Load specific test case
155
+ case = load_case_by_id("abstain_001")
156
+
157
+ # Your system's response
158
+ response = "Based on the context provided, I cannot find information about..."
159
+ mode = AnswerMode.ABSTAIN
160
+
161
+ # Evaluate
162
+ result = evaluator.evaluate_case(case, response, mode)
163
+ print(f"Passed: {result.passed}")
164
+ print(f"Expected: {case.expected_mode.value}, Got: {mode.value}")
165
+ ```
166
+
167
+ ## Two-Pass Validation (Answer Quality Categories)
168
+
169
+ For grounding and relevance categories, fitz-gov uses **two-pass validation** to reduce false positives:
170
+
171
+ 1. **Regex pass**: Fast pattern matching catches obvious violations
172
+ 2. **LLM pass**: Semantic validation for flagged cases
173
+
174
+ ### Enable LLM Validation
175
+
176
+ ```python
177
+ from fitz_gov import FitzGovEvaluator
178
+
179
+ # Enable LLM validation with local Ollama
180
+ evaluator = FitzGovEvaluator(
181
+ llm_validation=True,
182
+ llm_model="qwen2.5:14b", # or any Ollama model
183
+ llm_base_url="http://localhost:11434"
184
+ )
185
+
186
+ # Responses flagged by regex are sent to LLM for semantic check
187
+ results = evaluator.evaluate_all(cases, responses, modes)
188
+ ```
189
+
190
+ ### Validation Flow
191
+
192
+ ```
193
+ Response contains forbidden_claim pattern?
194
+
195
+ ├─ No → PASS (no hallucination detected)
196
+
197
+ └─ Yes → LLM validates: "Is this an actual hallucination?"
198
+
199
+ ├─ LLM says no (e.g., "no revenue mentioned") → PASS
200
+
201
+ └─ LLM says yes (fabricated specific value) → FAIL
202
+ ```
203
+
204
+ ### Caching
205
+
206
+ LLM validation results are cached for 7 days to speed up repeated evaluations:
207
+ - Cache location: `~/.cache/fitz_gov/`
208
+ - Automatic cache cleanup on startup
209
+
210
+ ## API Reference
211
+
212
+ ### Core Classes
213
+
214
+ ```python
215
+ from fitz_gov import (
216
+ # Evaluator
217
+ FitzGovEvaluator,
218
+
219
+ # Data loading
220
+ load_cases,
221
+ load_tier,
222
+ load_case_by_id,
223
+ get_category_info,
224
+ get_tier_info,
225
+ get_data_dir,
226
+ get_tier_dir,
227
+ Tier,
228
+
229
+ # Models
230
+ FitzGovCategory,
231
+ AnswerMode,
232
+ FitzGovCase,
233
+ FitzGovCaseResult,
234
+ FitzGovCategoryResult,
235
+ FitzGovConfusionMatrix,
236
+ FitzGovResult,
237
+
238
+ # Tiered Results
239
+ TieredResult,
240
+ Tier0Result,
241
+ Tier1Result,
242
+
243
+ # LLM Validation
244
+ OllamaValidator,
245
+ ValidatorConfig,
246
+ ValidationResult,
247
+ )
248
+ ```
249
+
250
+ ### FitzGovEvaluator
251
+
252
+ ```python
253
+ evaluator = FitzGovEvaluator(
254
+ llm_validation=False, # Enable two-pass validation
255
+ llm_model="qwen2.5:14b", # Ollama model for validation
256
+ llm_base_url="http://localhost:11434"
257
+ )
258
+
259
+ # Tiered evaluation (recommended)
260
+ result = evaluator.evaluate_tiered(
261
+ tier0_cases, tier0_responses, tier0_modes,
262
+ tier1_cases, tier1_responses, tier1_modes,
263
+ tier0_threshold=0.95, # Default: 95%
264
+ gating_enabled=True, # Skip Tier 1 if Tier 0 fails
265
+ )
266
+
267
+ # Flat evaluation (all cases together)
268
+ results = evaluator.evaluate_all(cases, responses, modes)
269
+
270
+ # Evaluate single case
271
+ result = evaluator.evaluate_case(case, response, mode)
272
+ ```
273
+
274
+ ### Loading Test Cases
275
+
276
+ ```python
277
+ # Load by tier (recommended)
278
+ tier0_cases = load_tier(Tier.SANITY) # 60 sanity cases
279
+ tier1_cases = load_tier(Tier.CORE) # 160 core cases
280
+
281
+ # Load all cases (220 total)
282
+ all_cases = load_cases()
283
+
284
+ # Load specific categories from a tier
285
+ abstention_tier0 = load_tier(Tier.SANITY, [FitzGovCategory.ABSTENTION])
286
+
287
+ # Load specific categories across all tiers
288
+ governance_cases = load_cases([
289
+ FitzGovCategory.ABSTENTION,
290
+ FitzGovCategory.DISPUTE,
291
+ ])
292
+
293
+ # Load single case by ID (IDs prefixed with t0_ or t1_)
294
+ case = load_case_by_id("t1_dispute_medium_005")
295
+ ```
296
+
297
+ ## Data Format
298
+
299
+ Test cases are organized in a tiered structure:
300
+
301
+ ```
302
+ data/
303
+ ├── tier0_sanity/ # 60 cases - baseline verification (95% threshold)
304
+ │ ├── abstention.json # 12 cases
305
+ │ ├── dispute.json # 12 cases
306
+ │ ├── qualification.json # 10 cases
307
+ │ ├── confidence.json # 10 cases
308
+ │ ├── grounding.json # 8 cases
309
+ │ └── relevance.json # 8 cases
310
+ ├── tier1_core/ # 160 cases - discriminative benchmark
311
+ │ ├── abstention.json # 30 cases
312
+ │ ├── dispute.json # 30 cases
313
+ │ ├── qualification.json # 30 cases
314
+ │ ├── confidence.json # 30 cases
315
+ │ ├── grounding.json # 20 cases
316
+ │ └── relevance.json # 20 cases
317
+ └── corpus/
318
+ └── documents.jsonl # 288 reference documents
319
+ ```
320
+
321
+ Each case has:
322
+
323
+ ```json
324
+ {
325
+ "id": "abstain_001",
326
+ "query": "What is the company's revenue for 2024?",
327
+ "contexts": ["The company was founded in 2010..."],
328
+ "expected_mode": "abstain",
329
+ "subcategory": "different_domain",
330
+ "difficulty": "medium",
331
+ "mode_rationale": "Context contains no financial data",
332
+ "evaluation_config": {
333
+ "forbidden_claims": ["\\$\\d"],
334
+ "allowed_phrases": ["not specified", "cannot find"]
335
+ }
336
+ }
337
+ ```
338
+
339
+ ## Version
340
+
341
+ Current version: **1.1.0**
342
+
343
+ See [CHANGELOG.md](CHANGELOG.md) for release history and [docs/roadmap](docs/roadmap/) for implementation details.
344
+
345
+ ## Architecture Note
346
+
347
+ fitz-gov is designed as a standalone package so that:
348
+
349
+ 1. **Any RAG system** can benchmark against the same test cases
350
+ 2. **Evaluation logic is consistent** - all systems get identical evaluation
351
+ 3. **Test data is versioned** - reproducible benchmarks across releases
352
+
353
+ For Fitz RAG engine integration, see `fitz_ai.evaluation.benchmarks.FitzGovBenchmark` which wraps this package.
354
+
355
+ ## Contributing
356
+
357
+ We welcome contributions! To add new test cases:
358
+
359
+ 1. Fork this repo
360
+ 2. Add cases to the appropriate `data/<category>/` directory
361
+ 3. Run validation: `python scripts/validate.py`
362
+ 4. Submit a PR
363
+
364
+ ## License
365
+
366
+ MIT License - see [LICENSE](LICENSE) for details.