advanced-rag-framework 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. advanced_rag_framework-0.2.0/LICENSE +21 -0
  2. advanced_rag_framework-0.2.0/PKG-INFO +650 -0
  3. advanced_rag_framework-0.2.0/README.md +613 -0
  4. advanced_rag_framework-0.2.0/advanced_rag_framework.egg-info/PKG-INFO +650 -0
  5. advanced_rag_framework-0.2.0/advanced_rag_framework.egg-info/SOURCES.txt +37 -0
  6. advanced_rag_framework-0.2.0/advanced_rag_framework.egg-info/dependency_links.txt +1 -0
  7. advanced_rag_framework-0.2.0/advanced_rag_framework.egg-info/requires.txt +21 -0
  8. advanced_rag_framework-0.2.0/advanced_rag_framework.egg-info/top_level.txt +2 -0
  9. advanced_rag_framework-0.2.0/arf/__init__.py +56 -0
  10. advanced_rag_framework-0.2.0/arf/document.py +181 -0
  11. advanced_rag_framework-0.2.0/arf/features.py +393 -0
  12. advanced_rag_framework-0.2.0/arf/ingest.py +185 -0
  13. advanced_rag_framework-0.2.0/arf/pipeline.py +349 -0
  14. advanced_rag_framework-0.2.0/arf/query_graph.py +132 -0
  15. advanced_rag_framework-0.2.0/arf/score_parser.py +143 -0
  16. advanced_rag_framework-0.2.0/arf/trainer.py +206 -0
  17. advanced_rag_framework-0.2.0/arf/triage.py +237 -0
  18. advanced_rag_framework-0.2.0/pyproject.toml +71 -0
  19. advanced_rag_framework-0.2.0/rag_dependencies/__init__.py +0 -0
  20. advanced_rag_framework-0.2.0/rag_dependencies/ai_service.py +1166 -0
  21. advanced_rag_framework-0.2.0/rag_dependencies/alias_manager.py +424 -0
  22. advanced_rag_framework-0.2.0/rag_dependencies/feature_extractor.py +482 -0
  23. advanced_rag_framework-0.2.0/rag_dependencies/keyword_matcher.py +434 -0
  24. advanced_rag_framework-0.2.0/rag_dependencies/llm_verifier.py +237 -0
  25. advanced_rag_framework-0.2.0/rag_dependencies/mlp_reranker.py +332 -0
  26. advanced_rag_framework-0.2.0/rag_dependencies/mongo_manager.py +1580 -0
  27. advanced_rag_framework-0.2.0/rag_dependencies/openai_service.py +1174 -0
  28. advanced_rag_framework-0.2.0/rag_dependencies/query_manager.py +626 -0
  29. advanced_rag_framework-0.2.0/rag_dependencies/query_processor.py +2476 -0
  30. advanced_rag_framework-0.2.0/rag_dependencies/vector_search.py +672 -0
  31. advanced_rag_framework-0.2.0/setup.cfg +4 -0
  32. advanced_rag_framework-0.2.0/tests/test_config_schema.py +117 -0
  33. advanced_rag_framework-0.2.0/tests/test_cost_tracker.py +55 -0
  34. advanced_rag_framework-0.2.0/tests/test_data.py +234 -0
  35. advanced_rag_framework-0.2.0/tests/test_feature_extractor.py +585 -0
  36. advanced_rag_framework-0.2.0/tests/test_integration.py +186 -0
  37. advanced_rag_framework-0.2.0/tests/test_keyword_matcher.py +282 -0
  38. advanced_rag_framework-0.2.0/tests/test_metrics.py +86 -0
  39. advanced_rag_framework-0.2.0/tests/test_query_processor.py +544 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 MicelyTech | Yuto Mori
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,650 @@
1
+ Metadata-Version: 2.4
2
+ Name: advanced-rag-framework
3
+ Version: 0.2.0
4
+ Summary: Advanced Retrieval Framework — dependency-free retrieval pipeline toolkit
5
+ License-Expression: MIT
6
+ Project-URL: Homepage, https://github.com/jager47X/ARF
7
+ Project-URL: Repository, https://github.com/jager47X/ARF
8
+ Classifier: Development Status :: 3 - Alpha
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.10
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Provides-Extra: ml
19
+ Requires-Dist: numpy>=1.24; extra == "ml"
20
+ Requires-Dist: scikit-learn>=1.3; extra == "ml"
21
+ Requires-Dist: joblib>=1.3; extra == "ml"
22
+ Provides-Extra: dev
23
+ Requires-Dist: pytest>=7.0; extra == "dev"
24
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
25
+ Requires-Dist: ruff>=0.4; extra == "dev"
26
+ Requires-Dist: numpy>=1.24; extra == "dev"
27
+ Provides-Extra: full
28
+ Requires-Dist: numpy>=1.24; extra == "full"
29
+ Requires-Dist: openai>=1.0; extra == "full"
30
+ Requires-Dist: pymongo>=4.0; extra == "full"
31
+ Requires-Dist: python-dotenv>=1.0; extra == "full"
32
+ Requires-Dist: tiktoken>=0.5; extra == "full"
33
+ Requires-Dist: nltk>=3.8; extra == "full"
34
+ Requires-Dist: pydantic>=2.0; extra == "full"
35
+ Requires-Dist: voyageai>=0.3; extra == "full"
36
+ Dynamic: license-file
37
+
38
+ # ARF - Advanced Retrieval Framework
39
+
40
+ [![CI](https://github.com/jager47X/ARF/actions/workflows/ci.yml/badge.svg)](https://github.com/jager47X/ARF/actions/workflows/ci.yml)
41
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
42
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
43
+ [![MongoDB Atlas](https://img.shields.io/badge/MongoDB-Atlas-47A248?logo=mongodb&logoColor=white)](https://www.mongodb.com/atlas)
44
+ [![Voyage AI](https://img.shields.io/badge/Embeddings-Voyage--3--large-purple)](https://www.voyageai.com/)
45
+ [![OpenAI](https://img.shields.io/badge/LLM-OpenAI-412991?logo=openai&logoColor=white)](https://openai.com/)
46
+ [![Code style: Ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
47
+
48
+ **ARF** (Advanced Retrieval Framework) is a production-ready RAG system designed to minimize cost and hallucination based on R-Flow. Optimized for legal document search and analysis across multiple domains.
49
+
50
+ ## Table of Contents
51
+
52
+ - [Summary](#summary)
53
+ - [Live Demo](#-live-demo)
54
+ - [Overview](#overview)
55
+ - [Architecture](#architecture)
56
+ - [Evaluation & Benchmarks](#evaluation--benchmarks)
57
+ - [MLP Reranker](#mlp-reranker)
58
+ - [Installation](#installation)
59
+ - [Configuration](#configuration)
60
+ - [Usage](#usage)
61
+ - [Components](#components)
62
+ - [Development](#development)
63
+ - [Contributing](#contributing)
64
+
65
+ ### Summary
66
+
67
+ **What makes ARF different from other RAG systems:**
68
+
69
+ Most RAG pipelines rely on expensive LLM calls to rerank and verify retrieval results. ARF proves this is unnecessary. A lightweight **MLP reranker** (128-64-32 neurons, <5ms, $0.00/query) trained on domain-specific features **outperforms LLM-based reranking** (GPT-4o, ~500ms, $0.004/query) by a wide margin.
70
+
71
+ **Key innovations:**
72
+ - **Learned retrieval > LLM reranking** — A small MLP trained on 3,600 labeled pairs achieves +40% MRR over LLM verification, at zero cost. The MLP sees the entire candidate distribution and learns cross-feature relevance patterns that per-document LLM scoring cannot capture.
73
+ - **R-Flow pipeline** — Each stage filters candidates so the next stage does less work:
74
+ 1. **Keyword matching** — Structured pattern detection (e.g., "Article I Section 8", "14th Amendment") maps directly to known documents, bypassing semantic search entirely for exact references.
75
+ 2. **Threshold gates (ABC Gates)** — Score-based routing: `≥ 0.85` → accept immediately, `< 0.70` → reject immediately, `0.70–0.85` → pass to next stage. Eliminates ~60% of candidates without any LLM call.
76
+ 3. **MLP Reranker** — A 128-64-32 MLP trained on 3,600 labeled pairs scores borderline candidates in <5ms at $0.00. Confidently accepts (p ≥ 0.6) or rejects (p ≤ 0.4) ~80% of remaining candidates.
77
+ 4. **LLM Fallback** — Only the ~20% of candidates where the MLP is uncertain (0.4 < p < 0.6) go to the LLM verifier. This is the only stage that costs money, and it handles the smallest batch.
78
+ - **Domain-specific thresholds** — Each legal domain (US Constitution, CFR, US Code, USCIS Policy) has independently tuned thresholds and bias maps, avoiding one-size-fits-all degradation.
79
+ - **Aggressive caching** — In-memory + MongoDB caching makes repeated/similar queries cost $0.00 with **335ms latency** (faster than raw MongoDB Atlas at 410ms). Cost stays flat as query volume grows.
80
+ - **Automated retraining** — Monthly pipeline exports new LLM judgments from production, retrains the MLP, and only deploys if performance improves.
81
+
82
+ ## 🚀 Live Demo
83
+
84
+ **Experience ARF in action:** [KnowYourRights.ai](https://knowyourrights-ai.com)
85
+
86
+ ![KnowYourRights.ai Demo](media/demo_en.png)
87
+
88
+ *KnowYourRights.ai - AI-powered legal rights search and case intake platform powered by ARF*
89
+
90
+ ## Overview
91
+
92
+ ARF is a production-ready RAG framework built for legal document retrieval across 6 domains: US Constitution, US Code, Code of Federal Regulations, USCIS Policy Manual, Supreme Court Cases, and Client Cases.
93
+
94
+ ### Core Capabilities
95
+
96
+ - **Multi-Strategy Retrieval** — Semantic vector search (Voyage-3-large, 1024d) + keyword matching + alias search + exact patterns, combined per domain
97
+ - **MLP Reranker** — Learned second-stage reranker that outperforms LLM verification (MRR 0.933 vs 0.665) at zero cost
98
+ - **R-Flow Pipeline** — Multi-stage filtering eliminates unnecessary computation: only ~20% of candidates reach the LLM
99
+ - **Domain-Specific Tuning** — Each domain has independent thresholds, bias maps, and field mappings
100
+ - **Aggressive Caching** — Embedding, result, and summary caching; repeated queries cost $0.00
101
+ - **Bilingual Support** — English/Spanish query processing and response generation
102
+ - **Automated Retraining** — Monthly MLP retraining from production LLM judgments
103
+
104
+ ### Supported Domains
105
+
106
+ | Domain | Collection | Features |
107
+ |--------|-----------|----------|
108
+ | US Constitution | `us_constitution` | Alias search, keyword matching, structured articles/sections |
109
+ | US Code | `us_code` | Large-scale (54 titles), clause-level search |
110
+ | Code of Federal Regulations | `code_of_federal_regulations` | Hierarchical part/chapter/section, section-level search |
111
+ | USCIS Policy Manual | `uscis_policy` | Automatic weekly updates, CFR reference tracking |
112
+ | Supreme Court Cases | `supreme_court_cases` | Case-to-constitutional provision mapping |
113
+ | Client Cases | `client_cases` | SQL-based private case search |
114
+
115
+ ## Architecture
116
+
117
+ ### System Components
118
+
119
+ ```
120
+ ARF/
121
+ ├── RAG_interface.py # Main orchestrator class
122
+ ├── config.py # Configuration and collection definitions
123
+ ├── rag_dependencies/ # Core RAG components
124
+ │ ├── mongo_manager.py # MongoDB connection and query management
125
+ │ ├── vector_search.py # MongoDB Atlas Vector Search implementation
126
+ │ ├── query_manager.py # Query processing and normalization
127
+ │ ├── query_processor.py # End-to-end query pipeline
128
+ │ ├── alias_manager.py # Alias/keyword search for US Constitution
129
+ │ ├── keyword_matcher.py # Structured keyword matching
130
+ │ ├── llm_verifier.py # LLM-based result reranking
131
+ │ ├── mlp_reranker.py # MLP-based learned reranker (cost optimizer)
132
+ │ ├── feature_extractor.py # Feature engineering for MLP reranker
133
+ │ ├── openai_service.py # OpenAI API integration
134
+ │ └── ai_service.py # AI service abstraction
135
+ ├── models/ # Trained ML models
136
+ │ └── mlp_reranker.joblib # Trained MLP reranker model
137
+ ├── benchmarks/ # Evaluation and benchmarking
138
+ │ ├── run_eval.py # Full evaluation runner
139
+ │ ├── run_baseline.py # Baseline measurement (before MLP)
140
+ │ ├── run_ablation_full.py # Full benchmark (7 strategies)
141
+ │ ├── run_benchmark.py # Basic strategy comparison
142
+ │ ├── train_reranker.py # MLP training pipeline
143
+ │ ├── retrain_monthly.py # Automated monthly retraining
144
+ │ ├── cost_comparison.py # Cost savings analysis
145
+ │ ├── metrics.py # Retrieval metrics (P@k, R@k, MRR, NDCG)
146
+ │ ├── cost_tracker.py # Query cost tracking
147
+ │ ├── hallucination_eval.py # Faithfulness evaluation
148
+ │ ├── benchmark_queries.json # Benchmark query dataset
149
+ │ └── eval_dataset.json # Labeled evaluation dataset (200+ queries)
150
+ └── preprocess/ # Data ingestion scripts
151
+ ├── us_constitution/ # US Constitution ingestion
152
+ ├── us_code/ # US Code ingestion
153
+ ├── cfr/ # CFR ingestion
154
+ ├── uscis_policy_manual/ # USCIS Policy Manual ingestion
155
+ ├── supreme_court_cases/ # Supreme Court cases ingestion
156
+ └── [other sources]/ # Additional data sources
157
+ ```
158
+
159
+ ### Query Processing Flow
160
+
161
+ ![Query Processing Flow](media/mermaid.png)
162
+
163
+ #### Pipeline Stages
164
+
165
+ 1. **Query Input** — Normalize, detect language (en/es), generate embedding
166
+ 2. **Cache Check** — Return cached results if available (zero API calls)
167
+ 3. **Multi-Strategy Search** — Semantic vector search + alias/keyword matching
168
+ 4. **MLP Reranking** — Feature extraction (15 features) + MLP scoring + blended reranking
169
+ 5. **LLM Fallback** — Only for MLP-uncertain candidates (~20%)
170
+ 6. **Summary & Cache** — Generate bilingual summary, cache for reuse
171
+
172
+ ## Installation
173
+
174
+ ### Prerequisites
175
+
176
+ - Python 3.10+
177
+ - MongoDB Atlas account with vector search enabled
178
+ - OpenAI API key
179
+ - Voyage AI API key
180
+
181
+ ### Setup
182
+
183
+ 1. **Clone the repository**:
184
+ ```bash
185
+ git clone <repository-url>
186
+ cd arf
187
+ ```
188
+
189
+ 2. **Install dependencies**:
190
+ ```bash
191
+ pip install -e ".[dev]"
192
+ ```
193
+
194
+ 3. **Configure environment variables**:
195
+ Create a `.env` file (or `.env.local`, `.env.dev`, `.env.production`) with:
196
+ ```env
197
+ OPENAI_API_KEY=your_openai_api_key
198
+ VOYAGE_API_KEY=your_voyage_api_key
199
+ MONGO_URI=your_mongodb_atlas_connection_string
200
+ ```
201
+
202
+ 4. **Set up MongoDB Atlas**:
203
+ - Create vector search indexes on your collections
204
+ - Index name: `vector_index` (default)
205
+ - Vector field: `embedding`
206
+ - Dimensions: 1024
207
+
208
+ ## Configuration
209
+
210
+ ### Collection Configuration
211
+
212
+ Collections are defined in `config.py` with domain-specific settings:
213
+
214
+ ```python
215
+ COLLECTION = {
216
+ "US_CONSTITUTION_SET": {
217
+ "db_name": "public",
218
+ "main_collection_name": "us_constitution",
219
+ "document_type": "US Constitution",
220
+ "use_alias_search": True,
221
+ "use_keyword_matcher": True,
222
+ "thresholds": DOMAIN_THRESHOLDS["us_constitution"],
223
+ # ... additional settings
224
+ },
225
+ # ... other collections
226
+ }
227
+ ```
228
+
229
+ ### Domain-Specific Thresholds
230
+
231
+ Each domain has optimized thresholds for:
232
+ - `query_search`: Initial semantic search threshold
233
+ - `alias_search`: Alias matching threshold
234
+ - `RAG_SEARCH_min`: Minimum score to continue processing
235
+ - `LLM_VERIFication`: Threshold for LLM reranking
236
+ - `RAG_SEARCH`: High-confidence result threshold
237
+ - `confident`: Threshold for saving summaries
238
+ - `FILTER_GAP`: Maximum score gap between results
239
+ - `LLM_SCORE`: LLM reranking score adjustment
240
+
241
+ ### Environment Selection
242
+
243
+ The framework supports multiple environments:
244
+ - `--production`: Uses `.env.production`
245
+ - `--dev`: Uses `.env.dev`
246
+ - `--local`: Uses `.env.local`
247
+ - Auto-detection: Based on Docker environment and file existence
248
+
249
+ ## Usage
250
+
251
+ ### Basic Usage
252
+
253
+ ```python
254
+ from RAG_interface import RAG
255
+ from config import COLLECTION
256
+
257
+ # Initialize RAG for a specific collection
258
+ rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=False)
259
+
260
+ # Process a query
261
+ results, query = rag.process_query(
262
+ query="What does the 14th Amendment say about equal protection?",
263
+ language="en"
264
+ )
265
+
266
+ # Get summary for a specific result
267
+ summary = rag.process_summary(
268
+ query=query,
269
+ result_list=results,
270
+ index=0,
271
+ language="en"
272
+ )
273
+ ```
274
+
275
+ ### Advanced Usage
276
+
277
+ ```python
278
+ # With jurisdiction filtering
279
+ results, query = rag.process_query(
280
+ query="immigration policy",
281
+ jurisdiction="federal",
282
+ language="en"
283
+ )
284
+
285
+ # Bilingual summary
286
+ insight_en, insight_es = rag.process_summary_bilingual(
287
+ query=query,
288
+ result_list=results,
289
+ index=0,
290
+ language="es" # Returns both English and Spanish
291
+ )
292
+
293
+ # SQL-based client case search
294
+ rag_sql = RAG(COLLECTION["CLIENT_CASES"], debug_mode=False)
295
+ results = rag_sql.process_query(
296
+ query="asylum case",
297
+ filtered_cases=["case_id_1", "case_id_2"]
298
+ )
299
+ ```
300
+
301
+ ### Query Processing Options
302
+
303
+ - `skip_pre_checks`: Skip initial query validation
304
+ - `skip_cases_search`: Skip Supreme Court case search
305
+ - `filtered_cases`: Filter results to specific case IDs (SQL path)
306
+ - `jurisdiction`: Filter by jurisdiction
307
+ - `language`: Query language ("en" or "es")
308
+
309
+ ## Components
310
+
311
+ ### RAG Interface (`RAG_interface.py`)
312
+
313
+ Main orchestrator class that wires all subsystems together:
314
+ - Collection configuration management
315
+ - Domain-specific threshold selection
316
+ - Component initialization
317
+ - Public API for query processing
318
+
319
+ ### Query Processor (`query_processor.py`)
320
+
321
+ End-to-end query processing pipeline:
322
+ - Query normalization and expansion
323
+ - Multi-stage search execution
324
+ - Result filtering and ranking
325
+ - Summary generation and caching
326
+ - Case-to-document mapping
327
+
328
+ ### Vector Search (`vector_search.py`)
329
+
330
+ MongoDB Atlas Vector Search implementation:
331
+ - Native `$vectorSearch` aggregation
332
+ - Score bias adjustments
333
+ - Efficient similarity search
334
+ - Error handling and retries
335
+
336
+ ### Query Manager (`query_manager.py`)
337
+
338
+ Query processing utilities:
339
+ - Text normalization
340
+ - Pattern matching
341
+ - Query rephrasing
342
+ - Domain detection
343
+
344
+ ### Alias Manager (`alias_manager.py`)
345
+
346
+ Alias-based search for US Constitution:
347
+ - Keyword/alias embeddings
348
+ - Fast alias matching
349
+ - Score boosting for exact matches
350
+
351
+ ### Keyword Matcher (`keyword_matcher.py`)
352
+
353
+ Structured keyword matching:
354
+ - Article/section pattern matching
355
+ - Hierarchical document navigation
356
+ - Exact match detection
357
+
358
+ ### LLM Verifier (`llm_verifier.py`)
359
+
360
+ LLM-based result verification (fallback for MLP-uncertain candidates):
361
+ - Only invoked for ~20% of borderline candidates (MLP handles the rest)
362
+ - Relevance scoring (0-9) with multiplier-based score adjustment
363
+ - Sequential or parallel verification modes
364
+
365
+ ### MLP Reranker (`mlp_reranker.py`)
366
+
367
+ Learned reranker that reduces LLM verification costs:
368
+ - scikit-learn MLPClassifier (128-64-32 hidden layers)
369
+ - Isotonic calibration for well-calibrated output probabilities
370
+ - Configurable uncertainty threshold for LLM fallback
371
+ - <5ms inference time per batch
372
+
373
+ ### Feature Extractor (`feature_extractor.py`)
374
+
375
+ Extracts 15-dimensional feature vectors for query-document pairs:
376
+
377
+ | Feature | Description |
378
+ |---------|-------------|
379
+ | `semantic_score` | Raw cosine similarity from vector search |
380
+ | `bm25_score` | Term-frequency based relevance approximation |
381
+ | `alias_match` | Whether query matches a document alias |
382
+ | `keyword_match` | Whether query matches via keyword pattern |
383
+ | `domain_type` | Encoded domain (0-3) |
384
+ | `document_length` | Log-scaled character count |
385
+ | `query_length` | Query character count |
386
+ | `section_depth` | Depth in legal hierarchy |
387
+ | `embedding_cosine_similarity` | Direct embedding cosine similarity |
388
+ | `match_type` | 0=none, 1=partial, 2=exact |
389
+ | `score_gap_from_top` | Gap from highest-scored document |
390
+ | `query_term_coverage` | Fraction of query terms in document |
391
+ | `title_similarity` | Jaccard similarity between query and title |
392
+ | `has_nested_content` | Whether document has clauses/sections |
393
+ | `bias_adjustment` | Domain-specific bias applied |
394
+
395
+ ### Mongo Manager (`mongo_manager.py`)
396
+
397
+ MongoDB connection and query management:
398
+ - Database connections
399
+ - Collection access
400
+ - Query caching
401
+ - User query history
402
+
403
+ ## Development
404
+
405
+ ### Running Tests
406
+
407
+ ```bash
408
+ # Unit + integration tests (no API keys needed)
409
+ pytest tests/ -v
410
+
411
+ # Live integration tests (requires API keys + MongoDB)
412
+ ARF_LIVE_TESTS=1 pytest tests/test_integration.py -v
413
+
414
+ # Lint check
415
+ ruff check config.py config_schema.py rag_dependencies/ tests/
416
+
417
+ # Validate config schemas
418
+ python config_schema.py
419
+ ```
420
+
421
+ ### Docker
422
+
423
+ Run tests, lint, or the full framework without installing anything locally:
424
+
425
+ ```bash
426
+ # Build the image (installs all dependencies + runs lint check)
427
+ docker build -t arf .
428
+
429
+ # Run tests
430
+ docker compose up tests
431
+
432
+ # Run lint only
433
+ docker compose run lint
434
+
435
+ # Validate config
436
+ docker compose run validate-config
437
+
438
+ # Run benchmarks (requires .env with API keys)
439
+ docker compose --profile benchmark up benchmark
440
+
441
+ # Interactive shell
442
+ docker compose run arf bash
443
+ ```
444
+
445
+ > **Note:** Copy `.env.example` to `.env` and fill in your API keys before running services that require MongoDB or OpenAI access.
446
+
447
+ ### Adding New Data Sources
448
+
449
+ 1. Create a new directory in `preprocess/`
450
+ 2. Implement fetch and ingest scripts
451
+ 3. Add collection configuration to `config.py`
452
+ 4. Define domain-specific thresholds
453
+ 5. Create vector search indexes in MongoDB Atlas
454
+
455
+ ### Debugging
456
+
457
+ Enable debug mode for detailed logging:
458
+
459
+ ```python
460
+ rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=True)
461
+ ```
462
+
463
+ ## Evaluation & Benchmarks
464
+
465
+ ### Benchmark
466
+
467
+ Measured on 15 US Constitution benchmark queries. Each strategy runs in its **own isolated RAG instance** — strategies build incrementally to show the marginal gain of each layer. Latency measured with in-memory query cache enabled.
468
+
469
+ | Strategy | MRR | P@1 | P@5 | R@5 | NDCG@5 | LLM% | Latency |
470
+ |----------|-----|-----|-----|-----|--------|------|---------|
471
+ | Semantic Only | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 0% | 410 ms |
472
+ | + Keyword | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 0% | 437 ms |
473
+ | + Threshold (ABC Gates) | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 100% | 453 ms |
474
+ | **+ MLP Reranker** | **0.933** | **0.933** | **0.267** | **0.900** | **0.908** | **0%** | **714 ms** |
475
+ | + MLP + LLM Fallback | 0.933 | 0.933 | 0.253 | 0.867 | 0.882 | 20% | 768 ms |
476
+ | Full ARF Pipeline (cold) | 0.489 | 0.400 | 0.133 | 0.580 | 0.503 | — | 807 ms |
477
+ | **Full ARF Pipeline (cached)** | **0.679** | **0.571** | **0.171** | **0.743** | **0.682** | **0%** | **335 ms** |
478
+
479
+ > **Key findings:**
480
+ > - **Semantic search** provides a solid baseline (MRR 0.665) — but the right answer is often not at rank 1.
481
+ > - **Keyword matching** adds no measurable gain on this query set (US Constitution queries are predominantly semantic, not keyword-based).
482
+ > - **Threshold filtering** adds quality gates but no ranking improvement — and incurs 100% LLM verification calls in the borderline band.
483
+ > - **MLP Reranker is the breakthrough**: MRR jumps from 0.665 to **0.933** (+40%), P@1 from 0.600 to **0.933**, and R@5 from 0.613 to **0.900** — with **zero LLM calls**. The MLP learns which features predict relevance and reranks candidates by blending semantic score with learned probability.
484
+ > - **In-memory cache** brings cached query latency to **335 ms** — faster than raw MongoDB Atlas (410 ms) — with zero API calls and $0.00 cost.
485
+ > - **MLP + LLM Fallback** matches MLP-only quality while using LLM verification on only **20%** of candidates (those the MLP is uncertain about). This is the production configuration.
486
+ >
487
+ > **MLP model**: 128-64-32 MLP with isotonic calibration. Trained on 3,600 labeled query-document pairs across 4 legal domains. F1=0.940, AUC-ROC=0.983.
488
+ >
489
+ > Run `python benchmarks/run_ablation_full.py --production` to reproduce.
490
+
491
+ ### Cost, Latency & Quality Over Volume
492
+
493
+ ![Cost, Latency & MRR Comparison](media/cost_latency_comparison.png)
494
+
495
+ *As query volume grows and cache warms, ARF latency drops from ~800ms (cold) to **335ms** (cached) — faster than raw MongoDB Atlas (410ms). Cached queries cost **$0.00** (zero API calls). MRR improves from 0.489 (cold) to 0.679 (cached) as verified results are reused.*
496
+
497
+ ### Cost Analysis
498
+
499
+ > **Key findings from real measurement:**
500
+ > - **Cached queries cost $0.00** — zero API calls, **335ms avg latency** with in-memory cache (280ms min). All 20 "cold" queries hit cache from prior runs.
501
+ > - **Similar queries mostly miss cache** (29% hit rate) — rephrased queries go through the full pipeline including Voyage batch embedding (~47 texts/call for alias search). OpenAI chat/moderation calls were **zero** on this run because threshold gates resolved all queries without LLM reranking.
502
+ > - **Voyage embedding is the only cost** — $0.000926 total for 200 queries. The batch embedding (~47 texts/call) is the real cost driver, not LLM calls.
503
+
504
+ #### Cost at Scale (Measured + Extrapolated)
505
+
506
+ ```
507
+ Query Volume Cache Hit Rate Total API Cost Cost/Query
508
+ ────────────────────────────────────────────────────────────────
509
+ 20 (cold, cached) 100% $0.000000 $0.000000
510
+ 200 (20+180 sim) 36% $0.000926 $0.000005
511
+ 1000 (100+900 sim) ~36% ~$0.005 ~$0.000005
512
+ ```
513
+
514
+ > **Cost thesis:** ARF's API cost is dominated by Voyage batch embedding ($0.06/1M tokens). Cached queries cost **$0.00** — zero external calls. At scale, cost grows only with the number of *genuinely new* queries that miss cache. For 1,000 queries where ~36% hit cache, total cost is ~$0.005 (half a cent). The MLP reranker runs locally at $0.00, and LLM reranking is reserved as a fallback for uncertain candidates (~20% of borderline cases).
515
+
516
+ ## MLP Reranker
517
+
518
+ A lightweight learned reranker (128-64-32 MLP, <5ms, $0.00/query) that replaces expensive LLM verification calls. Trained on 3,600 labeled query-document pairs across 4 legal domains. The LLM is reserved as a fallback for only ~20% of uncertain candidates.
519
+
520
+ ### Architecture
521
+
522
+ ```
523
+ ┌──────────────────────┐
524
+ │ Vector Search │
525
+ │ (MongoDB Atlas) │
526
+ └──────────┬───────────┘
527
+ │ candidates with scores
528
+ ┌──────────▼───────────┐
529
+ │ Feature Extractor │
530
+ │ (15 features) │
531
+ └──────────┬───────────┘
532
+ │ feature vectors
533
+ ┌──────────▼───────────┐
534
+ │ MLP Reranker │
535
+ │ (128→64→32 MLP) │
536
+ │ + isotonic calib. │
537
+ │ F1=0.940 AUC=0.983 │
538
+ └──────────┬───────────┘
539
+ ┌──────┼──────┐
540
+ p≥0.6│ 0.4<p<0.6 │p≤0.4
541
+ │ │ │
542
+ Accept ┌──▼──┐ Reject
543
+ (free) │ LLM │ (free)
544
+ │Verif│
545
+ │(20%)│
546
+ └──┬──┘
547
+ Accept/Reject
548
+ ```
549
+
550
+ ### Why the MLP Wins
551
+
552
+ The MLP blends **15 features** into a single relevance probability that captures signals the LLM cannot efficiently process:
553
+
554
+ - **Semantic score** + **BM25 score** — combines dense and sparse retrieval signals
555
+ - **Match type** (exact/partial/none) — structural pattern the LLM ignores
556
+ - **Score gap from top** — relative positioning in the candidate list
557
+ - **Section depth** — legal hierarchy structure (Title > Chapter > Section)
558
+ - **Domain type** — domain-specific relevance patterns
559
+
560
+ The LLM verifier sees one document at a time and rates it 0-9. The MLP sees the **entire candidate distribution** and learns which features predict relevance across domains.
561
+
562
+ ### How It Works
563
+
564
+ For each candidate document, the pipeline:
565
+ 1. **Extracts 15 features** — semantic score, BM25, keyword/alias match, document structure, query-document similarity
566
+ 2. **MLP predicts** — Calibrated probability of relevance (0-1)
567
+ 3. **Blends scores** — `0.4 * semantic_score + 0.6 * mlp_probability` for reranking
568
+ 4. **Routes by confidence**:
569
+ - **p >= 0.6**: Accept without LLM call (free, instant)
570
+ - **p <= 0.4**: Reject without LLM call (free, instant)
571
+ - **0.4 < p < 0.6**: Uncertain — escalate to LLM verifier (~20% of candidates)
572
+
573
+ ### Training
574
+
575
+ ```bash
576
+ # Train from evaluation dataset (requires MongoDB)
577
+ python benchmarks/train_reranker.py --dataset benchmarks/eval_dataset.json \
578
+ --features-cache benchmarks/features_cache.json --production
579
+
580
+ # Retrain from cached features (no MongoDB needed)
581
+ python benchmarks/train_reranker.py --retrain --features-cache benchmarks/features_cache.json
582
+ ```
583
+
584
+ The pipeline compares 3 architectures and picks the best:
585
+
586
+ | Model | F1 | AUC-ROC | Precision | Recall |
587
+ |-------|-----|---------|-----------|--------|
588
+ | Logistic Regression | 0.931 | 0.981 | 0.962 | 0.902 |
589
+ | MLP (64, 32) | 0.935 | 0.983 | 0.972 | 0.902 |
590
+ | **MLP (128, 64, 32)** | **0.948** | **0.988** | **0.977** | **0.921** |
591
+
592
+ ### Automated Monthly Retraining
593
+
594
+ ```bash
595
+ python benchmarks/retrain_monthly.py --production --dry-run # Check what would change
596
+ python benchmarks/retrain_monthly.py --production # Retrain and deploy
597
+ ```
598
+
599
+ The retraining pipeline:
600
+ 1. Exports recent LLM verifier judgments from MongoDB (last 30 days)
601
+ 2. Generates features for new query-document pairs
602
+ 3. Merges with existing training data (deduplicates)
603
+ 4. Retrains MLP on expanded dataset
604
+ 5. Only deploys new model if F1 >= old model
605
+
606
+ ### Running Benchmarks
607
+
608
+ ```bash
609
+ # Full benchmark (all strategies compared incrementally)
610
+ python benchmarks/run_ablation_full.py --production
611
+
612
+ # Baseline measurement (current pipeline without MLP)
613
+ python benchmarks/run_baseline.py --production
614
+
615
+ # With hallucination evaluation
616
+ python benchmarks/run_baseline.py --production --eval-faithfulness
617
+ ```
618
+
619
+ Metrics reported: P@k, R@k, MRR, NDCG@k, latency (p50/p95/p99), LLM call frequency, cost-per-query.
620
+
621
+ ## Contributing
622
+
623
+ Contributions are welcome! Please:
624
+
625
+ 1. Fork the repository
626
+ 2. Create a feature branch
627
+ 3. Make your changes
628
+ 4. Add tests if applicable
629
+ 5. Submit a pull request
630
+
631
+ ### Code Style
632
+
633
+ - Follow PEP 8 Python style guide
634
+ - Use type hints where appropriate
635
+ - Add docstrings to public functions
636
+ - Include logging for important operations
637
+
638
+ ## License
639
+
640
+ This project is licensed under the MIT License — see [LICENSE](LICENSE) for details.
641
+
642
+ ## Acknowledgments
643
+
644
+ - MongoDB Atlas for vector search capabilities
645
+ - Voyage AI for embedding models
646
+ - OpenAI for LLM services
647
+
648
+ ---
649
+
650
+ For detailed information on data ingestion, see [preprocess/README.md](preprocess/README.md).