ragmint 0.1.1__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {ragmint-0.1.1/src/ragmint.egg-info → ragmint-0.2.0}/PKG-INFO +91 -30
- {ragmint-0.1.1 → ragmint-0.2.0}/README.md +88 -29
- {ragmint-0.1.1 → ragmint-0.2.0}/pyproject.toml +4 -2
- ragmint-0.2.0/src/ragmint/autotuner.py +33 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/evaluation.py +11 -0
- ragmint-0.2.0/src/ragmint/explainer.py +61 -0
- ragmint-0.2.0/src/ragmint/leaderboard.py +45 -0
- ragmint-0.2.0/src/ragmint/tests/conftest.py +16 -0
- ragmint-0.2.0/src/ragmint/tests/test_autotuner.py +42 -0
- ragmint-0.2.0/src/ragmint/tests/test_explainer.py +20 -0
- ragmint-0.2.0/src/ragmint/tests/test_explainer_integration.py +18 -0
- ragmint-0.2.0/src/ragmint/tests/test_integration_autotuner_ragmint.py +60 -0
- ragmint-0.2.0/src/ragmint/tests/test_leaderboard.py +39 -0
- {ragmint-0.1.1 → ragmint-0.2.0/src/ragmint.egg-info}/PKG-INFO +91 -30
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint.egg-info/SOURCES.txt +9 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint.egg-info/requires.txt +2 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/LICENSE +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/setup.cfg +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/__main__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/chunking.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/embeddings.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/pipeline.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/reranker.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/core/retriever.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/experiments/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/optimization/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/optimization/search.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tests/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tests/test_pipeline.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tests/test_retriever.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tests/test_search.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tests/test_tuner.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/tuner.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/utils/__init__.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/utils/caching.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/utils/data_loader.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/utils/logger.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint/utils/metrics.py +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint.egg-info/dependency_links.txt +0 -0
- {ragmint-0.1.1 → ragmint-0.2.0}/src/ragmint.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: ragmint
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: A modular framework for evaluating and optimizing RAG pipelines.
|
|
5
5
|
Author-email: Andre Oliveira <oandreoliveira@outlook.com>
|
|
6
6
|
License: Apache License 2.0
|
|
@@ -22,6 +22,8 @@ Requires-Dist: faiss-cpu; sys_platform != "darwin"
|
|
|
22
22
|
Requires-Dist: optuna>=3.0
|
|
23
23
|
Requires-Dist: pytest
|
|
24
24
|
Requires-Dist: colorama
|
|
25
|
+
Requires-Dist: google-generativeai>=0.8.0
|
|
26
|
+
Requires-Dist: supabase>=2.4.0
|
|
25
27
|
Dynamic: license-file
|
|
26
28
|
|
|
27
29
|
# Ragmint
|
|
@@ -36,17 +38,19 @@ Dynamic: license-file
|
|
|
36
38
|
|
|
37
39
|
**Ragmint** (Retrieval-Augmented Generation Model Inspection & Tuning) is a modular, developer-friendly Python library for **evaluating, optimizing, and tuning RAG (Retrieval-Augmented Generation) pipelines**.
|
|
38
40
|
|
|
39
|
-
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization
|
|
41
|
+
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, and **explainability** through Gemini or Claude.
|
|
40
42
|
|
|
41
43
|
---
|
|
42
44
|
|
|
43
45
|
## ✨ Features
|
|
44
46
|
|
|
45
47
|
- ✅ **Automated hyperparameter optimization** (Grid, Random, Bayesian via Optuna)
|
|
48
|
+
- 🤖 **Auto-RAG Tuner** — dynamically recommends retriever–embedding pairs based on corpus size
|
|
49
|
+
- 🧠 **Explainability Layer** — interprets RAG performance via Gemini or Claude APIs
|
|
50
|
+
- 🏆 **Leaderboard Tracking** — stores and ranks experiment runs via JSON or external DB
|
|
46
51
|
- 🔍 **Built-in RAG evaluation metrics** — faithfulness, recall, BLEU, ROUGE, latency
|
|
47
52
|
- ⚙️ **Retrievers** — FAISS, Chroma, ElasticSearch
|
|
48
53
|
- 🧩 **Embeddings** — OpenAI, HuggingFace
|
|
49
|
-
- 🧠 **Rerankers** — MMR, CrossEncoder (extensible via plugin interface)
|
|
50
54
|
- 💾 **Caching, experiment tracking, and reproducibility** out of the box
|
|
51
55
|
- 🧰 **Clean modular structure** for easy integration in research and production setups
|
|
52
56
|
|
|
@@ -102,6 +106,7 @@ print(result)
|
|
|
102
106
|
```
|
|
103
107
|
|
|
104
108
|
---
|
|
109
|
+
|
|
105
110
|
## 🧪 Dataset Options
|
|
106
111
|
|
|
107
112
|
Ragmint can automatically load evaluation datasets for your RAG pipeline:
|
|
@@ -136,47 +141,99 @@ ragmint.optimize(validation_set="data/custom_qa.json")
|
|
|
136
141
|
|
|
137
142
|
---
|
|
138
143
|
|
|
144
|
+
## 🧠 Auto-RAG Tuner
|
|
145
|
+
|
|
146
|
+
The **AutoRAGTuner** automatically recommends retriever–embedding combinations
|
|
147
|
+
based on corpus size and average document length.
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
from ragmint.autotuner import AutoRAGTuner
|
|
151
|
+
|
|
152
|
+
corpus_stats = {"size": 5000, "avg_len": 250}
|
|
153
|
+
tuner = AutoRAGTuner(corpus_stats)
|
|
154
|
+
recommendation = tuner.recommend()
|
|
155
|
+
print(recommendation)
|
|
156
|
+
# Example output: {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## 🏆 Leaderboard Tracking
|
|
162
|
+
|
|
163
|
+
Track and visualize your best experiments across runs.
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
from ragmint.leaderboard import Leaderboard
|
|
167
|
+
|
|
168
|
+
lb = Leaderboard("experiments/leaderboard.json")
|
|
169
|
+
lb.add_entry({"trial": 1, "faithfulness": 0.87, "latency": 0.12})
|
|
170
|
+
lb.show_top(3)
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
---
|
|
174
|
+
|
|
175
|
+
## 🧠 Explainability with Gemini / Claude
|
|
176
|
+
|
|
177
|
+
Compare two RAG configurations and receive natural language insights
|
|
178
|
+
on **why** one performs better.
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
from ragmint.explainer import explain_results
|
|
182
|
+
|
|
183
|
+
config_a = {"retriever": "FAISS", "embedding_model": "OpenAI"}
|
|
184
|
+
config_b = {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
185
|
+
|
|
186
|
+
explanation = explain_results(config_a, config_b, model="gemini")
|
|
187
|
+
print(explanation)
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
> Set your API keys in a `.env` file or via environment variables:
|
|
191
|
+
> ```
|
|
192
|
+
> export GOOGLE_API_KEY="your_gemini_key"
|
|
193
|
+
> export ANTHROPIC_API_KEY="your_claude_key"
|
|
194
|
+
> ```
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
139
198
|
## 🧩 Folder Structure
|
|
140
199
|
|
|
141
200
|
```
|
|
142
201
|
ragmint/
|
|
143
202
|
├── core/
|
|
144
|
-
│ ├── pipeline.py
|
|
145
|
-
│ ├── retriever.py
|
|
146
|
-
│ ├── reranker.py
|
|
147
|
-
│
|
|
148
|
-
|
|
149
|
-
├──
|
|
150
|
-
├──
|
|
151
|
-
├──
|
|
152
|
-
├──
|
|
153
|
-
├──
|
|
154
|
-
|
|
203
|
+
│ ├── pipeline.py
|
|
204
|
+
│ ├── retriever.py
|
|
205
|
+
│ ├── reranker.py
|
|
206
|
+
│ ├── embedding.py
|
|
207
|
+
│ └── evaluation.py
|
|
208
|
+
├── autotuner.py
|
|
209
|
+
├── explainer.py
|
|
210
|
+
├── leaderboard.py
|
|
211
|
+
├── tuner.py
|
|
212
|
+
├── utils/
|
|
213
|
+
├── configs/
|
|
214
|
+
├── experiments/
|
|
215
|
+
├── tests/
|
|
216
|
+
└── main.py
|
|
155
217
|
```
|
|
156
218
|
|
|
157
219
|
---
|
|
158
220
|
|
|
159
221
|
## 🧪 Running Tests
|
|
160
222
|
|
|
161
|
-
To verify your setup:
|
|
162
|
-
|
|
163
223
|
```bash
|
|
164
224
|
pytest -v
|
|
165
225
|
```
|
|
166
226
|
|
|
167
|
-
|
|
168
|
-
|
|
227
|
+
To include integration tests with Gemini or Claude APIs:
|
|
169
228
|
```bash
|
|
170
|
-
pytest
|
|
229
|
+
pytest -m integration
|
|
171
230
|
```
|
|
172
231
|
|
|
173
|
-
All tests are designed for **Pytest** and run with lightweight mock data.
|
|
174
|
-
|
|
175
232
|
---
|
|
176
233
|
|
|
177
234
|
## ⚙️ Configuration via `pyproject.toml`
|
|
178
235
|
|
|
179
|
-
Your `pyproject.toml`
|
|
236
|
+
Your `pyproject.toml` includes all required dependencies:
|
|
180
237
|
|
|
181
238
|
```toml
|
|
182
239
|
[project]
|
|
@@ -191,6 +248,8 @@ dependencies = [
|
|
|
191
248
|
"pytest",
|
|
192
249
|
"openai",
|
|
193
250
|
"tqdm",
|
|
251
|
+
"google-generativeai",
|
|
252
|
+
"google-genai",
|
|
194
253
|
]
|
|
195
254
|
```
|
|
196
255
|
|
|
@@ -198,10 +257,10 @@ dependencies = [
|
|
|
198
257
|
|
|
199
258
|
## 📊 Example Experiment Workflow
|
|
200
259
|
|
|
201
|
-
1. Define your retriever and reranker
|
|
202
|
-
2. Launch
|
|
203
|
-
3.
|
|
204
|
-
4.
|
|
260
|
+
1. Define your retriever, embedding, and reranker setup
|
|
261
|
+
2. Launch optimization (Grid, Random, Bayesian) or AutoTune
|
|
262
|
+
3. Compare performance with explainability
|
|
263
|
+
4. Persist results to leaderboard for later inspection
|
|
205
264
|
|
|
206
265
|
---
|
|
207
266
|
|
|
@@ -214,7 +273,7 @@ flowchart TD
|
|
|
214
273
|
C --> D[Reranker]
|
|
215
274
|
D --> E[Generator]
|
|
216
275
|
E --> F[Evaluation]
|
|
217
|
-
F --> G[Optuna
|
|
276
|
+
F --> G[Optuna / AutoRAGTuner]
|
|
218
277
|
G -->|Best Params| B
|
|
219
278
|
```
|
|
220
279
|
|
|
@@ -224,8 +283,9 @@ flowchart TD
|
|
|
224
283
|
|
|
225
284
|
```
|
|
226
285
|
[INFO] Starting Bayesian optimization with Optuna
|
|
227
|
-
[INFO] Trial 7 finished:
|
|
286
|
+
[INFO] Trial 7 finished: faithfulness=0.83, latency=0.42s
|
|
228
287
|
[INFO] Best parameters: {'lambda_param': 0.6, 'retriever': 'faiss'}
|
|
288
|
+
[INFO] AutoRAGTuner: Suggested retriever=Chroma for medium corpus
|
|
229
289
|
```
|
|
230
290
|
|
|
231
291
|
---
|
|
@@ -233,8 +293,9 @@ flowchart TD
|
|
|
233
293
|
## 🧠 Why Ragmint?
|
|
234
294
|
|
|
235
295
|
- Built for **RAG researchers**, **AI engineers**, and **LLM ops**
|
|
236
|
-
- Works with **LangChain**, **LlamaIndex**, or standalone
|
|
237
|
-
- Designed for **extensibility** — plug in your own
|
|
296
|
+
- Works with **LangChain**, **LlamaIndex**, or standalone setups
|
|
297
|
+
- Designed for **extensibility** — plug in your own retrievers, models, or metrics
|
|
298
|
+
- Integrated **explainability and leaderboard** modules for research and production
|
|
238
299
|
|
|
239
300
|
---
|
|
240
301
|
|
|
@@ -10,17 +10,19 @@
|
|
|
10
10
|
|
|
11
11
|
**Ragmint** (Retrieval-Augmented Generation Model Inspection & Tuning) is a modular, developer-friendly Python library for **evaluating, optimizing, and tuning RAG (Retrieval-Augmented Generation) pipelines**.
|
|
12
12
|
|
|
13
|
-
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization
|
|
13
|
+
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, and **explainability** through Gemini or Claude.
|
|
14
14
|
|
|
15
15
|
---
|
|
16
16
|
|
|
17
17
|
## ✨ Features
|
|
18
18
|
|
|
19
19
|
- ✅ **Automated hyperparameter optimization** (Grid, Random, Bayesian via Optuna)
|
|
20
|
+
- 🤖 **Auto-RAG Tuner** — dynamically recommends retriever–embedding pairs based on corpus size
|
|
21
|
+
- 🧠 **Explainability Layer** — interprets RAG performance via Gemini or Claude APIs
|
|
22
|
+
- 🏆 **Leaderboard Tracking** — stores and ranks experiment runs via JSON or external DB
|
|
20
23
|
- 🔍 **Built-in RAG evaluation metrics** — faithfulness, recall, BLEU, ROUGE, latency
|
|
21
24
|
- ⚙️ **Retrievers** — FAISS, Chroma, ElasticSearch
|
|
22
25
|
- 🧩 **Embeddings** — OpenAI, HuggingFace
|
|
23
|
-
- 🧠 **Rerankers** — MMR, CrossEncoder (extensible via plugin interface)
|
|
24
26
|
- 💾 **Caching, experiment tracking, and reproducibility** out of the box
|
|
25
27
|
- 🧰 **Clean modular structure** for easy integration in research and production setups
|
|
26
28
|
|
|
@@ -76,6 +78,7 @@ print(result)
|
|
|
76
78
|
```
|
|
77
79
|
|
|
78
80
|
---
|
|
81
|
+
|
|
79
82
|
## 🧪 Dataset Options
|
|
80
83
|
|
|
81
84
|
Ragmint can automatically load evaluation datasets for your RAG pipeline:
|
|
@@ -110,47 +113,99 @@ ragmint.optimize(validation_set="data/custom_qa.json")
|
|
|
110
113
|
|
|
111
114
|
---
|
|
112
115
|
|
|
116
|
+
## 🧠 Auto-RAG Tuner
|
|
117
|
+
|
|
118
|
+
The **AutoRAGTuner** automatically recommends retriever–embedding combinations
|
|
119
|
+
based on corpus size and average document length.
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
from ragmint.autotuner import AutoRAGTuner
|
|
123
|
+
|
|
124
|
+
corpus_stats = {"size": 5000, "avg_len": 250}
|
|
125
|
+
tuner = AutoRAGTuner(corpus_stats)
|
|
126
|
+
recommendation = tuner.recommend()
|
|
127
|
+
print(recommendation)
|
|
128
|
+
# Example output: {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## 🏆 Leaderboard Tracking
|
|
134
|
+
|
|
135
|
+
Track and visualize your best experiments across runs.
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
from ragmint.leaderboard import Leaderboard
|
|
139
|
+
|
|
140
|
+
lb = Leaderboard("experiments/leaderboard.json")
|
|
141
|
+
lb.add_entry({"trial": 1, "faithfulness": 0.87, "latency": 0.12})
|
|
142
|
+
lb.show_top(3)
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## 🧠 Explainability with Gemini / Claude
|
|
148
|
+
|
|
149
|
+
Compare two RAG configurations and receive natural language insights
|
|
150
|
+
on **why** one performs better.
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
from ragmint.explainer import explain_results
|
|
154
|
+
|
|
155
|
+
config_a = {"retriever": "FAISS", "embedding_model": "OpenAI"}
|
|
156
|
+
config_b = {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
157
|
+
|
|
158
|
+
explanation = explain_results(config_a, config_b, model="gemini")
|
|
159
|
+
print(explanation)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
> Set your API keys in a `.env` file or via environment variables:
|
|
163
|
+
> ```
|
|
164
|
+
> export GOOGLE_API_KEY="your_gemini_key"
|
|
165
|
+
> export ANTHROPIC_API_KEY="your_claude_key"
|
|
166
|
+
> ```
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
113
170
|
## 🧩 Folder Structure
|
|
114
171
|
|
|
115
172
|
```
|
|
116
173
|
ragmint/
|
|
117
174
|
├── core/
|
|
118
|
-
│ ├── pipeline.py
|
|
119
|
-
│ ├── retriever.py
|
|
120
|
-
│ ├── reranker.py
|
|
121
|
-
│
|
|
122
|
-
|
|
123
|
-
├──
|
|
124
|
-
├──
|
|
125
|
-
├──
|
|
126
|
-
├──
|
|
127
|
-
├──
|
|
128
|
-
|
|
175
|
+
│ ├── pipeline.py
|
|
176
|
+
│ ├── retriever.py
|
|
177
|
+
│ ├── reranker.py
|
|
178
|
+
│ ├── embedding.py
|
|
179
|
+
│ └── evaluation.py
|
|
180
|
+
├── autotuner.py
|
|
181
|
+
├── explainer.py
|
|
182
|
+
├── leaderboard.py
|
|
183
|
+
├── tuner.py
|
|
184
|
+
├── utils/
|
|
185
|
+
├── configs/
|
|
186
|
+
├── experiments/
|
|
187
|
+
├── tests/
|
|
188
|
+
└── main.py
|
|
129
189
|
```
|
|
130
190
|
|
|
131
191
|
---
|
|
132
192
|
|
|
133
193
|
## 🧪 Running Tests
|
|
134
194
|
|
|
135
|
-
To verify your setup:
|
|
136
|
-
|
|
137
195
|
```bash
|
|
138
196
|
pytest -v
|
|
139
197
|
```
|
|
140
198
|
|
|
141
|
-
|
|
142
|
-
|
|
199
|
+
To include integration tests with Gemini or Claude APIs:
|
|
143
200
|
```bash
|
|
144
|
-
pytest
|
|
201
|
+
pytest -m integration
|
|
145
202
|
```
|
|
146
203
|
|
|
147
|
-
All tests are designed for **Pytest** and run with lightweight mock data.
|
|
148
|
-
|
|
149
204
|
---
|
|
150
205
|
|
|
151
206
|
## ⚙️ Configuration via `pyproject.toml`
|
|
152
207
|
|
|
153
|
-
Your `pyproject.toml`
|
|
208
|
+
Your `pyproject.toml` includes all required dependencies:
|
|
154
209
|
|
|
155
210
|
```toml
|
|
156
211
|
[project]
|
|
@@ -165,6 +220,8 @@ dependencies = [
|
|
|
165
220
|
"pytest",
|
|
166
221
|
"openai",
|
|
167
222
|
"tqdm",
|
|
223
|
+
"google-generativeai",
|
|
224
|
+
"google-genai",
|
|
168
225
|
]
|
|
169
226
|
```
|
|
170
227
|
|
|
@@ -172,10 +229,10 @@ dependencies = [
|
|
|
172
229
|
|
|
173
230
|
## 📊 Example Experiment Workflow
|
|
174
231
|
|
|
175
|
-
1. Define your retriever and reranker
|
|
176
|
-
2. Launch
|
|
177
|
-
3.
|
|
178
|
-
4.
|
|
232
|
+
1. Define your retriever, embedding, and reranker setup
|
|
233
|
+
2. Launch optimization (Grid, Random, Bayesian) or AutoTune
|
|
234
|
+
3. Compare performance with explainability
|
|
235
|
+
4. Persist results to leaderboard for later inspection
|
|
179
236
|
|
|
180
237
|
---
|
|
181
238
|
|
|
@@ -188,7 +245,7 @@ flowchart TD
|
|
|
188
245
|
C --> D[Reranker]
|
|
189
246
|
D --> E[Generator]
|
|
190
247
|
E --> F[Evaluation]
|
|
191
|
-
F --> G[Optuna
|
|
248
|
+
F --> G[Optuna / AutoRAGTuner]
|
|
192
249
|
G -->|Best Params| B
|
|
193
250
|
```
|
|
194
251
|
|
|
@@ -198,8 +255,9 @@ flowchart TD
|
|
|
198
255
|
|
|
199
256
|
```
|
|
200
257
|
[INFO] Starting Bayesian optimization with Optuna
|
|
201
|
-
[INFO] Trial 7 finished:
|
|
258
|
+
[INFO] Trial 7 finished: faithfulness=0.83, latency=0.42s
|
|
202
259
|
[INFO] Best parameters: {'lambda_param': 0.6, 'retriever': 'faiss'}
|
|
260
|
+
[INFO] AutoRAGTuner: Suggested retriever=Chroma for medium corpus
|
|
203
261
|
```
|
|
204
262
|
|
|
205
263
|
---
|
|
@@ -207,8 +265,9 @@ flowchart TD
|
|
|
207
265
|
## 🧠 Why Ragmint?
|
|
208
266
|
|
|
209
267
|
- Built for **RAG researchers**, **AI engineers**, and **LLM ops**
|
|
210
|
-
- Works with **LangChain**, **LlamaIndex**, or standalone
|
|
211
|
-
- Designed for **extensibility** — plug in your own
|
|
268
|
+
- Works with **LangChain**, **LlamaIndex**, or standalone setups
|
|
269
|
+
- Designed for **extensibility** — plug in your own retrievers, models, or metrics
|
|
270
|
+
- Integrated **explainability and leaderboard** modules for research and production
|
|
212
271
|
|
|
213
272
|
---
|
|
214
273
|
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "ragmint"
|
|
7
|
-
version = "0.
|
|
7
|
+
version = "0.2.0"
|
|
8
8
|
description = "A modular framework for evaluating and optimizing RAG pipelines."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
license = { text = "Apache License 2.0" }
|
|
@@ -24,7 +24,9 @@ dependencies = [
|
|
|
24
24
|
"faiss-cpu; sys_platform != 'darwin'",
|
|
25
25
|
"optuna>=3.0",
|
|
26
26
|
"pytest",
|
|
27
|
-
"colorama"
|
|
27
|
+
"colorama",
|
|
28
|
+
"google-generativeai>=0.8.0",
|
|
29
|
+
"supabase>=2.4.0"
|
|
28
30
|
]
|
|
29
31
|
|
|
30
32
|
[project.urls]
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Auto-RAG Tuner
|
|
3
|
+
--------------
|
|
4
|
+
Recommends retriever–embedding pairs dynamically based on corpus size
|
|
5
|
+
and dataset characteristics. Integrates seamlessly with RAGMint evaluator.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
from .core.evaluation import evaluate_config
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
class AutoRAGTuner:
|
|
12
|
+
def __init__(self, corpus_stats: dict):
|
|
13
|
+
"""
|
|
14
|
+
corpus_stats: dict
|
|
15
|
+
Example: {'size': 12000, 'avg_len': 240}
|
|
16
|
+
"""
|
|
17
|
+
self.corpus_stats = corpus_stats
|
|
18
|
+
|
|
19
|
+
def recommend(self):
|
|
20
|
+
size = self.corpus_stats.get("size", 0)
|
|
21
|
+
avg_len = self.corpus_stats.get("avg_len", 0)
|
|
22
|
+
|
|
23
|
+
if size < 1000:
|
|
24
|
+
return {"retriever": "BM25", "embedding_model": "OpenAI"}
|
|
25
|
+
elif size < 10000:
|
|
26
|
+
return {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
27
|
+
else:
|
|
28
|
+
return {"retriever": "FAISS", "embedding_model": "InstructorXL"}
|
|
29
|
+
|
|
30
|
+
def auto_tune(self, validation_data):
|
|
31
|
+
config = self.recommend()
|
|
32
|
+
results = evaluate_config(config, validation_data)
|
|
33
|
+
return {"recommended": config, "results": results}
|
|
@@ -25,3 +25,14 @@ class Evaluator:
|
|
|
25
25
|
|
|
26
26
|
def _similarity(self, a: str, b: str) -> float:
|
|
27
27
|
return SequenceMatcher(None, a, b).ratio()
|
|
28
|
+
|
|
29
|
+
def evaluate_config(config, validation_data):
|
|
30
|
+
evaluator = Evaluator()
|
|
31
|
+
results = []
|
|
32
|
+
for sample in validation_data:
|
|
33
|
+
query = sample.get("query", "")
|
|
34
|
+
answer = sample.get("answer", "")
|
|
35
|
+
context = sample.get("context", "")
|
|
36
|
+
results.append(evaluator.evaluate(query, answer, context))
|
|
37
|
+
return results
|
|
38
|
+
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Interpretability Layer
|
|
3
|
+
----------------------
|
|
4
|
+
Uses Gemini or Anthropic Claude to explain why one RAG configuration
|
|
5
|
+
outperforms another. Falls back gracefully if no API key is provided.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
import os
|
|
9
|
+
import json
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def explain_results(results_a: dict, results_b: dict, model: str = "gemini-1.5-pro") -> str:
|
|
13
|
+
"""
|
|
14
|
+
Generate a natural-language explanation comparing two RAG experiment results.
|
|
15
|
+
Priority:
|
|
16
|
+
1. Anthropic Claude (if ANTHROPIC_API_KEY is set)
|
|
17
|
+
2. Google Gemini (if GOOGLE_API_KEY is set)
|
|
18
|
+
3. Fallback text message
|
|
19
|
+
"""
|
|
20
|
+
prompt = f"""
|
|
21
|
+
You are an AI evaluation expert.
|
|
22
|
+
Compare these two RAG experiment results and explain why one performs better.
|
|
23
|
+
Metrics A: {json.dumps(results_a, indent=2)}
|
|
24
|
+
Metrics B: {json.dumps(results_b, indent=2)}
|
|
25
|
+
Provide a concise, human-friendly explanation and practical improvement tips.
|
|
26
|
+
"""
|
|
27
|
+
|
|
28
|
+
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
|
|
29
|
+
google_key = os.getenv("GEMINI_API_KEY")
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
# 1️⃣ Try Anthropic Claude first
|
|
33
|
+
if anthropic_key:
|
|
34
|
+
try:
|
|
35
|
+
from anthropic import Anthropic
|
|
36
|
+
client = Anthropic(api_key=anthropic_key)
|
|
37
|
+
response = client.messages.create(
|
|
38
|
+
model="claude-3-opus-20240229",
|
|
39
|
+
max_tokens=300,
|
|
40
|
+
messages=[{"role": "user", "content": prompt}],
|
|
41
|
+
)
|
|
42
|
+
return response.content[0].text
|
|
43
|
+
except Exception as e:
|
|
44
|
+
return f"[Claude unavailable] {e}"
|
|
45
|
+
|
|
46
|
+
# 2️⃣ Fallback to Google Gemini
|
|
47
|
+
elif google_key:
|
|
48
|
+
try:
|
|
49
|
+
import google.generativeai as genai
|
|
50
|
+
genai.configure(api_key=google_key)
|
|
51
|
+
response = genai.GenerativeModel(model).generate_content(prompt)
|
|
52
|
+
return response.text
|
|
53
|
+
except Exception as e:
|
|
54
|
+
return f"[Gemini unavailable] {e}"
|
|
55
|
+
|
|
56
|
+
# 3️⃣ Fallback if neither key is available
|
|
57
|
+
else:
|
|
58
|
+
return (
|
|
59
|
+
"[No LLM available] Please set ANTHROPIC_API_KEY or GOOGLE_API_KEY "
|
|
60
|
+
"to enable interpretability via Claude or Gemini."
|
|
61
|
+
)
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import json
|
|
3
|
+
from datetime import datetime
|
|
4
|
+
from typing import Dict, Any, Optional
|
|
5
|
+
from supabase import create_client
|
|
6
|
+
|
|
7
|
+
class Leaderboard:
|
|
8
|
+
def __init__(self, storage_path: Optional[str] = None):
|
|
9
|
+
self.storage_path = storage_path
|
|
10
|
+
url = os.getenv("SUPABASE_URL")
|
|
11
|
+
key = os.getenv("SUPABASE_KEY")
|
|
12
|
+
self.client = None
|
|
13
|
+
if url and key:
|
|
14
|
+
self.client = create_client(url, key)
|
|
15
|
+
elif not storage_path:
|
|
16
|
+
raise EnvironmentError("Set SUPABASE_URL/SUPABASE_KEY or pass storage_path")
|
|
17
|
+
|
|
18
|
+
def upload(self, run_id: str, config: Dict[str, Any], score: float):
|
|
19
|
+
data = {
|
|
20
|
+
"run_id": run_id,
|
|
21
|
+
"config": config,
|
|
22
|
+
"score": score,
|
|
23
|
+
"timestamp": datetime.utcnow().isoformat(),
|
|
24
|
+
}
|
|
25
|
+
if self.client:
|
|
26
|
+
return self.client.table("experiments").insert(data).execute()
|
|
27
|
+
else:
|
|
28
|
+
os.makedirs(os.path.dirname(self.storage_path), exist_ok=True)
|
|
29
|
+
with open(self.storage_path, "a", encoding="utf-8") as f:
|
|
30
|
+
f.write(json.dumps(data) + "\n")
|
|
31
|
+
return data
|
|
32
|
+
|
|
33
|
+
def top_results(self, limit: int = 10):
|
|
34
|
+
if self.client:
|
|
35
|
+
return (
|
|
36
|
+
self.client.table("experiments")
|
|
37
|
+
.select("*")
|
|
38
|
+
.order("score", desc=True)
|
|
39
|
+
.limit(limit)
|
|
40
|
+
.execute()
|
|
41
|
+
)
|
|
42
|
+
else:
|
|
43
|
+
with open(self.storage_path, "r", encoding="utf-8") as f:
|
|
44
|
+
lines = [json.loads(line) for line in f]
|
|
45
|
+
return sorted(lines, key=lambda x: x["score"], reverse=True)[:limit]
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
# src/ragmint/tests/conftest.py
|
|
2
|
+
import os
|
|
3
|
+
from dotenv import load_dotenv
|
|
4
|
+
import pytest
|
|
5
|
+
|
|
6
|
+
# Load .env from project root
|
|
7
|
+
load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "../../../.env"))
|
|
8
|
+
|
|
9
|
+
def pytest_configure(config):
|
|
10
|
+
"""Print which keys are loaded (debug)."""
|
|
11
|
+
google = os.getenv("GEMINI_API_KEY")
|
|
12
|
+
anthropic = os.getenv("ANTHROPIC_API_KEY")
|
|
13
|
+
if google:
|
|
14
|
+
print("✅ GOOGLE_API_KEY loaded")
|
|
15
|
+
if anthropic:
|
|
16
|
+
print("✅ ANTHROPIC_API_KEY loaded")
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
import pytest
|
|
2
|
+
from ragmint.autotuner import AutoRAGTuner
|
|
3
|
+
|
|
4
|
+
|
|
5
|
+
def test_autorag_recommend_small():
|
|
6
|
+
"""Small corpus should trigger BM25 + OpenAI."""
|
|
7
|
+
tuner = AutoRAGTuner({"size": 500, "avg_len": 150})
|
|
8
|
+
rec = tuner.recommend()
|
|
9
|
+
assert rec["retriever"] == "BM25"
|
|
10
|
+
assert rec["embedding_model"] == "OpenAI"
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
def test_autorag_recommend_medium():
|
|
14
|
+
"""Medium corpus should trigger Chroma + SentenceTransformers."""
|
|
15
|
+
tuner = AutoRAGTuner({"size": 5000, "avg_len": 200})
|
|
16
|
+
rec = tuner.recommend()
|
|
17
|
+
assert rec["retriever"] == "Chroma"
|
|
18
|
+
assert rec["embedding_model"] == "SentenceTransformers"
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def test_autorag_recommend_large():
|
|
22
|
+
"""Large corpus should trigger FAISS + InstructorXL."""
|
|
23
|
+
tuner = AutoRAGTuner({"size": 50000, "avg_len": 300})
|
|
24
|
+
rec = tuner.recommend()
|
|
25
|
+
assert rec["retriever"] == "FAISS"
|
|
26
|
+
assert rec["embedding_model"] == "InstructorXL"
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def test_autorag_auto_tune(monkeypatch):
|
|
30
|
+
"""Test auto_tune with a mock validation dataset."""
|
|
31
|
+
tuner = AutoRAGTuner({"size": 12000, "avg_len": 250})
|
|
32
|
+
|
|
33
|
+
# Monkeypatch evaluate_config inside autotuner
|
|
34
|
+
import ragmint.autotuner as autotuner
|
|
35
|
+
def mock_eval(config, data):
|
|
36
|
+
return {"faithfulness": 0.9, "latency": 0.01}
|
|
37
|
+
monkeypatch.setattr(autotuner, "evaluate_config", mock_eval)
|
|
38
|
+
|
|
39
|
+
result = tuner.auto_tune([{"question": "What is AI?", "answer": "Artificial Intelligence"}])
|
|
40
|
+
assert "recommended" in result
|
|
41
|
+
assert "results" in result
|
|
42
|
+
assert isinstance(result["results"], dict)
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
import pytest
|
|
2
|
+
from ragmint.explainer import explain_results
|
|
3
|
+
|
|
4
|
+
|
|
5
|
+
def test_explain_results_gemini():
|
|
6
|
+
"""Gemini explanation should contain model-specific phrasing."""
|
|
7
|
+
config_a = {"retriever": "FAISS", "embedding_model": "OpenAI"}
|
|
8
|
+
config_b = {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
9
|
+
result = explain_results(config_a, config_b, model="gemini")
|
|
10
|
+
assert isinstance(result, str)
|
|
11
|
+
assert "Gemini" in result or "gemini" in result
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
def test_explain_results_claude():
|
|
15
|
+
"""Claude explanation should contain model-specific phrasing."""
|
|
16
|
+
config_a = {"retriever": "FAISS"}
|
|
17
|
+
config_b = {"retriever": "Chroma"}
|
|
18
|
+
result = explain_results(config_a, config_b, model="claude")
|
|
19
|
+
assert isinstance(result, str)
|
|
20
|
+
assert "Claude" in result or "claude" in result
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import pytest
|
|
3
|
+
from ragmint.explainer import explain_results
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
@pytest.mark.integration
|
|
7
|
+
def test_real_gemini_explanation():
|
|
8
|
+
"""Run real Gemini call if GOOGLE_API_KEY is set."""
|
|
9
|
+
if not os.getenv("GEMINI_API_KEY"):
|
|
10
|
+
pytest.skip("GOOGLE_API_KEY not set")
|
|
11
|
+
|
|
12
|
+
config_a = {"retriever": "FAISS", "embedding_model": "OpenAI"}
|
|
13
|
+
config_b = {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
14
|
+
|
|
15
|
+
result = explain_results(config_a, config_b, model="gemini-1.5-pro")
|
|
16
|
+
assert isinstance(result, str)
|
|
17
|
+
assert len(result) > 0
|
|
18
|
+
print("\n[Gemini explanation]:", result[:200], "...")
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
import pytest
|
|
2
|
+
from ragmint.tuner import RAGMint
|
|
3
|
+
from ragmint.autotuner import AutoRAGTuner
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
def test_integration_ragmint_autotune(monkeypatch, tmp_path):
|
|
7
|
+
"""
|
|
8
|
+
Smoke test for integration between AutoRAGTuner and RAGMint.
|
|
9
|
+
Ensures end-to-end flow runs without real retrievers or embeddings.
|
|
10
|
+
"""
|
|
11
|
+
|
|
12
|
+
# --- Mock corpus and validation data ---
|
|
13
|
+
corpus = tmp_path / "docs"
|
|
14
|
+
corpus.mkdir()
|
|
15
|
+
(corpus / "doc1.txt").write_text("This is an AI document.")
|
|
16
|
+
validation_data = [{"question": "What is AI?", "answer": "Artificial Intelligence"}]
|
|
17
|
+
|
|
18
|
+
# --- Mock RAGMint.optimize() to avoid real model work ---
|
|
19
|
+
def mock_optimize(self, validation_set=None, metric="faithfulness", trials=2):
|
|
20
|
+
return (
|
|
21
|
+
{"retriever": "FAISS", "embedding_model": "OpenAI", "score": 0.88},
|
|
22
|
+
[{"trial": 1, "score": 0.88}],
|
|
23
|
+
)
|
|
24
|
+
|
|
25
|
+
monkeypatch.setattr(RAGMint, "optimize", mock_optimize)
|
|
26
|
+
|
|
27
|
+
# --- Mock evaluation used by AutoRAGTuner ---
|
|
28
|
+
def mock_evaluate_config(config, data):
|
|
29
|
+
return {"faithfulness": 0.9, "latency": 0.01}
|
|
30
|
+
|
|
31
|
+
import ragmint.autotuner as autotuner
|
|
32
|
+
monkeypatch.setattr(autotuner, "evaluate_config", mock_evaluate_config)
|
|
33
|
+
|
|
34
|
+
# --- Create AutoRAGTuner and RAGMint instances ---
|
|
35
|
+
ragmint = RAGMint(
|
|
36
|
+
docs_path=str(corpus),
|
|
37
|
+
retrievers=["faiss", "chroma"],
|
|
38
|
+
embeddings=["text-embedding-3-small"],
|
|
39
|
+
rerankers=["mmr"],
|
|
40
|
+
)
|
|
41
|
+
|
|
42
|
+
tuner = AutoRAGTuner({"size": 2000, "avg_len": 150})
|
|
43
|
+
|
|
44
|
+
# --- Run Auto-Tune and RAG Optimization ---
|
|
45
|
+
recommendation = tuner.recommend()
|
|
46
|
+
assert "retriever" in recommendation
|
|
47
|
+
assert "embedding_model" in recommendation
|
|
48
|
+
|
|
49
|
+
tuning_results = tuner.auto_tune(validation_data)
|
|
50
|
+
assert "results" in tuning_results
|
|
51
|
+
assert isinstance(tuning_results["results"], dict)
|
|
52
|
+
|
|
53
|
+
# --- Run RAGMint optimization flow (mocked) ---
|
|
54
|
+
best_config, results = ragmint.optimize(validation_set=validation_data, trials=2)
|
|
55
|
+
assert isinstance(best_config, dict)
|
|
56
|
+
assert "score" in best_config
|
|
57
|
+
assert isinstance(results, list)
|
|
58
|
+
|
|
59
|
+
# --- Integration Success ---
|
|
60
|
+
print(f"Integration OK: AutoRAG recommended {recommendation}, RAGMint best {best_config}")
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
import json
|
|
2
|
+
import tempfile
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
from ragmint.leaderboard import Leaderboard
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
def test_leaderboard_add_and_top(tmp_path):
|
|
8
|
+
"""Ensure local leaderboard persistence works without Supabase."""
|
|
9
|
+
file_path = tmp_path / "leaderboard.jsonl"
|
|
10
|
+
lb = Leaderboard(storage_path=str(file_path))
|
|
11
|
+
|
|
12
|
+
# Add two runs
|
|
13
|
+
lb.upload("run1", {"retriever": "FAISS"}, 0.91)
|
|
14
|
+
lb.upload("run2", {"retriever": "Chroma"}, 0.85)
|
|
15
|
+
|
|
16
|
+
# Verify file content
|
|
17
|
+
assert file_path.exists()
|
|
18
|
+
with open(file_path, "r", encoding="utf-8") as f:
|
|
19
|
+
lines = [json.loads(line) for line in f]
|
|
20
|
+
assert len(lines) == 2
|
|
21
|
+
|
|
22
|
+
# Get top results
|
|
23
|
+
top = lb.top_results(limit=1)
|
|
24
|
+
assert isinstance(top, list)
|
|
25
|
+
assert len(top) == 1
|
|
26
|
+
assert "score" in top[0]
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def test_leaderboard_append_existing(tmp_path):
|
|
30
|
+
"""Ensure multiple uploads append properly."""
|
|
31
|
+
file_path = tmp_path / "leaderboard.jsonl"
|
|
32
|
+
lb = Leaderboard(storage_path=str(file_path))
|
|
33
|
+
|
|
34
|
+
for i in range(3):
|
|
35
|
+
lb.upload(f"run{i}", {"retriever": "BM25"}, 0.8 + i * 0.05)
|
|
36
|
+
|
|
37
|
+
top = lb.top_results(limit=2)
|
|
38
|
+
assert len(top) == 2
|
|
39
|
+
assert top[0]["score"] >= top[1]["score"]
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: ragmint
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: A modular framework for evaluating and optimizing RAG pipelines.
|
|
5
5
|
Author-email: Andre Oliveira <oandreoliveira@outlook.com>
|
|
6
6
|
License: Apache License 2.0
|
|
@@ -22,6 +22,8 @@ Requires-Dist: faiss-cpu; sys_platform != "darwin"
|
|
|
22
22
|
Requires-Dist: optuna>=3.0
|
|
23
23
|
Requires-Dist: pytest
|
|
24
24
|
Requires-Dist: colorama
|
|
25
|
+
Requires-Dist: google-generativeai>=0.8.0
|
|
26
|
+
Requires-Dist: supabase>=2.4.0
|
|
25
27
|
Dynamic: license-file
|
|
26
28
|
|
|
27
29
|
# Ragmint
|
|
@@ -36,17 +38,19 @@ Dynamic: license-file
|
|
|
36
38
|
|
|
37
39
|
**Ragmint** (Retrieval-Augmented Generation Model Inspection & Tuning) is a modular, developer-friendly Python library for **evaluating, optimizing, and tuning RAG (Retrieval-Augmented Generation) pipelines**.
|
|
38
40
|
|
|
39
|
-
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization
|
|
41
|
+
It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, and **explainability** through Gemini or Claude.
|
|
40
42
|
|
|
41
43
|
---
|
|
42
44
|
|
|
43
45
|
## ✨ Features
|
|
44
46
|
|
|
45
47
|
- ✅ **Automated hyperparameter optimization** (Grid, Random, Bayesian via Optuna)
|
|
48
|
+
- 🤖 **Auto-RAG Tuner** — dynamically recommends retriever–embedding pairs based on corpus size
|
|
49
|
+
- 🧠 **Explainability Layer** — interprets RAG performance via Gemini or Claude APIs
|
|
50
|
+
- 🏆 **Leaderboard Tracking** — stores and ranks experiment runs via JSON or external DB
|
|
46
51
|
- 🔍 **Built-in RAG evaluation metrics** — faithfulness, recall, BLEU, ROUGE, latency
|
|
47
52
|
- ⚙️ **Retrievers** — FAISS, Chroma, ElasticSearch
|
|
48
53
|
- 🧩 **Embeddings** — OpenAI, HuggingFace
|
|
49
|
-
- 🧠 **Rerankers** — MMR, CrossEncoder (extensible via plugin interface)
|
|
50
54
|
- 💾 **Caching, experiment tracking, and reproducibility** out of the box
|
|
51
55
|
- 🧰 **Clean modular structure** for easy integration in research and production setups
|
|
52
56
|
|
|
@@ -102,6 +106,7 @@ print(result)
|
|
|
102
106
|
```
|
|
103
107
|
|
|
104
108
|
---
|
|
109
|
+
|
|
105
110
|
## 🧪 Dataset Options
|
|
106
111
|
|
|
107
112
|
Ragmint can automatically load evaluation datasets for your RAG pipeline:
|
|
@@ -136,47 +141,99 @@ ragmint.optimize(validation_set="data/custom_qa.json")
|
|
|
136
141
|
|
|
137
142
|
---
|
|
138
143
|
|
|
144
|
+
## 🧠 Auto-RAG Tuner
|
|
145
|
+
|
|
146
|
+
The **AutoRAGTuner** automatically recommends retriever–embedding combinations
|
|
147
|
+
based on corpus size and average document length.
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
from ragmint.autotuner import AutoRAGTuner
|
|
151
|
+
|
|
152
|
+
corpus_stats = {"size": 5000, "avg_len": 250}
|
|
153
|
+
tuner = AutoRAGTuner(corpus_stats)
|
|
154
|
+
recommendation = tuner.recommend()
|
|
155
|
+
print(recommendation)
|
|
156
|
+
# Example output: {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## 🏆 Leaderboard Tracking
|
|
162
|
+
|
|
163
|
+
Track and visualize your best experiments across runs.
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
from ragmint.leaderboard import Leaderboard
|
|
167
|
+
|
|
168
|
+
lb = Leaderboard("experiments/leaderboard.json")
|
|
169
|
+
lb.add_entry({"trial": 1, "faithfulness": 0.87, "latency": 0.12})
|
|
170
|
+
lb.show_top(3)
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
---
|
|
174
|
+
|
|
175
|
+
## 🧠 Explainability with Gemini / Claude
|
|
176
|
+
|
|
177
|
+
Compare two RAG configurations and receive natural language insights
|
|
178
|
+
on **why** one performs better.
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
from ragmint.explainer import explain_results
|
|
182
|
+
|
|
183
|
+
config_a = {"retriever": "FAISS", "embedding_model": "OpenAI"}
|
|
184
|
+
config_b = {"retriever": "Chroma", "embedding_model": "SentenceTransformers"}
|
|
185
|
+
|
|
186
|
+
explanation = explain_results(config_a, config_b, model="gemini")
|
|
187
|
+
print(explanation)
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
> Set your API keys in a `.env` file or via environment variables:
|
|
191
|
+
> ```
|
|
192
|
+
> export GOOGLE_API_KEY="your_gemini_key"
|
|
193
|
+
> export ANTHROPIC_API_KEY="your_claude_key"
|
|
194
|
+
> ```
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
139
198
|
## 🧩 Folder Structure
|
|
140
199
|
|
|
141
200
|
```
|
|
142
201
|
ragmint/
|
|
143
202
|
├── core/
|
|
144
|
-
│ ├── pipeline.py
|
|
145
|
-
│ ├── retriever.py
|
|
146
|
-
│ ├── reranker.py
|
|
147
|
-
│
|
|
148
|
-
|
|
149
|
-
├──
|
|
150
|
-
├──
|
|
151
|
-
├──
|
|
152
|
-
├──
|
|
153
|
-
├──
|
|
154
|
-
|
|
203
|
+
│ ├── pipeline.py
|
|
204
|
+
│ ├── retriever.py
|
|
205
|
+
│ ├── reranker.py
|
|
206
|
+
│ ├── embedding.py
|
|
207
|
+
│ └── evaluation.py
|
|
208
|
+
├── autotuner.py
|
|
209
|
+
├── explainer.py
|
|
210
|
+
├── leaderboard.py
|
|
211
|
+
├── tuner.py
|
|
212
|
+
├── utils/
|
|
213
|
+
├── configs/
|
|
214
|
+
├── experiments/
|
|
215
|
+
├── tests/
|
|
216
|
+
└── main.py
|
|
155
217
|
```
|
|
156
218
|
|
|
157
219
|
---
|
|
158
220
|
|
|
159
221
|
## 🧪 Running Tests
|
|
160
222
|
|
|
161
|
-
To verify your setup:
|
|
162
|
-
|
|
163
223
|
```bash
|
|
164
224
|
pytest -v
|
|
165
225
|
```
|
|
166
226
|
|
|
167
|
-
|
|
168
|
-
|
|
227
|
+
To include integration tests with Gemini or Claude APIs:
|
|
169
228
|
```bash
|
|
170
|
-
pytest
|
|
229
|
+
pytest -m integration
|
|
171
230
|
```
|
|
172
231
|
|
|
173
|
-
All tests are designed for **Pytest** and run with lightweight mock data.
|
|
174
|
-
|
|
175
232
|
---
|
|
176
233
|
|
|
177
234
|
## ⚙️ Configuration via `pyproject.toml`
|
|
178
235
|
|
|
179
|
-
Your `pyproject.toml`
|
|
236
|
+
Your `pyproject.toml` includes all required dependencies:
|
|
180
237
|
|
|
181
238
|
```toml
|
|
182
239
|
[project]
|
|
@@ -191,6 +248,8 @@ dependencies = [
|
|
|
191
248
|
"pytest",
|
|
192
249
|
"openai",
|
|
193
250
|
"tqdm",
|
|
251
|
+
"google-generativeai",
|
|
252
|
+
"google-genai",
|
|
194
253
|
]
|
|
195
254
|
```
|
|
196
255
|
|
|
@@ -198,10 +257,10 @@ dependencies = [
|
|
|
198
257
|
|
|
199
258
|
## 📊 Example Experiment Workflow
|
|
200
259
|
|
|
201
|
-
1. Define your retriever and reranker
|
|
202
|
-
2. Launch
|
|
203
|
-
3.
|
|
204
|
-
4.
|
|
260
|
+
1. Define your retriever, embedding, and reranker setup
|
|
261
|
+
2. Launch optimization (Grid, Random, Bayesian) or AutoTune
|
|
262
|
+
3. Compare performance with explainability
|
|
263
|
+
4. Persist results to leaderboard for later inspection
|
|
205
264
|
|
|
206
265
|
---
|
|
207
266
|
|
|
@@ -214,7 +273,7 @@ flowchart TD
|
|
|
214
273
|
C --> D[Reranker]
|
|
215
274
|
D --> E[Generator]
|
|
216
275
|
E --> F[Evaluation]
|
|
217
|
-
F --> G[Optuna
|
|
276
|
+
F --> G[Optuna / AutoRAGTuner]
|
|
218
277
|
G -->|Best Params| B
|
|
219
278
|
```
|
|
220
279
|
|
|
@@ -224,8 +283,9 @@ flowchart TD
|
|
|
224
283
|
|
|
225
284
|
```
|
|
226
285
|
[INFO] Starting Bayesian optimization with Optuna
|
|
227
|
-
[INFO] Trial 7 finished:
|
|
286
|
+
[INFO] Trial 7 finished: faithfulness=0.83, latency=0.42s
|
|
228
287
|
[INFO] Best parameters: {'lambda_param': 0.6, 'retriever': 'faiss'}
|
|
288
|
+
[INFO] AutoRAGTuner: Suggested retriever=Chroma for medium corpus
|
|
229
289
|
```
|
|
230
290
|
|
|
231
291
|
---
|
|
@@ -233,8 +293,9 @@ flowchart TD
|
|
|
233
293
|
## 🧠 Why Ragmint?
|
|
234
294
|
|
|
235
295
|
- Built for **RAG researchers**, **AI engineers**, and **LLM ops**
|
|
236
|
-
- Works with **LangChain**, **LlamaIndex**, or standalone
|
|
237
|
-
- Designed for **extensibility** — plug in your own
|
|
296
|
+
- Works with **LangChain**, **LlamaIndex**, or standalone setups
|
|
297
|
+
- Designed for **extensibility** — plug in your own retrievers, models, or metrics
|
|
298
|
+
- Integrated **explainability and leaderboard** modules for research and production
|
|
238
299
|
|
|
239
300
|
---
|
|
240
301
|
|
|
@@ -3,6 +3,9 @@ README.md
|
|
|
3
3
|
pyproject.toml
|
|
4
4
|
src/ragmint/__init__.py
|
|
5
5
|
src/ragmint/__main__.py
|
|
6
|
+
src/ragmint/autotuner.py
|
|
7
|
+
src/ragmint/explainer.py
|
|
8
|
+
src/ragmint/leaderboard.py
|
|
6
9
|
src/ragmint/tuner.py
|
|
7
10
|
src/ragmint.egg-info/PKG-INFO
|
|
8
11
|
src/ragmint.egg-info/SOURCES.txt
|
|
@@ -20,6 +23,12 @@ src/ragmint/experiments/__init__.py
|
|
|
20
23
|
src/ragmint/optimization/__init__.py
|
|
21
24
|
src/ragmint/optimization/search.py
|
|
22
25
|
src/ragmint/tests/__init__.py
|
|
26
|
+
src/ragmint/tests/conftest.py
|
|
27
|
+
src/ragmint/tests/test_autotuner.py
|
|
28
|
+
src/ragmint/tests/test_explainer.py
|
|
29
|
+
src/ragmint/tests/test_explainer_integration.py
|
|
30
|
+
src/ragmint/tests/test_integration_autotuner_ragmint.py
|
|
31
|
+
src/ragmint/tests/test_leaderboard.py
|
|
23
32
|
src/ragmint/tests/test_pipeline.py
|
|
24
33
|
src/ragmint/tests/test_retriever.py
|
|
25
34
|
src/ragmint/tests/test_search.py
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|