PyPI - ragmint - Versions diffs - 0.3.0__tar.gz → 0.3.1__tar.gz - Mend

ragmint 0.3.0tar.gz → 0.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (56) hide show

{ragmint-0.3.0 → ragmint-0.3.1}/PKG-INFO RENAMED Viewed

@@ -1,31 +1,45 @@
 Metadata-Version: 2.4
 Name: ragmint
-Version: 0.3.0
+Version: 0.3.1
 Summary: A modular framework for evaluating and optimizing RAG pipelines.
 Author-email: Andre Oliveira <oandreoliveira@outlook.com>
 License: Apache License 2.0
 Project-URL: Homepage, https://github.com/andyolivers/ragmint
 Project-URL: Documentation, https://andyolivers.com
 Project-URL: Issues, https://github.com/andyolivers/ragmint/issues
-Keywords: RAG,LLM,retrieval,optimization,AI,evaluation
+Keywords: RAG,LLM,retrieval,optimization,AI,evaluation,chunking,autotuning
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: numpy<2.0.0
 Requires-Dist: pandas>=2.0
 Requires-Dist: scikit-learn>=1.3
-Requires-Dist: openai>=1.0
-Requires-Dist: tqdm
-Requires-Dist: pyyaml
-Requires-Dist: chromadb>=0.4
+Requires-Dist: sentence-transformers>=2.2.2
+Requires-Dist: chromadb>=0.3.1
 Requires-Dist: faiss-cpu; sys_platform != "darwin"
+Requires-Dist: faiss-cpu==1.7.4; sys_platform == "darwin"
+Requires-Dist: rank-bm25>=0.2.2
 Requires-Dist: optuna>=3.0
-Requires-Dist: pytest
+Requires-Dist: tqdm
 Requires-Dist: colorama
+Requires-Dist: pyyaml
+Requires-Dist: python-dotenv
+Requires-Dist: openai>=1.0.0
 Requires-Dist: google-generativeai>=0.8.0
+Requires-Dist: anthropic>=0.25.0
 Requires-Dist: supabase>=2.4.0
-Requires-Dist: python-dotenv
-Requires-Dist: sentence-transformers
+Requires-Dist: pytest
+Requires-Dist: langchain>=0.2.5
+Requires-Dist: langchain-community>=0.2.5
+Requires-Dist: langchain-text-splitters>=0.2.1
+Provides-Extra: dev
+Requires-Dist: black; extra == "dev"
+Requires-Dist: flake8; extra == "dev"
+Requires-Dist: isort; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Provides-Extra: docs
+Requires-Dist: mkdocs; extra == "docs"
+Requires-Dist: mkdocs-material; extra == "docs"
 Dynamic: license-file
 # Ragmint
@@ -40,7 +54,7 @@ Dynamic: license-file
 **Ragmint** (Retrieval-Augmented Generation Model Inspection & Tuning) is a modular, developer-friendly Python library for **evaluating, optimizing, and tuning RAG (Retrieval-Augmented Generation) pipelines**.
-It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, and **explainability** through Gemini or Claude.
+It provides a complete toolkit for **retriever selection**, **embedding model tuning**, **automated RAG evaluation**, and **config-driven prebuilding** of pipelines with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, **chunking**, and **explainability** through Gemini or Claude.
 ---
@@ -55,6 +69,9 @@ It provides a complete toolkit for **retriever selection**, **embedding model tu
 - 🧩 **Embeddings** — Hugging Face
 - 💾 **Caching, experiment tracking, and reproducibility** out of the box
 - 🧰 **Clean modular structure** for easy integration in research and production setups
+- 📦 **Chunking system** — automatic or configurable chunk_size and overlap for documents
+- 🏗️ **Langchain Prebuilder** — prepares pipelines, applies chunking, embeddings, and vector store creation automatically
+- ⚙️ **Config Adapter (LangchainConfigAdapter)** — normalizes configuration, fills defaults, validates retrievers
 ---
@@ -83,6 +100,8 @@ Example `configs/default.yaml`:
 ```yaml
 retriever: faiss
 embedding_model: text-embedding-3-small
+chunk_size: 500
+overlap: 100
 reranker:
   mode: mmr
   lambda_param: 0.5
@@ -96,33 +115,58 @@ optimization:
 ### 3️⃣ Manual Pipeline Usage
 ```python
+from ragmint.prebuilder import PreBuilder
 from ragmint.tuner import RAGMint
-# Initialize RAGMint with available components
-rag = RAGMint(
+# Prebuild pipeline (chunking, embeddings, vector store)
+prebuilder = PreBuilder(
     docs_path="data/docs/",
-    retrievers=["faiss", "chroma", "sklearn"],
-    embeddings=["all-MiniLM-L6-v2", "sentence-transformers/all-MiniLM-L12-v2"],
-    rerankers=["mmr"]
+    config_path="configs/default.yaml"
 )
+pipeline = prebuilder.build_pipeline()
-# Run optimization over 3 trials using the default validation set
-best, results = rag.optimize(
-    validation_set=None,
-    metric="faithfulness",
-    trials=3
-)
+# Initialize RAGMint with prebuilt components
+rag = RAGMint(pipeline=pipeline)
+# Run optimization
+best, results = rag.optimize(validation_set=None, metric="faithfulness", trials=3)
 print("Best configuration:", best)
 ```
 ---
 # 🧩 Embeddings and Retrievers
 **Ragmint** supports a flexible set of embeddings and retrievers, allowing you to adapt easily to various **RAG architectures**.
+---
+## 🧩 Chunking System
+* **Automatically splits documents** into chunks with `chunk_size` and `overlap` parameters.
+* **Supports default values** if not provided in configuration.
+* **Optimized** for downstream **retrieval and embeddings**.
+* **Enables adaptive chunking strategies** in future releases.
+---
+## 🧩 Langchain Config Adapter
+* **Ensures consistent configuration** across pipeline components.
+* **Normalizes retriever and embedding names** (e.g., `faiss`, `sentence-transformers/...`).
+* **Adds default chunk parameters** when missing.
+* **Validates retriever backends** and **raises clear errors** for unsupported options.
+---
+## 🧩 Langchain Prebuilder
+**Automates pipeline preparation:**
+1. Reads documents
+2. Applies chunking
+3. Creates embeddings
+4. Initializes retriever / vector store
+5. Returns ready-to-use pipeline** for RAGMint or custom usage.
 ---
-## 🔤 Available Embeddings (Hugging Face / OpenAI)
+## 🔤 Available Embeddings (Hugging Face)
 You can select from the following models:
@@ -258,8 +302,12 @@ ragmint/
 │   ├── pipeline.py
 │   ├── retriever.py
 │   ├── reranker.py
-│   ├── embedding.py
+│   ├── embeddings.py
+│   ├── chunking.py
 │   └── evaluation.py
+├── integration/
+│   ├── config_adapter.py
+│   └── langchain_prebuilder.py
 ├── autotuner.py
 ├── explainer.py
 ├── leaderboard.py
@@ -295,21 +343,42 @@ Your `pyproject.toml` includes all required dependencies:
 name = "ragmint"
 version = "0.1.0"
 dependencies = [
+  # Core ML + Embeddings
   "numpy<2.0.0",
   "pandas>=2.0",
   "scikit-learn>=1.3",
-  "openai>=1.0",
-  "tqdm",
-  "pyyaml",
+  "sentence-transformers>=2.2.2",
+  # Retrieval backends
   "chromadb>=0.4",
-  "faiss-cpu; sys_platform != 'darwin'",
+  "faiss-cpu; sys_platform != 'darwin'",       # For Linux/Windows
+  "faiss-cpu==1.7.4; sys_platform == 'darwin'", # Optional fix for macOS MPS
+  "rank-bm25>=0.2.2",                          # For BM25 retriever
+  # Optimization & evaluation
   "optuna>=3.0",
-  "pytest",
+  "tqdm",
   "colorama",
+  # RAG evaluation and data utils
+  "pyyaml",
+  "python-dotenv",
+  # Explainability and LLM APIs
+  "openai>=1.0.0",
   "google-generativeai>=0.8.0",
+  "anthropic>=0.25.0",
+  # Integration / storage
   "supabase>=2.4.0",
-  "python-dotenv",
-  "sentence-transformers"
+  # Testing
+  "pytest",
+  # LangChain integration layer
+  "langchain>=0.2.5",
+  "langchain-community>=0.2.5",
+  "langchain-text-splitters>=0.2.1"
 ]
 ```

ragmint-0.3.0/src/ragmint.egg-info/PKG-INFO → ragmint-0.3.1/README.md RENAMED Viewed

@@ -1,33 +1,3 @@
-Metadata-Version: 2.4
-Name: ragmint
-Version: 0.3.0
-Summary: A modular framework for evaluating and optimizing RAG pipelines.
-Author-email: Andre Oliveira <oandreoliveira@outlook.com>
-License: Apache License 2.0
-Project-URL: Homepage, https://github.com/andyolivers/ragmint
-Project-URL: Documentation, https://andyolivers.com
-Project-URL: Issues, https://github.com/andyolivers/ragmint/issues
-Keywords: RAG,LLM,retrieval,optimization,AI,evaluation
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: numpy<2.0.0
-Requires-Dist: pandas>=2.0
-Requires-Dist: scikit-learn>=1.3
-Requires-Dist: openai>=1.0
-Requires-Dist: tqdm
-Requires-Dist: pyyaml
-Requires-Dist: chromadb>=0.4
-Requires-Dist: faiss-cpu; sys_platform != "darwin"
-Requires-Dist: optuna>=3.0
-Requires-Dist: pytest
-Requires-Dist: colorama
-Requires-Dist: google-generativeai>=0.8.0
-Requires-Dist: supabase>=2.4.0
-Requires-Dist: python-dotenv
-Requires-Dist: sentence-transformers
-Dynamic: license-file
 # Ragmint
 ![Python](https://img.shields.io/badge/python-3.9%2B-blue)
@@ -40,7 +10,7 @@ Dynamic: license-file
 **Ragmint** (Retrieval-Augmented Generation Model Inspection & Tuning) is a modular, developer-friendly Python library for **evaluating, optimizing, and tuning RAG (Retrieval-Augmented Generation) pipelines**.
-It provides a complete toolkit for **retriever selection**, **embedding model tuning**, and **automated RAG evaluation** with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, and **explainability** through Gemini or Claude.
+It provides a complete toolkit for **retriever selection**, **embedding model tuning**, **automated RAG evaluation**, and **config-driven prebuilding** of pipelines with support for **Optuna-based Bayesian optimization**, **Auto-RAG tuning**, **chunking**, and **explainability** through Gemini or Claude.
 ---
@@ -55,6 +25,9 @@ It provides a complete toolkit for **retriever selection**, **embedding model tu
 - 🧩 **Embeddings** — Hugging Face
 - 💾 **Caching, experiment tracking, and reproducibility** out of the box
 - 🧰 **Clean modular structure** for easy integration in research and production setups
+- 📦 **Chunking system** — automatic or configurable chunk_size and overlap for documents
+- 🏗️ **Langchain Prebuilder** — prepares pipelines, applies chunking, embeddings, and vector store creation automatically
+- ⚙️ **Config Adapter (LangchainConfigAdapter)** — normalizes configuration, fills defaults, validates retrievers
 ---
@@ -83,6 +56,8 @@ Example `configs/default.yaml`:
 ```yaml
 retriever: faiss
 embedding_model: text-embedding-3-small
+chunk_size: 500
+overlap: 100
 reranker:
   mode: mmr
   lambda_param: 0.5
@@ -96,24 +71,23 @@ optimization:
 ### 3️⃣ Manual Pipeline Usage
 ```python
+from ragmint.prebuilder import PreBuilder
 from ragmint.tuner import RAGMint
-# Initialize RAGMint with available components
-rag = RAGMint(
+# Prebuild pipeline (chunking, embeddings, vector store)
+prebuilder = PreBuilder(
     docs_path="data/docs/",
-    retrievers=["faiss", "chroma", "sklearn"],
-    embeddings=["all-MiniLM-L6-v2", "sentence-transformers/all-MiniLM-L12-v2"],
-    rerankers=["mmr"]
+    config_path="configs/default.yaml"
 )
+pipeline = prebuilder.build_pipeline()
-# Run optimization over 3 trials using the default validation set
-best, results = rag.optimize(
-    validation_set=None,
-    metric="faithfulness",
-    trials=3
-)
+# Initialize RAGMint with prebuilt components
+rag = RAGMint(pipeline=pipeline)
+# Run optimization
+best, results = rag.optimize(validation_set=None, metric="faithfulness", trials=3)
 print("Best configuration:", best)
 ```
 ---
 # 🧩 Embeddings and Retrievers
@@ -121,8 +95,34 @@ print("Best configuration:", best)
 **Ragmint** supports a flexible set of embeddings and retrievers, allowing you to adapt easily to various **RAG architectures**.
 ---
+## 🧩 Chunking System
+* **Automatically splits documents** into chunks with `chunk_size` and `overlap` parameters.
+* **Supports default values** if not provided in configuration.
+* **Optimized** for downstream **retrieval and embeddings**.
+* **Enables adaptive chunking strategies** in future releases.
+---
+## 🧩 Langchain Config Adapter
+* **Ensures consistent configuration** across pipeline components.
+* **Normalizes retriever and embedding names** (e.g., `faiss`, `sentence-transformers/...`).
+* **Adds default chunk parameters** when missing.
+* **Validates retriever backends** and **raises clear errors** for unsupported options.
+---
+## 🧩 Langchain Prebuilder
-## 🔤 Available Embeddings (Hugging Face / OpenAI)
+**Automates pipeline preparation:**
+1. Reads documents
+2. Applies chunking
+3. Creates embeddings
+4. Initializes retriever / vector store
+5. Returns ready-to-use pipeline** for RAGMint or custom usage.
+---
+## 🔤 Available Embeddings (Hugging Face)
 You can select from the following models:
@@ -258,8 +258,12 @@ ragmint/
 │   ├── pipeline.py
 │   ├── retriever.py
 │   ├── reranker.py
-│   ├── embedding.py
+│   ├── embeddings.py
+│   ├── chunking.py
 │   └── evaluation.py
+├── integration/
+│   ├── config_adapter.py
+│   └── langchain_prebuilder.py
 ├── autotuner.py
 ├── explainer.py
 ├── leaderboard.py
@@ -295,21 +299,42 @@ Your `pyproject.toml` includes all required dependencies:
 name = "ragmint"
 version = "0.1.0"
 dependencies = [
+  # Core ML + Embeddings
   "numpy<2.0.0",
   "pandas>=2.0",
   "scikit-learn>=1.3",
-  "openai>=1.0",
-  "tqdm",
-  "pyyaml",
+  "sentence-transformers>=2.2.2",
+  # Retrieval backends
   "chromadb>=0.4",
-  "faiss-cpu; sys_platform != 'darwin'",
+  "faiss-cpu; sys_platform != 'darwin'",       # For Linux/Windows
+  "faiss-cpu==1.7.4; sys_platform == 'darwin'", # Optional fix for macOS MPS
+  "rank-bm25>=0.2.2",                          # For BM25 retriever
+  # Optimization & evaluation
   "optuna>=3.0",
-  "pytest",
+  "tqdm",
   "colorama",
+  # RAG evaluation and data utils
+  "pyyaml",
+  "python-dotenv",
+  # Explainability and LLM APIs
+  "openai>=1.0.0",
   "google-generativeai>=0.8.0",
+  "anthropic>=0.25.0",
+  # Integration / storage
   "supabase>=2.4.0",
-  "python-dotenv",
-  "sentence-transformers"
+  # Testing
+  "pytest",
+  # LangChain integration layer
+  "langchain>=0.2.5",
+  "langchain-community>=0.2.5",
+  "langchain-text-splitters>=0.2.1"
 ]
 ```
@@ -369,4 +394,4 @@ Licensed under the **Apache License 2.0** — free for personal, research, and c
 **André Oliveira**
 [andyolivers.com](https://andyolivers.com)
-Data Scientist | AI Engineer
+Data Scientist | AI Engineer

{ragmint-0.3.0 → ragmint-0.3.1}/pyproject.toml RENAMED Viewed

@@ -4,33 +4,58 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "ragmint"
-version = "0.3.0"
+version = "0.3.1"
 description = "A modular framework for evaluating and optimizing RAG pipelines."
 readme = "README.md"
 license = { text = "Apache License 2.0" }
 authors = [
   { name = "Andre Oliveira", email = "oandreoliveira@outlook.com" }
 ]
-keywords = ["RAG", "LLM", "retrieval", "optimization", "AI", "evaluation"]
+keywords = ["RAG", "LLM", "retrieval", "optimization", "AI", "evaluation", "chunking", "autotuning"]
 requires-python = ">=3.9"
 dependencies = [
+  # Core ML + Embeddings
   "numpy<2.0.0",
   "pandas>=2.0",
   "scikit-learn>=1.3",
-  "openai>=1.0",
-  "tqdm",
-  "pyyaml",
-  "chromadb>=0.4",
-  "faiss-cpu; sys_platform != 'darwin'",
+  "sentence-transformers>=2.2.2",
+  # Retrieval backends
+  "chromadb>=0.3.1",
+  "faiss-cpu; sys_platform != 'darwin'",       # For Linux/Windows
+  "faiss-cpu==1.7.4; sys_platform == 'darwin'", # Optional fix for macOS MPS
+  "rank-bm25>=0.2.2",                          # For BM25 retriever
+  # Optimization & evaluation
   "optuna>=3.0",
-  "pytest",
+  "tqdm",
   "colorama",
+  # RAG evaluation and data utils
+  "pyyaml",
+  "python-dotenv",
+  # Explainability and LLM APIs
+  "openai>=1.0.0",
   "google-generativeai>=0.8.0",
+  "anthropic>=0.25.0",
+  # Integration / storage
   "supabase>=2.4.0",
-  "python-dotenv",
-  "sentence-transformers"
+  # Testing
+  "pytest",
+  # LangChain integration layer
+  "langchain>=0.2.5",
+  "langchain-community>=0.2.5",
+  "langchain-text-splitters>=0.2.1"
 ]
+[project.optional-dependencies]
+dev = ["black", "flake8", "isort", "pytest-cov"]
+docs = ["mkdocs", "mkdocs-material"]
 [project.urls]
 Homepage = "https://github.com/andyolivers/ragmint"
 Documentation = "https://andyolivers.com"
@@ -51,5 +76,4 @@ ragmint = ["experiments/*.json"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
-addopts = "-v"
+addopts = "-v --tb=short"

ragmint-0.3.1/src/ragmint/autotuner.py ADDED Viewed

@@ -0,0 +1,138 @@
+"""
+Auto-RAG Tuner
+--------------
+Automatically recommends and optimizes RAG configurations based on corpus statistics.
+Integrates with RAGMint to perform full end-to-end tuning.
+"""
+import os
+import logging
+from statistics import mean
+from typing import Dict, Any, Tuple, List
+from .tuner import RAGMint
+from .core.evaluation import evaluate_config
+logging.basicConfig(level=logging.INFO, format="[%(levelname)s] %(message)s")
+class AutoRAGTuner:
+    def __init__(self, docs_path: str):
+        """
+        AutoRAGTuner automatically analyzes a corpus and runs an optimized RAG tuning pipeline.
+        Args:
+            docs_path (str): Path to the directory containing documents (.txt, .md, .rst)
+        """
+        self.docs_path = docs_path
+        self.corpus_stats = self._analyze_corpus()
+    # -----------------------------
+    # Corpus Analysis
+    # -----------------------------
+    def _analyze_corpus(self) -> Dict[str, Any]:
+        """Compute corpus size, average length, and number of documents."""
+        docs = []
+        total_chars = 0
+        num_docs = 0
+        if not os.path.exists(self.docs_path):
+            logging.warning(f"⚠️ Corpus path not found: {self.docs_path}")
+            return {"size": 0, "avg_len": 0, "num_docs": 0}
+        for file in os.listdir(self.docs_path):
+            if file.endswith((".txt", ".md", ".rst")):
+                with open(os.path.join(self.docs_path, file), "r", encoding="utf-8") as f:
+                    content = f.read()
+                    docs.append(content)
+                    total_chars += len(content)
+                    num_docs += 1
+        avg_len = int(mean([len(d) for d in docs])) if docs else 0
+        stats = {"size": total_chars, "avg_len": avg_len, "num_docs": num_docs}
+        logging.info(f"📊 Corpus stats: {stats}")
+        return stats
+    # -----------------------------
+    # Recommendation Logic
+    # -----------------------------
+    def recommend(self) -> Dict[str, Any]:
+        """Recommend retriever, embedding, and chunking based on corpus stats."""
+        size = self.corpus_stats.get("size", 0)
+        avg_len = self.corpus_stats.get("avg_len", 0)
+        num_docs = self.corpus_stats.get("num_docs", 0)
+        # Heuristic-based tuning
+        # Determine chunking heuristics first
+        if avg_len < 200:
+            chunk_size, overlap = 300, 50
+        elif avg_len < 500:
+            chunk_size, overlap = 500, 100
+        else:
+            chunk_size, overlap = 800, 150
+        # Determine retriever–embedding based on corpus size
+        if size <= 2000:
+            retriever = "BM25"
+            embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
+        elif size <= 10000:
+            retriever = "Chroma"
+            embedding_model = "sentence-transformers/paraphrase-MiniLM-L6-v2"
+        else:
+            retriever = "FAISS"
+            embedding_model = "sentence-transformers/all-mpnet-base-v2"
+        strategy = "fixed" if avg_len < 400 else "sentence"
+        recommendation = {
+            "retriever": retriever,
+            "embedding_model": embedding_model,
+            "chunk_size": chunk_size,
+            "overlap": overlap,
+            "strategy": strategy,
+        }
+        logging.info(f"🔮 AutoRAG Recommendation: {recommendation}")
+        return recommendation
+    # -----------------------------
+    # Full Auto-Tuning
+    # -----------------------------
+    def auto_tune(
+        self,
+        validation_set: str = None,
+        metric: str = "faithfulness",
+        trials: int = 5,
+        search_type: str = "random",
+    ) -> Tuple[Dict[str, Any], List[Dict[str, Any]]]:
+        """
+        Run a full automatic optimization using RAGMint.
+        Automatically:
+        - Recommends initial config (retriever, embedding, chunking)
+        - Launches RAGMint optimization trials
+        - Returns best configuration and results
+        """
+        rec = self.recommend()
+        logging.info("🚀 Launching full AutoRAG optimization with RAGMint")
+        tuner = RAGMint(
+            docs_path=self.docs_path,
+            retrievers=[rec["retriever"]],
+            embeddings=[rec["embedding_model"]],
+            rerankers=["mmr"],
+            chunk_sizes=[rec["chunk_size"]],
+            overlaps=[rec["overlap"]],
+            strategies=[rec["strategy"]],
+        )
+        best, results = tuner.optimize(
+            validation_set=validation_set,
+            metric=metric,
+            trials=trials,
+            search_type=search_type,
+        )
+        logging.info(f"🏁 AutoRAG tuning complete. Best: {best}")
+        return best, results

ragmint 0.3.0__tar.gz → 0.3.1__tar.gz

ragmint 0.3.0tar.gz → 0.3.1tar.gz