PyPI - haiku.rag - Versions diffs - 0.7.7__tar.gz → 0.8.1__tar.gz - Mend

haiku.rag 0.7.7tar.gz → 0.8.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of haiku.rag might be problematic. Click here for more details.

Files changed (78) hide show

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: haiku.rag
-Version: 0.7.7
+Version: 0.8.1
 Summary: Retrieval Augmented Generation (RAG) with LanceDB
 Author-email: Yiorgis Gozadinos <ggozadinos@gmail.com>
 License: MIT

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/docs/benchmarks.md RENAMED Viewed

@@ -16,8 +16,8 @@ The recall obtained is ~0.79 for matching in the top result, raising to ~0.91 fo
 |---------------------------------------|-------------------|-------------------|------------------------|
 | Ollama / `mxbai-embed-large`          | 0.79              | 0.91              | None                   |
 | Ollama / `mxbai-embed-large`          | 0.90              | 0.95              | `mxbai-rerank-base-v2` |
-<!-- | Ollama / `nomic-embed-text`           | 0.74              | 0.88              | None                   |
-| OpenAI / `text-embeddings-3-small`    | 0.75              | 0.88              | None                   |
+| Ollama / `nomic-embed-text-v1.5`      | 0.74              | 0.90              | None                   |
+<!-- | OpenAI / `text-embeddings-3-small`    | 0.75              | 0.88              | None                   |
 | OpenAI / `text-embeddings-3-small`    | 0.75              | 0.88              | None                   |
 | OpenAI / `text-embeddings-3-small`    | 0.83              | 0.90              | Cohere / `rerank-v3.5` | -->

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/docs/cli.md RENAMED Viewed

@@ -2,22 +2,17 @@
 The `haiku-rag` CLI provides complete document management functionality.
-## Shell Autocompletion
-Enable shell autocompletion for faster, error‑free usage.
+!!! note
+    All commands support:
-- Temporary (current shell only):
-  ```bash
-  eval "$(haiku-rag --show-completion)"
-  ```
-- Permanent installation:
-  ```bash
-  haiku-rag --install-completion
-  ```
+    - `--db` - Specify custom database path
+    - `-h` - Show help for specific command
-What’s completed:
-- `get` and `delete`/`rm`: Document IDs from the selected database (respects `--db`).
-- `add-src`: Local filesystem paths (URLs can still be typed manually).
+    Example:
+    ```bash
+    haiku-rag list --db /path/to/custom.db
+    haiku-rag add -h
+    ```
 ## Document Management
@@ -40,6 +35,12 @@ haiku-rag add-src /path/to/document.pdf
 haiku-rag add-src https://example.com/article.html
 ```
+!!! note
+    As you add documents to `haiku.rag` the database keeps growing. By default, LanceDB supports versioning
+    of your data. Create/update operations are atomic‑feeling: if anything fails during chunking or embedding,
+    the database rolls back to the pre‑operation snapshot using LanceDB table versioning. You can optimize and
+    compact the database by running the [vacuum](#vacuum-optimize-and-cleanup) command.
 ### Get Document
 ```bash
@@ -55,33 +56,8 @@ haiku-rag delete <TAB>
 haiku-rag rm <TAB>       # alias
 ```
-### Rebuild Database
-Rebuild the database by deleting all chunks & embeddings and re-indexing all documents:
-```bash
-haiku-rag rebuild
-```
 Use this when you want to change things like the embedding model or chunk size for example.
-## Migration
-### Migrate from SQLite to LanceDB
-Migrate an existing SQLite database to LanceDB:
-```bash
-haiku-rag migrate /path/to/old_database.sqlite
-```
-This will:
-- Read all documents, chunks, embeddings, and settings from the SQLite database
-- Create a new LanceDB database with the same data in the same directory
-- Optimize the new database for best performance
-The original SQLite database remains unchanged, so you can safely migrate without risk of data loss.
 ## Search
 Basic search:
@@ -108,13 +84,6 @@ haiku-rag ask "Who is the author of haiku.rag?" --cite
 The QA agent will search your documents for relevant information and provide a comprehensive answer. With `--cite`, responses include citations showing which documents were used.
-## Configuration
-View current configuration settings:
-```bash
-haiku-rag settings
-```
 ## Server
 Start the MCP server:
@@ -129,14 +98,62 @@ haiku-rag serve --stdio
 haiku-rag serve --sse
 ```
-## Options
+## Settings
-All commands support:
-- `--db` - Specify custom database path
-- `-h` - Show help for specific command
+View current configuration settings:
+```bash
+haiku-rag settings
+```
+## Maintenance
+### Vacuum (Optimize and Cleanup)
+Reduce disk usage by optimizing and pruning old table versions across all tables:
+```bash
+haiku-rag vacuum
+```
+### Rebuild Database
+Rebuild the database by deleting all chunks & embeddings and re-indexing all documents. This is useful
+when want to switch embeddings provider or model:
-Example:
 ```bash
-haiku-rag list --db /path/to/custom.db
-haiku-rag add -h
+haiku-rag rebuild
 ```
+## Migration
+### Migrate from SQLite to LanceDB
+Migrate an existing SQLite database to LanceDB:
+```bash
+haiku-rag migrate /path/to/old_database.sqlite
+```
+This will:
+- Read all documents, chunks, embeddings, and settings from the SQLite database
+- Create a new LanceDB database with the same data in the same directory
+- Optimize the new database for best performance
+The original SQLite database remains unchanged, so you can safely migrate without risk of data loss.
+## Shell Autocompletion
+Enable shell autocompletion for faster, error‑free usage.
+- Temporary (current shell only):
+  ```bash
+  eval "$(haiku-rag --show-completion)"
+  ```
+- Permanent installation:
+  ```bash
+  haiku-rag --install-completion
+  ```
+What’s completed:
+- `get` and `delete`/`rm`: Document IDs from the selected database (respects `--db`).
+- `add-src`: Local filesystem paths (URLs can still be typed manually).

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/docs/configuration.md RENAMED Viewed

@@ -223,3 +223,35 @@ CHUNK_SIZE=256
 # into single chunks with continuous content to eliminate duplication
 CONTEXT_CHUNK_RADIUS=0
 ```
+#### Markdown Preprocessor
+Optionally preprocess Markdown before chunking by pointing to a callable that receives and returns Markdown text. This is useful for normalizing content, stripping boilerplate, or applying custom transformations before chunk boundaries are computed.
+```bash
+# A callable path in one of these formats:
+# - package.module:func
+# - package.module.func
+# - /abs/or/relative/path/to/file.py:func
+MARKDOWN_PREPROCESSOR="my_pkg.preprocess:clean_md"
+```
+!!! note
+    - The function signature should be `def clean_md(text: str) -> str` or `async def clean_md(text: str) -> str`.
+    - If the function raises or returns a non-string, haiku.rag logs a warning and proceeds without preprocessing.
+    - The preprocessor affects only the chunking pipeline. The stored document content remains unchanged.
+Example implementation:
+```python
+# my_pkg/preprocess.py
+def clean_md(text: str) -> str:
+    # strip HTML comments and collapse multiple blank lines
+    lines = [line for line in text.splitlines() if not line.strip().startswith("<!--")]
+    out = []
+    for line in lines:
+        if line.strip() == "" and (out and out[-1] == ""):
+            continue
+        out.append(line)
+    return "\n".join(out)
+```

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/docs/index.md RENAMED Viewed

@@ -52,7 +52,6 @@ haiku-rag migrate old_database.sqlite  # Migrate from SQLite
 - [Installation](installation.md) - Install haiku.rag with different providers
 - [Configuration](configuration.md) - Environment variables and settings
 - [CLI](cli.md) - Command line interface usage
-- [Question Answering](qa.md) - QA agents and natural language queries
 - [Server](server.md) - File monitoring and server mode
 - [MCP](mcp.md) - Model Context Protocol integration
 - [Python](python.md) - Python API reference

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/docs/python.md RENAMED Viewed

@@ -99,6 +99,24 @@ async for doc_id in client.rebuild_database():
     print(f"Processed document {doc_id}")
 ```
+## Maintenance
+Run maintenance to optimize storage and prune old table versions:
+```python
+await client.vacuum()
+```
+This compacts tables and removes historical versions to keep disk usage in check. It’s safe to run anytime, for example after bulk imports or periodically in long‑running apps.
+### Atomic Writes and Rollback
+Document create and update operations take a snapshot of table versions before any write and automatically roll back to that snapshot if something fails (for example, during chunking or embedding). This restores both the `documents` and `chunks` tables to their pre‑operation state using LanceDB’s table versioning.
+- Applies to: `create_document(...)`, `create_document_from_source(...)`, `update_document(...)`, and internal rebuild/update flows.
+- Scope: Both document rows and all associated chunks are rolled back together.
+- Vacuum: Running `vacuum()` later prunes old versions for disk efficiency; rollbacks occur immediately during the failing operation and are not impacted.
 ## Searching Documents
 The search method performs native hybrid search (vector + full-text) using LanceDB with optional reranking for improved relevance:

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "haiku.rag"
-version = "0.7.7"
+version = "0.8.1"
 description = "Retrieval Augmented Generation (RAG) with LanceDB"
 authors = [{ name = "Yiorgis Gozadinos", email = "ggozadinos@gmail.com" }]
 license = { text = "MIT" }
@@ -53,6 +53,7 @@ packages = ["src/haiku"]
 [dependency-groups]
 dev = [
     "datasets>=3.6.0",
+    "logfire>=4.6.0",
     "mkdocs>=1.6.1",
     "mkdocs-material>=9.6.14",
     "pre-commit>=4.2.0",

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/app.py RENAMED Viewed

@@ -102,6 +102,15 @@ class HaikuRAGApp:
             except Exception as e:
                 self.console.print(f"[red]Error rebuilding database: {e}[/red]")
+    async def vacuum(self):
+        """Run database maintenance: optimize and cleanup table history."""
+        try:
+            async with HaikuRAG(db_path=self.db_path, skip_validation=True) as client:
+                await client.vacuum()
+            self.console.print("[b]Vacuum completed successfully.[/b]")
+        except Exception as e:
+            self.console.print(f"[red]Error during vacuum: {e}[/red]")
     def show_settings(self):
         """Display current configuration settings."""
         self.console.print("[bold]haiku.rag configuration[/bold]")

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/cli.py RENAMED Viewed

@@ -256,6 +256,18 @@ def rebuild(
     asyncio.run(app.rebuild())
+@cli.command("vacuum", help="Optimize and clean up all tables to reduce disk usage")
+def vacuum(
+    db: Path = typer.Option(
+        Config.DEFAULT_DATA_DIR / "haiku.rag.lancedb",
+        "--db",
+        help="Path to the LanceDB database file",
+    ),
+):
+    app = HaikuRAGApp(db_path=db)
+    asyncio.run(app.vacuum())
 @cli.command(
     "serve", help="Start the haiku.rag MCP server (by default in streamable HTTP mode)"
 )

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/client.py RENAMED Viewed

@@ -550,6 +550,16 @@ class HaikuRAG:
                 )
                 yield doc.id
+        # Final maintenance: centralized vacuum to curb disk usage
+        try:
+            self.store.vacuum()
+        except Exception:
+            pass
+    async def vacuum(self) -> None:
+        """Optimize and clean up old versions across all tables."""
+        self.store.vacuum()
     def close(self):
         """Close the underlying store connection."""
         self.store.close()

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/config.py RENAMED Viewed

@@ -32,6 +32,10 @@ class AppConfig(BaseModel):
     CHUNK_SIZE: int = 256
     CONTEXT_CHUNK_RADIUS: int = 0
+    # Optional dotted path or file path to a callable that preprocesses
+    # markdown content before chunking. Examples:
+    MARKDOWN_PREPROCESSOR: str = ""
     OLLAMA_BASE_URL: str = "http://localhost:11434"
     VLLM_EMBEDDINGS_BASE_URL: str = ""
     VLLM_RERANK_BASE_URL: str = ""

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/logging.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import logging
+import warnings
 from rich.console import Console
 from rich.logging import RichHandler
@@ -50,4 +51,6 @@ def configure_cli_logging(level: int = logging.INFO) -> logging.Logger:
     logger = get_logger()
     logger.setLevel(level)
     logger.propagate = False
+    warnings.filterwarnings("ignore")
     return logger

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/migration.py RENAMED Viewed

@@ -47,7 +47,7 @@ class SQLiteToLanceDBMigrator:
             # Load the sqlite-vec extension
             try:
-                import sqlite_vec
+                import sqlite_vec  # type: ignore
                 sqlite_conn.enable_load_extension(True)
                 sqlite_vec.load(sqlite_conn)
@@ -91,10 +91,10 @@ class SQLiteToLanceDBMigrator:
             sqlite_conn.close()
-            # Optimize the chunks table after migration
+            # Optimize and cleanup using centralized vacuum
             self.console.print("[blue]Optimizing LanceDB...[/blue]")
             try:
-                lance_store.chunks_table.optimize()
+                lance_store.vacuum()
                 self.console.print("[green]✅ Optimization completed[/green]")
             except Exception as e:
                 self.console.print(

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/store/engine.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import json
 import logging
+from datetime import timedelta
 from importlib import metadata
 from pathlib import Path
 from uuid import uuid4
@@ -62,6 +63,15 @@ class Store:
         if not skip_validation:
             self._validate_configuration()
+    def vacuum(self) -> None:
+        """Optimize and clean up old versions across all tables to reduce disk usage."""
+        if self._has_cloud_config() and str(Config.LANCEDB_URI).startswith("db://"):
+            return
+        # Perform maintenance per table using optimize() with cleanup_older_than 0
+        for table in [self.documents_table, self.chunks_table, self.settings_table]:
+            table.optimize(cleanup_older_than=timedelta(0))
     def _connect_to_lancedb(self, db_path: Path):
         """Establish connection to LanceDB (local, cloud, or object storage)."""
         # Check if we have cloud configuration
@@ -159,16 +169,18 @@ class Store:
             self.settings_table.search().limit(1).to_pydantic(SettingsRecord)
         )
         if settings_records:
-            settings = (
+            # Only write if version actually changes to avoid creating new table versions
+            current = (
                 json.loads(settings_records[0].settings)
                 if settings_records[0].settings
                 else {}
             )
-            settings["version"] = version
-            # Update the record
-            self.settings_table.update(
-                where="id = 'settings'", values={"settings": json.dumps(settings)}
-            )
+            if current.get("version") != version:
+                current["version"] = version
+                self.settings_table.update(
+                    where="id = 'settings'",
+                    values={"settings": json.dumps(current)},
+                )
         else:
             # Create new settings record
             settings_data = Config.model_dump(mode="json")
@@ -197,6 +209,21 @@ class Store:
         # LanceDB connections are automatically managed
         pass
+    def current_table_versions(self) -> dict[str, int]:
+        """Capture current versions of key tables for rollback using LanceDB's API."""
+        return {
+            "documents": int(self.documents_table.version),
+            "chunks": int(self.chunks_table.version),
+            "settings": int(self.settings_table.version),
+        }
+    def restore_table_versions(self, versions: dict[str, int]) -> bool:
+        """Restore tables to the provided versions using LanceDB's API."""
+        self.documents_table.restore(int(versions["documents"]))
+        self.chunks_table.restore(int(versions["chunks"]))
+        self.settings_table.restore(int(versions["settings"]))
+        return True
     @property
     def _connection(self):
         """Compatibility property for repositories expecting _connection."""

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/store/repositories/chunk.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import asyncio
+import inspect
 import json
 import logging
 from uuid import uuid4
@@ -11,6 +12,7 @@ from haiku.rag.config import Config
 from haiku.rag.embeddings import get_embedder
 from haiku.rag.store.engine import DocumentRecord, Store
 from haiku.rag.store.models.chunk import Chunk
+from haiku.rag.utils import load_callable, text_to_docling_document
 logger = logging.getLogger(__name__)
@@ -152,7 +154,28 @@ class ChunkRepository:
         self, document_id: str, document: DoclingDocument
     ) -> list[Chunk]:
         """Create chunks and embeddings for a document from DoclingDocument."""
-        chunk_texts = await chunker.chunk(document)
+        # Optionally preprocess markdown before chunking
+        processed_document = document
+        preprocessor_path = Config.MARKDOWN_PREPROCESSOR
+        if preprocessor_path:
+            try:
+                pre_fn = load_callable(preprocessor_path)
+                markdown = document.export_to_markdown()
+                result = pre_fn(markdown)
+                if inspect.isawaitable(result):
+                    result = await result  # type: ignore[assignment]
+                processed_markdown = result
+                if not isinstance(processed_markdown, str):
+                    raise ValueError("Preprocessor must return a markdown string")
+                processed_document = text_to_docling_document(
+                    processed_markdown, name="content.md"
+                )
+            except Exception as e:
+                logger.warning(
+                    f"Failed to apply MARKDOWN_PREPROCESSOR '{preprocessor_path}': {e}. Proceeding without preprocessing."
+                )
+        chunk_texts = await chunker.chunk(processed_document)
         embeddings = await self.embedder.embed(chunk_texts)

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/store/repositories/document.py RENAMED Viewed

@@ -171,44 +171,64 @@ class DocumentRepository:
         chunks: list["Chunk"] | None = None,
     ) -> Document:
         """Create a document with its chunks and embeddings."""
+        # Snapshot table versions for versioned rollback (if supported)
+        versions = self.store.current_table_versions()
         # Create the document
         created_doc = await self.create(entity)
-        # Create chunks if not provided
-        if chunks is None:
-            assert created_doc.id is not None, (
-                "Document ID should not be None after creation"
-            )
-            await self.chunk_repository.create_chunks_for_document(
-                created_doc.id, docling_document
-            )
-        else:
-            # Use provided chunks, set order from list position
-            assert created_doc.id is not None, (
-                "Document ID should not be None after creation"
-            )
-            for order, chunk in enumerate(chunks):
-                chunk.document_id = created_doc.id
-                chunk.metadata["order"] = order
-                await self.chunk_repository.create(chunk)
-        return created_doc
+        # Attempt to create chunks; on failure, prefer version rollback
+        try:
+            # Create chunks if not provided
+            if chunks is None:
+                assert created_doc.id is not None, (
+                    "Document ID should not be None after creation"
+                )
+                await self.chunk_repository.create_chunks_for_document(
+                    created_doc.id, docling_document
+                )
+            else:
+                # Use provided chunks, set order from list position
+                assert created_doc.id is not None, (
+                    "Document ID should not be None after creation"
+                )
+                for order, chunk in enumerate(chunks):
+                    chunk.document_id = created_doc.id
+                    chunk.metadata["order"] = order
+                    await self.chunk_repository.create(chunk)
+            return created_doc
+        except Exception:
+            # Roll back to the captured versions and re-raise
+            self.store.restore_table_versions(versions)
+            raise
     async def _update_with_docling(
         self, entity: Document, docling_document: DoclingDocument
     ) -> Document:
         """Update a document and regenerate its chunks."""
-        # Delete existing chunks
         assert entity.id is not None, "Document ID is required for update"
+        # Snapshot table versions for versioned rollback
+        versions = self.store.current_table_versions()
+        # Delete existing chunks before writing new ones
         await self.chunk_repository.delete_by_document_id(entity.id)
-        # Update the document
-        updated_doc = await self.update(entity)
+        try:
+            # Update the document
+            updated_doc = await self.update(entity)
-        # Create new chunks
-        assert updated_doc.id is not None, "Document ID should not be None after update"
-        await self.chunk_repository.create_chunks_for_document(
-            updated_doc.id, docling_document
-        )
+            # Create new chunks
+            assert updated_doc.id is not None, (
+                "Document ID should not be None after update"
+            )
+            await self.chunk_repository.create_chunks_for_document(
+                updated_doc.id, docling_document
+            )
-        return updated_doc
+            return updated_doc
+        except Exception:
+            # Roll back to the captured versions and re-raise
+            self.store.restore_table_versions(versions)
+            raise

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/store/repositories/settings.py RENAMED Viewed

@@ -84,10 +84,15 @@ class SettingsRepository:
         )
         if existing:
-            # Update existing settings
-            self.store.settings_table.update(
-                where="id = 'settings'", values={"settings": json.dumps(current_config)}
+            # Only update when configuration actually changed to avoid needless new versions
+            existing_payload = (
+                json.loads(existing[0].settings) if existing[0].settings else {}
             )
+            if existing_payload != current_config:
+                self.store.settings_table.update(
+                    where="id = 'settings'",
+                    values={"settings": json.dumps(current_config)},
+                )
         else:
             # Create new settings
             settings_record = SettingsRecord(

{haiku_rag-0.7.7 → haiku_rag-0.8.1}/src/haiku/rag/utils.py RENAMED Viewed

@@ -1,10 +1,13 @@
 import asyncio
+import importlib
+import importlib.util
 import sys
 from collections.abc import Callable
 from functools import wraps
 from importlib import metadata
 from io import BytesIO
 from pathlib import Path
+from types import ModuleType
 import httpx
 from docling.document_converter import DocumentConverter
@@ -106,3 +109,54 @@ def text_to_docling_document(text: str, name: str = "content.md") -> DoclingDocu
     converter = DocumentConverter()
     result = converter.convert(doc_stream)
     return result.document
+def load_callable(path: str):
+    """Load a callable from a dotted path or file path.
+    Supported formats:
+    - "package.module:func" or "package.module.func"
+    - "path/to/file.py:func"
+    Returns the loaded callable. Raises ValueError on failure.
+    """
+    if not path:
+        raise ValueError("Empty callable path provided")
+    module_part = None
+    func_name = None
+    if ":" in path:
+        module_part, func_name = path.split(":", 1)
+    else:
+        # split by last dot for module.attr
+        if "." in path:
+            module_part, func_name = path.rsplit(".", 1)
+        else:
+            raise ValueError(
+                "Invalid callable path format. Use 'module:func' or 'module.func' or 'file.py:func'."
+            )
+    # Try file path first
+    mod: ModuleType | None = None
+    module_path = Path(module_part)
+    if module_path.suffix == ".py" and module_path.exists():
+        spec = importlib.util.spec_from_file_location(module_path.stem, module_path)
+        if spec and spec.loader:
+            mod = importlib.util.module_from_spec(spec)
+            spec.loader.exec_module(mod)
+    else:
+        # Import as a module path
+        try:
+            mod = importlib.import_module(module_part)
+        except Exception as e:
+            raise ValueError(f"Failed to import module '{module_part}': {e}")
+    if not hasattr(mod, func_name):
+        raise ValueError(f"Callable '{func_name}' not found in module '{module_part}'")
+    func = getattr(mod, func_name)
+    if not callable(func):
+        raise ValueError(
+            f"Attribute '{func_name}' in module '{module_part}' is not callable"
+        )
+    return func

haiku.rag 0.7.7__tar.gz → 0.8.1__tar.gz

Potentially problematic release.

haiku.rag 0.7.7tar.gz → 0.8.1tar.gz