PyPI - beaver-db - Versions diffs - 0.9.1__tar.gz → 0.10.0__tar.gz - Mend

beaver-db 0.9.1tar.gz → 0.10.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of beaver-db might be problematic. Click here for more details.

Files changed (17) hide show

{beaver_db-0.9.1/beaver_db.egg-info → beaver_db-0.10.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: beaver-db
-Version: 0.9.1
+Version: 0.10.0
 Summary: Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications.
 Requires-Python: >=3.13
 Description-Content-Type: text/markdown
@@ -19,20 +19,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
 `beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
-  - **Minimalistic & Zero-Dependency**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
-  - **Synchronous & Thread-Safe**: Designed for simplicity and safety in multi-threaded environments.
+  - **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
+  - **Schemaless**: Flexible data storage without rigid schemas across all modalities.
+  - **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
   - **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
   - **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
   - **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
 ## Core Features
-  - **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
+  - **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
   - **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
   - **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
   - **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
   - **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
-  - **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
+  - **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
   - **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
   - **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
@@ -190,17 +191,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
   - [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
   - [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
   - [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
+  - [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
   - [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
   - [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
   - [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
+  - [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
 ## Roadmap
 These are some of the features and improvements planned for future releases:
-  - **Fuzzy search**: Implement fuzzy matching capabilities for text search.
   - **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
-  - **Async API**: Comprehensive async support with on-demand wrappers for all collections.
+  - **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
 Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.

{beaver_db-0.9.1 → beaver_db-0.10.0}/README.md RENAMED Viewed

@@ -8,20 +8,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
 `beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
-  - **Minimalistic & Zero-Dependency**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
-  - **Synchronous & Thread-Safe**: Designed for simplicity and safety in multi-threaded environments.
+  - **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
+  - **Schemaless**: Flexible data storage without rigid schemas across all modalities.
+  - **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
   - **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
   - **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
   - **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
 ## Core Features
-  - **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
+  - **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
   - **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
   - **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
   - **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
   - **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
-  - **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
+  - **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
   - **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
   - **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
@@ -179,17 +180,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
   - [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
   - [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
   - [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
+  - [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
   - [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
   - [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
   - [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
+  - [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
 ## Roadmap
 These are some of the features and improvements planned for future releases:
-  - **Fuzzy search**: Implement fuzzy matching capabilities for text search.
   - **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
-  - **Async API**: Comprehensive async support with on-demand wrappers for all collections.
+  - **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
 Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.

{beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/channels.py RENAMED Viewed

@@ -1,14 +1,44 @@
+import asyncio
 import json
 import sqlite3
 import threading
 import time
 from queue import Empty, Queue
-from typing import Any, Iterator, Set
+from typing import Any, AsyncIterator, Iterator, Set
 # A special message object used to signal the listener to gracefully shut down.
 _SHUTDOWN_SENTINEL = object()
+class AsyncSubscriber:
+    """A thread-safe async message receiver for a specific channel subscription."""
+    def __init__(self, subscriber: "Subscriber"):
+        self._subscriber = subscriber
+    async def __aenter__(self) -> "AsyncSubscriber":
+        """Registers the listener's queue with the channel to start receiving messages."""
+        await asyncio.to_thread(self._subscriber.__enter__)
+        return self
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        """Unregisters the listener's queue from the channel to stop receiving messages."""
+        await asyncio.to_thread(self._subscriber.__exit__, exc_type, exc_val, exc_tb)
+    async def listen(self, timeout: float | None = None) -> AsyncIterator[Any]:
+        """
+        Returns a blocking async iterator that yields messages as they arrive.
+        """
+        while True:
+            try:
+                msg = await asyncio.to_thread(self._subscriber._queue.get, timeout=timeout)
+                if msg is _SHUTDOWN_SENTINEL:
+                    break
+                yield msg
+            except Empty:
+                raise TimeoutError(f"Timeout {timeout}s expired.")
 class Subscriber:
     """
     A thread-safe message receiver for a specific channel subscription.
@@ -54,6 +84,27 @@ class Subscriber:
             except Empty:
                 raise TimeoutError(f"Timeout {timeout}s expired.")
+    def as_async(self) -> "AsyncSubscriber":
+        """Returns an async version of the subscriber."""
+        return AsyncSubscriber(self)
+class AsyncChannelManager:
+    """The central async hub for a named pub/sub channel."""
+    def __init__(self, channel: "ChannelManager"):
+        self._channel = channel
+    async def publish(self, payload: Any):
+        """
+        Publishes a JSON-serializable message to the channel asynchronously.
+        """
+        await asyncio.to_thread(self._channel.publish, payload)
+    def subscribe(self) -> "AsyncSubscriber":
+        """Creates a new async subscription, returning an AsyncSubscriber context manager."""
+        return self._channel.subscribe().as_async()
 class ChannelManager:
     """
@@ -183,3 +234,7 @@ class ChannelManager:
                 "INSERT INTO beaver_pubsub_log (timestamp, channel_name, message_payload) VALUES (?, ?, ?)",
                 (time.time(), self._name, json_payload),
             )
+    def as_async(self) -> "AsyncChannelManager":
+        """Returns an async version of the channel manager."""
+        return AsyncChannelManager(self)

{beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/collections.py RENAMED Viewed

@@ -5,7 +5,63 @@ from enum import Enum
 from typing import Any, List, Literal, Set
 import numpy as np
-from scipy.spatial import cKDTree
+from scipy.spatial import KDTree
+# --- Fuzzy Search Helper Functions ---
+def _levenshtein_distance(s1: str, s2: str) -> int:
+    """Calculates the Levenshtein distance between two strings."""
+    if len(s1) < len(s2):
+        return _levenshtein_distance(s2, s1)
+    if len(s2) == 0:
+        return len(s1)
+    previous_row = range(len(s2) + 1)
+    for i, c1 in enumerate(s1):
+        current_row = [i + 1]
+        for j, c2 in enumerate(s2):
+            insertions = previous_row[j + 1] + 1
+            deletions = current_row[j] + 1
+            substitutions = previous_row[j] + (c1 != c2)
+            current_row.append(min(insertions, deletions, substitutions))
+        previous_row = current_row
+    return previous_row[-1]
+def _get_trigrams(text: str) -> set[str]:
+    """Generates a set of 3-character trigrams from a string."""
+    if not text or len(text) < 3:
+        return set()
+    return {text[i:i+3] for i in range(len(text) - 2)}
+def _sliding_window_levenshtein(query: str, content: str, fuzziness: int) -> int:
+    """
+    Finds the best Levenshtein match for a query within a larger text
+    by comparing it against relevant substrings.
+    """
+    query_tokens = query.lower().split()
+    content_tokens = content.lower().split()
+    query_len = len(query_tokens)
+    if query_len == 0:
+        return 0
+    min_dist = float('inf')
+    query_norm = " ".join(query_tokens)
+    # The window size can be slightly smaller or larger than the query length
+    # to account for missing or extra words in a fuzzy match.
+    for window_size in range(max(1, query_len - fuzziness), query_len + fuzziness + 1):
+        if window_size > len(content_tokens):
+            continue
+        for i in range(len(content_tokens) - window_size + 1):
+            window_text = " ".join(content_tokens[i:i+window_size])
+            dist = _levenshtein_distance(query_norm, window_text)
+            if dist < min_dist:
+                min_dist = dist
+    return int(min_dist)
 class WalkDirection(Enum):
@@ -54,18 +110,18 @@ class CollectionManager:
     def __init__(self, name: str, conn: sqlite3.Connection):
         self._name = name
         self._conn = conn
-        self._kdtree: cKDTree | None = None
+        self._kdtree: KDTree | None = None
         self._doc_ids: List[str] = []
         self._local_index_version = -1  # Version of the in-memory index
-    def _flatten_metadata(self, metadata: dict, prefix: str = "") -> dict[str, str]:
-        """Flattens a nested dictionary and filters for string values."""
+    def _flatten_metadata(self, metadata: dict, prefix: str = "") -> dict[str, Any]:
+        """Flattens a nested dictionary for indexing."""
         flat_dict = {}
         for key, value in metadata.items():
-            new_key = f"{prefix}__{key}" if prefix else key
+            new_key = f"{prefix}.{key}" if prefix else key
             if isinstance(value, dict):
                 flat_dict.update(self._flatten_metadata(value, new_key))
-            elif isinstance(value, str):
+            else:
                 flat_dict[new_key] = value
         return flat_dict
@@ -85,39 +141,63 @@ class CollectionManager:
             return True
         return self._local_index_version < self._get_db_version()
-    def index(self, document: Document, *, fts: bool = True):
-        """Indexes a Document, performing an upsert and updating the FTS index."""
+    def index(
+        self,
+        document: Document,
+        *,
+        fts: bool | list[str] = True,
+        fuzzy: bool = False
+    ):
+        """
+        Indexes a Document, including vector, FTS, and fuzzy search data.
+        The entire operation is performed in a single atomic transaction.
+        """
         with self._conn:
-            if fts:
-                self._conn.execute(
-                    "DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?",
-                    (self._name, document.id),
-                )
-                string_fields = self._flatten_metadata(document.to_dict())
-                if string_fields:
-                    fts_data = [
-                        (self._name, document.id, path, content)
-                        for path, content in string_fields.items()
-                    ]
-                    self._conn.executemany(
-                        "INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)",
-                        fts_data,
-                    )
+            # Step 1: Core Document and Vector Storage (Unaffected by FTS/Fuzzy)
             self._conn.execute(
                 "INSERT OR REPLACE INTO beaver_collections (collection, item_id, item_vector, metadata) VALUES (?, ?, ?, ?)",
                 (
                     self._name,
                     document.id,
-                    (
-                        document.embedding.tobytes()
-                        if document.embedding is not None
-                        else None
-                    ),
+                    document.embedding.tobytes() if document.embedding is not None else None,
                     json.dumps(document.to_dict()),
                 ),
             )
-            # Atomically increment the collection's version number
+            # Step 2: FTS and Fuzzy Indexing
+            # First, clean up old index data for this document
+            self._conn.execute("DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?", (self._name, document.id))
+            self._conn.execute("DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?", (self._name, document.id))
+            # Determine which string fields to index
+            flat_metadata = self._flatten_metadata(document.to_dict())
+            fields_to_index: dict[str, str] = {}
+            if isinstance(fts, list):
+                fields_to_index = {k: v for k, v in flat_metadata.items() if k in fts and isinstance(v, str)}
+            elif fts:
+                fields_to_index = {k: v for k, v in flat_metadata.items() if isinstance(v, str)}
+            if fields_to_index:
+                # FTS indexing
+                fts_data = [(self._name, document.id, path, content) for path, content in fields_to_index.items()]
+                self._conn.executemany(
+                    "INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)",
+                    fts_data,
+                )
+                # Fuzzy indexing (if enabled)
+                if fuzzy:
+                    trigram_data = []
+                    for path, content in fields_to_index.items():
+                        for trigram in _get_trigrams(content.lower()):
+                            trigram_data.append((self._name, document.id, path, trigram))
+                    if trigram_data:
+                        self._conn.executemany(
+                            "INSERT INTO beaver_trigrams (collection, item_id, field_path, trigram) VALUES (?, ?, ?, ?)",
+                            trigram_data,
+                        )
+            # Step 3: Update Collection Version
             self._conn.execute(
                 """
                 INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1)
@@ -139,6 +219,10 @@ class CollectionManager:
                 "DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?",
                 (self._name, document.id),
             )
+            self._conn.execute(
+                "DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?",
+                (self._name, document.id),
+            )
             self._conn.execute(
                 "DELETE FROM beaver_edges WHERE collection = ? AND (source_item_id = ? OR target_item_id = ?)",
                 (self._name, document.id, document.id),
@@ -181,7 +265,7 @@ class CollectionManager:
             self._doc_ids.append(row["item_id"])
             vectors.append(np.frombuffer(row["item_vector"], dtype=np.float32))
-        self._kdtree = cKDTree(vectors) if vectors else None
+        self._kdtree = KDTree(vectors) if vectors else None
         self._local_index_version = self._get_db_version()
     def search(
@@ -222,9 +306,36 @@ class CollectionManager:
         return results
     def match(
-        self, query: str, on_field: str | None = None, top_k: int = 10
+        self,
+        query: str,
+        *,
+        on: str | list[str] | None = None,
+        top_k: int = 10,
+        fuzziness: int = 0
     ) -> list[tuple[Document, float]]:
-        """Performs a full-text search on indexed string fields."""
+        """
+        Performs a full-text or fuzzy search on indexed string fields.
+        Args:
+            query: The search query string.
+            on: An optional list of fields to restrict the search to.
+            top_k: The maximum number of results to return.
+            fuzziness: The Levenshtein distance for fuzzy matching.
+                       If 0, performs an exact FTS search.
+                       If > 0, performs a fuzzy search.
+        """
+        if isinstance(on, str):
+            on = [on]
+        if fuzziness == 0:
+            return self._perform_fts_search(query, on, top_k)
+        else:
+            return self._perform_fuzzy_search(query, on, top_k, fuzziness)
+    def _perform_fts_search(
+        self, query: str, on: list[str] | None, top_k: int
+    ) -> list[tuple[Document, float]]:
+        """Performs a standard FTS search."""
         cursor = self._conn.cursor()
         sql_query = """
             SELECT t1.item_id, t1.item_vector, t1.metadata, fts.rank
@@ -234,30 +345,127 @@ class CollectionManager:
             ) AS fts ON t1.item_id = fts.item_id
             WHERE t1.collection = ? ORDER BY fts.rank
         """
-        params, field_filter_sql = [], ""
-        if on_field:
-            field_filter_sql = "AND field_path = ?"
-            params.extend([query, on_field])
-        else:
-            params.append(query)
-        params.extend([top_k, self._name])
+        params: list[Any] = [query]
+        field_filter_sql = ""
+        if on:
+            placeholders = ",".join("?" for _ in on)
+            field_filter_sql = f"AND field_path IN ({placeholders})"
+            params.extend(on)
-        rows = cursor.execute(
-            sql_query.format(field_filter_sql), tuple(params)
-        ).fetchall()
+        params.extend([top_k, self._name])
+        rows = cursor.execute(sql_query.format(field_filter_sql), tuple(params)).fetchall()
         results = []
         for row in rows:
             embedding = (
                 np.frombuffer(row["item_vector"], dtype=np.float32).tolist()
-                if row["item_vector"]
-                else None
-            )
-            doc = Document(
-                id=row["item_id"], embedding=embedding, **json.loads(row["metadata"])
+                if row["item_vector"] else None
             )
+            doc = Document(id=row["item_id"], embedding=embedding, **json.loads(row["metadata"]))
             results.append((doc, row["rank"]))
         return results
+    def _get_trigram_candidates(self, query: str, on: list[str] | None) -> set[str]:
+        """
+        Gets document IDs that meet a trigram similarity threshold with the query.
+        """
+        query_trigrams = _get_trigrams(query.lower())
+        if not query_trigrams:
+            return set()
+        # Optimization: Only consider documents that share a significant number of trigrams.
+        # This threshold dramatically reduces the number of candidates for the expensive
+        # Levenshtein check. A 30% threshold is a reasonable starting point.
+        similarity_threshold = int(len(query_trigrams) * 0.3)
+        if similarity_threshold == 0:
+            return set()
+        cursor = self._conn.cursor()
+        sql = """
+            SELECT item_id FROM beaver_trigrams
+            WHERE collection = ? AND trigram IN ({}) {}
+            GROUP BY item_id
+            HAVING COUNT(DISTINCT trigram) >= ?
+        """
+        params: list[Any] = [self._name]
+        trigram_placeholders = ",".join("?" for _ in query_trigrams)
+        params.extend(query_trigrams)
+        field_filter_sql = ""
+        if on:
+            field_placeholders = ",".join("?" for _ in on)
+            field_filter_sql = f"AND field_path IN ({field_placeholders})"
+            params.extend(on)
+        params.append(similarity_threshold)
+        cursor.execute(sql.format(trigram_placeholders, field_filter_sql), tuple(params))
+        return {row['item_id'] for row in cursor.fetchall()}
+    def _perform_fuzzy_search(
+        self, query: str, on: list[str] | None, top_k: int, fuzziness: int
+    ) -> list[tuple[Document, float]]:
+        """Performs a 3-stage fuzzy search: gather, score, and sort."""
+        # Stage 1: Gather Candidates
+        fts_results = self._perform_fts_search(query, on, top_k)
+        fts_candidate_ids = {doc.id for doc, _ in fts_results}
+        trigram_candidate_ids = self._get_trigram_candidates(query, on)
+        candidate_ids = fts_candidate_ids.union(trigram_candidate_ids)
+        if not candidate_ids:
+            return []
+        # Stage 2: Score Candidates
+        cursor = self._conn.cursor()
+        id_placeholders = ",".join("?" for _ in candidate_ids)
+        sql_text = f"SELECT item_id, field_path, field_content FROM beaver_fts_index WHERE collection = ? AND item_id IN ({id_placeholders})"
+        params_text: list[Any] = [self._name]
+        params_text.extend(candidate_ids)
+        if on:
+            sql_text += f" AND field_path IN ({','.join('?' for _ in on)})"
+            params_text.extend(on)
+        cursor.execute(sql_text, tuple(params_text))
+        candidate_texts: dict[str, dict[str, str]] = {}
+        for row in cursor.fetchall():
+            item_id = row['item_id']
+            if item_id not in candidate_texts:
+                candidate_texts[item_id] = {}
+            candidate_texts[item_id][row['field_path']] = row['field_content']
+        scored_candidates = []
+        fts_rank_map = {doc.id: rank for doc, rank in fts_results}
+        for item_id in candidate_ids:
+            if item_id not in candidate_texts:
+                continue
+            min_dist = float('inf')
+            for content in candidate_texts[item_id].values():
+                dist = _sliding_window_levenshtein(query, content, fuzziness)
+                if dist < min_dist:
+                    min_dist = dist
+            if min_dist <= fuzziness:
+                scored_candidates.append({
+                    "id": item_id,
+                    "distance": min_dist,
+                    "fts_rank": fts_rank_map.get(item_id, 0) # Use 0 for non-matches (less relevant)
+                })
+        # Stage 3: Sort and Fetch Results
+        scored_candidates.sort(key=lambda x: (x["distance"], x["fts_rank"]))
+        top_ids = [c["id"] for c in scored_candidates[:top_k]]
+        if not top_ids:
+            return []
+        id_placeholders = ",".join("?" for _ in top_ids)
+        sql_docs = f"SELECT item_id, item_vector, metadata FROM beaver_collections WHERE collection = ? AND item_id IN ({id_placeholders})"
+        cursor.execute(sql_docs, (self._name, *top_ids))
+        doc_map = {row["item_id"]: Document(id=row["item_id"], embedding=(np.frombuffer(row["item_vector"], dtype=np.float32).tolist() if row["item_vector"] else None), **json.loads(row["metadata"])) for row in cursor.fetchall()}
+        final_results = []
+        distance_map = {c["id"]: c["distance"] for c in scored_candidates}
+        for doc_id in top_ids:
+            if doc_id in doc_map:
+                final_results.append((doc_map[doc_id], float(distance_map[doc_id])))
+        return final_results
     def connect(
         self, source: Document, target: Document, label: str, metadata: dict = None
     ):

{beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/core.py RENAMED Viewed

@@ -36,6 +36,7 @@ class BeaverDB:
         self._create_list_table()
         self._create_collections_table()
         self._create_fts_table()
+        self._create_trigrams_table()
         self._create_edges_table()
         self._create_versions_table()
         self._create_dict_table()
@@ -139,6 +140,27 @@ class BeaverDB:
             """
             )
+    def _create_trigrams_table(self):
+        """Creates the table for the fuzzy search trigram index."""
+        with self._conn:
+            self._conn.execute(
+                """
+                CREATE TABLE IF NOT EXISTS beaver_trigrams (
+                    collection TEXT NOT NULL,
+                    item_id TEXT NOT NULL,
+                    field_path TEXT NOT NULL,
+                    trigram TEXT NOT NULL,
+                    PRIMARY KEY (collection, field_path, trigram, item_id)
+                )
+                """
+            )
+            self._conn.execute(
+                """
+                CREATE INDEX IF NOT EXISTS idx_trigram_lookup
+                ON beaver_trigrams (collection, trigram, field_path)
+                """
+            )
     def _create_edges_table(self):
         """Creates the table for storing relationships between documents."""
         with self._conn:

{beaver_db-0.9.1 → beaver_db-0.10.0/beaver_db.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: beaver-db
-Version: 0.9.1
+Version: 0.10.0
 Summary: Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications.
 Requires-Python: >=3.13
 Description-Content-Type: text/markdown
@@ -19,20 +19,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
 `beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
-  - **Minimalistic & Zero-Dependency**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
-  - **Synchronous & Thread-Safe**: Designed for simplicity and safety in multi-threaded environments.
+  - **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
+  - **Schemaless**: Flexible data storage without rigid schemas across all modalities.
+  - **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
   - **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
   - **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
   - **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
 ## Core Features
-  - **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
+  - **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
   - **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
   - **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
   - **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
   - **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
-  - **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
+  - **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
   - **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
   - **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
@@ -190,17 +191,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
   - [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
   - [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
   - [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
+  - [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
   - [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
   - [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
   - [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
+  - [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
 ## Roadmap
 These are some of the features and improvements planned for future releases:
-  - **Fuzzy search**: Implement fuzzy matching capabilities for text search.
   - **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
-  - **Async API**: Comprehensive async support with on-demand wrappers for all collections.
+  - **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
 Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.

{beaver_db-0.9.1 → beaver_db-0.10.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "beaver-db"
-version = "0.9.1"
+version = "0.10.0"
 description = "Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications."
 readme = "README.md"
 requires-python = ">=3.13"