beaver-db 0.10.0__tar.gz → 0.11.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of beaver-db might be problematic. Click here for more details.
- {beaver_db-0.10.0 → beaver_db-0.11.1}/PKG-INFO +29 -9
- {beaver_db-0.10.0 → beaver_db-0.11.1}/README.md +27 -7
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/collections.py +123 -141
- beaver_db-0.11.1/beaver/core.py +301 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/dicts.py +18 -3
- beaver_db-0.11.1/beaver/vectors.py +370 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver_db.egg-info/PKG-INFO +29 -9
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver_db.egg-info/SOURCES.txt +1 -0
- beaver_db-0.11.1/beaver_db.egg-info/requires.txt +2 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/pyproject.toml +2 -2
- beaver_db-0.10.0/beaver/core.py +0 -242
- beaver_db-0.10.0/beaver_db.egg-info/requires.txt +0 -2
- {beaver_db-0.10.0 → beaver_db-0.11.1}/LICENSE +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/__init__.py +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/channels.py +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/lists.py +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver/queues.py +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver_db.egg-info/dependency_links.txt +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/beaver_db.egg-info/top_level.txt +0 -0
- {beaver_db-0.10.0 → beaver_db-0.11.1}/setup.cfg +0 -0
|
@@ -1,14 +1,16 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: beaver-db
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.11.1
|
|
4
4
|
Summary: Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications.
|
|
5
5
|
Requires-Python: >=3.13
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
7
7
|
License-File: LICENSE
|
|
8
|
+
Requires-Dist: faiss-cpu>=1.12.0
|
|
8
9
|
Requires-Dist: numpy>=2.3.3
|
|
9
|
-
Requires-Dist: scipy>=1.16.2
|
|
10
10
|
Dynamic: license-file
|
|
11
11
|
|
|
12
|
+
Of course, here is a rewritten README to explain the vector store uses a high performance FAISS-based implementation with in-memory and persistent indices, with an added small section on how is this implemented to explain the basic ideas behind the implementation of beaver.
|
|
13
|
+
|
|
12
14
|
# beaver 🦫
|
|
13
15
|
|
|
14
16
|
A fast, single-file, multi-modal database for Python, built with the standard `sqlite3` library.
|
|
@@ -19,11 +21,11 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
19
21
|
|
|
20
22
|
`beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
|
|
21
23
|
|
|
22
|
-
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`
|
|
24
|
+
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`faiss-cpu`.
|
|
23
25
|
- **Schemaless**: Flexible data storage without rigid schemas across all modalities.
|
|
24
26
|
- **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
|
|
25
27
|
- **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
|
|
26
|
-
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications.
|
|
28
|
+
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. Vector search is accelerated with a high-performance, persistent `faiss` index.
|
|
27
29
|
- **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
|
|
28
30
|
|
|
29
31
|
## Core Features
|
|
@@ -32,11 +34,28 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
32
34
|
- **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
|
|
33
35
|
- **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
|
|
34
36
|
- **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
|
|
35
|
-
- **
|
|
36
|
-
- **Full-Text
|
|
37
|
-
- **Graph
|
|
37
|
+
- **High-Performance Vector Storage & Search**: Store vector embeddings and perform fast, crash-safe approximate nearest neighbor searches using a `faiss`-based hybrid index.
|
|
38
|
+
- **Full-Text and Fuzzy Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy search for typo-tolerant matching.
|
|
39
|
+
- **Knowledge Graph**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
|
|
38
40
|
- **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
|
|
39
41
|
|
|
42
|
+
## How Beaver is Implemented
|
|
43
|
+
|
|
44
|
+
BeaverDB is architected as a set of targeted wrappers around a standard SQLite database. The core `BeaverDB` class manages a single connection to the SQLite file and initializes all the necessary tables for the various features.
|
|
45
|
+
|
|
46
|
+
When you call a method like `db.dict("my_dict")` or `db.collection("my_docs")`, you get back a specialized manager object (`DictManager`, `CollectionManager`, etc.) that provides a clean, Pythonic API for that specific data modality. These managers translate the simple method calls (e.g., `my_dict["key"] = "value"`) into the appropriate SQL queries, handling all the complexity of data serialization, indexing, and transaction management behind the scenes. This design provides a minimal and intuitive API surface while leveraging the power and reliability of SQLite.
|
|
47
|
+
|
|
48
|
+
The vector store in BeaverDB is designed for high performance and reliability, using a hybrid faiss-based index that is both fast and persistent. Here's a look at the core ideas behind its implementation:
|
|
49
|
+
|
|
50
|
+
- **Hybrid Index System**: The vector store uses a two-tiered system to balance fast writes with efficient long-term storage:
|
|
51
|
+
- **Base Index**: A large, optimized faiss index that contains the majority of the vectors. This index is serialized and stored as a BLOB inside a dedicated SQLite table, ensuring it remains part of the single database file.
|
|
52
|
+
- **Delta Index**: A small, in-memory faiss index that holds all newly added vectors. This allows for near-instant write performance without having to rebuild the entire index for every new addition.
|
|
53
|
+
- **Crash-Safe Logging**: To ensure durability, all new vector additions and deletions are first recorded in a dedicated log table in the SQLite database. This means that even if the application crashes, no data is lost.
|
|
54
|
+
- **Automatic Compaction**: When the number of changes in the log reaches a certain threshold, a background process is automatically triggered to "compact" the index. This process rebuilds the base index, incorporating all the recent changes from the delta index, and then clears the log. This ensures that the index remains optimized for fast search performance over time.
|
|
55
|
+
|
|
56
|
+
This hybrid approach allows BeaverDB to provide a vector search experience that is both fast and durable, without sacrificing the single-file, embedded philosophy of the library.
|
|
57
|
+
|
|
58
|
+
|
|
40
59
|
## Installation
|
|
41
60
|
|
|
42
61
|
```bash
|
|
@@ -136,7 +155,7 @@ for message in chat_history:
|
|
|
136
155
|
|
|
137
156
|
### 4. Build a RAG (Retrieval-Augmented Generation) System
|
|
138
157
|
|
|
139
|
-
Combine **vector search** and **full-text search** to build a powerful RAG pipeline for your local documents.
|
|
158
|
+
Combine **vector search** and **full-text search** to build a powerful RAG pipeline for your local documents. The vector search uses a high-performance, persistent `faiss` index that supports incremental additions without downtime.
|
|
140
159
|
|
|
141
160
|
```python
|
|
142
161
|
# Get context for a user query like "fast python web frameworks"
|
|
@@ -196,12 +215,13 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
|
|
|
196
215
|
- [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
|
|
197
216
|
- [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
|
|
198
217
|
- [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
|
|
218
|
+
- [`examples/stress_vectors.py`](examples/stress_vectors.py): A stress test for the vector search functionality.
|
|
219
|
+
- [`examples/general_test.py`](examples/general_test.py): A general-purpose test to run all operations randomly which allows testing long-running processes and synchronicity issues.
|
|
199
220
|
|
|
200
221
|
## Roadmap
|
|
201
222
|
|
|
202
223
|
These are some of the features and improvements planned for future releases:
|
|
203
224
|
|
|
204
|
-
- **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
|
|
205
225
|
- **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
206
226
|
|
|
207
227
|
Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
Of course, here is a rewritten README to explain the vector store uses a high performance FAISS-based implementation with in-memory and persistent indices, with an added small section on how is this implemented to explain the basic ideas behind the implementation of beaver.
|
|
2
|
+
|
|
1
3
|
# beaver 🦫
|
|
2
4
|
|
|
3
5
|
A fast, single-file, multi-modal database for Python, built with the standard `sqlite3` library.
|
|
@@ -8,11 +10,11 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
8
10
|
|
|
9
11
|
`beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
|
|
10
12
|
|
|
11
|
-
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`
|
|
13
|
+
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`faiss-cpu`.
|
|
12
14
|
- **Schemaless**: Flexible data storage without rigid schemas across all modalities.
|
|
13
15
|
- **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
|
|
14
16
|
- **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
|
|
15
|
-
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications.
|
|
17
|
+
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. Vector search is accelerated with a high-performance, persistent `faiss` index.
|
|
16
18
|
- **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
|
|
17
19
|
|
|
18
20
|
## Core Features
|
|
@@ -21,11 +23,28 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
21
23
|
- **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
|
|
22
24
|
- **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
|
|
23
25
|
- **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
|
|
24
|
-
- **
|
|
25
|
-
- **Full-Text
|
|
26
|
-
- **Graph
|
|
26
|
+
- **High-Performance Vector Storage & Search**: Store vector embeddings and perform fast, crash-safe approximate nearest neighbor searches using a `faiss`-based hybrid index.
|
|
27
|
+
- **Full-Text and Fuzzy Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy search for typo-tolerant matching.
|
|
28
|
+
- **Knowledge Graph**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
|
|
27
29
|
- **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
|
|
28
30
|
|
|
31
|
+
## How Beaver is Implemented
|
|
32
|
+
|
|
33
|
+
BeaverDB is architected as a set of targeted wrappers around a standard SQLite database. The core `BeaverDB` class manages a single connection to the SQLite file and initializes all the necessary tables for the various features.
|
|
34
|
+
|
|
35
|
+
When you call a method like `db.dict("my_dict")` or `db.collection("my_docs")`, you get back a specialized manager object (`DictManager`, `CollectionManager`, etc.) that provides a clean, Pythonic API for that specific data modality. These managers translate the simple method calls (e.g., `my_dict["key"] = "value"`) into the appropriate SQL queries, handling all the complexity of data serialization, indexing, and transaction management behind the scenes. This design provides a minimal and intuitive API surface while leveraging the power and reliability of SQLite.
|
|
36
|
+
|
|
37
|
+
The vector store in BeaverDB is designed for high performance and reliability, using a hybrid faiss-based index that is both fast and persistent. Here's a look at the core ideas behind its implementation:
|
|
38
|
+
|
|
39
|
+
- **Hybrid Index System**: The vector store uses a two-tiered system to balance fast writes with efficient long-term storage:
|
|
40
|
+
- **Base Index**: A large, optimized faiss index that contains the majority of the vectors. This index is serialized and stored as a BLOB inside a dedicated SQLite table, ensuring it remains part of the single database file.
|
|
41
|
+
- **Delta Index**: A small, in-memory faiss index that holds all newly added vectors. This allows for near-instant write performance without having to rebuild the entire index for every new addition.
|
|
42
|
+
- **Crash-Safe Logging**: To ensure durability, all new vector additions and deletions are first recorded in a dedicated log table in the SQLite database. This means that even if the application crashes, no data is lost.
|
|
43
|
+
- **Automatic Compaction**: When the number of changes in the log reaches a certain threshold, a background process is automatically triggered to "compact" the index. This process rebuilds the base index, incorporating all the recent changes from the delta index, and then clears the log. This ensures that the index remains optimized for fast search performance over time.
|
|
44
|
+
|
|
45
|
+
This hybrid approach allows BeaverDB to provide a vector search experience that is both fast and durable, without sacrificing the single-file, embedded philosophy of the library.
|
|
46
|
+
|
|
47
|
+
|
|
29
48
|
## Installation
|
|
30
49
|
|
|
31
50
|
```bash
|
|
@@ -125,7 +144,7 @@ for message in chat_history:
|
|
|
125
144
|
|
|
126
145
|
### 4. Build a RAG (Retrieval-Augmented Generation) System
|
|
127
146
|
|
|
128
|
-
Combine **vector search** and **full-text search** to build a powerful RAG pipeline for your local documents.
|
|
147
|
+
Combine **vector search** and **full-text search** to build a powerful RAG pipeline for your local documents. The vector search uses a high-performance, persistent `faiss` index that supports incremental additions without downtime.
|
|
129
148
|
|
|
130
149
|
```python
|
|
131
150
|
# Get context for a user query like "fast python web frameworks"
|
|
@@ -185,12 +204,13 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
|
|
|
185
204
|
- [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
|
|
186
205
|
- [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
|
|
187
206
|
- [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
|
|
207
|
+
- [`examples/stress_vectors.py`](examples/stress_vectors.py): A stress test for the vector search functionality.
|
|
208
|
+
- [`examples/general_test.py`](examples/general_test.py): A general-purpose test to run all operations randomly which allows testing long-running processes and synchronicity issues.
|
|
188
209
|
|
|
189
210
|
## Roadmap
|
|
190
211
|
|
|
191
212
|
These are some of the features and improvements planned for future releases:
|
|
192
213
|
|
|
193
|
-
- **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
|
|
194
214
|
- **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
195
215
|
|
|
196
216
|
Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.
|
|
@@ -1,11 +1,13 @@
|
|
|
1
1
|
import json
|
|
2
2
|
import sqlite3
|
|
3
|
+
import threading
|
|
3
4
|
import uuid
|
|
4
5
|
from enum import Enum
|
|
5
|
-
from typing import Any, List, Literal,
|
|
6
|
+
from typing import Any, List, Literal, Tuple
|
|
6
7
|
|
|
7
8
|
import numpy as np
|
|
8
|
-
|
|
9
|
+
|
|
10
|
+
from .vectors import VectorIndex
|
|
9
11
|
|
|
10
12
|
|
|
11
13
|
# --- Fuzzy Search Helper Functions ---
|
|
@@ -103,16 +105,18 @@ class Document:
|
|
|
103
105
|
|
|
104
106
|
class CollectionManager:
|
|
105
107
|
"""
|
|
106
|
-
A wrapper for multi-modal collection operations
|
|
107
|
-
FTS,
|
|
108
|
+
A wrapper for multi-modal collection operations, including document storage,
|
|
109
|
+
FTS, fuzzy search, graph traversal, and persistent vector search.
|
|
108
110
|
"""
|
|
109
111
|
|
|
110
112
|
def __init__(self, name: str, conn: sqlite3.Connection):
|
|
111
113
|
self._name = name
|
|
112
114
|
self._conn = conn
|
|
113
|
-
|
|
114
|
-
self.
|
|
115
|
-
|
|
115
|
+
# All vector-related operations are now delegated to the VectorIndex class.
|
|
116
|
+
self._vector_index = VectorIndex(name, conn)
|
|
117
|
+
# A lock to ensure only one compaction thread runs at a time for this collection.
|
|
118
|
+
self._compaction_lock = threading.Lock()
|
|
119
|
+
self._compaction_thread: threading.Thread | None = None
|
|
116
120
|
|
|
117
121
|
def _flatten_metadata(self, metadata: dict, prefix: str = "") -> dict[str, Any]:
|
|
118
122
|
"""Flattens a nested dictionary for indexing."""
|
|
@@ -125,21 +129,58 @@ class CollectionManager:
|
|
|
125
129
|
flat_dict[new_key] = value
|
|
126
130
|
return flat_dict
|
|
127
131
|
|
|
128
|
-
def
|
|
129
|
-
"""
|
|
132
|
+
def _needs_compaction(self, threshold: int = 1000) -> bool:
|
|
133
|
+
"""Checks if the total number of pending vector operations exceeds the threshold."""
|
|
130
134
|
cursor = self._conn.cursor()
|
|
131
135
|
cursor.execute(
|
|
132
|
-
"SELECT
|
|
133
|
-
(self._name,)
|
|
136
|
+
"SELECT COUNT(*) FROM _beaver_ann_pending_log WHERE collection_name = ?",
|
|
137
|
+
(self._name,)
|
|
138
|
+
)
|
|
139
|
+
pending_count = cursor.fetchone()[0]
|
|
140
|
+
cursor.execute(
|
|
141
|
+
"SELECT COUNT(*) FROM _beaver_ann_deletions_log WHERE collection_name = ?",
|
|
142
|
+
(self._name,)
|
|
134
143
|
)
|
|
135
|
-
|
|
136
|
-
return
|
|
144
|
+
deletion_count = cursor.fetchone()[0]
|
|
145
|
+
return (pending_count + deletion_count) >= threshold
|
|
146
|
+
|
|
147
|
+
def _run_compaction_and_release_lock(self):
|
|
148
|
+
"""
|
|
149
|
+
A target function for the background thread that runs the compaction
|
|
150
|
+
and ensures the lock is always released, even if errors occur.
|
|
151
|
+
"""
|
|
152
|
+
try:
|
|
153
|
+
self._vector_index.compact()
|
|
154
|
+
finally:
|
|
155
|
+
self._compaction_lock.release()
|
|
156
|
+
|
|
157
|
+
def compact(self, block: bool = False):
|
|
158
|
+
"""
|
|
159
|
+
Triggers a non-blocking background compaction of the vector index.
|
|
160
|
+
|
|
161
|
+
If a compaction is already running for this collection, this method returns
|
|
162
|
+
immediately without starting a new one.
|
|
137
163
|
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
164
|
+
Args:
|
|
165
|
+
block: If True, this method will wait for the compaction to complete
|
|
166
|
+
before returning. Defaults to False (non-blocking).
|
|
167
|
+
"""
|
|
168
|
+
# Use a non-blocking lock acquire to check if a compaction is already running.
|
|
169
|
+
if self._compaction_lock.acquire(blocking=False):
|
|
170
|
+
try:
|
|
171
|
+
# If we get the lock, start a new background thread.
|
|
172
|
+
self._compaction_thread = threading.Thread(
|
|
173
|
+
target=self._run_compaction_and_release_lock,
|
|
174
|
+
daemon=True # Daemon threads don't block program exit.
|
|
175
|
+
)
|
|
176
|
+
self._compaction_thread.start()
|
|
177
|
+
if block:
|
|
178
|
+
self._compaction_thread.join()
|
|
179
|
+
except Exception:
|
|
180
|
+
# If something goes wrong during thread creation, release the lock.
|
|
181
|
+
self._compaction_lock.release()
|
|
182
|
+
raise
|
|
183
|
+
# If acquire fails, it means another thread holds the lock, so we do nothing.
|
|
143
184
|
|
|
144
185
|
def index(
|
|
145
186
|
self,
|
|
@@ -152,9 +193,14 @@ class CollectionManager:
|
|
|
152
193
|
Indexes a Document, including vector, FTS, and fuzzy search data.
|
|
153
194
|
The entire operation is performed in a single atomic transaction.
|
|
154
195
|
"""
|
|
196
|
+
if not isinstance(document, Document):
|
|
197
|
+
raise TypeError("Item to index must be a Document object.")
|
|
198
|
+
|
|
155
199
|
with self._conn:
|
|
156
|
-
|
|
157
|
-
|
|
200
|
+
cursor = self._conn.cursor()
|
|
201
|
+
|
|
202
|
+
# Step 1: Core Document and Vector Storage
|
|
203
|
+
cursor.execute(
|
|
158
204
|
"INSERT OR REPLACE INTO beaver_collections (collection, item_id, item_vector, metadata) VALUES (?, ?, ?, ?)",
|
|
159
205
|
(
|
|
160
206
|
self._name,
|
|
@@ -164,12 +210,14 @@ class CollectionManager:
|
|
|
164
210
|
),
|
|
165
211
|
)
|
|
166
212
|
|
|
167
|
-
# Step 2:
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
213
|
+
# Step 2: Delegate to the VectorIndex if an embedding exists.
|
|
214
|
+
if document.embedding is not None:
|
|
215
|
+
self._vector_index.index(document.id, document.embedding, cursor)
|
|
216
|
+
|
|
217
|
+
# Step 3: FTS and Fuzzy Indexing
|
|
218
|
+
cursor.execute("DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
219
|
+
cursor.execute("DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
171
220
|
|
|
172
|
-
# Determine which string fields to index
|
|
173
221
|
flat_metadata = self._flatten_metadata(document.to_dict())
|
|
174
222
|
fields_to_index: dict[str, str] = {}
|
|
175
223
|
if isinstance(fts, list):
|
|
@@ -178,63 +226,46 @@ class CollectionManager:
|
|
|
178
226
|
fields_to_index = {k: v for k, v in flat_metadata.items() if isinstance(v, str)}
|
|
179
227
|
|
|
180
228
|
if fields_to_index:
|
|
181
|
-
# FTS indexing
|
|
182
229
|
fts_data = [(self._name, document.id, path, content) for path, content in fields_to_index.items()]
|
|
183
|
-
|
|
184
|
-
"INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)",
|
|
185
|
-
fts_data,
|
|
186
|
-
)
|
|
187
|
-
|
|
188
|
-
# Fuzzy indexing (if enabled)
|
|
230
|
+
cursor.executemany("INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)", fts_data)
|
|
189
231
|
if fuzzy:
|
|
190
232
|
trigram_data = []
|
|
191
233
|
for path, content in fields_to_index.items():
|
|
192
234
|
for trigram in _get_trigrams(content.lower()):
|
|
193
235
|
trigram_data.append((self._name, document.id, path, trigram))
|
|
194
236
|
if trigram_data:
|
|
195
|
-
|
|
196
|
-
"INSERT INTO beaver_trigrams (collection, item_id, field_path, trigram) VALUES (?, ?, ?, ?)",
|
|
197
|
-
trigram_data,
|
|
198
|
-
)
|
|
237
|
+
cursor.executemany("INSERT INTO beaver_trigrams (collection, item_id, field_path, trigram) VALUES (?, ?, ?, ?)", trigram_data)
|
|
199
238
|
|
|
200
|
-
# Step
|
|
201
|
-
|
|
202
|
-
""
|
|
203
|
-
INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1)
|
|
204
|
-
ON CONFLICT(collection_name) DO UPDATE SET version = version + 1
|
|
205
|
-
""",
|
|
239
|
+
# Step 4: Update Collection Version to signal a change.
|
|
240
|
+
cursor.execute(
|
|
241
|
+
"INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1) ON CONFLICT(collection_name) DO UPDATE SET version = version + 1",
|
|
206
242
|
(self._name,),
|
|
207
243
|
)
|
|
208
244
|
|
|
245
|
+
# After the transaction commits, check if auto-compaction is needed.
|
|
246
|
+
if self._needs_compaction():
|
|
247
|
+
self.compact()
|
|
248
|
+
|
|
209
249
|
def drop(self, document: Document):
|
|
210
250
|
"""Removes a document and all its associated data from the collection."""
|
|
211
251
|
if not isinstance(document, Document):
|
|
212
252
|
raise TypeError("Item to drop must be a Document object.")
|
|
213
253
|
with self._conn:
|
|
214
|
-
self._conn.
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
)
|
|
218
|
-
self.
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
self._conn.execute(
|
|
223
|
-
"DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?",
|
|
224
|
-
(self._name, document.id),
|
|
225
|
-
)
|
|
226
|
-
self._conn.execute(
|
|
227
|
-
"DELETE FROM beaver_edges WHERE collection = ? AND (source_item_id = ? OR target_item_id = ?)",
|
|
228
|
-
(self._name, document.id, document.id),
|
|
229
|
-
)
|
|
230
|
-
self._conn.execute(
|
|
231
|
-
"""
|
|
232
|
-
INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1)
|
|
233
|
-
ON CONFLICT(collection_name) DO UPDATE SET version = version + 1
|
|
234
|
-
""",
|
|
254
|
+
cursor = self._conn.cursor()
|
|
255
|
+
cursor.execute("DELETE FROM beaver_collections WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
256
|
+
cursor.execute("DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
257
|
+
cursor.execute("DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
258
|
+
cursor.execute("DELETE FROM beaver_edges WHERE collection = ? AND (source_item_id = ? OR target_item_id = ?)", (self._name, document.id, document.id))
|
|
259
|
+
self._vector_index.drop(document.id, cursor)
|
|
260
|
+
cursor.execute(
|
|
261
|
+
"INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1) ON CONFLICT(collection_name) DO UPDATE SET version = version + 1",
|
|
235
262
|
(self._name,),
|
|
236
263
|
)
|
|
237
264
|
|
|
265
|
+
# Check for auto-compaction after a drop as well.
|
|
266
|
+
if self._needs_compaction():
|
|
267
|
+
self.compact()
|
|
268
|
+
|
|
238
269
|
def __iter__(self):
|
|
239
270
|
"""Returns an iterator over all documents in the collection."""
|
|
240
271
|
cursor = self._conn.cursor()
|
|
@@ -253,57 +284,45 @@ class CollectionManager:
|
|
|
253
284
|
)
|
|
254
285
|
cursor.close()
|
|
255
286
|
|
|
256
|
-
def refresh(self):
|
|
257
|
-
"""Forces a rebuild of the in-memory ANN index from data in SQLite."""
|
|
258
|
-
cursor = self._conn.cursor()
|
|
259
|
-
cursor.execute(
|
|
260
|
-
"SELECT item_id, item_vector FROM beaver_collections WHERE collection = ? AND item_vector IS NOT NULL",
|
|
261
|
-
(self._name,),
|
|
262
|
-
)
|
|
263
|
-
vectors, self._doc_ids = [], []
|
|
264
|
-
for row in cursor.fetchall():
|
|
265
|
-
self._doc_ids.append(row["item_id"])
|
|
266
|
-
vectors.append(np.frombuffer(row["item_vector"], dtype=np.float32))
|
|
267
|
-
|
|
268
|
-
self._kdtree = KDTree(vectors) if vectors else None
|
|
269
|
-
self._local_index_version = self._get_db_version()
|
|
270
|
-
|
|
271
287
|
def search(
|
|
272
288
|
self, vector: list[float], top_k: int = 10
|
|
273
|
-
) ->
|
|
274
|
-
"""Performs a fast approximate nearest neighbor search."""
|
|
275
|
-
if
|
|
276
|
-
|
|
277
|
-
if not self._kdtree:
|
|
278
|
-
return []
|
|
289
|
+
) -> List[Tuple[Document, float]]:
|
|
290
|
+
"""Performs a fast, persistent approximate nearest neighbor search."""
|
|
291
|
+
if not isinstance(vector, list):
|
|
292
|
+
raise TypeError("Search vector must be a list of floats.")
|
|
279
293
|
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
distances, indices = self._kdtree.query(
|
|
284
|
-
np.array(vector, dtype=np.float32), k=top_k
|
|
294
|
+
search_results = self._vector_index.search(
|
|
295
|
+
np.array(vector, dtype=np.float32), top_k=top_k
|
|
285
296
|
)
|
|
286
|
-
if
|
|
287
|
-
|
|
297
|
+
if not search_results:
|
|
298
|
+
return []
|
|
299
|
+
|
|
300
|
+
result_ids = [item[0] for item in search_results]
|
|
301
|
+
distance_map = {item[0]: item[1] for item in search_results}
|
|
288
302
|
|
|
289
|
-
result_ids = [self._doc_ids[i] for i in indices]
|
|
290
303
|
placeholders = ",".join("?" for _ in result_ids)
|
|
291
304
|
sql = f"SELECT item_id, item_vector, metadata FROM beaver_collections WHERE collection = ? AND item_id IN ({placeholders})"
|
|
292
305
|
|
|
293
306
|
cursor = self._conn.cursor()
|
|
294
307
|
rows = cursor.execute(sql, (self._name, *result_ids)).fetchall()
|
|
295
|
-
row_map = {row["item_id"]: row for row in rows}
|
|
296
308
|
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
|
|
304
|
-
|
|
305
|
-
|
|
306
|
-
|
|
309
|
+
doc_map = {
|
|
310
|
+
row["item_id"]: Document(
|
|
311
|
+
id=row["item_id"],
|
|
312
|
+
embedding=(np.frombuffer(row["item_vector"], dtype=np.float32).tolist() if row["item_vector"] else None),
|
|
313
|
+
**json.loads(row["metadata"]),
|
|
314
|
+
)
|
|
315
|
+
for row in rows
|
|
316
|
+
}
|
|
317
|
+
|
|
318
|
+
final_results = []
|
|
319
|
+
for doc_id in result_ids:
|
|
320
|
+
if doc_id in doc_map:
|
|
321
|
+
doc = doc_map[doc_id]
|
|
322
|
+
distance = distance_map[doc_id]
|
|
323
|
+
final_results.append((doc, distance))
|
|
324
|
+
|
|
325
|
+
return final_results
|
|
307
326
|
|
|
308
327
|
def match(
|
|
309
328
|
self,
|
|
@@ -315,14 +334,6 @@ class CollectionManager:
|
|
|
315
334
|
) -> list[tuple[Document, float]]:
|
|
316
335
|
"""
|
|
317
336
|
Performs a full-text or fuzzy search on indexed string fields.
|
|
318
|
-
|
|
319
|
-
Args:
|
|
320
|
-
query: The search query string.
|
|
321
|
-
on: An optional list of fields to restrict the search to.
|
|
322
|
-
top_k: The maximum number of results to return.
|
|
323
|
-
fuzziness: The Levenshtein distance for fuzzy matching.
|
|
324
|
-
If 0, performs an exact FTS search.
|
|
325
|
-
If > 0, performs a fuzzy search.
|
|
326
337
|
"""
|
|
327
338
|
if isinstance(on, str):
|
|
328
339
|
on = [on]
|
|
@@ -372,9 +383,6 @@ class CollectionManager:
|
|
|
372
383
|
if not query_trigrams:
|
|
373
384
|
return set()
|
|
374
385
|
|
|
375
|
-
# Optimization: Only consider documents that share a significant number of trigrams.
|
|
376
|
-
# This threshold dramatically reduces the number of candidates for the expensive
|
|
377
|
-
# Levenshtein check. A 30% threshold is a reasonable starting point.
|
|
378
386
|
similarity_threshold = int(len(query_trigrams) * 0.3)
|
|
379
387
|
if similarity_threshold == 0:
|
|
380
388
|
return set()
|
|
@@ -404,7 +412,6 @@ class CollectionManager:
|
|
|
404
412
|
self, query: str, on: list[str] | None, top_k: int, fuzziness: int
|
|
405
413
|
) -> list[tuple[Document, float]]:
|
|
406
414
|
"""Performs a 3-stage fuzzy search: gather, score, and sort."""
|
|
407
|
-
# Stage 1: Gather Candidates
|
|
408
415
|
fts_results = self._perform_fts_search(query, on, top_k)
|
|
409
416
|
fts_candidate_ids = {doc.id for doc, _ in fts_results}
|
|
410
417
|
trigram_candidate_ids = self._get_trigram_candidates(query, on)
|
|
@@ -412,7 +419,6 @@ class CollectionManager:
|
|
|
412
419
|
if not candidate_ids:
|
|
413
420
|
return []
|
|
414
421
|
|
|
415
|
-
# Stage 2: Score Candidates
|
|
416
422
|
cursor = self._conn.cursor()
|
|
417
423
|
id_placeholders = ",".join("?" for _ in candidate_ids)
|
|
418
424
|
sql_text = f"SELECT item_id, field_path, field_content FROM beaver_fts_index WHERE collection = ? AND item_id IN ({id_placeholders})"
|
|
@@ -445,10 +451,9 @@ class CollectionManager:
|
|
|
445
451
|
scored_candidates.append({
|
|
446
452
|
"id": item_id,
|
|
447
453
|
"distance": min_dist,
|
|
448
|
-
"fts_rank": fts_rank_map.get(item_id, 0)
|
|
454
|
+
"fts_rank": fts_rank_map.get(item_id, 0)
|
|
449
455
|
})
|
|
450
456
|
|
|
451
|
-
# Stage 3: Sort and Fetch Results
|
|
452
457
|
scored_candidates.sort(key=lambda x: (x["distance"], x["fts_rank"]))
|
|
453
458
|
top_ids = [c["id"] for c in scored_candidates[:top_k]]
|
|
454
459
|
if not top_ids:
|
|
@@ -563,49 +568,26 @@ def rerank(
|
|
|
563
568
|
) -> list[Document]:
|
|
564
569
|
"""
|
|
565
570
|
Reranks documents from multiple search result lists using Reverse Rank Fusion (RRF).
|
|
566
|
-
This function is specifically designed to work with beaver.collections.Document objects.
|
|
567
|
-
|
|
568
|
-
Args:
|
|
569
|
-
results (sequence of list[Document]): A sequence of search result lists, where each
|
|
570
|
-
inner list contains Document objects.
|
|
571
|
-
weights (list[float], optional): A list of weights corresponding to each
|
|
572
|
-
result list. If None, all lists are weighted equally. Defaults to None.
|
|
573
|
-
k (int, optional): A constant used in the RRF formula. Defaults to 60.
|
|
574
|
-
|
|
575
|
-
Returns:
|
|
576
|
-
list[Document]: A single, reranked list of unique Document objects, sorted
|
|
577
|
-
by their fused rank score in descending order.
|
|
578
571
|
"""
|
|
579
572
|
if not results:
|
|
580
573
|
return []
|
|
581
574
|
|
|
582
|
-
# Assign a default weight of 1.0 if none are provided
|
|
583
575
|
if weights is None:
|
|
584
576
|
weights = [1.0] * len(results)
|
|
585
577
|
|
|
586
578
|
if len(results) != len(weights):
|
|
587
579
|
raise ValueError("The number of result lists must match the number of weights.")
|
|
588
580
|
|
|
589
|
-
# Use dictionaries to store scores and unique documents by their ID
|
|
590
581
|
rrf_scores: dict[str, float] = {}
|
|
591
582
|
doc_store: dict[str, Document] = {}
|
|
592
583
|
|
|
593
|
-
# Iterate through each list of Document objects and its weight
|
|
594
584
|
for result_list, weight in zip(results, weights):
|
|
595
585
|
for rank, doc in enumerate(result_list):
|
|
596
|
-
# Use the .id attribute from the Document object
|
|
597
586
|
doc_id = doc.id
|
|
598
587
|
if doc_id not in doc_store:
|
|
599
588
|
doc_store[doc_id] = doc
|
|
600
|
-
|
|
601
|
-
# Calculate the reciprocal rank score, scaled by the weight
|
|
602
589
|
score = weight * (1 / (k + rank))
|
|
603
|
-
|
|
604
|
-
# Add the score to the document's running total
|
|
605
590
|
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + score
|
|
606
591
|
|
|
607
|
-
# Sort the document IDs by their final aggregated scores
|
|
608
592
|
sorted_doc_ids = sorted(rrf_scores.keys(), key=rrf_scores.get, reverse=True)
|
|
609
|
-
|
|
610
|
-
# Return the final list of Document objects in the new, reranked order
|
|
611
593
|
return [doc_store[doc_id] for doc_id in sorted_doc_ids]
|