beaver-db 0.9.1__tar.gz → 0.10.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of beaver-db might be problematic. Click here for more details.
- {beaver_db-0.9.1/beaver_db.egg-info → beaver_db-0.10.0}/PKG-INFO +9 -7
- {beaver_db-0.9.1 → beaver_db-0.10.0}/README.md +8 -6
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/channels.py +56 -1
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/collections.py +256 -48
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/core.py +22 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0/beaver_db.egg-info}/PKG-INFO +9 -7
- {beaver_db-0.9.1 → beaver_db-0.10.0}/pyproject.toml +1 -1
- {beaver_db-0.9.1 → beaver_db-0.10.0}/LICENSE +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/__init__.py +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/dicts.py +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/lists.py +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver/queues.py +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver_db.egg-info/SOURCES.txt +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver_db.egg-info/dependency_links.txt +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver_db.egg-info/requires.txt +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/beaver_db.egg-info/top_level.txt +0 -0
- {beaver_db-0.9.1 → beaver_db-0.10.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: beaver-db
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.10.0
|
|
4
4
|
Summary: Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications.
|
|
5
5
|
Requires-Python: >=3.13
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
@@ -19,20 +19,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
19
19
|
|
|
20
20
|
`beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
|
|
21
21
|
|
|
22
|
-
- **Minimalistic
|
|
23
|
-
- **
|
|
22
|
+
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
|
|
23
|
+
- **Schemaless**: Flexible data storage without rigid schemas across all modalities.
|
|
24
|
+
- **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
|
|
24
25
|
- **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
|
|
25
26
|
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
|
|
26
27
|
- **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
|
|
27
28
|
|
|
28
29
|
## Core Features
|
|
29
30
|
|
|
30
|
-
- **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
|
|
31
|
+
- **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
|
|
31
32
|
- **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
|
|
32
33
|
- **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
|
|
33
34
|
- **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
|
|
34
35
|
- **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
|
|
35
|
-
- **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
|
|
36
|
+
- **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
|
|
36
37
|
- **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
|
|
37
38
|
- **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
|
|
38
39
|
|
|
@@ -190,17 +191,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
|
|
|
190
191
|
- [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
|
|
191
192
|
- [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
|
|
192
193
|
- [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
|
|
194
|
+
- [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
|
|
193
195
|
- [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
|
|
194
196
|
- [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
|
|
195
197
|
- [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
|
|
198
|
+
- [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
|
|
196
199
|
|
|
197
200
|
## Roadmap
|
|
198
201
|
|
|
199
202
|
These are some of the features and improvements planned for future releases:
|
|
200
203
|
|
|
201
|
-
- **Fuzzy search**: Implement fuzzy matching capabilities for text search.
|
|
202
204
|
- **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
|
|
203
|
-
- **Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
205
|
+
- **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
204
206
|
|
|
205
207
|
Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.
|
|
206
208
|
|
|
@@ -8,20 +8,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
8
8
|
|
|
9
9
|
`beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
|
|
10
10
|
|
|
11
|
-
- **Minimalistic
|
|
12
|
-
- **
|
|
11
|
+
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
|
|
12
|
+
- **Schemaless**: Flexible data storage without rigid schemas across all modalities.
|
|
13
|
+
- **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
|
|
13
14
|
- **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
|
|
14
15
|
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
|
|
15
16
|
- **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
|
|
16
17
|
|
|
17
18
|
## Core Features
|
|
18
19
|
|
|
19
|
-
- **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
|
|
20
|
+
- **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
|
|
20
21
|
- **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
|
|
21
22
|
- **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
|
|
22
23
|
- **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
|
|
23
24
|
- **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
|
|
24
|
-
- **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
|
|
25
|
+
- **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
|
|
25
26
|
- **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
|
|
26
27
|
- **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
|
|
27
28
|
|
|
@@ -179,17 +180,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
|
|
|
179
180
|
- [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
|
|
180
181
|
- [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
|
|
181
182
|
- [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
|
|
183
|
+
- [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
|
|
182
184
|
- [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
|
|
183
185
|
- [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
|
|
184
186
|
- [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
|
|
187
|
+
- [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
|
|
185
188
|
|
|
186
189
|
## Roadmap
|
|
187
190
|
|
|
188
191
|
These are some of the features and improvements planned for future releases:
|
|
189
192
|
|
|
190
|
-
- **Fuzzy search**: Implement fuzzy matching capabilities for text search.
|
|
191
193
|
- **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
|
|
192
|
-
- **Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
194
|
+
- **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
193
195
|
|
|
194
196
|
Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.
|
|
195
197
|
|
|
@@ -1,14 +1,44 @@
|
|
|
1
|
+
import asyncio
|
|
1
2
|
import json
|
|
2
3
|
import sqlite3
|
|
3
4
|
import threading
|
|
4
5
|
import time
|
|
5
6
|
from queue import Empty, Queue
|
|
6
|
-
from typing import Any, Iterator, Set
|
|
7
|
+
from typing import Any, AsyncIterator, Iterator, Set
|
|
7
8
|
|
|
8
9
|
# A special message object used to signal the listener to gracefully shut down.
|
|
9
10
|
_SHUTDOWN_SENTINEL = object()
|
|
10
11
|
|
|
11
12
|
|
|
13
|
+
class AsyncSubscriber:
|
|
14
|
+
"""A thread-safe async message receiver for a specific channel subscription."""
|
|
15
|
+
|
|
16
|
+
def __init__(self, subscriber: "Subscriber"):
|
|
17
|
+
self._subscriber = subscriber
|
|
18
|
+
|
|
19
|
+
async def __aenter__(self) -> "AsyncSubscriber":
|
|
20
|
+
"""Registers the listener's queue with the channel to start receiving messages."""
|
|
21
|
+
await asyncio.to_thread(self._subscriber.__enter__)
|
|
22
|
+
return self
|
|
23
|
+
|
|
24
|
+
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
|
25
|
+
"""Unregisters the listener's queue from the channel to stop receiving messages."""
|
|
26
|
+
await asyncio.to_thread(self._subscriber.__exit__, exc_type, exc_val, exc_tb)
|
|
27
|
+
|
|
28
|
+
async def listen(self, timeout: float | None = None) -> AsyncIterator[Any]:
|
|
29
|
+
"""
|
|
30
|
+
Returns a blocking async iterator that yields messages as they arrive.
|
|
31
|
+
"""
|
|
32
|
+
while True:
|
|
33
|
+
try:
|
|
34
|
+
msg = await asyncio.to_thread(self._subscriber._queue.get, timeout=timeout)
|
|
35
|
+
if msg is _SHUTDOWN_SENTINEL:
|
|
36
|
+
break
|
|
37
|
+
yield msg
|
|
38
|
+
except Empty:
|
|
39
|
+
raise TimeoutError(f"Timeout {timeout}s expired.")
|
|
40
|
+
|
|
41
|
+
|
|
12
42
|
class Subscriber:
|
|
13
43
|
"""
|
|
14
44
|
A thread-safe message receiver for a specific channel subscription.
|
|
@@ -54,6 +84,27 @@ class Subscriber:
|
|
|
54
84
|
except Empty:
|
|
55
85
|
raise TimeoutError(f"Timeout {timeout}s expired.")
|
|
56
86
|
|
|
87
|
+
def as_async(self) -> "AsyncSubscriber":
|
|
88
|
+
"""Returns an async version of the subscriber."""
|
|
89
|
+
return AsyncSubscriber(self)
|
|
90
|
+
|
|
91
|
+
|
|
92
|
+
class AsyncChannelManager:
|
|
93
|
+
"""The central async hub for a named pub/sub channel."""
|
|
94
|
+
|
|
95
|
+
def __init__(self, channel: "ChannelManager"):
|
|
96
|
+
self._channel = channel
|
|
97
|
+
|
|
98
|
+
async def publish(self, payload: Any):
|
|
99
|
+
"""
|
|
100
|
+
Publishes a JSON-serializable message to the channel asynchronously.
|
|
101
|
+
"""
|
|
102
|
+
await asyncio.to_thread(self._channel.publish, payload)
|
|
103
|
+
|
|
104
|
+
def subscribe(self) -> "AsyncSubscriber":
|
|
105
|
+
"""Creates a new async subscription, returning an AsyncSubscriber context manager."""
|
|
106
|
+
return self._channel.subscribe().as_async()
|
|
107
|
+
|
|
57
108
|
|
|
58
109
|
class ChannelManager:
|
|
59
110
|
"""
|
|
@@ -183,3 +234,7 @@ class ChannelManager:
|
|
|
183
234
|
"INSERT INTO beaver_pubsub_log (timestamp, channel_name, message_payload) VALUES (?, ?, ?)",
|
|
184
235
|
(time.time(), self._name, json_payload),
|
|
185
236
|
)
|
|
237
|
+
|
|
238
|
+
def as_async(self) -> "AsyncChannelManager":
|
|
239
|
+
"""Returns an async version of the channel manager."""
|
|
240
|
+
return AsyncChannelManager(self)
|
|
@@ -5,7 +5,63 @@ from enum import Enum
|
|
|
5
5
|
from typing import Any, List, Literal, Set
|
|
6
6
|
|
|
7
7
|
import numpy as np
|
|
8
|
-
from scipy.spatial import
|
|
8
|
+
from scipy.spatial import KDTree
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
# --- Fuzzy Search Helper Functions ---
|
|
12
|
+
|
|
13
|
+
def _levenshtein_distance(s1: str, s2: str) -> int:
|
|
14
|
+
"""Calculates the Levenshtein distance between two strings."""
|
|
15
|
+
if len(s1) < len(s2):
|
|
16
|
+
return _levenshtein_distance(s2, s1)
|
|
17
|
+
if len(s2) == 0:
|
|
18
|
+
return len(s1)
|
|
19
|
+
|
|
20
|
+
previous_row = range(len(s2) + 1)
|
|
21
|
+
for i, c1 in enumerate(s1):
|
|
22
|
+
current_row = [i + 1]
|
|
23
|
+
for j, c2 in enumerate(s2):
|
|
24
|
+
insertions = previous_row[j + 1] + 1
|
|
25
|
+
deletions = current_row[j] + 1
|
|
26
|
+
substitutions = previous_row[j] + (c1 != c2)
|
|
27
|
+
current_row.append(min(insertions, deletions, substitutions))
|
|
28
|
+
previous_row = current_row
|
|
29
|
+
return previous_row[-1]
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
def _get_trigrams(text: str) -> set[str]:
|
|
33
|
+
"""Generates a set of 3-character trigrams from a string."""
|
|
34
|
+
if not text or len(text) < 3:
|
|
35
|
+
return set()
|
|
36
|
+
return {text[i:i+3] for i in range(len(text) - 2)}
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def _sliding_window_levenshtein(query: str, content: str, fuzziness: int) -> int:
|
|
40
|
+
"""
|
|
41
|
+
Finds the best Levenshtein match for a query within a larger text
|
|
42
|
+
by comparing it against relevant substrings.
|
|
43
|
+
"""
|
|
44
|
+
query_tokens = query.lower().split()
|
|
45
|
+
content_tokens = content.lower().split()
|
|
46
|
+
query_len = len(query_tokens)
|
|
47
|
+
if query_len == 0:
|
|
48
|
+
return 0
|
|
49
|
+
|
|
50
|
+
min_dist = float('inf')
|
|
51
|
+
query_norm = " ".join(query_tokens)
|
|
52
|
+
|
|
53
|
+
# The window size can be slightly smaller or larger than the query length
|
|
54
|
+
# to account for missing or extra words in a fuzzy match.
|
|
55
|
+
for window_size in range(max(1, query_len - fuzziness), query_len + fuzziness + 1):
|
|
56
|
+
if window_size > len(content_tokens):
|
|
57
|
+
continue
|
|
58
|
+
for i in range(len(content_tokens) - window_size + 1):
|
|
59
|
+
window_text = " ".join(content_tokens[i:i+window_size])
|
|
60
|
+
dist = _levenshtein_distance(query_norm, window_text)
|
|
61
|
+
if dist < min_dist:
|
|
62
|
+
min_dist = dist
|
|
63
|
+
|
|
64
|
+
return int(min_dist)
|
|
9
65
|
|
|
10
66
|
|
|
11
67
|
class WalkDirection(Enum):
|
|
@@ -54,18 +110,18 @@ class CollectionManager:
|
|
|
54
110
|
def __init__(self, name: str, conn: sqlite3.Connection):
|
|
55
111
|
self._name = name
|
|
56
112
|
self._conn = conn
|
|
57
|
-
self._kdtree:
|
|
113
|
+
self._kdtree: KDTree | None = None
|
|
58
114
|
self._doc_ids: List[str] = []
|
|
59
115
|
self._local_index_version = -1 # Version of the in-memory index
|
|
60
116
|
|
|
61
|
-
def _flatten_metadata(self, metadata: dict, prefix: str = "") -> dict[str,
|
|
62
|
-
"""Flattens a nested dictionary
|
|
117
|
+
def _flatten_metadata(self, metadata: dict, prefix: str = "") -> dict[str, Any]:
|
|
118
|
+
"""Flattens a nested dictionary for indexing."""
|
|
63
119
|
flat_dict = {}
|
|
64
120
|
for key, value in metadata.items():
|
|
65
|
-
new_key = f"{prefix}
|
|
121
|
+
new_key = f"{prefix}.{key}" if prefix else key
|
|
66
122
|
if isinstance(value, dict):
|
|
67
123
|
flat_dict.update(self._flatten_metadata(value, new_key))
|
|
68
|
-
|
|
124
|
+
else:
|
|
69
125
|
flat_dict[new_key] = value
|
|
70
126
|
return flat_dict
|
|
71
127
|
|
|
@@ -85,39 +141,63 @@ class CollectionManager:
|
|
|
85
141
|
return True
|
|
86
142
|
return self._local_index_version < self._get_db_version()
|
|
87
143
|
|
|
88
|
-
def index(
|
|
89
|
-
|
|
144
|
+
def index(
|
|
145
|
+
self,
|
|
146
|
+
document: Document,
|
|
147
|
+
*,
|
|
148
|
+
fts: bool | list[str] = True,
|
|
149
|
+
fuzzy: bool = False
|
|
150
|
+
):
|
|
151
|
+
"""
|
|
152
|
+
Indexes a Document, including vector, FTS, and fuzzy search data.
|
|
153
|
+
The entire operation is performed in a single atomic transaction.
|
|
154
|
+
"""
|
|
90
155
|
with self._conn:
|
|
91
|
-
|
|
92
|
-
self._conn.execute(
|
|
93
|
-
"DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?",
|
|
94
|
-
(self._name, document.id),
|
|
95
|
-
)
|
|
96
|
-
string_fields = self._flatten_metadata(document.to_dict())
|
|
97
|
-
if string_fields:
|
|
98
|
-
fts_data = [
|
|
99
|
-
(self._name, document.id, path, content)
|
|
100
|
-
for path, content in string_fields.items()
|
|
101
|
-
]
|
|
102
|
-
self._conn.executemany(
|
|
103
|
-
"INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)",
|
|
104
|
-
fts_data,
|
|
105
|
-
)
|
|
106
|
-
|
|
156
|
+
# Step 1: Core Document and Vector Storage (Unaffected by FTS/Fuzzy)
|
|
107
157
|
self._conn.execute(
|
|
108
158
|
"INSERT OR REPLACE INTO beaver_collections (collection, item_id, item_vector, metadata) VALUES (?, ?, ?, ?)",
|
|
109
159
|
(
|
|
110
160
|
self._name,
|
|
111
161
|
document.id,
|
|
112
|
-
(
|
|
113
|
-
document.embedding.tobytes()
|
|
114
|
-
if document.embedding is not None
|
|
115
|
-
else None
|
|
116
|
-
),
|
|
162
|
+
document.embedding.tobytes() if document.embedding is not None else None,
|
|
117
163
|
json.dumps(document.to_dict()),
|
|
118
164
|
),
|
|
119
165
|
)
|
|
120
|
-
|
|
166
|
+
|
|
167
|
+
# Step 2: FTS and Fuzzy Indexing
|
|
168
|
+
# First, clean up old index data for this document
|
|
169
|
+
self._conn.execute("DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
170
|
+
self._conn.execute("DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?", (self._name, document.id))
|
|
171
|
+
|
|
172
|
+
# Determine which string fields to index
|
|
173
|
+
flat_metadata = self._flatten_metadata(document.to_dict())
|
|
174
|
+
fields_to_index: dict[str, str] = {}
|
|
175
|
+
if isinstance(fts, list):
|
|
176
|
+
fields_to_index = {k: v for k, v in flat_metadata.items() if k in fts and isinstance(v, str)}
|
|
177
|
+
elif fts:
|
|
178
|
+
fields_to_index = {k: v for k, v in flat_metadata.items() if isinstance(v, str)}
|
|
179
|
+
|
|
180
|
+
if fields_to_index:
|
|
181
|
+
# FTS indexing
|
|
182
|
+
fts_data = [(self._name, document.id, path, content) for path, content in fields_to_index.items()]
|
|
183
|
+
self._conn.executemany(
|
|
184
|
+
"INSERT INTO beaver_fts_index (collection, item_id, field_path, field_content) VALUES (?, ?, ?, ?)",
|
|
185
|
+
fts_data,
|
|
186
|
+
)
|
|
187
|
+
|
|
188
|
+
# Fuzzy indexing (if enabled)
|
|
189
|
+
if fuzzy:
|
|
190
|
+
trigram_data = []
|
|
191
|
+
for path, content in fields_to_index.items():
|
|
192
|
+
for trigram in _get_trigrams(content.lower()):
|
|
193
|
+
trigram_data.append((self._name, document.id, path, trigram))
|
|
194
|
+
if trigram_data:
|
|
195
|
+
self._conn.executemany(
|
|
196
|
+
"INSERT INTO beaver_trigrams (collection, item_id, field_path, trigram) VALUES (?, ?, ?, ?)",
|
|
197
|
+
trigram_data,
|
|
198
|
+
)
|
|
199
|
+
|
|
200
|
+
# Step 3: Update Collection Version
|
|
121
201
|
self._conn.execute(
|
|
122
202
|
"""
|
|
123
203
|
INSERT INTO beaver_collection_versions (collection_name, version) VALUES (?, 1)
|
|
@@ -139,6 +219,10 @@ class CollectionManager:
|
|
|
139
219
|
"DELETE FROM beaver_fts_index WHERE collection = ? AND item_id = ?",
|
|
140
220
|
(self._name, document.id),
|
|
141
221
|
)
|
|
222
|
+
self._conn.execute(
|
|
223
|
+
"DELETE FROM beaver_trigrams WHERE collection = ? AND item_id = ?",
|
|
224
|
+
(self._name, document.id),
|
|
225
|
+
)
|
|
142
226
|
self._conn.execute(
|
|
143
227
|
"DELETE FROM beaver_edges WHERE collection = ? AND (source_item_id = ? OR target_item_id = ?)",
|
|
144
228
|
(self._name, document.id, document.id),
|
|
@@ -181,7 +265,7 @@ class CollectionManager:
|
|
|
181
265
|
self._doc_ids.append(row["item_id"])
|
|
182
266
|
vectors.append(np.frombuffer(row["item_vector"], dtype=np.float32))
|
|
183
267
|
|
|
184
|
-
self._kdtree =
|
|
268
|
+
self._kdtree = KDTree(vectors) if vectors else None
|
|
185
269
|
self._local_index_version = self._get_db_version()
|
|
186
270
|
|
|
187
271
|
def search(
|
|
@@ -222,9 +306,36 @@ class CollectionManager:
|
|
|
222
306
|
return results
|
|
223
307
|
|
|
224
308
|
def match(
|
|
225
|
-
self,
|
|
309
|
+
self,
|
|
310
|
+
query: str,
|
|
311
|
+
*,
|
|
312
|
+
on: str | list[str] | None = None,
|
|
313
|
+
top_k: int = 10,
|
|
314
|
+
fuzziness: int = 0
|
|
226
315
|
) -> list[tuple[Document, float]]:
|
|
227
|
-
"""
|
|
316
|
+
"""
|
|
317
|
+
Performs a full-text or fuzzy search on indexed string fields.
|
|
318
|
+
|
|
319
|
+
Args:
|
|
320
|
+
query: The search query string.
|
|
321
|
+
on: An optional list of fields to restrict the search to.
|
|
322
|
+
top_k: The maximum number of results to return.
|
|
323
|
+
fuzziness: The Levenshtein distance for fuzzy matching.
|
|
324
|
+
If 0, performs an exact FTS search.
|
|
325
|
+
If > 0, performs a fuzzy search.
|
|
326
|
+
"""
|
|
327
|
+
if isinstance(on, str):
|
|
328
|
+
on = [on]
|
|
329
|
+
|
|
330
|
+
if fuzziness == 0:
|
|
331
|
+
return self._perform_fts_search(query, on, top_k)
|
|
332
|
+
else:
|
|
333
|
+
return self._perform_fuzzy_search(query, on, top_k, fuzziness)
|
|
334
|
+
|
|
335
|
+
def _perform_fts_search(
|
|
336
|
+
self, query: str, on: list[str] | None, top_k: int
|
|
337
|
+
) -> list[tuple[Document, float]]:
|
|
338
|
+
"""Performs a standard FTS search."""
|
|
228
339
|
cursor = self._conn.cursor()
|
|
229
340
|
sql_query = """
|
|
230
341
|
SELECT t1.item_id, t1.item_vector, t1.metadata, fts.rank
|
|
@@ -234,30 +345,127 @@ class CollectionManager:
|
|
|
234
345
|
) AS fts ON t1.item_id = fts.item_id
|
|
235
346
|
WHERE t1.collection = ? ORDER BY fts.rank
|
|
236
347
|
"""
|
|
237
|
-
params
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
params.
|
|
243
|
-
params.extend([top_k, self._name])
|
|
348
|
+
params: list[Any] = [query]
|
|
349
|
+
field_filter_sql = ""
|
|
350
|
+
if on:
|
|
351
|
+
placeholders = ",".join("?" for _ in on)
|
|
352
|
+
field_filter_sql = f"AND field_path IN ({placeholders})"
|
|
353
|
+
params.extend(on)
|
|
244
354
|
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
).fetchall()
|
|
355
|
+
params.extend([top_k, self._name])
|
|
356
|
+
rows = cursor.execute(sql_query.format(field_filter_sql), tuple(params)).fetchall()
|
|
248
357
|
results = []
|
|
249
358
|
for row in rows:
|
|
250
359
|
embedding = (
|
|
251
360
|
np.frombuffer(row["item_vector"], dtype=np.float32).tolist()
|
|
252
|
-
if row["item_vector"]
|
|
253
|
-
else None
|
|
254
|
-
)
|
|
255
|
-
doc = Document(
|
|
256
|
-
id=row["item_id"], embedding=embedding, **json.loads(row["metadata"])
|
|
361
|
+
if row["item_vector"] else None
|
|
257
362
|
)
|
|
363
|
+
doc = Document(id=row["item_id"], embedding=embedding, **json.loads(row["metadata"]))
|
|
258
364
|
results.append((doc, row["rank"]))
|
|
259
365
|
return results
|
|
260
366
|
|
|
367
|
+
def _get_trigram_candidates(self, query: str, on: list[str] | None) -> set[str]:
|
|
368
|
+
"""
|
|
369
|
+
Gets document IDs that meet a trigram similarity threshold with the query.
|
|
370
|
+
"""
|
|
371
|
+
query_trigrams = _get_trigrams(query.lower())
|
|
372
|
+
if not query_trigrams:
|
|
373
|
+
return set()
|
|
374
|
+
|
|
375
|
+
# Optimization: Only consider documents that share a significant number of trigrams.
|
|
376
|
+
# This threshold dramatically reduces the number of candidates for the expensive
|
|
377
|
+
# Levenshtein check. A 30% threshold is a reasonable starting point.
|
|
378
|
+
similarity_threshold = int(len(query_trigrams) * 0.3)
|
|
379
|
+
if similarity_threshold == 0:
|
|
380
|
+
return set()
|
|
381
|
+
|
|
382
|
+
cursor = self._conn.cursor()
|
|
383
|
+
sql = """
|
|
384
|
+
SELECT item_id FROM beaver_trigrams
|
|
385
|
+
WHERE collection = ? AND trigram IN ({}) {}
|
|
386
|
+
GROUP BY item_id
|
|
387
|
+
HAVING COUNT(DISTINCT trigram) >= ?
|
|
388
|
+
"""
|
|
389
|
+
params: list[Any] = [self._name]
|
|
390
|
+
trigram_placeholders = ",".join("?" for _ in query_trigrams)
|
|
391
|
+
params.extend(query_trigrams)
|
|
392
|
+
|
|
393
|
+
field_filter_sql = ""
|
|
394
|
+
if on:
|
|
395
|
+
field_placeholders = ",".join("?" for _ in on)
|
|
396
|
+
field_filter_sql = f"AND field_path IN ({field_placeholders})"
|
|
397
|
+
params.extend(on)
|
|
398
|
+
|
|
399
|
+
params.append(similarity_threshold)
|
|
400
|
+
cursor.execute(sql.format(trigram_placeholders, field_filter_sql), tuple(params))
|
|
401
|
+
return {row['item_id'] for row in cursor.fetchall()}
|
|
402
|
+
|
|
403
|
+
def _perform_fuzzy_search(
|
|
404
|
+
self, query: str, on: list[str] | None, top_k: int, fuzziness: int
|
|
405
|
+
) -> list[tuple[Document, float]]:
|
|
406
|
+
"""Performs a 3-stage fuzzy search: gather, score, and sort."""
|
|
407
|
+
# Stage 1: Gather Candidates
|
|
408
|
+
fts_results = self._perform_fts_search(query, on, top_k)
|
|
409
|
+
fts_candidate_ids = {doc.id for doc, _ in fts_results}
|
|
410
|
+
trigram_candidate_ids = self._get_trigram_candidates(query, on)
|
|
411
|
+
candidate_ids = fts_candidate_ids.union(trigram_candidate_ids)
|
|
412
|
+
if not candidate_ids:
|
|
413
|
+
return []
|
|
414
|
+
|
|
415
|
+
# Stage 2: Score Candidates
|
|
416
|
+
cursor = self._conn.cursor()
|
|
417
|
+
id_placeholders = ",".join("?" for _ in candidate_ids)
|
|
418
|
+
sql_text = f"SELECT item_id, field_path, field_content FROM beaver_fts_index WHERE collection = ? AND item_id IN ({id_placeholders})"
|
|
419
|
+
params_text: list[Any] = [self._name]
|
|
420
|
+
params_text.extend(candidate_ids)
|
|
421
|
+
if on:
|
|
422
|
+
sql_text += f" AND field_path IN ({','.join('?' for _ in on)})"
|
|
423
|
+
params_text.extend(on)
|
|
424
|
+
|
|
425
|
+
cursor.execute(sql_text, tuple(params_text))
|
|
426
|
+
candidate_texts: dict[str, dict[str, str]] = {}
|
|
427
|
+
for row in cursor.fetchall():
|
|
428
|
+
item_id = row['item_id']
|
|
429
|
+
if item_id not in candidate_texts:
|
|
430
|
+
candidate_texts[item_id] = {}
|
|
431
|
+
candidate_texts[item_id][row['field_path']] = row['field_content']
|
|
432
|
+
|
|
433
|
+
scored_candidates = []
|
|
434
|
+
fts_rank_map = {doc.id: rank for doc, rank in fts_results}
|
|
435
|
+
|
|
436
|
+
for item_id in candidate_ids:
|
|
437
|
+
if item_id not in candidate_texts:
|
|
438
|
+
continue
|
|
439
|
+
min_dist = float('inf')
|
|
440
|
+
for content in candidate_texts[item_id].values():
|
|
441
|
+
dist = _sliding_window_levenshtein(query, content, fuzziness)
|
|
442
|
+
if dist < min_dist:
|
|
443
|
+
min_dist = dist
|
|
444
|
+
if min_dist <= fuzziness:
|
|
445
|
+
scored_candidates.append({
|
|
446
|
+
"id": item_id,
|
|
447
|
+
"distance": min_dist,
|
|
448
|
+
"fts_rank": fts_rank_map.get(item_id, 0) # Use 0 for non-matches (less relevant)
|
|
449
|
+
})
|
|
450
|
+
|
|
451
|
+
# Stage 3: Sort and Fetch Results
|
|
452
|
+
scored_candidates.sort(key=lambda x: (x["distance"], x["fts_rank"]))
|
|
453
|
+
top_ids = [c["id"] for c in scored_candidates[:top_k]]
|
|
454
|
+
if not top_ids:
|
|
455
|
+
return []
|
|
456
|
+
|
|
457
|
+
id_placeholders = ",".join("?" for _ in top_ids)
|
|
458
|
+
sql_docs = f"SELECT item_id, item_vector, metadata FROM beaver_collections WHERE collection = ? AND item_id IN ({id_placeholders})"
|
|
459
|
+
cursor.execute(sql_docs, (self._name, *top_ids))
|
|
460
|
+
doc_map = {row["item_id"]: Document(id=row["item_id"], embedding=(np.frombuffer(row["item_vector"], dtype=np.float32).tolist() if row["item_vector"] else None), **json.loads(row["metadata"])) for row in cursor.fetchall()}
|
|
461
|
+
|
|
462
|
+
final_results = []
|
|
463
|
+
distance_map = {c["id"]: c["distance"] for c in scored_candidates}
|
|
464
|
+
for doc_id in top_ids:
|
|
465
|
+
if doc_id in doc_map:
|
|
466
|
+
final_results.append((doc_map[doc_id], float(distance_map[doc_id])))
|
|
467
|
+
return final_results
|
|
468
|
+
|
|
261
469
|
def connect(
|
|
262
470
|
self, source: Document, target: Document, label: str, metadata: dict = None
|
|
263
471
|
):
|
|
@@ -36,6 +36,7 @@ class BeaverDB:
|
|
|
36
36
|
self._create_list_table()
|
|
37
37
|
self._create_collections_table()
|
|
38
38
|
self._create_fts_table()
|
|
39
|
+
self._create_trigrams_table()
|
|
39
40
|
self._create_edges_table()
|
|
40
41
|
self._create_versions_table()
|
|
41
42
|
self._create_dict_table()
|
|
@@ -139,6 +140,27 @@ class BeaverDB:
|
|
|
139
140
|
"""
|
|
140
141
|
)
|
|
141
142
|
|
|
143
|
+
def _create_trigrams_table(self):
|
|
144
|
+
"""Creates the table for the fuzzy search trigram index."""
|
|
145
|
+
with self._conn:
|
|
146
|
+
self._conn.execute(
|
|
147
|
+
"""
|
|
148
|
+
CREATE TABLE IF NOT EXISTS beaver_trigrams (
|
|
149
|
+
collection TEXT NOT NULL,
|
|
150
|
+
item_id TEXT NOT NULL,
|
|
151
|
+
field_path TEXT NOT NULL,
|
|
152
|
+
trigram TEXT NOT NULL,
|
|
153
|
+
PRIMARY KEY (collection, field_path, trigram, item_id)
|
|
154
|
+
)
|
|
155
|
+
"""
|
|
156
|
+
)
|
|
157
|
+
self._conn.execute(
|
|
158
|
+
"""
|
|
159
|
+
CREATE INDEX IF NOT EXISTS idx_trigram_lookup
|
|
160
|
+
ON beaver_trigrams (collection, trigram, field_path)
|
|
161
|
+
"""
|
|
162
|
+
)
|
|
163
|
+
|
|
142
164
|
def _create_edges_table(self):
|
|
143
165
|
"""Creates the table for storing relationships between documents."""
|
|
144
166
|
with self._conn:
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: beaver-db
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.10.0
|
|
4
4
|
Summary: Fast, embedded, and multi-modal DB based on SQLite for AI-powered applications.
|
|
5
5
|
Requires-Python: >=3.13
|
|
6
6
|
Description-Content-Type: text/markdown
|
|
@@ -19,20 +19,21 @@ A fast, single-file, multi-modal database for Python, built with the standard `s
|
|
|
19
19
|
|
|
20
20
|
`beaver` is built with a minimalistic philosophy for small, local use cases where a full-blown database server would be overkill.
|
|
21
21
|
|
|
22
|
-
- **Minimalistic
|
|
23
|
-
- **
|
|
22
|
+
- **Minimalistic**: Uses only Python's standard libraries (`sqlite3`) and `numpy`/`scipy`.
|
|
23
|
+
- **Schemaless**: Flexible data storage without rigid schemas across all modalities.
|
|
24
|
+
- **Synchronous, Multi-Process, and Thread-Safe**: Designed for simplicity and safety in multi-threaded and multi-process environments.
|
|
24
25
|
- **Built for Local Applications**: Perfect for local AI tools, RAG prototypes, chatbots, and desktop utilities that need persistent, structured data without network overhead.
|
|
25
26
|
- **Fast by Default**: It's built on SQLite, which is famously fast and reliable for local applications. The vector search is accelerated with an in-memory k-d tree.
|
|
26
27
|
- **Standard Relational Interface**: While `beaver` provides high-level features, you can always use the same SQLite file for normal relational tasks with standard SQL.
|
|
27
28
|
|
|
28
29
|
## Core Features
|
|
29
30
|
|
|
30
|
-
- **High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture.
|
|
31
|
+
- **Sync/Async High-Efficiency Pub/Sub**: A powerful, thread and process-safe publish-subscribe system for real-time messaging with a fan-out architecture. Sync by default, but with an `as_async` wrapper for async applications.
|
|
31
32
|
- **Namespaced Key-Value Dictionaries**: A Pythonic, dictionary-like interface for storing any JSON-serializable object within separate namespaces with optional TTL for cache implementations.
|
|
32
33
|
- **Pythonic List Management**: A fluent, Redis-like interface for managing persistent, ordered lists.
|
|
33
34
|
- **Persistent Priority Queue**: A high-performance, persistent queue that always returns the item with the highest priority, perfect for task management.
|
|
34
35
|
- **Efficient Vector Storage & Search**: Store vector embeddings and perform fast approximate nearest neighbor searches using an in-memory k-d tree.
|
|
35
|
-
- **Full-Text Search**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine.
|
|
36
|
+
- **Full-Text Search and Fuzzy**: Automatically index and search through document metadata using SQLite's powerful FTS5 engine, enhanced with optional fuzzy saerch.
|
|
36
37
|
- **Graph Traversal**: Create relationships between documents and traverse the graph to find neighbors or perform multi-hop walks.
|
|
37
38
|
- **Single-File & Portable**: All data is stored in a single SQLite file, making it incredibly easy to move, back up, or embed in your application.
|
|
38
39
|
|
|
@@ -190,17 +191,18 @@ For more in-depth examples, check out the scripts in the `examples/` directory:
|
|
|
190
191
|
- [`examples/fts.py`](examples/fts.py): A detailed look at full-text search, including targeted searches on specific metadata fields.
|
|
191
192
|
- [`examples/graph.py`](examples/graph.py): Shows how to create relationships between documents and perform multi-hop graph traversals.
|
|
192
193
|
- [`examples/pubsub.py`](examples/pubsub.py): A demonstration of the synchronous, thread-safe publish/subscribe system in a single process.
|
|
194
|
+
- [`examples/async_pubsub.py`](examples/async_pubsub.py): A demonstration of the asynchronous wrapper for the publish/subscribe system.
|
|
193
195
|
- [`examples/publisher.py`](examples/publisher.py) and [`examples/subscriber.py`](examples/subscriber.py): A pair of examples demonstrating inter-process message passing with the publish/subscribe system.
|
|
194
196
|
- [`examples/cache.py`](examples/cache.py): A practical example of using a dictionary with TTL as a cache for API calls.
|
|
195
197
|
- [`examples/rerank.py`](examples/rerank.py): Shows how to combine results from vector and text search for more refined results.
|
|
198
|
+
- [`examples/fuzzy.py`](examples/fuzzy.py): Demonstrates fuzzy search capabilities for text search.
|
|
196
199
|
|
|
197
200
|
## Roadmap
|
|
198
201
|
|
|
199
202
|
These are some of the features and improvements planned for future releases:
|
|
200
203
|
|
|
201
|
-
- **Fuzzy search**: Implement fuzzy matching capabilities for text search.
|
|
202
204
|
- **Faster ANN**: Explore integrating more advanced ANN libraries like `faiss` for improved vector search performance.
|
|
203
|
-
- **Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
205
|
+
- **Full Async API**: Comprehensive async support with on-demand wrappers for all collections.
|
|
204
206
|
|
|
205
207
|
Check out the [roadmap](roadmap.md) for a detailed list of upcoming features and design ideas.
|
|
206
208
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|