pysnapdb 0.11.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pysnapdb-0.11.0/LICENSE +21 -0
- pysnapdb-0.11.0/PKG-INFO +436 -0
- pysnapdb-0.11.0/README.md +404 -0
- pysnapdb-0.11.0/pyproject.toml +61 -0
- pysnapdb-0.11.0/pysnapdb.egg-info/PKG-INFO +436 -0
- pysnapdb-0.11.0/pysnapdb.egg-info/SOURCES.txt +27 -0
- pysnapdb-0.11.0/pysnapdb.egg-info/dependency_links.txt +1 -0
- pysnapdb-0.11.0/pysnapdb.egg-info/requires.txt +7 -0
- pysnapdb-0.11.0/pysnapdb.egg-info/top_level.txt +1 -0
- pysnapdb-0.11.0/setup.cfg +4 -0
- pysnapdb-0.11.0/snapdb/__init__.py +35 -0
- pysnapdb-0.11.0/snapdb/columnar.py +1373 -0
- pysnapdb-0.11.0/snapdb/core.py +1096 -0
- pysnapdb-0.11.0/snapdb/document_store.py +271 -0
- pysnapdb-0.11.0/snapdb/index.py +128 -0
- pysnapdb-0.11.0/snapdb/metrics.py +133 -0
- pysnapdb-0.11.0/snapdb/query.py +114 -0
- pysnapdb-0.11.0/snapdb/wal.py +122 -0
- pysnapdb-0.11.0/tests/test_delta_encoding.py +141 -0
- pysnapdb-0.11.0/tests/test_dict_encoding.py +131 -0
- pysnapdb-0.11.0/tests/test_document_store.py +209 -0
- pysnapdb-0.11.0/tests/test_features.py +240 -0
- pysnapdb-0.11.0/tests/test_for_encoding.py +263 -0
- pysnapdb-0.11.0/tests/test_numpy_accel.py +210 -0
- pysnapdb-0.11.0/tests/test_optimizations.py +255 -0
- pysnapdb-0.11.0/tests/test_persistence.py +146 -0
- pysnapdb-0.11.0/tests/test_schema_fast.py +200 -0
- pysnapdb-0.11.0/tests/test_snapdb.py +339 -0
- pysnapdb-0.11.0/tests/test_v02.py +124 -0
pysnapdb-0.11.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Hussain Alsaibai
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
pysnapdb-0.11.0/PKG-INFO
ADDED
|
@@ -0,0 +1,436 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pysnapdb
|
|
3
|
+
Version: 0.11.0
|
|
4
|
+
Summary: Extremely Lightweight Lightning-Fast In-Memory Database for Python
|
|
5
|
+
Author-email: "H. A. Alsaibai" <hussain.alsaibai@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/hussain-alsaibai/snapdb
|
|
8
|
+
Project-URL: Repository, https://github.com/hussain-alsaibai/snapdb
|
|
9
|
+
Project-URL: Issues, https://github.com/hussain-alsaibai/snapdb/issues
|
|
10
|
+
Keywords: database,in-memory,embedded,columnar,mmap,zero-copy,pure-python
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
21
|
+
Classifier: Topic :: Database :: Database Engines/Servers
|
|
22
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
23
|
+
Requires-Python: >=3.9
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
License-File: LICENSE
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=7; extra == "dev"
|
|
28
|
+
Requires-Dist: ruff>=0.4; extra == "dev"
|
|
29
|
+
Provides-Extra: numpy
|
|
30
|
+
Requires-Dist: numpy>=1.21; extra == "numpy"
|
|
31
|
+
Dynamic: license-file
|
|
32
|
+
|
|
33
|
+
# SnapDB
|
|
34
|
+
|
|
35
|
+
**Extremely Lightweight, Lightning-Fast In-Memory Database for Python**
|
|
36
|
+
|
|
37
|
+
[](https://github.com/hussain-alsaibai/snapdb/actions/workflows/ci.yml)
|
|
38
|
+
[](https://pypi.org/project/snapdb/)
|
|
39
|
+
[](https://www.python.org/)
|
|
40
|
+
[](LICENSE)
|
|
41
|
+
|
|
42
|
+
A **zero-dependency, pure-Python** embedded database with a columnar analytics
|
|
43
|
+
engine and a row store, memory-mapped files, lightweight column compression, and
|
|
44
|
+
precompiled struct codecs — built for **maximum speed at minimum memory**.
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
pip install snapdb
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## Contents
|
|
51
|
+
|
|
52
|
+
- [Key Innovations](#key-innovations)
|
|
53
|
+
- [Installation](#installation)
|
|
54
|
+
- [Quick Start](#quick-start)
|
|
55
|
+
- [Storage Modes](#storage-modes)
|
|
56
|
+
- [Dictionary Encoding](#dictionary-encoding-v040) · [Delta Encoding](#delta-encoding-v050)
|
|
57
|
+
- [Vectorized Filtering](#vectorized-filtering-v060) · [Auto-Indexing](#auto-indexing-v060) · [NumPy / Zero-Copy Export](#numpy--zero-copy-export-v060)
|
|
58
|
+
- [Benchmarks](#benchmarks)
|
|
59
|
+
- [Architecture](#architecture) · [Supported Types](#supported-types)
|
|
60
|
+
- [Development](#development)
|
|
61
|
+
- [Roadmap & Known Limitations](#roadmap--known-limitations)
|
|
62
|
+
- [License](#license)
|
|
63
|
+
|
|
64
|
+
## Key Innovations
|
|
65
|
+
|
|
66
|
+
- **Columnar engine** — column-oriented per-column `array.array` storage; full-scan aggregation **~27× faster than SQLite** at a fraction of the memory
|
|
67
|
+
- **NumPy-accelerated aggregates** *(optional, v0.8.0)* — when NumPy is installed, `aggregate()` runs over the zero-copy column buffer (**~530M rows/s**, on par with pandas); pure-Python remains the zero-dependency default
|
|
68
|
+
- **NumPy-accelerated filters** *(optional, v0.9.0)* — `select_where()` builds masks vectorially; `count_where()` (filtered count, no row materialization) hits **~314M rows/s** on numeric predicates (~166× the pure-Python path)
|
|
69
|
+
- **Lowest memory footprint of the field** — ~2.2 MB / 100K rows vs SQLite 2.9 MB, pandas 11 MB, plain `dict` 22 MB ([benchmarks](#benchmarks))
|
|
70
|
+
- **Vectorized multi-condition filters** *(v0.6.0)* — `select_where()` combines per-column bitmasks with C-speed big-integer `AND`/`OR` (**~2× faster** selective `WHERE`)
|
|
71
|
+
- **O(1) delta-encoded reads** *(v0.6.0)* — lazy reconstruction cache turns delta scans from O(n²) into O(n) (orders of magnitude faster)
|
|
72
|
+
- **Auto-indexing** *(v0.6.0)* — `auto_index=True` builds a hash index for a column once it's queried often enough
|
|
73
|
+
- **Zero-copy NumPy export** *(v0.6.0)* — `to_numpy()` / `buffer()` (PEP 688) share raw column memory with NumPy without copying
|
|
74
|
+
- **Dictionary encoding** — transparent per-column dictionary for low-cardinality strings: **~3× memory reduction** (v0.4.0)
|
|
75
|
+
- **Delta encoding** — base + deltas for monotonic columns (timestamps, IDs) (v0.5.0)
|
|
76
|
+
- **Bit-packed booleans** — Python `int` bitmask: ~8× smaller than `array('b')`
|
|
77
|
+
- **Hash index** — `create_index()` / `lookup()` / `find()`, **kept in sync** on every insert / update / delete
|
|
78
|
+
- **Durable writes** — write-ahead log with real transaction rollback; CDC stream; Prometheus-style metrics
|
|
79
|
+
- **Zero dependencies** — stdlib only (NumPy is optional, only for zero-copy export)
|
|
80
|
+
|
|
81
|
+
## Installation
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
pip install snapdb
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Or from source:
|
|
88
|
+
```bash
|
|
89
|
+
git clone https://github.com/hussain-alsaibai/snapdb.git
|
|
90
|
+
cd snapdb
|
|
91
|
+
pip install -e .
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## Quick Start
|
|
95
|
+
|
|
96
|
+
```python
|
|
97
|
+
from snapdb import SnapDB, Schema, ColumnDef
|
|
98
|
+
|
|
99
|
+
# Define schema
|
|
100
|
+
schema = Schema([
|
|
101
|
+
ColumnDef("id", "i32"),
|
|
102
|
+
ColumnDef("email", "bytes:32"),
|
|
103
|
+
ColumnDef("score", "f32"),
|
|
104
|
+
ColumnDef("active", "bool"),
|
|
105
|
+
])
|
|
106
|
+
|
|
107
|
+
# Create database (columnar mode for analytics)
|
|
108
|
+
db = SnapDB("data.snap", schema, storage_type="columnar")
|
|
109
|
+
|
|
110
|
+
# Insert
|
|
111
|
+
db.insert({"id": 1, "email": "alice@test.com", "score": 100.0, "active": True})
|
|
112
|
+
|
|
113
|
+
# Fast columnar aggregate (~59M rows/sec full scan)
|
|
114
|
+
total = db.aggregate("score", "sum")
|
|
115
|
+
|
|
116
|
+
# Vectorized multi-condition filter (v0.6.0)
|
|
117
|
+
hot = db.select_where([("score", ">", 90.0), ("active", "==", True)])
|
|
118
|
+
|
|
119
|
+
# Create index for O(1) lookups
|
|
120
|
+
db.create_index("id")
|
|
121
|
+
result = db.lookup("id", 1)
|
|
122
|
+
|
|
123
|
+
# Batch insert for speed
|
|
124
|
+
db.batch_insert([
|
|
125
|
+
{"id": i, "email": f"user_{i}@test.com", "score": i * 10.0, "active": i % 2 == 0}
|
|
126
|
+
for i in range(1000)
|
|
127
|
+
])
|
|
128
|
+
|
|
129
|
+
# CDC (Change Data Capture)
|
|
130
|
+
from snapdb import Metrics
|
|
131
|
+
db = SnapDB("data.snap", schema, metrics=Metrics())
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
## Storage Modes
|
|
135
|
+
|
|
136
|
+
| Mode | Best For | Strengths |
|
|
137
|
+
|------|---------|-----------|
|
|
138
|
+
| `storage_type="columnar"` | OLAP / analytics | Fast full-scan aggregation (~59M rows/s), vectorized filters, column compression, lowest memory |
|
|
139
|
+
| `storage_type="row"` | OLTP / full-row point access | Zero-copy `get_raw()`, WAL transactions, hash indexes, CDC |
|
|
140
|
+
|
|
141
|
+
See [Benchmarks](#benchmarks) for measured throughput and memory.
|
|
142
|
+
|
|
143
|
+
## Dictionary Encoding (v0.4.0)
|
|
144
|
+
|
|
145
|
+
For columns with few unique string values (status, category, type, country), dictionary encoding reduces memory by **3×**:
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
from snapdb import ColumnarTable
|
|
149
|
+
|
|
150
|
+
schema = [
|
|
151
|
+
("id", "i32"),
|
|
152
|
+
("status", "bytes:20"), # "active", "inactive", "pending" — 3 unique
|
|
153
|
+
("category", "bytes:20"), # "electronics", "books", "clothing" — 5 unique
|
|
154
|
+
("score", "f32"),
|
|
155
|
+
]
|
|
156
|
+
|
|
157
|
+
# Enable dict encoding on low-cardinality columns
|
|
158
|
+
db = ColumnarTable("products", schema, dict_columns=["status", "category"])
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
| Metric | Raw | Dict-Encoded | Improvement |
|
|
162
|
+
|--------|-----|--------------|-------------|
|
|
163
|
+
| Memory (100K rows) | 4.0 MB | **1.34 MB** | **3.0× reduction** |
|
|
164
|
+
| Insert | 0.137s | 0.159s | ~15% overhead (acceptable) |
|
|
165
|
+
| Data integrity | — | ✅ 100% | Verified |
|
|
166
|
+
|
|
167
|
+
- **Transparent**: insert/query work with raw strings
|
|
168
|
+
- **Auto-fallback**: switches to raw when unique count > threshold (default 256)
|
|
169
|
+
- **Per-column**: specify which columns to encode via `dict_columns=[]`
|
|
170
|
+
|
|
171
|
+
## Delta Encoding (v0.5.0)
|
|
172
|
+
|
|
173
|
+
For monotonic columns (timestamps, auto-increment IDs, sequences), delta encoding reduces memory by storing differences instead of full values:
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from snapdb import ColumnarTable
|
|
177
|
+
|
|
178
|
+
schema = [
|
|
179
|
+
("id", "i32"),
|
|
180
|
+
("timestamp", "i64"), # Monotonic timestamps → delta-encoded
|
|
181
|
+
("seq", "u32"), # Auto-increment IDs → delta-encoded
|
|
182
|
+
("value", "f32"),
|
|
183
|
+
]
|
|
184
|
+
|
|
185
|
+
# Enable delta encoding on monotonic columns
|
|
186
|
+
db = ColumnarTable("events", schema, delta_columns=["timestamp", "seq"])
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
| Metric | Raw | Delta-Encoded | Improvement |
|
|
190
|
+
|--------|-----|---------------|-------------|
|
|
191
|
+
| Memory (100K rows) | 2.29 MB | **1.91 MB** | **1.2× reduction** |
|
|
192
|
+
| Insert | 0.128s | 0.148s | ~16% overhead |
|
|
193
|
+
| Data integrity | — | ✅ 100% | Verified |
|
|
194
|
+
|
|
195
|
+
- **Auto-detects**: samples first 50 rows for monotonicity
|
|
196
|
+
- **Auto-fallback**: switches to raw if non-monotonic data detected
|
|
197
|
+
- **Per-column**: specify which columns via `delta_columns=[]`
|
|
198
|
+
- **Auto-upgrade**: dynamically upgrades delta typecode if deltas overflow
|
|
199
|
+
|
|
200
|
+
## Frame-of-Reference Encoding (v0.7.0)
|
|
201
|
+
|
|
202
|
+
For numeric columns with bounded ranges (ages 0-120, scores 0-100, ratings 1-5), Frame-of-Reference (FOR) stores the minimum value once, then bit-packs deltas into the minimum required bits. **4–8× memory reduction**:
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
from snapdb import ColumnarTable
|
|
206
|
+
|
|
207
|
+
schema = [
|
|
208
|
+
("user_id", "i32"),
|
|
209
|
+
("age", "i32"), # Ages 18-65 → 6 bits per value
|
|
210
|
+
("rating", "i32"), # Ratings 1-5 → 3 bits per value
|
|
211
|
+
("score", "i32"), # Scores 0-100 → 7 bits per value
|
|
212
|
+
]
|
|
213
|
+
|
|
214
|
+
# Enable FOR encoding on bounded numeric columns
|
|
215
|
+
db = ColumnarTable("survey", schema, for_columns=["age", "rating", "score"])
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
| Metric | Raw | FOR-Encoded | Improvement |
|
|
219
|
+
|--------|-----|-------------|-------------|
|
|
220
|
+
| Memory (100K rows, range 0-100) | 400 KB | **~88 KB** | **4.5× reduction** |
|
|
221
|
+
| Memory (100K rows, range 0-120) | 400 KB | **~103 KB** | **3.9× reduction** |
|
|
222
|
+
| Insert overhead | — | ~10% | Sampling cost |
|
|
223
|
+
| Data integrity | — | ✅ 100% | Verified |
|
|
224
|
+
|
|
225
|
+
- **Auto-detects**: samples first N rows (default 50) to measure range
|
|
226
|
+
- **Auto-fallback**: switches to raw if range exceeds 16 bits (saves <50%)
|
|
227
|
+
- **Per-column**: specify which columns via `for_columns=[]`
|
|
228
|
+
- **Bit-packed**: Python `int` bitmask (same technique as v0.3.2 booleans)
|
|
229
|
+
- **Transparent**: reads return full values, no API changes
|
|
230
|
+
|
|
231
|
+
## Vectorized Filtering (v0.6.0, NumPy-accelerated in v0.9.0)
|
|
232
|
+
|
|
233
|
+
`select_where()` evaluates each condition column-at-a-time into a mask and
|
|
234
|
+
combines them with `AND`/`OR`. With NumPy installed the masks are built
|
|
235
|
+
vectorially over the column buffers (pure-Python big-integer masks otherwise).
|
|
236
|
+
For filtered counts, `count_where()` skips row materialization entirely and runs
|
|
237
|
+
at **~314M rows/s** on numeric predicates (~166× the pure-Python path).
|
|
238
|
+
|
|
239
|
+
```python
|
|
240
|
+
db = SnapDB("events.snap", schema, storage_type="columnar")
|
|
241
|
+
|
|
242
|
+
# (column, op, value) triples — op ∈ eq/ne/gt/gte/lt/lte/in/between
|
|
243
|
+
rows = db.select_where(
|
|
244
|
+
[("age", ">", 30), ("status", "==", b"active")],
|
|
245
|
+
columns=["id", "age"], limit=100,
|
|
246
|
+
)
|
|
247
|
+
|
|
248
|
+
# OR semantics, ranges and membership
|
|
249
|
+
db.select_where([("age", "<", 18), ("age", ">", 65)], combine="or")
|
|
250
|
+
db.select_where([("age", "between", (30, 40)), ("country", "in", [b"US", b"CA"])])
|
|
251
|
+
|
|
252
|
+
# dict shorthand
|
|
253
|
+
db.select_where({"status": b"active", "age": {"gte": 21}})
|
|
254
|
+
|
|
255
|
+
# fast filtered count — no rows materialized (NumPy-accelerated)
|
|
256
|
+
db.count_where([("age", ">", 30), ("temp", "<", 35.0)])
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
## Auto-Indexing (v0.6.0)
|
|
260
|
+
|
|
261
|
+
Let SnapDB index the columns you actually query, so you never forget a
|
|
262
|
+
`create_index()` for a hot path:
|
|
263
|
+
|
|
264
|
+
```python
|
|
265
|
+
db = SnapDB("users.snap", schema, auto_index=True, auto_index_threshold=8)
|
|
266
|
+
# after the 8th equality query on a column, a hash index is built automatically
|
|
267
|
+
for uid in stream:
|
|
268
|
+
db.find(email=uid) # transparently O(1) once the index materializes
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
`find()` also works **without** any index (scan fallback), so correctness never
|
|
272
|
+
depends on remembering to index.
|
|
273
|
+
|
|
274
|
+
## NumPy / Zero-Copy Export (v0.6.0)
|
|
275
|
+
|
|
276
|
+
Hand raw column memory to NumPy without copying (PEP 688 buffer protocol). NumPy
|
|
277
|
+
is an **optional** dependency — only needed if you call these methods.
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
col = db.to_numpy("temperature") # safe copy (works for any column)
|
|
281
|
+
view = db.to_numpy("temperature", zero_copy=True) # shares memory, no copy
|
|
282
|
+
mv = db.column_buffer("temperature") # raw memoryview for advanced use
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
Plain numeric columns export a true zero-copy view; encoded columns
|
|
286
|
+
(dictionary/delta) transparently fall back to a materialized copy.
|
|
287
|
+
|
|
288
|
+
## Benchmarks
|
|
289
|
+
|
|
290
|
+
SnapDB's headline strength is memory efficiency — the columnar store is the
|
|
291
|
+
lightest engine in this comparison while staying fully analytical:
|
|
292
|
+
|
|
293
|
+
<p align="center">
|
|
294
|
+
<img src="docs/memory-efficiency.svg" alt="Memory footprint for 100,000 rows: SnapDB columnar 2.2 MB, sqlite3 in-memory 2.9 MB, pandas 11.0 MB, dict baseline 22.5 MB — lower is better" width="720">
|
|
295
|
+
</p>
|
|
296
|
+
|
|
297
|
+
<p align="center"><em>~5× lighter than pandas and ~10× lighter than a plain <code>dict</code> — with zero dependencies.</em></p>
|
|
298
|
+
|
|
299
|
+
Reproduce locally (numbers below are from the environment noted in the table):
|
|
300
|
+
|
|
301
|
+
```bash
|
|
302
|
+
python benchmarks/bench_suite.py --rows 100000 --markdown bench.md
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
<!-- BENCH:START -->
|
|
306
|
+
_100,000 rows · 50,000 point reads · best of 5 · Python 3.13 · win32 (NumPy installed → accelerated aggregate). Higher is better except Memory (lower is better)._
|
|
307
|
+
|
|
308
|
+
| Workload | Unit | SnapDB (columnar) | SnapDB (row) | sqlite3 (:memory:) | pandas | dict (baseline) |
|
|
309
|
+
|---|---|---|---|---|---|---|
|
|
310
|
+
| Bulk insert | rows/s | 467,309 | 287,230 | 770,788 | 794,461 | 11,139,083 |
|
|
311
|
+
| Point read (PK) | ops/s | 86,243 | 87,836 | 370,698 | 32,296 | 5,494,807 |
|
|
312
|
+
| Full scan + SUM | rows/s | 529,660,985 | 483,067 | 19,910,403 | 513,874,544 | 19,488,619 |
|
|
313
|
+
| 3-cond filter | rows/s | 2,259,928 | 470,223 | 11,842,168 | 19,827,894 | 13,811,773 |
|
|
314
|
+
| Memory footprint | MB | 2.2 | n/a | 2.9 | 11.0 | 22.5 |
|
|
315
|
+
<!-- BENCH:END -->
|
|
316
|
+
|
|
317
|
+
**Where SnapDB wins (honestly):**
|
|
318
|
+
|
|
319
|
+
- **Memory** — the columnar store is the **lightest** here: ~5× smaller than pandas and ~10× smaller than a plain `dict`, with zero dependencies.
|
|
320
|
+
- **Full-scan aggregation** — **on par with pandas (~530M rows/s)** and ~27× faster than in-memory SQLite. With NumPy installed, `aggregate()` runs over the zero-copy column buffer (issue #14); without NumPy the pure-Python path still does ~58M rows/s (~3× SQLite).
|
|
321
|
+
- **Embeddable** — a single mmap-backed file, no server, no C extensions.
|
|
322
|
+
|
|
323
|
+
**Where it doesn't (also honestly):** pandas still wins multi-condition
|
|
324
|
+
filtering (vectorized `WHERE` acceleration is the next item, [#14](https://github.com/hussain-alsaibai/snapdb/issues/14)), and SQLite's
|
|
325
|
+
B-tree wins indexed point reads. SnapDB targets the lightweight-embedded-
|
|
326
|
+
analytics niche. Encoding memory wins for low-cardinality / monotonic columns
|
|
327
|
+
are shown above.
|
|
328
|
+
|
|
329
|
+
> CI runs this suite on every push and publishes a fresh table to the workflow
|
|
330
|
+
> run summary (Actions → CI → Benchmark).
|
|
331
|
+
|
|
332
|
+
### Encoding memory (100K rows)
|
|
333
|
+
|
|
334
|
+
| Encoding | Raw | Encoded | Reduction |
|
|
335
|
+
|----------|-----|---------|-----------|
|
|
336
|
+
| Frame-of-Reference (bounded numeric) | 400 KB | **~88 KB** | **~4.5×** |
|
|
337
|
+
| Dictionary (low-cardinality strings) | 4.0 MB | **1.34 MB** | **~3.0×** |
|
|
338
|
+
| Delta (monotonic integers) | 2.29 MB | **1.91 MB** | **~1.2×** |
|
|
339
|
+
|
|
340
|
+
## Architecture
|
|
341
|
+
|
|
342
|
+
```
|
|
343
|
+
SnapDB
|
|
344
|
+
├── core.py — Slab storage, Schema, CRUD, WAL
|
|
345
|
+
├── columnar.py — column-oriented analytical engine
|
|
346
|
+
├── metrics.py — Prometheus-style metrics collector
|
|
347
|
+
├── index.py — Hash + multi-column indexes
|
|
348
|
+
├── query.py — SQL-like query builder
|
|
349
|
+
├── wal.py — Write-ahead log for transactions
|
|
350
|
+
└── document_store.py — MongoDB-style DocumentStore API
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
## Supported Types
|
|
354
|
+
|
|
355
|
+
| Type | Bytes | Use Case |
|
|
356
|
+
|------|-------|----------|
|
|
357
|
+
| `i8` / `u8` | 1 | Flags, small counters |
|
|
358
|
+
| `i16` / `u16` | 2 | IDs, ports |
|
|
359
|
+
| `i32` / `u32` | 4 | Integers, IDs |
|
|
360
|
+
| `i64` / `u64` | 8 | Timestamps, large IDs |
|
|
361
|
+
| `f32` | 4 | ML scores, prices |
|
|
362
|
+
| `f64` | 8 | Scientific, financial |
|
|
363
|
+
| `bool` | ~0.125 | Bit-packed bitmask |
|
|
364
|
+
| `bytes:N` | N | Strings, hashes, fixed data |
|
|
365
|
+
|
|
366
|
+
## Development
|
|
367
|
+
|
|
368
|
+
```bash
|
|
369
|
+
# Install with dev + optional extras
|
|
370
|
+
pip install -e ".[dev,numpy]"
|
|
371
|
+
|
|
372
|
+
# Lint (same config CI uses)
|
|
373
|
+
ruff check .
|
|
374
|
+
|
|
375
|
+
# Unit tests
|
|
376
|
+
pytest tests/ -q
|
|
377
|
+
|
|
378
|
+
# Legacy script-style suites (encoding/codec checks)
|
|
379
|
+
python tests/test_delta_encoding.py
|
|
380
|
+
python tests/test_dict_encoding.py
|
|
381
|
+
|
|
382
|
+
# Benchmark suite (writes a Markdown table you can drop into the README)
|
|
383
|
+
python benchmarks/bench_suite.py --rows 100000 --json bench.json --markdown bench.md
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Continuous integration (`.github/workflows/ci.yml`) runs ruff, the test matrix
|
|
387
|
+
on Linux (3.9–3.13) and Windows, and the benchmark on every push and PR.
|
|
388
|
+
|
|
389
|
+
## Version History
|
|
390
|
+
|
|
391
|
+
- **v0.11.0** — NumPy-accelerated string filtering:
|
|
392
|
+
- `select_where()`/`count_where()` on **dict-encoded** string columns compare integer dict codes via NumPy for `eq`/`ne`/`in` instead of per-row string comparison — **~300×+** faster (dict `==` count ~969M rows/s); a mixed numeric+string filtered count now runs ~143× faster. Exact parity verified; ordering ops and non-dict bytes columns keep the Python path
|
|
393
|
+
- **v0.10.0** — Fast row-store bulk insert ([#13](https://github.com/hussain-alsaibai/snapdb/issues/13)):
|
|
394
|
+
- `batch_insert()` now grows the backing file in a **single** truncate + remap for the whole batch instead of one per slab — **~26× faster** (100K rows: ~5.8s → ~0.29s, now in the same ballpark as SQLite/pandas). On-disk format and durability guarantees unchanged
|
|
395
|
+
- **v0.9.0** — NumPy-accelerated filters ([#14](https://github.com/hussain-alsaibai/snapdb/issues/14)):
|
|
396
|
+
- `select_where()` builds condition masks vectorially over the column buffers when NumPy is installed (~2× faster); `use_numpy=False` forces the pure-Python path
|
|
397
|
+
- New `count_where()` — filtered row count with no materialization, **~314M rows/s** on numeric predicates (~166×). Exact parity with the pure-Python path verified
|
|
398
|
+
- Bytes/encoded conditions fall back to the Python mask; mixed queries still accelerate their numeric conditions
|
|
399
|
+
- **v0.8.0** — Optional NumPy-accelerated aggregates ([#14](https://github.com/hussain-alsaibai/snapdb/issues/14)):
|
|
400
|
+
- `aggregate()` runs `sum`/`min`/`max`/`avg` over the zero-copy column buffer with NumPy when it's installed — **~13–27× faster** (full-scan SUM ~530M rows/s, on par with pandas)
|
|
401
|
+
- Auto-enabled when NumPy is present; `use_numpy=False` forces the pure-Python path; exact parity verified (integers exact, floats within tolerance)
|
|
402
|
+
- Zero-dependency default unchanged; encoded (delta/FOR) and 64-bit-int-sum cases fall through to the exact Python path
|
|
403
|
+
- **v0.7.0** — Frame-of-Reference encoding:
|
|
404
|
+
- **New:** Frame-of-Reference (FOR) + bit packing for bounded numeric columns (ages, scores, ratings): **4–8× memory reduction**
|
|
405
|
+
- Auto-detects after sampling threshold (default 50 rows), auto-fallback when range exceeds 16 bits
|
|
406
|
+
- Per-column via `for_columns=[]`, transparent API, update fallback to raw
|
|
407
|
+
- 6 new tests, zero regressions
|
|
408
|
+
- **v0.6.0** — Performance, correctness & features:
|
|
409
|
+
- **New:** vectorized multi-condition `select_where()` (bitmask `AND`/`OR`), auto-indexing (`auto_index=True`), zero-copy NumPy export (`to_numpy()`/`buffer()`, PEP 688)
|
|
410
|
+
- Delta-encoded column reads are now **O(1)/O(n)** (lazy reconstruction cache) instead of **O(n)/O(n²)** — orders of magnitude faster delta scans/aggregates
|
|
411
|
+
- Hash indexes are genuinely **kept in sync** on insert / `batch_insert` / update / delete (previously went stale after the first build); single unified `create_index()` for row **and** columnar storage; `find()` gained a scan fallback
|
|
412
|
+
- Fixed data corruption: deleting/nulling a delta-encoded row no longer shifts other rows' values
|
|
413
|
+
- Transaction rollback now actually undoes writes (and restores indexes)
|
|
414
|
+
- **Durability fix:** multi-slab row databases now survive `close()`/reopen — the on-disk bitmap geometry and slab high-water marks are persisted correctly (previously reopening a >1-slab database lost data)
|
|
415
|
+
- Vectorized aggregates (array-level `sum`/`min`/`max`) for null-free numeric columns
|
|
416
|
+
- `__slots__` on hot classes; `close()` reliably releases the mmap (Windows file locks)
|
|
417
|
+
- Tooling: reproducible benchmark suite, GitHub Actions CI (ruff + test matrix + benchmark), `ruff`-clean codebase
|
|
418
|
+
- **v0.5.0** — Delta encoding (1.2× memory reduction for monotonic numeric columns)
|
|
419
|
+
- **v0.4.0** — Dictionary encoding (3× memory reduction for low-cardinality strings)
|
|
420
|
+
- **v0.3.2** — Precompiled struct format, hash index, bit-packed booleans
|
|
421
|
+
- **v0.3.1** — Batch insert, optimized columnar, comprehensive benchmarks
|
|
422
|
+
- **v0.3.0** — Columnar engine, metrics, CDC
|
|
423
|
+
- **v0.2.0** — Query engine, hash indexes, WAL transactions, DocumentStore
|
|
424
|
+
- **v0.1.0** — Initial release
|
|
425
|
+
|
|
426
|
+
## Roadmap & Known Limitations
|
|
427
|
+
|
|
428
|
+
Tracked as GitHub issues:
|
|
429
|
+
|
|
430
|
+
- [#11](https://github.com/hussain-alsaibai/snapdb/issues/11) — Frame-of-Reference (FOR) encoding for bounded numeric ranges
|
|
431
|
+
- [#12](https://github.com/hussain-alsaibai/snapdb/issues/12) — Low-overhead query profiler via `sys.monitoring` (PEP 669)
|
|
432
|
+
- [#14](https://github.com/hussain-alsaibai/snapdb/issues/14) — Optional NumPy-accelerated filters & aggregates (keeping the zero-dependency default)
|
|
433
|
+
|
|
434
|
+
## License
|
|
435
|
+
|
|
436
|
+
MIT — see [LICENSE](LICENSE)
|