vecdbc 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
vecdbc-0.1.0/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ # MIT License
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19
+ SOFTWARE.
@@ -0,0 +1,3 @@
1
+ include vecdb.c turboquant.c hybrid.c
2
+ include vecdb.h turboquant.h hybrid.h
3
+ include README.md LICENSE
vecdbc-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,603 @@
1
+ Metadata-Version: 2.4
2
+ Name: vecdbc
3
+ Version: 0.1.0
4
+ Summary: A from-scratch vector database in C (HNSW + TurboQuant + hybrid) with Python bindings
5
+ Author-email: Seyed Navid Mirnour algeri <navid72m@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/navid72m/vectorDatabase
8
+ Project-URL: Repository, https://github.com/navid72m/vectorDatabase
9
+ Keywords: vector-database,ann,nearest-neighbor,hnsw,quantization,embeddings
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Topic :: Scientific/Engineering
13
+ Classifier: Topic :: Database :: Database Engines/Servers
14
+ Requires-Python: >=3.8
15
+ Description-Content-Type: text/markdown
16
+ License-File: LICENSE
17
+ Requires-Dist: numpy
18
+ Dynamic: license-file
19
+
20
+ # vecdb — a small vector database in C
21
+
22
+ ![CI](https://github.com/navid72m/vectorDatabase/actions/workflows/ci.yml/badge.svg)
23
+
24
+ `vecdb` is a dependency-light vector index implemented in C11 with Python bindings through `ctypes`. It stores fixed-size `float32` vectors and supports exact and approximate nearest-neighbor search using squared L2 distance.
25
+
26
+ ## What it implements
27
+
28
+ - **Flat exact search** — brute-force L2 scan.
29
+ - **Blocked exact batch search** — processes up to 8 queries per pass so one vector load feeds multiple query accumulators.
30
+ - **HNSW approximate search** — graph index using greedy descent, beam search (`ef`), and heuristic neighbor selection.
31
+ - **Binary persistence** — save/load an index with vectors, IDs, levels, and neighbor links.
32
+ - **Delete API** — O(1) tombstone deletes by user ID, with stable recall under churn and a `compact()` rebuild to reclaim space.
33
+ - **Filtered search** — restrict any search to an allow-list or away from a deny-list of IDs; HNSW traverses through filtered nodes so selective filters cannot disconnect the search.
34
+ - **Concurrent reads** — searches are thread-safe (per-thread visited buffers, no shared mutable state); any number of threads may search one index simultaneously. Writes require external exclusion.
35
+ - **OpenMP batch search** — `make OMP=1` parallelizes batched HNSW, exact, and TurboQuant searches over queries. Single-threaded behavior is unchanged; results are identical either way.
36
+ - **Parallel insert (experimental)** — `add_bulk(ids, vecs, threads=N)` builds the HNSW graph concurrently using a published-node model: a node becomes a traversal target (atomic release/acquire flag) only after its links are fully wired, and per-node locks guard every link-list read and mutation. Produces a graph of equivalent recall to the serial build (verified to within +-0.02, ASan-clean). Measured 1.73x at 8 threads on an M2 Pro for a 1M-vector build (see scaling table below); the serial path is the default and is byte-identical to repeated single inserts.
37
+ - **NEON kernels** — on AArch64 (Apple Silicon, Graviton, ...) the distance kernel, blocked exact scan, and both TurboQuant scans use NEON; the 4-bit codebook decodes via a single `tbl` table register and both quantized scans run on `sdot` (signed x signed, so no zero-point correction). Verified against scalar references under QEMU in CI.
38
+ - **Hybrid index (HNSW + TurboQuant + rerank)** — an HNSW graph that traverses on TurboQuant code distances and reranks the top candidates with exact fp32; matches pure-fp32 recall while the resident footprint (codes + graph) is smaller than the fp32 vectors alone.
39
+ - **TurboQuant compressed index** — optional 4-bit or 8-bit compressed brute-force index with randomized Hadamard rotation, norm separation, Lloyd-Max Gaussian quantization, and optional QJL residual estimation.
40
+
41
+ The C API is declared in `vecdb.h`. The Python API lives in `pyvecdb.py` and loads `libvecdb.so` from the project directory by default.
42
+
43
+ ## Build from source
44
+
45
+ Requirements:
46
+
47
+ - `clang`
48
+ - `numpy` for the Python bindings
49
+ - A Unix-like shell with `make`
50
+
51
+ ```sh
52
+ make clean
53
+ make
54
+ ```
55
+
56
+ `make` builds `libvecdb.so`. Build the demo program separately:
57
+
58
+ ```sh
59
+ make bench
60
+ ```
61
+
62
+ Useful targets:
63
+
64
+ | Target | Result |
65
+ |--------|--------|
66
+ | `make` / `make all` | Builds `libvecdb.so` |
67
+ | `make bench` | Builds the `bench` demo executable |
68
+ | `make clean` | Removes `.o`, `.so`, and `bench` build artifacts |
69
+
70
+ The Makefile uses `-O3 -march=native`, so generated binaries are machine-specific. If you need to override the compiler or flags:
71
+
72
+ ```sh
73
+ make clean
74
+ make CC=clang CFLAGS="-O3 -Wall -Wextra -fPIC"
75
+ ```
76
+
77
+ For local Python use, run scripts from the repo root or set `PYTHONPATH=.`:
78
+
79
+ ```sh
80
+ PYTHONPATH=. python script.py
81
+ ```
82
+
83
+ `pyvecdb.py` loads `libvecdb.so` from the same directory as the Python file. To load a library from somewhere else, set `VECDB_LIB`:
84
+
85
+ ```sh
86
+ VECDB_LIB=/path/to/libvecdb.so PYTHONPATH=. python script.py
87
+ ```
88
+
89
+ ## Quick start
90
+
91
+ ```sh
92
+ make
93
+ make bench
94
+ ./bench 50000 128
95
+ ```
96
+
97
+ `bench` generates random vectors, inserts them into an HNSW index, runs one approximate query, prints the top result, and writes `index.vecdb`.
98
+
99
+ ## Python API
100
+
101
+ `pyvecdb.py` exposes two index classes:
102
+
103
+ | Class | Purpose |
104
+ |-------|---------|
105
+ | `VecDB` | Full vector index with flat exact search, HNSW approximate search, delete, save, and load |
106
+ | `TQIndex` | TurboQuant compressed brute-force index |
107
+ | `HybridIndex` | HNSW over TurboQuant codes with fp32 rerank (Design A + B) |
108
+
109
+ ### `VecDB`
110
+
111
+ ```python
112
+ from pyvecdb import VecDB
113
+
114
+ db = VecDB(
115
+ dim=128,
116
+ M=16,
117
+ ef_construction=200,
118
+ capacity=1024,
119
+ seed=0x9E3779B97F4A7C15,
120
+ )
121
+ ```
122
+
123
+ Constructor arguments:
124
+
125
+ | Argument | Meaning |
126
+ |----------|---------|
127
+ | `dim` | Vector dimensionality |
128
+ | `M` | HNSW max links per node on upper layers; level 0 uses `2 * M` |
129
+ | `ef_construction` | HNSW beam width during inserts |
130
+ | `capacity` | Initial vector capacity; the index grows automatically |
131
+ | `seed` | Deterministic RNG seed for HNSW level sampling |
132
+
133
+ Core methods:
134
+
135
+ ```python
136
+ db.add(vectors, ids=None) # returns None; ids default to current_count..current_count+n
137
+ db.delete(ids) # O(1) per id; raises RuntimeError if any id is missing
138
+ db.compact() # rebuild without tombstones after many deletes
139
+ ids, distances = db.search(queries, k=10, ef=100, exact=False)
140
+ ids, distances = db.search(queries, k=10, allow_ids=some_ids) # filtered
141
+ ids, distances = db.search(queries, k=10, deny_ids=other_ids)
142
+ db.save(path)
143
+ loaded = VecDB.load(path)
144
+ len(db) # number of stored vectors
145
+ db.dim # vector dimensionality
146
+ ```
147
+
148
+ Input requirements:
149
+
150
+ - `vectors` and `queries` must be convertible to contiguous `float32` arrays.
151
+ - Shape must be `(n, dim)`; a 1D vector is accepted as `(1, dim)`.
152
+ - `ids`, when provided, must be shape `(n,)`.
153
+ - `ef` is clamped up to `k` if `ef < k`.
154
+
155
+ `search()` returns `(ids, distances)`, both shaped `(n_queries, k)`. Missing result slots are filled with `id = 2**64 - 1` and `dist = inf`. Distances are squared L2 values.
156
+
157
+ Example:
158
+
159
+ ```python
160
+ import numpy as np
161
+ from pyvecdb import VecDB
162
+
163
+ vectors = np.random.rand(10_000, 128).astype(np.float32)
164
+ ids = np.arange(vectors.shape[0], dtype=np.uint64)
165
+
166
+ db = VecDB(dim=128, M=16, ef_construction=200)
167
+ db.add(vectors, ids)
168
+
169
+ queries = vectors[:5]
170
+ flat_ids, flat_dists = db.search(queries, k=10, ef=100, exact=True)
171
+ hnsw_ids, hnsw_dists = db.search(queries, k=10, ef=100, exact=False)
172
+
173
+ db.delete(np.array([42], dtype=np.uint64))
174
+
175
+ db.save("index.vecdb")
176
+ loaded = VecDB.load("index.vecdb")
177
+ ```
178
+
179
+ Delete semantics: `VecDB.delete()` is O(1) per ID via an internal id->slot
180
+ hash map. Deleted nodes are tombstoned: they keep carrying HNSW graph
181
+ connectivity (traversal passes through them, so deleting hub nodes cannot
182
+ orphan graph regions) but are excluded from all search results, exact scans,
183
+ and counts. Tombstoned slots are not reused; after heavy churn, call
184
+ `db.compact()` to rebuild the index without them. In a 5-cycle churn test
185
+ (30% turnover per cycle, 10k vectors), recall@10 stayed in 0.95-0.97 with
186
+ no degradation trend. Adding a duplicate live ID is rejected with an error.
187
+
188
+ ### `TQIndex`
189
+
190
+ ```python
191
+ from pyvecdb import TQIndex
192
+
193
+ tq = TQIndex(dim=128, bits=4, qjl=False, seed=123)
194
+ ```
195
+
196
+ Constructor arguments:
197
+
198
+ | Argument | Meaning |
199
+ |----------|---------|
200
+ | `dim` | Vector dimensionality; internally padded to the next power of two |
201
+ | `bits` | `4` or `8` bits per coordinate |
202
+ | `qjl` | Enable TurboQuant-prod residual estimation |
203
+ | `seed` | Deterministic RNG seed for Hadamard signs and QJL projection |
204
+
205
+ Core methods:
206
+
207
+ ```python
208
+ tq.add(vectors, ids=None)
209
+ ids, distances = tq.search(queries, k=10)
210
+ bytes_used = tq.memory_bytes()
211
+ len(tq)
212
+ ```
213
+
214
+ TurboQuant is a compressed brute-force index, not an HNSW graph. It currently exposes `add`, `search`, `memory_bytes`, and `__len__`; it does not expose delete or persistence.
215
+
216
+ Example:
217
+
218
+ ```python
219
+ import numpy as np
220
+ from pyvecdb import TQIndex
221
+
222
+ vectors = np.random.rand(10_000, 128).astype(np.float32)
223
+ ids = np.arange(vectors.shape[0], dtype=np.uint64)
224
+
225
+ tq = TQIndex(dim=128, bits=4, qjl=False)
226
+ tq.add(vectors, ids)
227
+
228
+ queries = vectors[:5]
229
+ tq_ids, tq_dists = tq.search(queries, k=10)
230
+ print(tq.memory_bytes())
231
+ ```
232
+
233
+ ## C API
234
+
235
+ `vecdb.h` declares the public C interface.
236
+
237
+ ### Types
238
+
239
+ ```c
240
+ typedef struct VecDB VecDB;
241
+
242
+ typedef struct {
243
+ uint64_t id; /* user-supplied id */
244
+ float dist; /* squared L2 distance to query */
245
+ } VecResult;
246
+
247
+ typedef struct {
248
+ int dim;
249
+ int M;
250
+ int ef_construction;
251
+ size_t initial_capacity;
252
+ uint64_t seed;
253
+ } VecDBConfig;
254
+ ```
255
+
256
+ Return semantics:
257
+
258
+ - `vecdb_create()` returns `NULL` on invalid config or allocation failure.
259
+ - `vecdb_add()` returns the internal index on success, or `-1` on failure.
260
+ - `vecdb_delete()` returns `0` on success, or `-1` if the ID is not found.
261
+ - `vecdb_compact()` returns `0` on success, `-1` on allocation failure.
262
+ - `vecdb_search_*()` returns the number of results written.
263
+ - `vecdb_save()` returns `0` on success, or `-1` on failure.
264
+ - `vecdb_load()` returns `NULL` on open/read/format failure.
265
+
266
+ ### Example
267
+
268
+ ```c
269
+ #include "vecdb.h"
270
+
271
+ int main(void) {
272
+ VecDBConfig cfg = vecdb_default_config(128);
273
+ cfg.M = 16;
274
+ cfg.ef_construction = 200;
275
+ cfg.initial_capacity = 1024;
276
+ cfg.seed = 0x9E3779B97F4A7C15ULL;
277
+
278
+ VecDB *db = vecdb_create(&cfg);
279
+ if (!db) return 1;
280
+
281
+ float vec[128]; // fill with your data
282
+ vecdb_add(db, 42, vec);
283
+
284
+ float query[128]; // fill with your data
285
+ VecResult out[10];
286
+
287
+ int n_flat = vecdb_search_flat(db, query, 10, out);
288
+ int n_hnsw = vecdb_search_hnsw(db, query, 10, 100, out);
289
+
290
+ vecdb_delete(db, 42);
291
+
292
+ if (vecdb_save(db, "index.vecdb") != 0) return 1;
293
+
294
+ VecDB *loaded = vecdb_load("index.vecdb");
295
+
296
+ vecdb_free(db);
297
+ vecdb_free(loaded);
298
+ return 0;
299
+ }
300
+ ```
301
+
302
+ Core C functions:
303
+
304
+ | Function | Purpose |
305
+ |----------|---------|
306
+ | `vecdb_default_config(dim)` | Fill a config with defaults: `M=16`, `ef_construction=200`, `initial_capacity=1024` |
307
+ | `vecdb_create(&cfg)` | Allocate and initialize a database |
308
+ | `vecdb_free(db)` | Free all memory |
309
+ | `vecdb_add(db, id, vector)` | Insert one vector |
310
+ | `vecdb_delete(db, id)` | O(1) tombstone delete by user ID |
311
+ | `vecdb_compact(db)` | Rebuild in place without tombstoned nodes |
312
+ | `vecdb_count(db)` | Number of stored vectors |
313
+ | `vecdb_dim(db)` | Vector dimensionality |
314
+ | `vecdb_search_flat(db, query, k, out)` | Exact single-query search |
315
+ | `vecdb_search_flat_batch(db, queries, nq, k, out)` | Exact batched search over up to 8 queries per pass |
316
+ | `vecdb_search_hnsw(db, query, k, ef, out)` | Approximate HNSW search (thread-safe for concurrent reads) |
317
+ | `vecdb_search_hnsw_filtered(db, query, k, ef, mask, out)` | HNSW search restricted to a slot mask |
318
+ | `vecdb_search_flat_batch_filtered(db, queries, nq, k, mask, out)` | Exact filtered batch scan |
319
+ | `vecdb_make_mask(db, ids, n, mode, mask)` | Build an allow (mode 0) or deny (mode 1) mask from user IDs |
320
+ | `vecdb_slots(db)` | Mask size in bytes (internal slot count incl. tombstones) |
321
+ | `vecdb_save(db, path)` | Persist vectors and graph |
322
+ | `vecdb_load(path)` | Load a persisted index |
323
+
324
+ ## TurboQuant details
325
+
326
+ `TQIndex` encodes each vector as a compact code and searches by scanning those codes.
327
+
328
+ Per-vector encoding pipeline:
329
+
330
+ 1. Store the original norm.
331
+ 2. Normalize the vector direction.
332
+ 3. Pad dimension to the next power of two.
333
+ 4. Apply three rounds of sign flips plus fast Walsh-Hadamard transform.
334
+ 5. Quantize each rotated coordinate with an analytic Lloyd-Max Gaussian codebook.
335
+ 6. For 8-bit mode, snap reconstructions to signed int8 and use an AVX512-VNNI scan path when available.
336
+ 7. Optional QJL mode stores residual sign bits and residual norms for an unbiased inner-product estimator.
337
+
338
+ Distances returned by `TQIndex.search()` are estimated squared L2 distances for the compressed representation.
339
+
340
+ ## Implementation notes
341
+
342
+ - Distance metric: squared L2. Cosine search can be done by normalizing vectors before insertion and querying.
343
+ - HNSW uses deterministic seeded `xorshift64*` level sampling.
344
+ - HNSW search uses epoch-tagged visited marks to avoid clearing a visited array per query.
345
+ - Flat batch search uses the identity `||q - x||^2 = ||q||^2 - 2<q,x> + ||x||^2` with cached vector norms.
346
+ - `vecdb.c` has AVX-512 and NEON distance kernels; otherwise it falls back to scalar code that is intended to auto-vectorize under `-O3`.
347
+ - `turboquant.c` scan paths by target: AVX-512 4-bit register-LUT + VNNI 8-bit on x86; NEON `tbl`+`sdot` for both bit-widths on AArch64 (requires the dotprod extension, ARMv8.2+); scalar otherwise. The NEON 4-bit path scans on the int8 grid (like the VNNI 8-bit path), trading a small amount of raw recall for speed; reranking recovers it.
348
+ - OpenMP parallelism applies to batched searches only; index build remains single-threaded. macOS note: Apple clang has no bundled OpenMP — use `brew install gcc` and `make OMP=1 CC=gcc-14`, or build serial (default).
349
+ - The Makefile uses `-march=native`, so build artifacts are machine-specific.
350
+
351
+ ## Known limits
352
+
353
+ - Single writer: add/delete/compact must be externally excluded from each other and from searches. Concurrent searches are safe.
354
+ - Very selective allow-lists (under ~1% of vectors) can return fewer than k HNSW results; use `exact=True` or raise `ef` for those.
355
+ - Tombstoned slots are not reused by inserts; memory is reclaimed only by `compact()`. Long-running high-churn workloads should compact periodically.
356
+ - Parallel insert (`threads>1`) is experimental: correctness is validated by recall-equivalence and AddressSanitizer, but ThreadSanitizer cannot fully verify it because libgomp's `omp_lock_t` synchronization is invisible to TSan (residual reports are lock-protected accesses TSan can't see). The default serial path is the verified one.
357
+ - `compact()` rebuilds the whole graph (O(N log N) inserts), so it is a maintenance operation, not a per-delete cost.
358
+ - TurboQuant is a brute-force compressed index; it does not build an HNSW graph.
359
+ - TurboQuant has no delete or save/load API yet.
360
+ - Persistence format is versioned (v2 adds tombstones; v1 files still load) but not a long-term stable external format.
361
+
362
+ ## Project layout
363
+
364
+ ```text
365
+ vecdb.h # C API
366
+ vecdb.c # VecDB storage, flat search, HNSW, persistence
367
+ turboquant.c # TurboQuant compressed index (+ codec API for the hybrid)
368
+ hybrid.c # HNSW-over-codes + fp32 rerank index
369
+ pyvecdb.py # ctypes Python bindings
370
+ bench.c # small random-data smoke/demo program
371
+ Makefile # build rules for libvecdb.so and bench
372
+ pyproject.toml # Python package metadata
373
+ tests.py # correctness, churn, filtering, concurrency suite (python tests.py)
374
+ .github/ # CI: build + full test suite on every push
375
+ benchmarks/ # FAISS/Chroma/Qdrant comparison, quantization shoot-out,
376
+ # parallel scaling, SIFT1M (standard ANN dataset),
377
+ # and a recall-QPS Pareto plot vs FAISS
378
+ ```
379
+
380
+ ## Measured results
381
+
382
+ All single-threaded, 20k x 128 Gaussian-mixture vectors, recall verified
383
+ against exact ground truth (see `benchmarks/` for the scripts):
384
+
385
+ | index | QPS | recall@10 | note |
386
+ |----------------------------|--------|-----------|-------------------------------|
387
+ | flat exact (8-query block) | 3,928 | 1.000 | 1.9x faster than faiss flat |
388
+ | HNSW ef=50 | 16,578 | 1.000 | (synthetic; see SIFT1M below) |
389
+ | HNSW ef=200 | 8,574 | 1.000 | (synthetic; see SIFT1M below) |
390
+ | TurboQuant 4-bit | 2,235 | 0.622 | beats faiss SQ4; 1.000 w/ rerank |
391
+ | TurboQuant 8-bit (VNNI) | 4,993 | 0.937 | 2.5x faiss SQ8; 1.000 w/ rerank |
392
+
393
+ Churn stability: 5 cycles of 30% delete+reinsert on 10k vectors held
394
+ recall@10 at 0.95-0.97; `compact()` rebuilt 10k vectors in ~1.6s.
395
+
396
+ ### 1M vectors on Apple Silicon (M2 Pro, NEON build, single thread)
397
+
398
+ 1,000,000 x 128 Gaussian-mixture vectors, 200 held-out queries, recall
399
+ against exact ground truth (`benchmarks/bench_large.py --n 1000000`):
400
+
401
+ | ef | QPS | recall@10 |
402
+ |-----|--------|-----------|
403
+ | 10 | 30,666 | 0.618 |
404
+ | 50 | 10,691 | 0.926 |
405
+ | 100 | 7,471 | 0.978 |
406
+ | 200 | 5,427 | 0.989 |
407
+ | 400 | 3,836 | 0.999 |
408
+
409
+ Build: 178s (5,622 vec/s) with capacity preallocated. (Build times are not
410
+ comparable across machines — different CPUs, clocks, and memory; in the
411
+ parallel table below, compare thread counts within the single M2 Pro run.)
412
+
413
+ ### Parallel insert scaling (M2 Pro, 6 performance + 4 efficiency cores)
414
+
415
+ 1M x 128 clustered vectors, `add_bulk(ids, vecs, threads=N)`, concurrent
416
+ HNSW graph construction (gcc-15, `make OMP=1`):
417
+
418
+ | threads | build | vec/s | speedup |
419
+ |---------|-------|-------|---------|
420
+ | 1 | 322.8s | 3,098 | 1.00x |
421
+ | 2 | 247.3s | 4,044 | 1.31x |
422
+ | 4 | 203.4s | 4,916 | 1.59x |
423
+ | 6 | 193.4s | 5,170 | 1.67x |
424
+ | 8 | 186.9s | 5,351 | 1.73x |
425
+
426
+ The curve flattens after 4 threads — the expected signature of HNSW insert:
427
+ a small set of well-connected "hub" nodes are selected as neighbors by most
428
+ inserts, so threads serialize on those nodes' locks (plus a global
429
+ entry/max_level lock taken twice per insert). hnswlib shows the same
430
+ sublinear scaling for the same reason. The graph built concurrently has
431
+ recall equivalent to the serial build (verified to within +-0.02). Reducing
432
+ the plateau (striped locks, a read-mostly max_level check) is future work.
433
+
434
+ ### Parallel search scaling (M2 Pro, same machine)
435
+
436
+ Search is read-only over a frozen graph (thread-safe via per-thread visited
437
+ buffers), so unlike insert it holds no locks and contends on nothing but the
438
+ memory system. 500k x 128 vectors, 10k queries, `set_threads(N)`
439
+ (`benchmarks/bench_search_mt.py`):
440
+
441
+ | threads | HNSW (ef=100) | exact scan | TurboQuant 8-bit |
442
+ |---------|---------------|------------|------------------|
443
+ | 1 | 1.00x | 1.00x | 1.00x |
444
+ | 2 | 1.82x | 1.94x | 1.77x |
445
+ | 4 | 3.06x | 3.55x | 3.28x |
446
+ | 6 | 3.58x | 4.36x | 3.96x |
447
+ | 8 | 3.97x | 5.17x | 4.38x |
448
+
449
+ The contrast with the insert table is the point: writes contend on hub-node
450
+ locks and plateau at ~1.7x; reads share no mutable state and scale near the
451
+ performance-core count. Note the exact scan scales *best* (5.17x) despite
452
+ being the simplest algorithm — its tight, uniform, fully-independent loop is
453
+ the most parallel-friendly, while HNSW's irregular pointer-chasing through
454
+ shared graph memory hits the memory system sooner. Even lock-free
455
+ parallelism has a ceiling, and here it's bandwidth, not coordination.
456
+
457
+ ### Hybrid index: recall of fp32, memory of codes
458
+
459
+ The hybrid runs the HNSW graph on TurboQuant codes (Design A) and reranks
460
+ the top candidates with exact fp32 (Design B). On 30k x 128 clustered
461
+ vectors it reaches recall@10 = 1.000 at ef=200 — matching pure-fp32 HNSW —
462
+ while the resident structures (codes + graph; fp32 is memory-mappable) take
463
+ ~8.8 MB vs 15.4 MB for the fp32 vectors alone. The quantized graph alone
464
+ (no rerank) tracks pure-TurboQuant brute force; the rerank is what lifts it
465
+ to fp32 parity, and the HNSW neighbor-diversity heuristic on the quantized
466
+ graph is what makes the rerank candidates good enough to matter.
467
+
468
+ ```python
469
+ from pyvecdb import HybridIndex
470
+ h = HybridIndex(dim=128, bits=8, ef_construction=200, rerank_mult=4)
471
+ h.add(vectors) # stores codes + fp32 (for rerank)
472
+ ids, dists = h.search(queries, k=10, ef=200)
473
+ h.memory_bytes(include_fp32=False) # resident footprint w/ mmap'd fp32
474
+ ```
475
+ On *uniform* random
476
+ 128-d data the same index scores ~0.42 recall@10 at ef=200 — uniform
477
+ high-dimensional data breaks HNSW's navigability assumptions at scale;
478
+ that is a property of the data, not the implementation, and is why
479
+ `bench_large.py` offers both geometries.
480
+
481
+ Caveats: x86 numbers from one AVX-512 + VNNI machine, ARM numbers from one
482
+ M2 Pro; synthetic clustered data, single thread; the Qdrant comparison in
483
+ `benchmarks/benchmark.py` uses qdrant-client's in-process mode, not a
484
+ server.
485
+
486
+ ## SIFT1M (standard benchmark dataset)
487
+
488
+ `benchmarks/bench_sift.py` runs against SIFT1M (Inria TEXMEX): 1M base + 10k
489
+ query vectors at dim 128, with a provided ground-truth file of the true 100
490
+ nearest neighbors per query. Recall is measured against that canonical answer
491
+ key — not our own exact search — so results are directly comparable to
492
+ published FAISS / hnswlib / Qdrant numbers.
493
+
494
+ ```sh
495
+ wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
496
+ tar xzf sift.tar.gz
497
+ PYTHONPATH=. python3 benchmarks/bench_sift.py --dir sift --faiss --build-threads 8
498
+ ```
499
+
500
+ For the standard recall-vs-QPS Pareto figure (and a multi-threaded
501
+ head-to-head — both sides at matched thread counts, since FAISS multi-threads
502
+ search too):
503
+
504
+ ```sh
505
+ PYTHONPATH=. python3 benchmarks/bench_sift_pareto.py --dir sift --faiss \
506
+ --search-threads 1 8 --plot sift_pareto.png
507
+ ```
508
+
509
+ This plots recall@10 vs QPS (log scale) for vecdb and FAISS on the same axes
510
+ — the convention ann-benchmarks uses — so the comparison is legible at a
511
+ glance rather than read off a table.
512
+
513
+ ### Measured result: vecdb vs FAISS on SIFT1M (M2 Pro)
514
+
515
+ ![SIFT1M recall vs QPS, vecdb vs FAISS](docs/sift_pareto.png)
516
+
517
+ Recall@10 (x) vs throughput (y, log) on SIFT1M, both at 1 and 8 search
518
+ threads. Higher-and-to-the-right is better. vecdb (solid) leads FAISS
519
+ (dashed) single-threaded across the whole curve, and multi-threaded across
520
+ the operating range (recall ≥ 0.95); FAISS scales better only at the
521
+ cheap/low-recall left edge. Regenerate with `benchmarks/bench_sift_pareto.py`.
522
+
523
+
524
+ 1M base + 10k queries, dim 128, single-thread search, matched HNSW
525
+ parameters (M=16, efConstruction=200), recall against the provided ground
526
+ truth:
527
+
528
+ Matched HNSW params (M=16, efC=200), recall vs the provided ground truth.
529
+ Recall is identical to FAISS to ~3 decimals at every ef — same answers,
530
+ measured at both 1 and 8 search threads:
531
+
532
+ | ef | recall | vecdb 1t | faiss 1t | 1t | vecdb 8t | faiss 8t | 8t |
533
+ |-----|--------|----------|----------|-----|----------|----------|------|
534
+ | 10 | 0.740 | 27,400 | 22,598 |1.21x| 92,647 |107,640 |0.86x |
535
+ | 50 | 0.954 | 9,297 | 6,690 |1.39x| 35,507 | 32,074 |1.11x |
536
+ | 100 | 0.985 | 5,343 | 3,636 |1.47x| 20,389 | 17,414 |1.17x |
537
+ | 200 | 0.996 | 2,983 | 1,943 |1.54x| 11,896 | 8,916 |1.33x |
538
+ | 400 | 0.999 | 1,626 | 1,006 |1.62x| 6,658 | 4,441 |1.50x |
539
+ | 800 | 0.999 | 897 | 492 |1.82x| 3,677 | 2,077 |1.77x |
540
+
541
+ Reading it honestly:
542
+ - **Single thread:** vecdb is faster at every point (1.2–1.8x).
543
+ - **8 threads:** FAISS scales better at low ef (it wins ef=10, where each
544
+ query is tiny and thread-dispatch overhead dominates — FAISS's threading
545
+ has less per-query overhead than vecdb's `schedule(dynamic)` dispatch).
546
+ But across the entire useful range (recall ≥ 0.95) vecdb still wins, and
547
+ the lead *widens* with ef: as each query does more distance work, the
548
+ faster per-query kernel (NEON, prefetch, flat link arenas) dominates and
549
+ dispatch overhead becomes negligible. FAISS wins the dispatch race; vecdb
550
+ wins the compute race.
551
+ - vecdb's parallel **build** was 2.1x faster (233s vs 493s).
552
+
553
+ Scope: Apple Silicon (NEON kernels), SIFT1M. On x86 with AVX-512 the per-query
554
+ kernel gap likely narrows. vecdb scales ~3.4x at 8 threads vs FAISS's ~4.8x —
555
+ FAISS threads more efficiently; vecdb's per-query work is faster. The honest
556
+ summary is: faster single-thread across the board, and faster at all useful
557
+ operating points multi-threaded, but out-scaled at the cheap/low-recall end.
558
+
559
+ The `.fvecs`/`.ivecs` loaders are validated by a format round-trip; the full
560
+ pipeline (load → build → recall-vs-ground-truth) is exercised on a small
561
+ synthetic file in the same binary layout.
562
+
563
+ ### Second dataset: GIST1M (960-d) — a different regime
564
+
565
+ ![GIST1M recall vs QPS, vecdb vs FAISS](docs/gist_pareto.png)
566
+
567
+ Recall@10 vs QPS (log) on GIST1M, 1 and 8 threads. Read at matched recall:
568
+ the curves cross — FAISS is higher-right at low recall, vecdb at high recall
569
+ (and reaches recall FAISS doesn't in this sweep). Regenerate with
570
+ `benchmarks/bench_sift_pareto.py --dir gist`.
571
+
572
+
573
+ GIST1M (1M x 960, same TEXMEX format) is a harder, higher-dimensional test.
574
+ At 960-d, recall must be read against recall, not ef: vecdb returns higher
575
+ recall than FAISS at every ef (better graph quality where navigation is
576
+ hard), so a row-by-row QPS comparison is misleading. Single-thread, matched
577
+ HNSW params, recall vs ground truth:
578
+
579
+ | ef | vecdb recall / QPS | faiss recall / QPS |
580
+ |-----|--------------------|--------------------|
581
+ | 50 | 0.752 / 1,534 | 0.724 / 3,067 |
582
+ | 100 | 0.854 / 1,091 | 0.827 / 1,654 |
583
+ | 200 | 0.931 / 609 | 0.903 / 913 |
584
+ | 400 | 0.971 / 350 | 0.948 / 493 |
585
+ | 800 | 0.989 / 177 | 0.970 / 255 |
586
+
587
+ Read at matched *recall*: FAISS is faster in the low-recall regime, but
588
+ vecdb wins at high recall and reaches accuracy FAISS doesn't hit in this
589
+ sweep — e.g. ~0.97 recall costs vecdb ef=400 (350 qps) vs FAISS ef=800
590
+ (255 qps), ~1.37x; and vecdb reaches 0.989 where FAISS tops out at 0.970.
591
+
592
+ The contrast with SIFT (128-d) is the interesting part. At 128-d vecdb's
593
+ hand-written kernel makes it faster at matched ef (compute-bound, the kernel
594
+ dominates). At 960-d each distance streams ~3.8 KB, so the inner loop is
595
+ memory-bandwidth-bound and the kernel advantage shrinks — FAISS catches up
596
+ on raw speed. What shows through instead is graph quality: vecdb's higher
597
+ recall per ef is an algorithmic edge independent of the kernel. Two
598
+ dimensionalities, two regimes: kernel-bound at 128-d, bandwidth-plus-graph
599
+ at 960-d. Build was 1.8x faster (765s vs 1396s).
600
+
601
+ ## License
602
+
603
+ MIT.