PyPI - vecdbc - Versions diffs - 0.1.0__tar.gz - Mend

vecdbc 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

vecdbc-0.1.0/LICENSE +19 -0
vecdbc-0.1.0/MANIFEST.in +3 -0
vecdbc-0.1.0/PKG-INFO +603 -0
vecdbc-0.1.0/README.md +584 -0
vecdbc-0.1.0/hybrid.c +310 -0
vecdbc-0.1.0/hybrid.h +42 -0
vecdbc-0.1.0/pyproject.toml +29 -0
vecdbc-0.1.0/pyvecdb.py +402 -0
vecdbc-0.1.0/setup.cfg +4 -0
vecdbc-0.1.0/setup.py +66 -0
vecdbc-0.1.0/turboquant.c +672 -0
vecdbc-0.1.0/turboquant.h +53 -0
vecdbc-0.1.0/vecdb.c +1223 -0
vecdbc-0.1.0/vecdb.h +107 -0
vecdbc-0.1.0/vecdbc.egg-info/PKG-INFO +603 -0
vecdbc-0.1.0/vecdbc.egg-info/SOURCES.txt +17 -0
vecdbc-0.1.0/vecdbc.egg-info/dependency_links.txt +1 -0
vecdbc-0.1.0/vecdbc.egg-info/requires.txt +1 -0
vecdbc-0.1.0/vecdbc.egg-info/top_level.txt +1 -0

vecdbc-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+# MIT License
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

vecdbc-0.1.0/MANIFEST.in ADDED Viewed

@@ -0,0 +1,3 @@
+include vecdb.c turboquant.c hybrid.c
+include vecdb.h turboquant.h hybrid.h
+include README.md LICENSE

vecdbc-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,603 @@
+Metadata-Version: 2.4
+Name: vecdbc
+Version: 0.1.0
+Summary: A from-scratch vector database in C (HNSW + TurboQuant + hybrid) with Python bindings
+Author-email: Seyed Navid Mirnour algeri <navid72m@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/navid72m/vectorDatabase
+Project-URL: Repository, https://github.com/navid72m/vectorDatabase
+Keywords: vector-database,ann,nearest-neighbor,hnsw,quantization,embeddings
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Database :: Database Engines/Servers
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy
+Dynamic: license-file
+# vecdb — a small vector database in C
+![CI](https://github.com/navid72m/vectorDatabase/actions/workflows/ci.yml/badge.svg)
+`vecdb` is a dependency-light vector index implemented in C11 with Python bindings through `ctypes`. It stores fixed-size `float32` vectors and supports exact and approximate nearest-neighbor search using squared L2 distance.
+## What it implements
+- **Flat exact search** — brute-force L2 scan.
+- **Blocked exact batch search** — processes up to 8 queries per pass so one vector load feeds multiple query accumulators.
+- **HNSW approximate search** — graph index using greedy descent, beam search (`ef`), and heuristic neighbor selection.
+- **Binary persistence** — save/load an index with vectors, IDs, levels, and neighbor links.
+- **Delete API** — O(1) tombstone deletes by user ID, with stable recall under churn and a `compact()` rebuild to reclaim space.
+- **Filtered search** — restrict any search to an allow-list or away from a deny-list of IDs; HNSW traverses through filtered nodes so selective filters cannot disconnect the search.
+- **Concurrent reads** — searches are thread-safe (per-thread visited buffers, no shared mutable state); any number of threads may search one index simultaneously. Writes require external exclusion.
+- **OpenMP batch search** — `make OMP=1` parallelizes batched HNSW, exact, and TurboQuant searches over queries. Single-threaded behavior is unchanged; results are identical either way.
+- **Parallel insert (experimental)** — `add_bulk(ids, vecs, threads=N)` builds the HNSW graph concurrently using a published-node model: a node becomes a traversal target (atomic release/acquire flag) only after its links are fully wired, and per-node locks guard every link-list read and mutation. Produces a graph of equivalent recall to the serial build (verified to within +-0.02, ASan-clean). Measured 1.73x at 8 threads on an M2 Pro for a 1M-vector build (see scaling table below); the serial path is the default and is byte-identical to repeated single inserts.
+- **NEON kernels** — on AArch64 (Apple Silicon, Graviton, ...) the distance kernel, blocked exact scan, and both TurboQuant scans use NEON; the 4-bit codebook decodes via a single `tbl` table register and both quantized scans run on `sdot` (signed x signed, so no zero-point correction). Verified against scalar references under QEMU in CI.
+- **Hybrid index (HNSW + TurboQuant + rerank)** — an HNSW graph that traverses on TurboQuant code distances and reranks the top candidates with exact fp32; matches pure-fp32 recall while the resident footprint (codes + graph) is smaller than the fp32 vectors alone.
+- **TurboQuant compressed index** — optional 4-bit or 8-bit compressed brute-force index with randomized Hadamard rotation, norm separation, Lloyd-Max Gaussian quantization, and optional QJL residual estimation.
+The C API is declared in `vecdb.h`. The Python API lives in `pyvecdb.py` and loads `libvecdb.so` from the project directory by default.
+## Build from source
+Requirements:
+- `clang`
+- `numpy` for the Python bindings
+- A Unix-like shell with `make`
+```sh
+make clean
+make
+```
+`make` builds `libvecdb.so`. Build the demo program separately:
+```sh
+make bench
+```
+Useful targets:
+| Target | Result |
+|--------|--------|
+| `make` / `make all` | Builds `libvecdb.so` |
+| `make bench` | Builds the `bench` demo executable |
+| `make clean` | Removes `.o`, `.so`, and `bench` build artifacts |
+The Makefile uses `-O3 -march=native`, so generated binaries are machine-specific. If you need to override the compiler or flags:
+```sh
+make clean
+make CC=clang CFLAGS="-O3 -Wall -Wextra -fPIC"
+```
+For local Python use, run scripts from the repo root or set `PYTHONPATH=.`:
+```sh
+PYTHONPATH=. python script.py
+```
+`pyvecdb.py` loads `libvecdb.so` from the same directory as the Python file. To load a library from somewhere else, set `VECDB_LIB`:
+```sh
+VECDB_LIB=/path/to/libvecdb.so PYTHONPATH=. python script.py
+```
+## Quick start
+```sh
+make
+make bench
+./bench 50000 128
+```
+`bench` generates random vectors, inserts them into an HNSW index, runs one approximate query, prints the top result, and writes `index.vecdb`.
+## Python API
+`pyvecdb.py` exposes two index classes:
+| Class | Purpose |
+|-------|---------|
+| `VecDB` | Full vector index with flat exact search, HNSW approximate search, delete, save, and load |
+| `TQIndex` | TurboQuant compressed brute-force index |
+| `HybridIndex` | HNSW over TurboQuant codes with fp32 rerank (Design A + B) |
+### `VecDB`
+```python
+from pyvecdb import VecDB
+db = VecDB(
+    dim=128,
+    M=16,
+    ef_construction=200,
+    capacity=1024,
+    seed=0x9E3779B97F4A7C15,
+)
+```
+Constructor arguments:
+| Argument | Meaning |
+|----------|---------|
+| `dim` | Vector dimensionality |
+| `M` | HNSW max links per node on upper layers; level 0 uses `2 * M` |
+| `ef_construction` | HNSW beam width during inserts |
+| `capacity` | Initial vector capacity; the index grows automatically |
+| `seed` | Deterministic RNG seed for HNSW level sampling |
+Core methods:
+```python
+db.add(vectors, ids=None)          # returns None; ids default to current_count..current_count+n
+db.delete(ids)                     # O(1) per id; raises RuntimeError if any id is missing
+db.compact()                       # rebuild without tombstones after many deletes
+ids, distances = db.search(queries, k=10, ef=100, exact=False)
+ids, distances = db.search(queries, k=10, allow_ids=some_ids)   # filtered
+ids, distances = db.search(queries, k=10, deny_ids=other_ids)
+db.save(path)
+loaded = VecDB.load(path)
+len(db)                            # number of stored vectors
+db.dim                             # vector dimensionality
+```
+Input requirements:
+- `vectors` and `queries` must be convertible to contiguous `float32` arrays.
+- Shape must be `(n, dim)`; a 1D vector is accepted as `(1, dim)`.
+- `ids`, when provided, must be shape `(n,)`.
+- `ef` is clamped up to `k` if `ef < k`.
+`search()` returns `(ids, distances)`, both shaped `(n_queries, k)`. Missing result slots are filled with `id = 2**64 - 1` and `dist = inf`. Distances are squared L2 values.
+Example:
+```python
+import numpy as np
+from pyvecdb import VecDB
+vectors = np.random.rand(10_000, 128).astype(np.float32)
+ids = np.arange(vectors.shape[0], dtype=np.uint64)
+db = VecDB(dim=128, M=16, ef_construction=200)
+db.add(vectors, ids)
+queries = vectors[:5]
+flat_ids, flat_dists = db.search(queries, k=10, ef=100, exact=True)
+hnsw_ids, hnsw_dists = db.search(queries, k=10, ef=100, exact=False)
+db.delete(np.array([42], dtype=np.uint64))
+db.save("index.vecdb")
+loaded = VecDB.load("index.vecdb")
+```
+Delete semantics: `VecDB.delete()` is O(1) per ID via an internal id->slot
+hash map. Deleted nodes are tombstoned: they keep carrying HNSW graph
+connectivity (traversal passes through them, so deleting hub nodes cannot
+orphan graph regions) but are excluded from all search results, exact scans,
+and counts. Tombstoned slots are not reused; after heavy churn, call
+`db.compact()` to rebuild the index without them. In a 5-cycle churn test
+(30% turnover per cycle, 10k vectors), recall@10 stayed in 0.95-0.97 with
+no degradation trend. Adding a duplicate live ID is rejected with an error.
+### `TQIndex`
+```python
+from pyvecdb import TQIndex
+tq = TQIndex(dim=128, bits=4, qjl=False, seed=123)
+```
+Constructor arguments:
+| Argument | Meaning |
+|----------|---------|
+| `dim` | Vector dimensionality; internally padded to the next power of two |
+| `bits` | `4` or `8` bits per coordinate |
+| `qjl` | Enable TurboQuant-prod residual estimation |
+| `seed` | Deterministic RNG seed for Hadamard signs and QJL projection |
+Core methods:
+```python
+tq.add(vectors, ids=None)
+ids, distances = tq.search(queries, k=10)
+bytes_used = tq.memory_bytes()
+len(tq)
+```
+TurboQuant is a compressed brute-force index, not an HNSW graph. It currently exposes `add`, `search`, `memory_bytes`, and `__len__`; it does not expose delete or persistence.
+Example:
+```python
+import numpy as np
+from pyvecdb import TQIndex
+vectors = np.random.rand(10_000, 128).astype(np.float32)
+ids = np.arange(vectors.shape[0], dtype=np.uint64)
+tq = TQIndex(dim=128, bits=4, qjl=False)
+tq.add(vectors, ids)
+queries = vectors[:5]
+tq_ids, tq_dists = tq.search(queries, k=10)
+print(tq.memory_bytes())
+```
+## C API
+`vecdb.h` declares the public C interface.
+### Types
+```c
+typedef struct VecDB VecDB;
+typedef struct {
+    uint64_t id;     /* user-supplied id */
+    float    dist;   /* squared L2 distance to query */
+} VecResult;
+typedef struct {
+    int    dim;
+    int    M;
+    int    ef_construction;
+    size_t initial_capacity;
+    uint64_t seed;
+} VecDBConfig;
+```
+Return semantics:
+- `vecdb_create()` returns `NULL` on invalid config or allocation failure.
+- `vecdb_add()` returns the internal index on success, or `-1` on failure.
+- `vecdb_delete()` returns `0` on success, or `-1` if the ID is not found.
+- `vecdb_compact()` returns `0` on success, `-1` on allocation failure.
+- `vecdb_search_*()` returns the number of results written.
+- `vecdb_save()` returns `0` on success, or `-1` on failure.
+- `vecdb_load()` returns `NULL` on open/read/format failure.
+### Example
+```c
+#include "vecdb.h"
+int main(void) {
+    VecDBConfig cfg = vecdb_default_config(128);
+    cfg.M = 16;
+    cfg.ef_construction = 200;
+    cfg.initial_capacity = 1024;
+    cfg.seed = 0x9E3779B97F4A7C15ULL;
+    VecDB *db = vecdb_create(&cfg);
+    if (!db) return 1;
+    float vec[128]; // fill with your data
+    vecdb_add(db, 42, vec);
+    float query[128]; // fill with your data
+    VecResult out[10];
+    int n_flat = vecdb_search_flat(db, query, 10, out);
+    int n_hnsw = vecdb_search_hnsw(db, query, 10, 100, out);
+    vecdb_delete(db, 42);
+    if (vecdb_save(db, "index.vecdb") != 0) return 1;
+    VecDB *loaded = vecdb_load("index.vecdb");
+    vecdb_free(db);
+    vecdb_free(loaded);
+    return 0;
+}
+```
+Core C functions:
+| Function | Purpose |
+|----------|---------|
+| `vecdb_default_config(dim)` | Fill a config with defaults: `M=16`, `ef_construction=200`, `initial_capacity=1024` |
+| `vecdb_create(&cfg)` | Allocate and initialize a database |
+| `vecdb_free(db)` | Free all memory |
+| `vecdb_add(db, id, vector)` | Insert one vector |
+| `vecdb_delete(db, id)` | O(1) tombstone delete by user ID |
+| `vecdb_compact(db)` | Rebuild in place without tombstoned nodes |
+| `vecdb_count(db)` | Number of stored vectors |
+| `vecdb_dim(db)` | Vector dimensionality |
+| `vecdb_search_flat(db, query, k, out)` | Exact single-query search |
+| `vecdb_search_flat_batch(db, queries, nq, k, out)` | Exact batched search over up to 8 queries per pass |
+| `vecdb_search_hnsw(db, query, k, ef, out)` | Approximate HNSW search (thread-safe for concurrent reads) |
+| `vecdb_search_hnsw_filtered(db, query, k, ef, mask, out)` | HNSW search restricted to a slot mask |
+| `vecdb_search_flat_batch_filtered(db, queries, nq, k, mask, out)` | Exact filtered batch scan |
+| `vecdb_make_mask(db, ids, n, mode, mask)` | Build an allow (mode 0) or deny (mode 1) mask from user IDs |
+| `vecdb_slots(db)` | Mask size in bytes (internal slot count incl. tombstones) |
+| `vecdb_save(db, path)` | Persist vectors and graph |
+| `vecdb_load(path)` | Load a persisted index |
+## TurboQuant details
+`TQIndex` encodes each vector as a compact code and searches by scanning those codes.
+Per-vector encoding pipeline:
+1. Store the original norm.
+2. Normalize the vector direction.
+3. Pad dimension to the next power of two.
+4. Apply three rounds of sign flips plus fast Walsh-Hadamard transform.
+5. Quantize each rotated coordinate with an analytic Lloyd-Max Gaussian codebook.
+6. For 8-bit mode, snap reconstructions to signed int8 and use an AVX512-VNNI scan path when available.
+7. Optional QJL mode stores residual sign bits and residual norms for an unbiased inner-product estimator.
+Distances returned by `TQIndex.search()` are estimated squared L2 distances for the compressed representation.
+## Implementation notes
+- Distance metric: squared L2. Cosine search can be done by normalizing vectors before insertion and querying.
+- HNSW uses deterministic seeded `xorshift64*` level sampling.
+- HNSW search uses epoch-tagged visited marks to avoid clearing a visited array per query.
+- Flat batch search uses the identity `||q - x||^2 = ||q||^2 - 2<q,x> + ||x||^2` with cached vector norms.
+- `vecdb.c` has AVX-512 and NEON distance kernels; otherwise it falls back to scalar code that is intended to auto-vectorize under `-O3`.
+- `turboquant.c` scan paths by target: AVX-512 4-bit register-LUT + VNNI 8-bit on x86; NEON `tbl`+`sdot` for both bit-widths on AArch64 (requires the dotprod extension, ARMv8.2+); scalar otherwise. The NEON 4-bit path scans on the int8 grid (like the VNNI 8-bit path), trading a small amount of raw recall for speed; reranking recovers it.
+- OpenMP parallelism applies to batched searches only; index build remains single-threaded. macOS note: Apple clang has no bundled OpenMP — use `brew install gcc` and `make OMP=1 CC=gcc-14`, or build serial (default).
+- The Makefile uses `-march=native`, so build artifacts are machine-specific.
+## Known limits
+- Single writer: add/delete/compact must be externally excluded from each other and from searches. Concurrent searches are safe.
+- Very selective allow-lists (under ~1% of vectors) can return fewer than k HNSW results; use `exact=True` or raise `ef` for those.
+- Tombstoned slots are not reused by inserts; memory is reclaimed only by `compact()`. Long-running high-churn workloads should compact periodically.
+- Parallel insert (`threads>1`) is experimental: correctness is validated by recall-equivalence and AddressSanitizer, but ThreadSanitizer cannot fully verify it because libgomp's `omp_lock_t` synchronization is invisible to TSan (residual reports are lock-protected accesses TSan can't see). The default serial path is the verified one.
+- `compact()` rebuilds the whole graph (O(N log N) inserts), so it is a maintenance operation, not a per-delete cost.
+- TurboQuant is a brute-force compressed index; it does not build an HNSW graph.
+- TurboQuant has no delete or save/load API yet.
+- Persistence format is versioned (v2 adds tombstones; v1 files still load) but not a long-term stable external format.
+## Project layout
+```text
+vecdb.h        # C API
+vecdb.c        # VecDB storage, flat search, HNSW, persistence
+turboquant.c   # TurboQuant compressed index (+ codec API for the hybrid)
+hybrid.c       # HNSW-over-codes + fp32 rerank index
+pyvecdb.py     # ctypes Python bindings
+bench.c        # small random-data smoke/demo program
+Makefile       # build rules for libvecdb.so and bench
+pyproject.toml # Python package metadata
+tests.py       # correctness, churn, filtering, concurrency suite (python tests.py)
+.github/       # CI: build + full test suite on every push
+benchmarks/    # FAISS/Chroma/Qdrant comparison, quantization shoot-out,
+               #   parallel scaling, SIFT1M (standard ANN dataset),
+               #   and a recall-QPS Pareto plot vs FAISS
+```
+## Measured results
+All single-threaded, 20k x 128 Gaussian-mixture vectors, recall verified
+against exact ground truth (see `benchmarks/` for the scripts):
+| index                      | QPS    | recall@10 | note                          |
+|----------------------------|--------|-----------|-------------------------------|
+| flat exact (8-query block) |  3,928 | 1.000     | 1.9x faster than faiss flat   |
+| HNSW ef=50                 | 16,578 | 1.000     | (synthetic; see SIFT1M below) |
+| HNSW ef=200                |  8,574 | 1.000     | (synthetic; see SIFT1M below) |
+| TurboQuant 4-bit           |  2,235 | 0.622     | beats faiss SQ4; 1.000 w/ rerank |
+| TurboQuant 8-bit (VNNI)    |  4,993 | 0.937     | 2.5x faiss SQ8; 1.000 w/ rerank |
+Churn stability: 5 cycles of 30% delete+reinsert on 10k vectors held
+recall@10 at 0.95-0.97; `compact()` rebuilt 10k vectors in ~1.6s.
+### 1M vectors on Apple Silicon (M2 Pro, NEON build, single thread)
+1,000,000 x 128 Gaussian-mixture vectors, 200 held-out queries, recall
+against exact ground truth (`benchmarks/bench_large.py --n 1000000`):
+| ef  | QPS    | recall@10 |
+|-----|--------|-----------|
+| 10  | 30,666 | 0.618     |
+| 50  | 10,691 | 0.926     |
+| 100 |  7,471 | 0.978     |
+| 200 |  5,427 | 0.989     |
+| 400 |  3,836 | 0.999     |
+Build: 178s (5,622 vec/s) with capacity preallocated. (Build times are not
+comparable across machines — different CPUs, clocks, and memory; in the
+parallel table below, compare thread counts within the single M2 Pro run.)
+### Parallel insert scaling (M2 Pro, 6 performance + 4 efficiency cores)
+1M x 128 clustered vectors, `add_bulk(ids, vecs, threads=N)`, concurrent
+HNSW graph construction (gcc-15, `make OMP=1`):
+| threads | build | vec/s | speedup |
+|---------|-------|-------|---------|
+| 1       | 322.8s | 3,098 | 1.00x  |
+| 2       | 247.3s | 4,044 | 1.31x  |
+| 4       | 203.4s | 4,916 | 1.59x  |
+| 6       | 193.4s | 5,170 | 1.67x  |
+| 8       | 186.9s | 5,351 | 1.73x  |
+The curve flattens after 4 threads — the expected signature of HNSW insert:
+a small set of well-connected "hub" nodes are selected as neighbors by most
+inserts, so threads serialize on those nodes' locks (plus a global
+entry/max_level lock taken twice per insert). hnswlib shows the same
+sublinear scaling for the same reason. The graph built concurrently has
+recall equivalent to the serial build (verified to within +-0.02). Reducing
+the plateau (striped locks, a read-mostly max_level check) is future work.
+### Parallel search scaling (M2 Pro, same machine)
+Search is read-only over a frozen graph (thread-safe via per-thread visited
+buffers), so unlike insert it holds no locks and contends on nothing but the
+memory system. 500k x 128 vectors, 10k queries, `set_threads(N)`
+(`benchmarks/bench_search_mt.py`):
+| threads | HNSW (ef=100) | exact scan | TurboQuant 8-bit |
+|---------|---------------|------------|------------------|
+| 1       | 1.00x         | 1.00x      | 1.00x            |
+| 2       | 1.82x         | 1.94x      | 1.77x            |
+| 4       | 3.06x         | 3.55x      | 3.28x            |
+| 6       | 3.58x         | 4.36x      | 3.96x            |
+| 8       | 3.97x         | 5.17x      | 4.38x            |
+The contrast with the insert table is the point: writes contend on hub-node
+locks and plateau at ~1.7x; reads share no mutable state and scale near the
+performance-core count. Note the exact scan scales *best* (5.17x) despite
+being the simplest algorithm — its tight, uniform, fully-independent loop is
+the most parallel-friendly, while HNSW's irregular pointer-chasing through
+shared graph memory hits the memory system sooner. Even lock-free
+parallelism has a ceiling, and here it's bandwidth, not coordination.
+### Hybrid index: recall of fp32, memory of codes
+The hybrid runs the HNSW graph on TurboQuant codes (Design A) and reranks
+the top candidates with exact fp32 (Design B). On 30k x 128 clustered
+vectors it reaches recall@10 = 1.000 at ef=200 — matching pure-fp32 HNSW —
+while the resident structures (codes + graph; fp32 is memory-mappable) take
+~8.8 MB vs 15.4 MB for the fp32 vectors alone. The quantized graph alone
+(no rerank) tracks pure-TurboQuant brute force; the rerank is what lifts it
+to fp32 parity, and the HNSW neighbor-diversity heuristic on the quantized
+graph is what makes the rerank candidates good enough to matter.
+```python
+from pyvecdb import HybridIndex
+h = HybridIndex(dim=128, bits=8, ef_construction=200, rerank_mult=4)
+h.add(vectors)                       # stores codes + fp32 (for rerank)
+ids, dists = h.search(queries, k=10, ef=200)
+h.memory_bytes(include_fp32=False)   # resident footprint w/ mmap'd fp32
+```
+ On *uniform* random
+128-d data the same index scores ~0.42 recall@10 at ef=200 — uniform
+high-dimensional data breaks HNSW's navigability assumptions at scale;
+that is a property of the data, not the implementation, and is why
+`bench_large.py` offers both geometries.
+Caveats: x86 numbers from one AVX-512 + VNNI machine, ARM numbers from one
+M2 Pro; synthetic clustered data, single thread; the Qdrant comparison in
+`benchmarks/benchmark.py` uses qdrant-client's in-process mode, not a
+server.
+## SIFT1M (standard benchmark dataset)
+`benchmarks/bench_sift.py` runs against SIFT1M (Inria TEXMEX): 1M base + 10k
+query vectors at dim 128, with a provided ground-truth file of the true 100
+nearest neighbors per query. Recall is measured against that canonical answer
+key — not our own exact search — so results are directly comparable to
+published FAISS / hnswlib / Qdrant numbers.
+```sh
+wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
+tar xzf sift.tar.gz
+PYTHONPATH=. python3 benchmarks/bench_sift.py --dir sift --faiss --build-threads 8
+```
+For the standard recall-vs-QPS Pareto figure (and a multi-threaded
+head-to-head — both sides at matched thread counts, since FAISS multi-threads
+search too):
+```sh
+PYTHONPATH=. python3 benchmarks/bench_sift_pareto.py --dir sift --faiss \
+    --search-threads 1 8 --plot sift_pareto.png
+```
+This plots recall@10 vs QPS (log scale) for vecdb and FAISS on the same axes
+— the convention ann-benchmarks uses — so the comparison is legible at a
+glance rather than read off a table.
+### Measured result: vecdb vs FAISS on SIFT1M (M2 Pro)
+![SIFT1M recall vs QPS, vecdb vs FAISS](docs/sift_pareto.png)
+Recall@10 (x) vs throughput (y, log) on SIFT1M, both at 1 and 8 search
+threads. Higher-and-to-the-right is better. vecdb (solid) leads FAISS
+(dashed) single-threaded across the whole curve, and multi-threaded across
+the operating range (recall ≥ 0.95); FAISS scales better only at the
+cheap/low-recall left edge. Regenerate with `benchmarks/bench_sift_pareto.py`.
+1M base + 10k queries, dim 128, single-thread search, matched HNSW
+parameters (M=16, efConstruction=200), recall against the provided ground
+truth:
+Matched HNSW params (M=16, efC=200), recall vs the provided ground truth.
+Recall is identical to FAISS to ~3 decimals at every ef — same answers,
+measured at both 1 and 8 search threads:
+| ef  | recall | vecdb 1t | faiss 1t | 1t  | vecdb 8t | faiss 8t | 8t   |
+|-----|--------|----------|----------|-----|----------|----------|------|
+| 10  | 0.740  | 27,400   | 22,598   |1.21x| 92,647   |107,640   |0.86x |
+| 50  | 0.954  |  9,297   |  6,690   |1.39x| 35,507   | 32,074   |1.11x |
+| 100 | 0.985  |  5,343   |  3,636   |1.47x| 20,389   | 17,414   |1.17x |
+| 200 | 0.996  |  2,983   |  1,943   |1.54x| 11,896   |  8,916   |1.33x |
+| 400 | 0.999  |  1,626   |  1,006   |1.62x|  6,658   |  4,441   |1.50x |
+| 800 | 0.999  |    897   |    492   |1.82x|  3,677   |  2,077   |1.77x |
+Reading it honestly:
+- **Single thread:** vecdb is faster at every point (1.2–1.8x).
+- **8 threads:** FAISS scales better at low ef (it wins ef=10, where each
+  query is tiny and thread-dispatch overhead dominates — FAISS's threading
+  has less per-query overhead than vecdb's `schedule(dynamic)` dispatch).
+  But across the entire useful range (recall ≥ 0.95) vecdb still wins, and
+  the lead *widens* with ef: as each query does more distance work, the
+  faster per-query kernel (NEON, prefetch, flat link arenas) dominates and
+  dispatch overhead becomes negligible. FAISS wins the dispatch race; vecdb
+  wins the compute race.
+- vecdb's parallel **build** was 2.1x faster (233s vs 493s).
+Scope: Apple Silicon (NEON kernels), SIFT1M. On x86 with AVX-512 the per-query
+kernel gap likely narrows. vecdb scales ~3.4x at 8 threads vs FAISS's ~4.8x —
+FAISS threads more efficiently; vecdb's per-query work is faster. The honest
+summary is: faster single-thread across the board, and faster at all useful
+operating points multi-threaded, but out-scaled at the cheap/low-recall end.
+The `.fvecs`/`.ivecs` loaders are validated by a format round-trip; the full
+pipeline (load → build → recall-vs-ground-truth) is exercised on a small
+synthetic file in the same binary layout.
+### Second dataset: GIST1M (960-d) — a different regime
+![GIST1M recall vs QPS, vecdb vs FAISS](docs/gist_pareto.png)
+Recall@10 vs QPS (log) on GIST1M, 1 and 8 threads. Read at matched recall:
+the curves cross — FAISS is higher-right at low recall, vecdb at high recall
+(and reaches recall FAISS doesn't in this sweep). Regenerate with
+`benchmarks/bench_sift_pareto.py --dir gist`.
+GIST1M (1M x 960, same TEXMEX format) is a harder, higher-dimensional test.
+At 960-d, recall must be read against recall, not ef: vecdb returns higher
+recall than FAISS at every ef (better graph quality where navigation is
+hard), so a row-by-row QPS comparison is misleading. Single-thread, matched
+HNSW params, recall vs ground truth:
+| ef  | vecdb recall / QPS | faiss recall / QPS |
+|-----|--------------------|--------------------|
+| 50  | 0.752 / 1,534      | 0.724 / 3,067      |
+| 100 | 0.854 / 1,091      | 0.827 / 1,654      |
+| 200 | 0.931 / 609        | 0.903 / 913        |
+| 400 | 0.971 / 350        | 0.948 / 493        |
+| 800 | 0.989 / 177        | 0.970 / 255        |
+Read at matched *recall*: FAISS is faster in the low-recall regime, but
+vecdb wins at high recall and reaches accuracy FAISS doesn't hit in this
+sweep — e.g. ~0.97 recall costs vecdb ef=400 (350 qps) vs FAISS ef=800
+(255 qps), ~1.37x; and vecdb reaches 0.989 where FAISS tops out at 0.970.
+The contrast with SIFT (128-d) is the interesting part. At 128-d vecdb's
+hand-written kernel makes it faster at matched ef (compute-bound, the kernel
+dominates). At 960-d each distance streams ~3.8 KB, so the inner loop is
+memory-bandwidth-bound and the kernel advantage shrinks — FAISS catches up
+on raw speed. What shows through instead is graph quality: vecdb's higher
+recall per ef is an algorithmic edge independent of the kernel. Two
+dimensionalities, two regimes: kernel-bound at 128-d, bandwidth-plus-graph
+at 960-d. Build was 1.8x faster (765s vs 1396s).
+## License
+MIT.