PyPI - faceflash - Versions diffs - 0.1.0__tar.gz - Mend

faceflash 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

faceflash-0.1.0/PKG-INFO +575 -0
faceflash-0.1.0/README.md +529 -0
faceflash-0.1.0/faceflash/__init__.py +29 -0
faceflash-0.1.0/faceflash/align.py +202 -0
faceflash-0.1.0/faceflash/cluster.py +116 -0
faceflash-0.1.0/faceflash/detect.py +75 -0
faceflash-0.1.0/faceflash/embed.py +93 -0
faceflash-0.1.0/faceflash/engine.py +207 -0
faceflash-0.1.0/faceflash/index.py +432 -0
faceflash-0.1.0/faceflash/pca_quantize.py +199 -0
faceflash-0.1.0/faceflash/py.typed +0 -0
faceflash-0.1.0/pyproject.toml +61 -0
faceflash-0.1.0/rust/Cargo.lock +321 -0
faceflash-0.1.0/rust/Cargo.toml +22 -0
faceflash-0.1.0/rust/src/lib.rs +463 -0

faceflash-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,575 @@
+Metadata-Version: 2.4
+Name: faceflash
+Version: 0.1.0
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Rust
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Operating System :: MacOS
+Classifier: Operating System :: Microsoft :: Windows
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Image Recognition
+Requires-Dist: numpy>=1.24
+Requires-Dist: opencv-python-headless>=4.8
+Requires-Dist: pillow>=10.0
+Requires-Dist: tqdm>=4.60
+Requires-Dist: faiss-cpu>=1.7 ; extra == 'benchmark'
+Requires-Dist: usearch>=2.0 ; extra == 'benchmark'
+Requires-Dist: hnswlib>=0.7 ; extra == 'benchmark'
+Requires-Dist: onnxruntime>=1.16 ; extra == 'cpu'
+Requires-Dist: pytest ; extra == 'dev'
+Requires-Dist: matplotlib ; extra == 'dev'
+Requires-Dist: pandas ; extra == 'dev'
+Requires-Dist: onnxruntime>=1.16 ; extra == 'dev'
+Requires-Dist: onnxruntime-gpu>=1.16 ; extra == 'gpu'
+Provides-Extra: benchmark
+Provides-Extra: cpu
+Provides-Extra: dev
+Provides-Extra: gpu
+License-File: LICENSE
+Summary: Fast face retrieval via PCA+ITQ binary quantization — up to 96x less memory than HNSW at equal recall
+Keywords: face-recognition,vector-search,binary-quantization,arcface,memory-efficient
+Author-email: Raghavender Grudhanti <raghavenderreddy1212@gmail.com>
+License: MIT
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Homepage, https://github.com/raghavenderreddygrudhanti/faceflash
+Project-URL: Issues, https://github.com/raghavenderreddygrudhanti/faceflash/issues
+Project-URL: Repository, https://github.com/raghavenderreddygrudhanti/faceflash
+# ⚡ FaceFlash
+**FaceFlash searches 1M faces in 61 MB of RAM. HNSW needs 2.9 GB, USearch needs 2.5 GB, FAISS needs 1.9 GB — for the same 100% recall.**
+FaceFlash is a Rust face search engine with Python bindings, built on **PCA+ITQ binary quantization** — a learned hash that preserves identity information with zero recall loss and no separate training phase.
+- **100% Recall@1, 48–96× less memory.** On the MS1MV2 benchmark (44,291 identities), FaceFlash held 100% Recall@1 at every scale from 100K to 1M while using **48× less index memory than HNSWLIB** at the default 512-bit config (~42× vs USearch, ~32× vs FAISS-Flat) — and **96× less** at the 256-bit compact config (100% recall, ~2× single-query latency).
+- **Faster than HNSW at 100K.** On the same benchmark, FaceFlash single-query latency was 0.30ms vs HNSWLIB's 0.60ms (2× faster). Batched throughput: 27,661 qps vs 5,813 qps (4.8× faster). Both at 100% recall.
+- **AVX-512 VPOPCNTDQ + NEON.** Hand-written SIMD kernels process one 512-bit face code per instruction (~3× faster than scalar). Multi-core batched search measured at 10–17× throughput vs single-query serial on the same hardware.
+- **Zero-config indexing.** Add faces, they're indexed — PCA fits automatically after 1,024 samples, no hyperparameter tuning, no rebuilds as the gallery grows.
+- **Pure local.** No managed service, no data leaving your machine. Pair with any ArcFace model for a fully air-gapped face search stack.
+<div align="center">
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://python.org)
+[![Rust AVX-512](https://img.shields.io/badge/backend-Rust%20AVX--512%20VPOPCNTDQ-orange.svg)](rust/)
+[![CI](https://github.com/raghavenderreddygrudhanti/faceflash/actions/workflows/ci.yml/badge.svg)](https://github.com/raghavenderreddygrudhanti/faceflash/actions)
+</div>
+```python
+from faceflash import FaceFlash
+ff = FaceFlash()
+ff.register_folder("employees/")   # bulk enroll
+ff.save("my_index/")
+result = ff.search("visitor.jpg")
+# {"matches": [{"name": "Alice", "confidence": 0.92}], "search_time_ms": 0.4}
+```
+---
+## The Problem
+Most face search libraries force a trade-off:
+- **HNSW** is fast and accurate — but consumes **2.9 GB of RAM at 1M faces**
+- **ScaNN / USearch** are blazing fast — but drop to **94–99% recall**
+- **FAISS-Flat** is exact — but is **10× slower** at scale
+FaceFlash breaks this trade-off. It compresses each face into a **64-byte binary fingerprint** using PCA + ITQ quantization, scans them with a single AVX-512 VPOPCNTDQ instruction, then re-ranks only the top candidates with exact cosine similarity. The result: exact accuracy at a fraction of the memory.
+---
+## At a Glance
+| | FaceFlash | HNSWLIB | USearch | ScaNN | FAISS-Flat |
+|---|---|---|---|---|---|
+| **Recall@1** | **100%** | 100% | 99.5% | 98.3% | 100% |
+| **Memory @ 100K** | **3.05 MB** | 293 MB | 254 MB | 12 MB | 195 MB |
+| **Memory @ 500K** | **15.3 MB** | 1,465 MB | 1,270 MB | 61 MB | 977 MB |
+| **Memory @ 1M** | **30.5 MB** | 2,930 MB | 2,539 MB | 122 MB | 1,953 MB |
+| **Latency @ 100K** | **0.30ms** | 0.60ms | 0.17ms | 0.10ms | 4.90ms |
+| **Batched QPS @ 100K** | 27,661 | 5,813 | **137,264** | — | — |
+| **Index build** | Auto (PCA fit) | Build graph | Build graph | Partition | None |
+> Tested on MS1MV2 (44,291 identities, 645,019 embeddings). Hardware: AMD EPYC 9355, 128 threads, AVX-512 active.
+>
+> **Memory = binary index only.** Float vectors for cosine reranking are mmap'd from disk after `save()`/`load()` — only ~100 candidate rows are paged per query. See [Limitations](#limitations).
+<div align="center">
+<img src="docs/figures/chart_memory_scale.png" width="800" alt="Memory comparison: FaceFlash vs all competitors at 500K faces"/>
+<sub><b>Index Memory at 500K Faces (256-bit compact config):</b> FaceFlash (15 MB) vs HNSW (1,465 MB) — 96× less RAM at 100% recall</sub>
+</div>
+<div align="center">
+<img src="docs/figures/demo.gif" width="720" alt="FaceFlash live demo"/>
+</div>
+---
+## Install
+```bash
+pip install "faceflash[cpu] @ git+https://github.com/raghavenderreddygrudhanti/faceflash.git"
+# With benchmark dependencies
+pip install "faceflash[cpu,benchmark] @ git+https://github.com/raghavenderreddygrudhanti/faceflash.git"
+```
+> Requires a [Rust toolchain](https://rustup.rs) — it compiles the AVX-512/NEON backend automatically and falls back to NumPy if unavailable.
+---
+## Quick Start
+```python
+from faceflash import FaceFlash
+ff = FaceFlash()  # downloads ArcFace model (~166 MB) on first run
+# Register individual faces
+ff.register("Alice", "alice.jpg")
+ff.register("Bob", "bob.jpg")
+# Identify a face
+result = ff.search("query.jpg")
+# {"matches": [{"name": "Alice", "confidence": 0.92}], "search_time_ms": 0.4}
+# Verify two faces are the same person
+ff.verify("photo1.jpg", "photo2.jpg")
+# {"match": True, "confidence": 0.87}
+# Bulk enroll from folder (expects folder/person_name/photo.jpg)
+ff.register_folder("employees/")
+ff.save("my_index/")
+ff.load("my_index/")
+# Manage the index
+len(ff)                  # how many faces are registered
+"Alice" in ff            # is this person registered?
+ff.names()               # list registered people
+ff.remove("Alice")       # unregister a person (returns count removed)
+```
+> For best accuracy, use pre-aligned 112x112 face crops. 5-point alignment (SCRFD/RetinaFace) adds +1.28 accuracy points over a basic center-crop.
+---
+## Gallery Management
+```python
+from faceflash import FaceFlash
+ff = FaceFlash()
+ff.register("Alice", "alice.jpg")
+ff.register("Bob", "bob.jpg")
+ff.register("Alice", "alice2.jpg")  # multiple photos per person
+# Check gallery state
+len(ff)              # 3 (total face entries)
+ff.names()           # ["Alice", "Bob"]
+"Alice" in ff        # True
+"Charlie" in ff      # False
+# Remove a person (GDPR / right-to-be-forgotten)
+ff.remove("Bob")     # removes all entries for Bob
+len(ff)              # 2
+ff.names()           # ["Alice"]
+# Monitor index stats
+ff.stats()
+# {'count': 2, 'pca_fitted': True, 'rust_backend': True,
+#  'binary_memory_mb': 0.0, 'resident_memory_mb': 0.0, ...}
+```
+## Batch Identification
+Process many query faces at once (4.8x faster than one-by-one):
+```python
+import numpy as np
+from faceflash import FaceFlash
+ff = FaceFlash()
+ff.register_folder("gallery/")
+ff.save("my_index/")
+# Low-level batch search (for bulk dedup, watchlists, video frames)
+embeddings = np.load("query_embeddings.npy")  # (N, 512) float32
+results = ff.index.search_batch(embeddings, k=1)
+# results[i] = [(name, similarity, index), ...]
+# Example: find all matches above threshold
+for i, matches in enumerate(results):
+    if matches and matches[0][1] > 0.5:
+        print(f"Query {i}: {matches[0][0]} (confidence {matches[0][1]:.2f})")
+```
+---
+## Is FaceFlash Right for You?
+| Scenario | Why FaceFlash wins |
+|---|---|
+| **Edge / mobile / IoT** | 3–30 MB vs 293–2,930 MB for HNSW — fits in device RAM |
+| **Multi-tenant servers** | 100 galleries x 30 MB = 3 GB. HNSW: 100 x 1.5 GB = 150 GB |
+| **Batch dedup / watchlists** | 4.8x faster than HNSW batched at 100K; 1.9x at 500K |
+| **100% recall is non-negotiable** | FaceFlash hits 100% at every scale; USearch drops to 94-99% |
+| **Budget / offline / air-gapped** | Runs on Raspberry Pi, cheap VPS, phones — no GPU, no network |
+| **10K-500K face databases** | The sweet spot: faster AND less memory than HNSW |
+**When HNSW is the better choice:**
+- You need <0.3ms single-query latency at >500K faces and have gigabytes of RAM to spare
+- Your database exceeds 2M faces (HNSW's O(log N) pulls clearly ahead)
+- You need 100K+ batched QPS regardless of memory (USearch wins there)
+---
+## How It Works
+![FaceFlash Architecture Pipeline](docs/figures/architecture_pipeline.png)
+Each face is compressed into a **64-byte binary fingerprint**:
+1. **ArcFace** extracts a 512-dimensional float embedding
+2. **PCA** aligns the quantization with the axes where identity varies most
+3. **ITQ** rotates bits to maximize information per bit (balanced marginals)
+4. **AVX-512 VPOPCNTDQ** scans all binary codes in a single instruction per face
+5. **Cosine rerank** runs exact similarity on only the top ~100 candidates
+This is why 512 bits is the fastest setting — the entire code fits in one AVX-512 register.
+---
+## Benchmark Methodology
+FaceFlash's recall, latency, throughput, and memory are measured. Competitor recall and latency are also measured; **competitor memory is estimated** from the standard vectors + index-structure overhead formula (e.g. HNSW ≈ 1.5× raw vectors).
+| | Details |
+|---|---|
+| **Dataset** | MS1MV2 — 645,019 ArcFace embeddings, 44,291 distinct identities |
+| **Embedding** | ArcFace ONNX (w600k_r50), 512 dimensions, L2-normalized |
+| **Hardware** | AMD EPYC 9355 (32 cores / 128 threads), AVX-512 VPOPCNTDQ enabled |
+| **Competitors** | HNSWLIB 0.8+, FAISS 1.7+, USearch 2.x, ScaNN (latest) |
+| **Ground truth** | Exact brute-force cosine argmax (FAISS-Flat) |
+| **Timing** | `time.perf_counter()` per query, 10 warmup excluded |
+| **Recall metric** | Recall@1 — fraction of queries where the true nearest neighbor is rank-1 |
+| **Memory metric** | FaceFlash: measured binary index size in RAM (floats mmap'd after save/load). Competitors: estimated (vectors + index-structure overhead) |
+| **Batched timing** | Wall-clock for the full query batch / number of queries |
+| **Reproducibility** | `bash scripts/runpod_ms1m.sh` reproduces all results end-to-end |
+All single-query rows are single-threaded. Batched rows use all available cores. Every benchmark script validates correctness before reporting speed.
+---
+## Performance
+### Scale Summary (100K-1M)
+<div align="center">
+<img src="docs/figures/chart_throughput_scale.png" width="760" alt="Batched throughput: FaceFlash vs all competitors 100K to 1M"/>
+<sub><b>Batched Throughput (QPS):</b> FaceFlash 100K→1M — 4.8× faster than HNSW at 100K</sub>
+</div>
+<div align="center">
+<img src="docs/figures/chart_latency_scale.png" width="760" alt="Single-query latency: all methods 100K to 1M"/>
+<sub><b>Single-Query Latency:</b> FaceFlash 0.30ms vs HNSW 0.60ms at 100K — both at 100% recall</sub>
+</div>
+<div align="center">
+<img src="docs/figures/chart_recall_memory_scale.png" width="760" alt="Recall vs Memory at 500K"/>
+<sub><b>Recall vs Memory Pareto:</b> FaceFlash sits at the frontier — 100% recall, 30 MB</sub>
+</div>
+| Scale | Recall | Single-query | Batched QPS | Memory | vs HNSW memory |
+|---|---|---|---|---|---|
+| 100K | 100% | **0.30ms** (2× faster than HNSW) | 27,661 | **6.1 MB** | **48× less** |
+| 200K | 100% | 0.57ms (tied with HNSW) | 19,930 | **12.2 MB** | **48× less** |
+| 300K | 100% | 0.84ms | 15,147 | **18.3 MB** | **48× less** |
+| 500K | 100% | 1.45ms | 10,337 | **30.5 MB** | **48× less** |
+| 1M | 100% | 2.95ms | 5,403 | **61 MB** | **48× less** |
+*Numbers are the default **512-bit** config (fastest, 100% recall). The **256-bit compact** config holds 100% recall at half the memory — **96× less than HNSW** — trading ~2× single-query latency.*
+FaceFlash dominates up to 300K on every axis. At 500K-1M, HNSW edges ahead on single-query latency (O(log N) vs O(N)), but FaceFlash still wins on batched throughput and always uses 48–96× less memory.
+### 1:N Identification - 44,290 Distinct People
+The hardest test: one photo per person in the gallery, identify them from a different photo.
+<div align="center">
+<img src="docs/figures/chart_rank1_tie.png" width="720" alt="Rank-1 identification ties exact search on 44,290 people"/>
+<sub><b>1:N Identification on 44,290 Identities:</b> FaceFlash matches FAISS-Flat accuracy at 32× less memory</sub>
+</div>
+| Method | Rank-1 Accuracy | Memory |
+|---|---|---|
+| FAISS-Flat (exact ceiling) | 95.8% | 86.5 MB |
+| **FaceFlash (512b / 100c)** | **95.8%** | **2.70 MB** |
+| FaceFlash (512b / 300c) | 95.8% | 2.70 MB |
+| FaceFlash (256b / 100c) | 95.6% | 1.35 MB |
+FaceFlash ties exact search using **32× less memory** (512 float32 → 512 bits = exactly 32× compression). Binary quantization is lossless at 512 bits.
+<details>
+<summary>Detailed per-scale benchmark tables</summary>
+### 100K Faces (6,939 identities)
+| Method | Recall@1 | Latency | QPS | Memory |
+|---|---|---|---|---|
+| FaceFlash (batched) | 100% | 0.036ms | 27,661 | 6.1 MB |
+| FaceFlash (512b/200c) | 100% | 0.30ms | 3,310 | 6.1 MB |
+| FaceFlash (512b/100c) | 100% | 0.43ms | 2,344 | 6.1 MB |
+| HNSWLIB (ef=128) | 100% | 0.60ms | 1,671 | 293 MB |
+| HNSWLIB batched | 100% | 0.172ms | 5,813 | 293 MB |
+| USearch batched | 99.5% | 0.007ms | 137,264 | 254 MB |
+| USearch | 99.5% | 0.17ms | -- | 254 MB |
+| ScaNN | 98.3% | 0.10ms | -- | 12 MB |
+| FAISS-Flat (exact) | 100% | 4.90ms | 204 | 195 MB |
+### 200K Faces (13,749 identities)
+| Method | Recall@1 | Latency | QPS | Memory |
+|---|---|---|---|---|
+| FaceFlash (batched) | 100% | 0.050ms | 19,930 | 12.2 MB |
+| FaceFlash (512b/200c) | 100% | 0.57ms | 1,751 | 12.2 MB |
+| HNSWLIB (ef=128) | 99.9% | 0.65ms | 1,531 | 586 MB |
+| HNSWLIB batched | 99.9% | 0.378ms | 2,646 | 586 MB |
+| USearch batched | 99.1% | 0.008ms | 121,660 | 508 MB |
+| ScaNN | 97.2% | 0.19ms | -- | 24 MB |
+### 300K Faces (20,615 identities)
+| Method | Recall@1 | Latency | QPS | Memory |
+|---|---|---|---|---|
+| FaceFlash (batched) | 100% | 0.066ms | 15,147 | 18.3 MB |
+| FaceFlash (512b/200c) | 100% | 0.84ms | 1,187 | 18.3 MB |
+| HNSWLIB (ef=128) | 99.9% | 0.66ms | 1,510 | 879 MB |
+| HNSWLIB batched | 99.7% | 0.269ms | 3,715 | 879 MB |
+| USearch batched | 98.7% | 0.014ms | 73,383 | 762 MB |
+| ScaNN | 97.8% | 0.28ms | -- | 37 MB |
+### 500K Faces (34,328 identities)
+| Method | Recall@1 | Latency | QPS | Memory |
+|---|---|---|---|---|
+| FaceFlash (batched) | 100% | 0.097ms | 10,337 | 30.5 MB |
+| FaceFlash (512b/200c) | 100% | 1.45ms | 692 | 30.5 MB |
+| HNSWLIB (ef=128) | 100% | 0.71ms | 1,416 | 1,465 MB |
+| HNSWLIB batched | 99.9% | 0.179ms | 5,577 | 1,465 MB |
+| USearch batched | 98.4% | 0.013ms | 76,150 | 1,270 MB |
+| ScaNN | 97.6% | 0.45ms | -- | 61 MB |
+### 1M Faces (44,291 identities)
+| Method | Recall@1 | Latency | QPS | Memory |
+|---|---|---|---|---|
+| FaceFlash (batched) | 100% | 0.185ms | 5,403 | 61 MB |
+| FaceFlash (512b/100c) | 100% | 2.92ms | 342 | 61 MB |
+| HNSWLIB (ef=128) | 100% | 0.66ms | 1,523 | 2,930 MB |
+| HNSWLIB batched | 100% | 0.178ms | 5,621 | 2,930 MB |
+| USearch batched | 94.1% | 0.013ms | 77,266 | 2,539 MB |
+| ScaNN | 98.2% | 0.86ms | -- | 122 MB |
+</details>
+---
+## Tuning
+Pick a config that matches your deployment:
+| Deployment | Config | Recall@1 | Memory/face | Notes |
+|---|---|---|---|---|
+| Ultra-compact (mobile/IoT) | n_bits=128, n_candidates=500 | 99.4% | 16 bytes | Minimum RAM |
+| Balanced | n_bits=256, n_candidates=100 | 100% | 32 bytes | Good default |
+| **Default (fastest)** | n_bits=512, n_candidates=100 | **100%** | 64 bytes | One AVX-512 instruction |
+| Edge (minimize disk reads) | n_bits=512, n_candidates=50 | 99.5% | 64 bytes | -- |
+```python
+ff = FaceFlash(n_bits=128, n_candidates=500)   # mobile/IoT
+ff = FaceFlash(n_bits=256, n_candidates=100)   # balanced
+ff = FaceFlash(n_bits=512, n_candidates=100)   # default: fastest
+ff.search("query.jpg", n_candidates=200)       # per-query override
+```
+### Speed up large indexes with clustering
+<div align="center">
+<img src="docs/figures/chart_clustering_tradeoff.png" width="720" alt="Clustering recall/speed tradeoff"/>
+<sub><b>IVF Clustering Speedup:</b> 5–8× faster at 500K+ with configurable recall trade-off</sub>
+</div>
+```python
+ff.index.build_clusters(n_probe=16)
+```
+| Scale | n_probe | Recall@1 | Latency | Speedup |
+|---|---|---|---|---|
+| 100K | 16 | 96.1% | 0.12ms | 2.6x |
+| 100K | 32 | 98.5% | 0.17ms | 1.8x |
+| 500K | 16 | 87.9% | 0.31ms | 5.0x |
+| 500K | 32 | 92.8% | 0.56ms | 2.8x |
+Clustering is mainly useful at 500K+ where it delivers 5-8x speedup at ~88-93% recall.
+---
+## Architecture
+```
+faceflash/                          # Python package
+├── engine.py         ◀─ High-level API (register, search, verify)
+├── detect.py         ◀─ Face detection (SCRFD + Haar fallback)
+├── align.py          ◀─ 5-point alignment to ArcFace template
+├── embed.py          ◀─ ArcFace ONNX embedding (512-dim, auto-download)
+├── index.py          ◀─ Binary index + batched search
+└── pca_quantize.py   ◀─ PCA+ITQ quantizer (the core algorithm)
+rust/                               # Rust backend (PyO3 + Rayon)
+├── src/lib.rs        ◀─ AVX-512 VPOPCNTDQ / NEON / scalar POPCNT
+└── Cargo.toml
+```
+**Why PCA+ITQ?** ArcFace embeddings concentrate identity information along principal axes. PCA aligns quantization with those axes; ITQ rotates bits for balanced marginals. The result is lossless compression at 512 bits.
+**Why not HNSW internally?** HNSW stores a graph on top of full float vectors — about 1.5x raw memory. FaceFlash stores 32–64 bytes per face. Float vectors are memory-mapped from disk and paged only for the top ~100 candidates per query. Trade-off: higher single-query latency at 500K+, but 48–96× less memory.
+**Why Rust + AVX-512?** AVX-512 VPOPCNTDQ processes an entire 512-bit code in one instruction (~3× faster than scalar POPCNT). Combined with Rayon multi-core parallelism (and cache-blocked batching to keep the scan cache-friendly), batched search reaches 10-17× throughput versus single-query serial. Runtime-detected — no user configuration needed.
+---
+## Limitations
+- **Single-query at 1M+** — O(N) linear scan; HNSW is 4.4x faster per single query at 1M. Batched path ties.
+- **Memory during build** — holds all float vectors in RAM. The 48–96× savings apply after `save()` / `load()`.
+- **AVX-512 VPOPCNTDQ** — the ~3× kernel speedup requires Ice Lake / Zen 4+ / EPYC 9004+. Older CPUs fall back to scalar POPCNT automatically.
+- **Rerank I/O** — pages ~100 float rows from disk per query. Invisible on NVMe; adds latency on slow storage.
+---
+## Reproduce the Benchmarks
+```bash
+# Local (LFW + VGGFace2 100K)
+python scripts/extract_lfw_embeddings.py
+python benchmarks/bench_ann_comparison.py --scales 100K --queries 500
+# RunPod (full suite)
+export GITHUB_TOKEN=<token>
+bash scripts/runpod_ms1m.sh   # FORCE_EXTRACT=1 for full 85K extraction
+```
+---
+## Roadmap
+**v0.1.0 (current)**
+- [x] PCA+ITQ binary quantization + Rust search backend
+- [x] High-level API: register, search, verify
+- [x] Benchmarked against FAISS, HNSWLIB, USearch, ScaNN at 100K-1M
+- [x] 1:N identification on 44,290 distinct identities (MS1MV2)
+- [x] 5-point alignment via SCRFD/RetinaFace — 99.85% LFW accuracy
+**v0.2.0 (done)**
+- [x] Prebuilt wheels (`pip install faceflash`)
+- [x] Full 85K-identity benchmark (76,872 identities extracted, 44,291 with sufficient data)
+- [x] On-device memory measurement (3.05 MB binary index @100K with 256b)
+**v0.3.0 (done)**
+- [x] IVF coarse clustering (2.7-4.9x speedup at scale)
+- [x] AVX-512 VPOPCNTDQ — native 512-bit popcount (~3x faster than scalar)
+- [x] Batched search — 17x throughput at 500K-1M (multi-core + VPOPCNTDQ; cache-blocked)
+- [x] NEON kernels — ARM-optimized (vcntq_u8)
+**v1.0.0 — next**
+- [ ] Stable public API (no breaking changes)
+- [ ] DiskANN comparison
+- [ ] Mobile deployment (ONNX + CoreML)
+- [ ] Streaming insertion (add faces without refitting PCA)
+---
+## Contributing
+```bash
+# Dev setup (one command)
+git clone https://github.com/raghavenderreddygrudhanti/faceflash.git
+cd faceflash && python -m venv .venv && source .venv/bin/activate
+pip install -e ".[cpu,benchmark]" && maturin develop --release
+python -m pytest tests/  # 33 tests, ~17s
+```
+Open areas for contribution:
+| Area | Difficulty | Impact |
+|------|-----------|--------|
+| **DiskANN comparison** | Medium | High — the one competitor missing |
+| **Mobile deployment** (ONNX + CoreML) | Medium | High — iOS/Android face search |
+| **Streaming insertion** (no PCA refit) | Hard | High — online learning |
+| **GPU batched search** (CUDA) | Hard | Medium — 10M+ galleries |
+| **Raspberry Pi / Jetson benchmarks** | Easy | Medium — edge credibility |
+| **WebAssembly build** | Medium | Medium — browser face search |
+See [CONTRIBUTING.md](CONTRIBUTING.md) for coding guidelines.
+---
+## Credits & References
+FaceFlash does **not** introduce a new embedding model or hashing algorithm. It combines proven, published techniques into a CPU-efficient retrieval system — with hand-written Rust AVX-512/NEON kernels, cache-blocked batching, and rigorous, reproducible benchmarks. The contribution is the **system** (exact-recall face search in a megabyte-scale footprint) and the honest measurement of it, not the underlying math.
+It builds directly on:
+| Component | Reference |
+|-----------|-----------|
+| **Binary quantization** (PCA + ITQ — the core method) | Gong & Lazebnik, *"Iterative Quantization: A Procrustean Approach to Learning Binary Codes,"* CVPR 2011 |
+| **Face embeddings** | Deng, Guo, Xue & Zafeiriou, *"ArcFace: Additive Angular Margin Loss for Deep Face Recognition,"* CVPR 2019 — weights from [InsightFace](https://github.com/deepinsight/insightface) |
+| **Detection & 5-point alignment** | *RetinaFace* (CVPR 2020) / *SCRFD* (ICLR 2022), InsightFace |
+| **Dataset** | Guo et al., *"MS-Celeb-1M,"* ECCV 2016 (MS1MV2 = ArcFace's refined/cleaned version) |
+| **Verification benchmark** | Huang et al., *"Labeled Faces in the Wild (LFW),"* UMass TR 2007 |
+| **Baselines compared** | FAISS (Johnson et al.), HNSW (Malkov & Yashunin, TPAMI 2018), ScaNN (Guo et al., ICML 2020), [USearch](https://github.com/unum-cloud/usearch) |
+PCA dates to Pearson (1901) / Hotelling (1933); the Hamming distance to Hamming (1950); `POPCNT` / `VPOPCNTDQ` are Intel/AMD hardware instructions. FaceFlash's value is in how these are combined and implemented — not in inventing them.
+---
+## License
+MIT — see [LICENSE](LICENSE).
+If FaceFlash is useful to you, a star helps others find it.