vector-engine 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vector_engine-1.0.0/LICENSE +21 -0
- vector_engine-1.0.0/PKG-INFO +342 -0
- vector_engine-1.0.0/README.md +324 -0
- vector_engine-1.0.0/pyproject.toml +32 -0
- vector_engine-1.0.0/setup.cfg +4 -0
- vector_engine-1.0.0/tests/test_api_stability.py +42 -0
- vector_engine-1.0.0/tests/test_artifact_contracts.py +83 -0
- vector_engine-1.0.0/tests/test_core.py +39 -0
- vector_engine-1.0.0/tests/test_credibility_audit.py +80 -0
- vector_engine-1.0.0/tests/test_faiss_optional.py +27 -0
- vector_engine-1.0.0/tests/test_hardening.py +36 -0
- vector_engine-1.0.0/tests/test_ml_eval.py +37 -0
- vector_engine-1.0.0/tests/test_persistence_compat.py +37 -0
- vector_engine-1.0.0/tests/test_rag_reliability.py +31 -0
- vector_engine-1.0.0/tests/test_real_corpus_eval.py +78 -0
- vector_engine-1.0.0/tests/test_release_bundle.py +8 -0
- vector_engine-1.0.0/tests/test_v02_features.py +117 -0
- vector_engine-1.0.0/vector_engine/__init__.py +16 -0
- vector_engine-1.0.0/vector_engine/array.py +153 -0
- vector_engine-1.0.0/vector_engine/backends/__init__.py +13 -0
- vector_engine-1.0.0/vector_engine/backends/base.py +28 -0
- vector_engine-1.0.0/vector_engine/backends/bruteforce.py +107 -0
- vector_engine-1.0.0/vector_engine/backends/faiss_backend.py +123 -0
- vector_engine-1.0.0/vector_engine/backends/registry.py +15 -0
- vector_engine-1.0.0/vector_engine/eval/__init__.py +17 -0
- vector_engine-1.0.0/vector_engine/eval/retrieval.py +173 -0
- vector_engine-1.0.0/vector_engine/index.py +190 -0
- vector_engine-1.0.0/vector_engine/io/__init__.py +3 -0
- vector_engine-1.0.0/vector_engine/io/manifest.py +50 -0
- vector_engine-1.0.0/vector_engine/metric.py +58 -0
- vector_engine-1.0.0/vector_engine/ml/__init__.py +4 -0
- vector_engine-1.0.0/vector_engine/ml/clustering.py +56 -0
- vector_engine-1.0.0/vector_engine/ml/knn.py +71 -0
- vector_engine-1.0.0/vector_engine/results.py +15 -0
- vector_engine-1.0.0/vector_engine/training/__init__.py +3 -0
- vector_engine-1.0.0/vector_engine/training/hard_negative.py +140 -0
- vector_engine-1.0.0/vector_engine.egg-info/PKG-INFO +342 -0
- vector_engine-1.0.0/vector_engine.egg-info/SOURCES.txt +39 -0
- vector_engine-1.0.0/vector_engine.egg-info/dependency_links.txt +1 -0
- vector_engine-1.0.0/vector_engine.egg-info/requires.txt +12 -0
- vector_engine-1.0.0/vector_engine.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Vector Engine Contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,342 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vector-engine
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: ML-first vector computation and retrieval engine.
|
|
5
|
+
Author: Vector Engine Contributors
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Requires-Python: >=3.10
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: numpy>=1.24
|
|
11
|
+
Provides-Extra: faiss
|
|
12
|
+
Requires-Dist: faiss-cpu>=1.7.4; (platform_system != "Darwin" or platform_machine != "arm64") and extra == "faiss"
|
|
13
|
+
Provides-Extra: ml
|
|
14
|
+
Requires-Dist: scikit-learn>=1.3; extra == "ml"
|
|
15
|
+
Provides-Extra: dev
|
|
16
|
+
Requires-Dist: pytest>=7.4; extra == "dev"
|
|
17
|
+
Dynamic: license-file
|
|
18
|
+
|
|
19
|
+
# Vector Engine
|
|
20
|
+
|
|
21
|
+
ML-first vector computation and retrieval for Python.
|
|
22
|
+
|
|
23
|
+
Vector Engine provides one clean API for exact search, ANN backends, metadata-aware retrieval, and ML utilities such as kNN and retrieval metrics.
|
|
24
|
+
|
|
25
|
+
## Why this exists
|
|
26
|
+
|
|
27
|
+
- ANN libraries are powerful but low-level and backend-specific.
|
|
28
|
+
- Vector DBs solve infra and ops, but many ML workflows need fast local experimentation.
|
|
29
|
+
- Existing ML APIs do not offer a unified, backend-pluggable vector layer for embedding-heavy work.
|
|
30
|
+
|
|
31
|
+
## Install
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
pip install -e .
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Optional extras:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install -e ".[dev,ml]"
|
|
41
|
+
pip install -e ".[faiss]"
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
## API contracts (v1.0 target)
|
|
45
|
+
|
|
46
|
+
- `VectorArray` accepts only 2D arrays with shape `(n, d)` where `n > 0` and `d > 0`.
|
|
47
|
+
- `VectorArray` IDs must be unique and must be `int` or `str`.
|
|
48
|
+
- Metadata length must always align with number of vectors.
|
|
49
|
+
- `VectorIndex.search(..., k=...)` requires `k` to be an integer `> 0`.
|
|
50
|
+
- Score direction is explicit in `Metric.higher_is_better`:
|
|
51
|
+
- cosine/ip: higher is better
|
|
52
|
+
- l2: lower is better
|
|
53
|
+
- `kmeans(..., random_state=...)` requires an integer seed and finite vectors.
|
|
54
|
+
- `mine_hard_negatives` supports `top1`, `topk_sample`, and `distance_band` strategies with explicit exclusion controls.
|
|
55
|
+
- Retrieval eval validates malformed ground-truth shapes/types with stable `eval_error` messages.
|
|
56
|
+
|
|
57
|
+
## 60-second quickstart
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
import numpy as np
|
|
61
|
+
from vector_engine import VectorArray, VectorIndex
|
|
62
|
+
|
|
63
|
+
xb = VectorArray.from_numpy(
|
|
64
|
+
np.random.randn(1000, 384).astype("float32"),
|
|
65
|
+
ids=[f"doc-{i}" for i in range(1000)],
|
|
66
|
+
normalize=True,
|
|
67
|
+
)
|
|
68
|
+
xq = VectorArray.from_numpy(np.random.randn(2, 384).astype("float32"), normalize=True)
|
|
69
|
+
|
|
70
|
+
index = VectorIndex.create(xb, metric="cosine", backend="bruteforce")
|
|
71
|
+
res = index.search(xq, k=5)
|
|
72
|
+
print(res.ids[0], res.scores[0])
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Core API
|
|
76
|
+
|
|
77
|
+
- `VectorArray`: canonical vector storage with IDs and metadata.
|
|
78
|
+
- `Metric`: built-in and custom metric definitions.
|
|
79
|
+
- `VectorIndex`: backend-agnostic build/add/search/save/load.
|
|
80
|
+
- `vector_engine.ml`: `knn_classify`, `knn_regress`, `kmeans`.
|
|
81
|
+
- `vector_engine.training`: `mine_hard_negatives` with configurable sampling strategies.
|
|
82
|
+
- `vector_engine.eval`: `precision_at_k`, `recall_at_k`, `ndcg_at_k`, `retrieval_report`, `retrieval_report_detailed`, `batch_metrics_summary`.
|
|
83
|
+
|
|
84
|
+
## Backend support matrix
|
|
85
|
+
|
|
86
|
+
| Backend | Search | Add | Save/Load | Custom Metric |
|
|
87
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
88
|
+
| `bruteforce` | yes | yes | yes | yes |
|
|
89
|
+
| `faiss` | yes | yes | yes | no |
|
|
90
|
+
|
|
91
|
+
## Examples and notebooks
|
|
92
|
+
|
|
93
|
+
- `notebooks/01_semantic_search.ipynb`
|
|
94
|
+
- `notebooks/02_knn_baseline.ipynb`
|
|
95
|
+
- `notebooks/03_recommender_similarity.ipynb`
|
|
96
|
+
|
|
97
|
+
## Benchmarks
|
|
98
|
+
|
|
99
|
+
Run:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
With exact-overlap gate and artifact output:
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/benchmark_exact.json
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Benchmark matrix (publishable aggregate):
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Compose publishable summary bundle:
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
ANN mode (optional):
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode ann
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
The benchmark reports:
|
|
130
|
+
|
|
131
|
+
- `qps`: queries per second (higher is better)
|
|
132
|
+
- `latency_p50_ms` and `latency_p95_ms`: median and tail latency (lower is better)
|
|
133
|
+
- `overlap_vs_bruteforce`: top-k neighbor overlap against exact brute-force (closer to `1.0` is better)
|
|
134
|
+
- `memory_mb_estimate`: coarse in-process memory estimate for vector/query buffers
|
|
135
|
+
|
|
136
|
+
Recommended protocol for publishable results:
|
|
137
|
+
|
|
138
|
+
- Use fixed seed and fixed hardware notes.
|
|
139
|
+
- Warm up before timed runs.
|
|
140
|
+
- Run at least 3 repeated trials and report median numbers.
|
|
141
|
+
- Keep dataset size (`n`, `d`, `nq`, `k`) fixed across backend comparisons.
|
|
142
|
+
- Include machine-readable matrix summary artifacts in release evidence.
|
|
143
|
+
|
|
144
|
+
## Validation snapshot
|
|
145
|
+
|
|
146
|
+
Artifacts produced in this repo:
|
|
147
|
+
|
|
148
|
+
- Real-corpus style 3-run reports:
|
|
149
|
+
- `artifacts/real_corpus_runs/run_1.json`
|
|
150
|
+
- `artifacts/real_corpus_runs/run_2.json`
|
|
151
|
+
- `artifacts/real_corpus_runs/run_3.json`
|
|
152
|
+
- Faiss Flat exact-equivalence checks (3 runs):
|
|
153
|
+
- `artifacts/faiss_equivalence/run_1.json`
|
|
154
|
+
- `artifacts/faiss_equivalence/run_2.json`
|
|
155
|
+
- `artifacts/faiss_equivalence/run_3.json`
|
|
156
|
+
- 200-run stability study:
|
|
157
|
+
- `artifacts/testing_runs/stability_runs_bruteforce_200.jsonl`
|
|
158
|
+
- `artifacts/testing_runs/stability_summary_bruteforce_200.json`
|
|
159
|
+
- `artifacts/testing_runs/stability_plot_p95_qps.png`
|
|
160
|
+
- Matrix benchmark summary:
|
|
161
|
+
- `artifacts/benchmark_matrix/matrix_summary.json`
|
|
162
|
+
- `artifacts/benchmark_matrix/publishable_results.v1.json`
|
|
163
|
+
|
|
164
|
+
Observed outcomes for current mock/public-safe corpus:
|
|
165
|
+
|
|
166
|
+
- 3-run quality is identical across runs (`recall@1/3/6 = 1.0`, `ndcg@1/3/6 = 1.0`).
|
|
167
|
+
- 3-run latency envelope: p95 ranges from `0.0376 ms` to `0.0717 ms`.
|
|
168
|
+
- Faiss Flat exact mode achieves `overlap_vs_bruteforce = 1.0` for all 3 runs with `--min-flat-overlap 0.99`.
|
|
169
|
+
- In exact benchmark runs (`n=10000`, `d=128`, `nq=200`, `k=10`), Faiss Flat p95 latency is `4.17-15.03 ms` vs bruteforce `29.99-37.63 ms`.
|
|
170
|
+
- 200-run bruteforce study: p95 mean `0.0255 ms` (95% interval `0.0203-0.0547 ms`), QPS mean `188,097` (95% interval `117,499-214,111`).
|
|
171
|
+
|
|
172
|
+
Run the 200-run study:
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
python3 scripts/stability_runs.py \
|
|
176
|
+
--embeddings artifacts/real_corpus_inputs/embeddings.npy \
|
|
177
|
+
--query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
|
|
178
|
+
--ids artifacts/real_corpus_inputs/ids.json \
|
|
179
|
+
--ground-truth artifacts/real_corpus_inputs/ground_truth.json \
|
|
180
|
+
--metadata artifacts/real_corpus_inputs/metadata.json \
|
|
181
|
+
--backend bruteforce \
|
|
182
|
+
--k 6 \
|
|
183
|
+
--ks 1,3,6 \
|
|
184
|
+
--loops 5 \
|
|
185
|
+
--run-count 200 \
|
|
186
|
+
--output-dir artifacts/testing_runs \
|
|
187
|
+
--threshold-recall 0.75 \
|
|
188
|
+
--threshold-ndcg 0.70 \
|
|
189
|
+
--threshold-p95-ms 120
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Example result table format:
|
|
193
|
+
|
|
194
|
+
| Backend | QPS | p50 ms | p95 ms | overlap@k vs brute-force |
|
|
195
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
196
|
+
| bruteforce | ... | ... | ... | 1.000 |
|
|
197
|
+
| faiss_flat | ... | ... | ... | ... |
|
|
198
|
+
| faiss_ivf (optional) | ... | ... | ... | ... |
|
|
199
|
+
|
|
200
|
+
## Integration quickstarts
|
|
201
|
+
|
|
202
|
+
### Local RAG app path
|
|
203
|
+
|
|
204
|
+
```bash
|
|
205
|
+
pip install -e ".[dev,ml]"
|
|
206
|
+
python examples/minimal_rag_integration.py
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
### Batch evaluation path
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
python scripts/rag_real_corpus_eval.py \
|
|
213
|
+
--embeddings artifacts/real_corpus_inputs/embeddings.npy \
|
|
214
|
+
--query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
|
|
215
|
+
--ids artifacts/real_corpus_inputs/ids.json \
|
|
216
|
+
--ground-truth artifacts/real_corpus_inputs/ground_truth.json \
|
|
217
|
+
--metadata artifacts/real_corpus_inputs/metadata.json \
|
|
218
|
+
--output artifacts/real_corpus_runs/run_1.json \
|
|
219
|
+
--backend bruteforce \
|
|
220
|
+
--k 6 \
|
|
221
|
+
--ks 1,3,6 \
|
|
222
|
+
--loops 5 \
|
|
223
|
+
--threshold-recall 0.75 \
|
|
224
|
+
--threshold-ndcg 0.70 \
|
|
225
|
+
--threshold-p95-ms 120
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### Benchmark interpretation path
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
python benchmarks/compare_bruteforce_vs_faiss.py \
|
|
232
|
+
--mode exact \
|
|
233
|
+
--min-flat-overlap 0.99 \
|
|
234
|
+
--output artifacts/faiss_equivalence/run_1.json
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
- If `overlap_vs_bruteforce` is near `1.0`, approximation risk is low for that configuration.
|
|
238
|
+
- Use `latency_p95_ms` for user-facing SLO decisions.
|
|
239
|
+
- Use repeated runs + median values before publishing backend comparisons.
|
|
240
|
+
|
|
241
|
+
### Minimal production path (copy-paste)
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
pip install -e ".[dev,ml]"
|
|
245
|
+
python scripts/rag_baseline.py --output-dir artifacts --k 3
|
|
246
|
+
python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --output artifacts/real_corpus_runs/run_1.json --backend bruteforce --k 10 --ks 1,5,10 --loops 5
|
|
247
|
+
python scripts/stability_runs.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --backend bruteforce --run-count 200 --output-dir artifacts/testing_runs
|
|
248
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/faiss_equivalence/run_1.json
|
|
249
|
+
python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
|
|
250
|
+
python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
Expected artifacts:
|
|
254
|
+
|
|
255
|
+
- `artifacts/rag_baseline_metrics.v1.json`
|
|
256
|
+
- `artifacts/real_corpus_runs/run_*.json`
|
|
257
|
+
- `artifacts/testing_runs/stability_summary_*.json`
|
|
258
|
+
- `artifacts/faiss_equivalence/run_*.json`
|
|
259
|
+
- `artifacts/benchmark_matrix/matrix_summary.json`
|
|
260
|
+
- `artifacts/benchmark_matrix/publishable_results.v1.json`
|
|
261
|
+
|
|
262
|
+
Further reading:
|
|
263
|
+
|
|
264
|
+
- `docs/integration_guides.md`
|
|
265
|
+
- `docs/reproducibility.md`
|
|
266
|
+
- `docs/kpi_charter.md`
|
|
267
|
+
- `docs/research_claims.md`
|
|
268
|
+
- `docs/credibility_audit.md`
|
|
269
|
+
- `docs/limitations.md`
|
|
270
|
+
- `docs/releases/v1.0.0.md`
|
|
271
|
+
- `docs/paper/reproducibility_appendix.md`
|
|
272
|
+
|
|
273
|
+
## Publication and release bundle
|
|
274
|
+
|
|
275
|
+
Generate a release-bundle manifest that checks required docs/governance/evidence files:
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
python scripts/build_release_bundle.py --output-dir artifacts/release_bundle
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Artifact policy (publish vs private)
|
|
282
|
+
|
|
283
|
+
- Safe to publish:
|
|
284
|
+
- benchmark result summaries
|
|
285
|
+
- stability aggregate summaries
|
|
286
|
+
- synthetic/mock input examples
|
|
287
|
+
- Keep private:
|
|
288
|
+
- real corpus raw embeddings
|
|
289
|
+
- query embeddings derived from private data
|
|
290
|
+
- sensitive metadata and ID mappings
|
|
291
|
+
- Recommended:
|
|
292
|
+
- commit docs + summary metrics in repo
|
|
293
|
+
- keep private input blobs in external storage
|
|
294
|
+
|
|
295
|
+
## Project adoption checklist
|
|
296
|
+
|
|
297
|
+
- Install: `pip install -e ".[dev,ml]"` and optional `.[faiss]`.
|
|
298
|
+
- Validation: run `pytest -q`.
|
|
299
|
+
- Quality baseline: run `python scripts/rag_baseline.py`.
|
|
300
|
+
- Real corpus eval: run `python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --threshold-recall 0.75 --threshold-ndcg 0.70 --threshold-p95-ms 120`.
|
|
301
|
+
- Persistence: verify `VectorIndex.save/load` on your own embeddings snapshot.
|
|
302
|
+
- Performance: run benchmark script with your target `n`, `d`, `nq`, `k`.
|
|
303
|
+
- Integration: run `python examples/minimal_rag_integration.py`.
|
|
304
|
+
|
|
305
|
+
## Feature snapshot
|
|
306
|
+
|
|
307
|
+
- `kmeans` returns rich outputs (`labels`, `centers`, `inertia`, `n_iter`) with deterministic validation.
|
|
308
|
+
- Hard-negative mining supports `top1`, `topk_sample`, and `distance_band`, plus `exclude_ids` / `exclude_mask`.
|
|
309
|
+
- Retrieval evaluation includes `retrieval_report_detailed(include_per_query=...)` and `batch_metrics_summary(include_std=True)`.
|
|
310
|
+
- Public demo bootstrap is available under `demo_repo_template/`.
|
|
311
|
+
|
|
312
|
+
## v1.0 readiness gates
|
|
313
|
+
|
|
314
|
+
- Benchmark matrix artifacts produced with fixed protocol and environment metadata.
|
|
315
|
+
- Stability harness demonstrates repeatability for latency/QPS/quality summaries.
|
|
316
|
+
- API stability contract documented in `docs/api.md` and enforced in `tests/test_api_stability.py`.
|
|
317
|
+
- Release packaging includes reproducible command blocks and artifact policy.
|
|
318
|
+
|
|
319
|
+
## Governance and trust
|
|
320
|
+
|
|
321
|
+
- `LICENSE`
|
|
322
|
+
- `CITATION.cff`
|
|
323
|
+
|
|
324
|
+
## Error cases
|
|
325
|
+
|
|
326
|
+
Stable error prefixes are used for fast debugging:
|
|
327
|
+
|
|
328
|
+
- `vector_array_error`: malformed array, IDs, metadata, subset lookup
|
|
329
|
+
- `metric_error`: unsupported or invalid metric definitions
|
|
330
|
+
- `index_error`: index lifecycle/search/add/persistence consistency issues
|
|
331
|
+
- `manifest_error`: missing/unsupported manifest fields or version
|
|
332
|
+
|
|
333
|
+
## Troubleshooting
|
|
334
|
+
|
|
335
|
+
- **Faiss not available**
|
|
336
|
+
- Install with `pip install -e ".[faiss]"`.
|
|
337
|
+
- **Dimension mismatch at search/add**
|
|
338
|
+
- Ensure both base vectors and query vectors use the same embedding dimension.
|
|
339
|
+
- **Metric confusion**
|
|
340
|
+
- For cosine similarity, pass normalized vectors or set `normalize=True`.
|
|
341
|
+
- **Persistence load failure**
|
|
342
|
+
- Check manifest version compatibility and whether artifacts were modified after save.
|
|
@@ -0,0 +1,324 @@
|
|
|
1
|
+
# Vector Engine
|
|
2
|
+
|
|
3
|
+
ML-first vector computation and retrieval for Python.
|
|
4
|
+
|
|
5
|
+
Vector Engine provides one clean API for exact search, ANN backends, metadata-aware retrieval, and ML utilities such as kNN and retrieval metrics.
|
|
6
|
+
|
|
7
|
+
## Why this exists
|
|
8
|
+
|
|
9
|
+
- ANN libraries are powerful but low-level and backend-specific.
|
|
10
|
+
- Vector DBs solve infra and ops, but many ML workflows need fast local experimentation.
|
|
11
|
+
- Existing ML APIs do not offer a unified, backend-pluggable vector layer for embedding-heavy work.
|
|
12
|
+
|
|
13
|
+
## Install
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
pip install -e .
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Optional extras:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
pip install -e ".[dev,ml]"
|
|
23
|
+
pip install -e ".[faiss]"
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## API contracts (v1.0 target)
|
|
27
|
+
|
|
28
|
+
- `VectorArray` accepts only 2D arrays with shape `(n, d)` where `n > 0` and `d > 0`.
|
|
29
|
+
- `VectorArray` IDs must be unique and must be `int` or `str`.
|
|
30
|
+
- Metadata length must always align with number of vectors.
|
|
31
|
+
- `VectorIndex.search(..., k=...)` requires `k` to be an integer `> 0`.
|
|
32
|
+
- Score direction is explicit in `Metric.higher_is_better`:
|
|
33
|
+
- cosine/ip: higher is better
|
|
34
|
+
- l2: lower is better
|
|
35
|
+
- `kmeans(..., random_state=...)` requires an integer seed and finite vectors.
|
|
36
|
+
- `mine_hard_negatives` supports `top1`, `topk_sample`, and `distance_band` strategies with explicit exclusion controls.
|
|
37
|
+
- Retrieval eval validates malformed ground-truth shapes/types with stable `eval_error` messages.
|
|
38
|
+
|
|
39
|
+
## 60-second quickstart
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
import numpy as np
|
|
43
|
+
from vector_engine import VectorArray, VectorIndex
|
|
44
|
+
|
|
45
|
+
xb = VectorArray.from_numpy(
|
|
46
|
+
np.random.randn(1000, 384).astype("float32"),
|
|
47
|
+
ids=[f"doc-{i}" for i in range(1000)],
|
|
48
|
+
normalize=True,
|
|
49
|
+
)
|
|
50
|
+
xq = VectorArray.from_numpy(np.random.randn(2, 384).astype("float32"), normalize=True)
|
|
51
|
+
|
|
52
|
+
index = VectorIndex.create(xb, metric="cosine", backend="bruteforce")
|
|
53
|
+
res = index.search(xq, k=5)
|
|
54
|
+
print(res.ids[0], res.scores[0])
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Core API
|
|
58
|
+
|
|
59
|
+
- `VectorArray`: canonical vector storage with IDs and metadata.
|
|
60
|
+
- `Metric`: built-in and custom metric definitions.
|
|
61
|
+
- `VectorIndex`: backend-agnostic build/add/search/save/load.
|
|
62
|
+
- `vector_engine.ml`: `knn_classify`, `knn_regress`, `kmeans`.
|
|
63
|
+
- `vector_engine.training`: `mine_hard_negatives` with configurable sampling strategies.
|
|
64
|
+
- `vector_engine.eval`: `precision_at_k`, `recall_at_k`, `ndcg_at_k`, `retrieval_report`, `retrieval_report_detailed`, `batch_metrics_summary`.
|
|
65
|
+
|
|
66
|
+
## Backend support matrix
|
|
67
|
+
|
|
68
|
+
| Backend | Search | Add | Save/Load | Custom Metric |
|
|
69
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
70
|
+
| `bruteforce` | yes | yes | yes | yes |
|
|
71
|
+
| `faiss` | yes | yes | yes | no |
|
|
72
|
+
|
|
73
|
+
## Examples and notebooks
|
|
74
|
+
|
|
75
|
+
- `notebooks/01_semantic_search.ipynb`
|
|
76
|
+
- `notebooks/02_knn_baseline.ipynb`
|
|
77
|
+
- `notebooks/03_recommender_similarity.ipynb`
|
|
78
|
+
|
|
79
|
+
## Benchmarks
|
|
80
|
+
|
|
81
|
+
Run:
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
With exact-overlap gate and artifact output:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/benchmark_exact.json
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Benchmark matrix (publishable aggregate):
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Compose publishable summary bundle:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
ANN mode (optional):
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode ann
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
The benchmark reports:
|
|
112
|
+
|
|
113
|
+
- `qps`: queries per second (higher is better)
|
|
114
|
+
- `latency_p50_ms` and `latency_p95_ms`: median and tail latency (lower is better)
|
|
115
|
+
- `overlap_vs_bruteforce`: top-k neighbor overlap against exact brute-force (closer to `1.0` is better)
|
|
116
|
+
- `memory_mb_estimate`: coarse in-process memory estimate for vector/query buffers
|
|
117
|
+
|
|
118
|
+
Recommended protocol for publishable results:
|
|
119
|
+
|
|
120
|
+
- Use fixed seed and fixed hardware notes.
|
|
121
|
+
- Warm up before timed runs.
|
|
122
|
+
- Run at least 3 repeated trials and report median numbers.
|
|
123
|
+
- Keep dataset size (`n`, `d`, `nq`, `k`) fixed across backend comparisons.
|
|
124
|
+
- Include machine-readable matrix summary artifacts in release evidence.
|
|
125
|
+
|
|
126
|
+
## Validation snapshot
|
|
127
|
+
|
|
128
|
+
Artifacts produced in this repo:
|
|
129
|
+
|
|
130
|
+
- Real-corpus style 3-run reports:
|
|
131
|
+
- `artifacts/real_corpus_runs/run_1.json`
|
|
132
|
+
- `artifacts/real_corpus_runs/run_2.json`
|
|
133
|
+
- `artifacts/real_corpus_runs/run_3.json`
|
|
134
|
+
- Faiss Flat exact-equivalence checks (3 runs):
|
|
135
|
+
- `artifacts/faiss_equivalence/run_1.json`
|
|
136
|
+
- `artifacts/faiss_equivalence/run_2.json`
|
|
137
|
+
- `artifacts/faiss_equivalence/run_3.json`
|
|
138
|
+
- 200-run stability study:
|
|
139
|
+
- `artifacts/testing_runs/stability_runs_bruteforce_200.jsonl`
|
|
140
|
+
- `artifacts/testing_runs/stability_summary_bruteforce_200.json`
|
|
141
|
+
- `artifacts/testing_runs/stability_plot_p95_qps.png`
|
|
142
|
+
- Matrix benchmark summary:
|
|
143
|
+
- `artifacts/benchmark_matrix/matrix_summary.json`
|
|
144
|
+
- `artifacts/benchmark_matrix/publishable_results.v1.json`
|
|
145
|
+
|
|
146
|
+
Observed outcomes for current mock/public-safe corpus:
|
|
147
|
+
|
|
148
|
+
- 3-run quality is identical across runs (`recall@1/3/6 = 1.0`, `ndcg@1/3/6 = 1.0`).
|
|
149
|
+
- 3-run latency envelope: p95 ranges from `0.0376 ms` to `0.0717 ms`.
|
|
150
|
+
- Faiss Flat exact mode achieves `overlap_vs_bruteforce = 1.0` for all 3 runs with `--min-flat-overlap 0.99`.
|
|
151
|
+
- In exact benchmark runs (`n=10000`, `d=128`, `nq=200`, `k=10`), Faiss Flat p95 latency is `4.17-15.03 ms` vs bruteforce `29.99-37.63 ms`.
|
|
152
|
+
- 200-run bruteforce study: p95 mean `0.0255 ms` (95% interval `0.0203-0.0547 ms`), QPS mean `188,097` (95% interval `117,499-214,111`).
|
|
153
|
+
|
|
154
|
+
Run the 200-run study:
|
|
155
|
+
|
|
156
|
+
```bash
|
|
157
|
+
python3 scripts/stability_runs.py \
|
|
158
|
+
--embeddings artifacts/real_corpus_inputs/embeddings.npy \
|
|
159
|
+
--query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
|
|
160
|
+
--ids artifacts/real_corpus_inputs/ids.json \
|
|
161
|
+
--ground-truth artifacts/real_corpus_inputs/ground_truth.json \
|
|
162
|
+
--metadata artifacts/real_corpus_inputs/metadata.json \
|
|
163
|
+
--backend bruteforce \
|
|
164
|
+
--k 6 \
|
|
165
|
+
--ks 1,3,6 \
|
|
166
|
+
--loops 5 \
|
|
167
|
+
--run-count 200 \
|
|
168
|
+
--output-dir artifacts/testing_runs \
|
|
169
|
+
--threshold-recall 0.75 \
|
|
170
|
+
--threshold-ndcg 0.70 \
|
|
171
|
+
--threshold-p95-ms 120
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Example result table format:
|
|
175
|
+
|
|
176
|
+
| Backend | QPS | p50 ms | p95 ms | overlap@k vs brute-force |
|
|
177
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
178
|
+
| bruteforce | ... | ... | ... | 1.000 |
|
|
179
|
+
| faiss_flat | ... | ... | ... | ... |
|
|
180
|
+
| faiss_ivf (optional) | ... | ... | ... | ... |
|
|
181
|
+
|
|
182
|
+
## Integration quickstarts
|
|
183
|
+
|
|
184
|
+
### Local RAG app path
|
|
185
|
+
|
|
186
|
+
```bash
|
|
187
|
+
pip install -e ".[dev,ml]"
|
|
188
|
+
python examples/minimal_rag_integration.py
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
### Batch evaluation path
|
|
192
|
+
|
|
193
|
+
```bash
|
|
194
|
+
python scripts/rag_real_corpus_eval.py \
|
|
195
|
+
--embeddings artifacts/real_corpus_inputs/embeddings.npy \
|
|
196
|
+
--query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
|
|
197
|
+
--ids artifacts/real_corpus_inputs/ids.json \
|
|
198
|
+
--ground-truth artifacts/real_corpus_inputs/ground_truth.json \
|
|
199
|
+
--metadata artifacts/real_corpus_inputs/metadata.json \
|
|
200
|
+
--output artifacts/real_corpus_runs/run_1.json \
|
|
201
|
+
--backend bruteforce \
|
|
202
|
+
--k 6 \
|
|
203
|
+
--ks 1,3,6 \
|
|
204
|
+
--loops 5 \
|
|
205
|
+
--threshold-recall 0.75 \
|
|
206
|
+
--threshold-ndcg 0.70 \
|
|
207
|
+
--threshold-p95-ms 120
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Benchmark interpretation path
|
|
211
|
+
|
|
212
|
+
```bash
|
|
213
|
+
python benchmarks/compare_bruteforce_vs_faiss.py \
|
|
214
|
+
--mode exact \
|
|
215
|
+
--min-flat-overlap 0.99 \
|
|
216
|
+
--output artifacts/faiss_equivalence/run_1.json
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
- If `overlap_vs_bruteforce` is near `1.0`, approximation risk is low for that configuration.
|
|
220
|
+
- Use `latency_p95_ms` for user-facing SLO decisions.
|
|
221
|
+
- Use repeated runs + median values before publishing backend comparisons.
|
|
222
|
+
|
|
223
|
+
### Minimal production path (copy-paste)
|
|
224
|
+
|
|
225
|
+
```bash
|
|
226
|
+
pip install -e ".[dev,ml]"
|
|
227
|
+
python scripts/rag_baseline.py --output-dir artifacts --k 3
|
|
228
|
+
python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --output artifacts/real_corpus_runs/run_1.json --backend bruteforce --k 10 --ks 1,5,10 --loops 5
|
|
229
|
+
python scripts/stability_runs.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --backend bruteforce --run-count 200 --output-dir artifacts/testing_runs
|
|
230
|
+
python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/faiss_equivalence/run_1.json
|
|
231
|
+
python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
|
|
232
|
+
python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Expected artifacts:
|
|
236
|
+
|
|
237
|
+
- `artifacts/rag_baseline_metrics.v1.json`
|
|
238
|
+
- `artifacts/real_corpus_runs/run_*.json`
|
|
239
|
+
- `artifacts/testing_runs/stability_summary_*.json`
|
|
240
|
+
- `artifacts/faiss_equivalence/run_*.json`
|
|
241
|
+
- `artifacts/benchmark_matrix/matrix_summary.json`
|
|
242
|
+
- `artifacts/benchmark_matrix/publishable_results.v1.json`
|
|
243
|
+
|
|
244
|
+
Further reading:
|
|
245
|
+
|
|
246
|
+
- `docs/integration_guides.md`
|
|
247
|
+
- `docs/reproducibility.md`
|
|
248
|
+
- `docs/kpi_charter.md`
|
|
249
|
+
- `docs/research_claims.md`
|
|
250
|
+
- `docs/credibility_audit.md`
|
|
251
|
+
- `docs/limitations.md`
|
|
252
|
+
- `docs/releases/v1.0.0.md`
|
|
253
|
+
- `docs/paper/reproducibility_appendix.md`
|
|
254
|
+
|
|
255
|
+
## Publication and release bundle
|
|
256
|
+
|
|
257
|
+
Generate a release-bundle manifest that checks required docs/governance/evidence files:
|
|
258
|
+
|
|
259
|
+
```bash
|
|
260
|
+
python scripts/build_release_bundle.py --output-dir artifacts/release_bundle
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
## Artifact policy (publish vs private)
|
|
264
|
+
|
|
265
|
+
- Safe to publish:
|
|
266
|
+
- benchmark result summaries
|
|
267
|
+
- stability aggregate summaries
|
|
268
|
+
- synthetic/mock input examples
|
|
269
|
+
- Keep private:
|
|
270
|
+
- real corpus raw embeddings
|
|
271
|
+
- query embeddings derived from private data
|
|
272
|
+
- sensitive metadata and ID mappings
|
|
273
|
+
- Recommended:
|
|
274
|
+
- commit docs + summary metrics in repo
|
|
275
|
+
- keep private input blobs in external storage
|
|
276
|
+
|
|
277
|
+
## Project adoption checklist
|
|
278
|
+
|
|
279
|
+
- Install: `pip install -e ".[dev,ml]"` and optional `.[faiss]`.
|
|
280
|
+
- Validation: run `pytest -q`.
|
|
281
|
+
- Quality baseline: run `python scripts/rag_baseline.py`.
|
|
282
|
+
- Real corpus eval: run `python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --threshold-recall 0.75 --threshold-ndcg 0.70 --threshold-p95-ms 120`.
|
|
283
|
+
- Persistence: verify `VectorIndex.save/load` on your own embeddings snapshot.
|
|
284
|
+
- Performance: run benchmark script with your target `n`, `d`, `nq`, `k`.
|
|
285
|
+
- Integration: run `python examples/minimal_rag_integration.py`.
|
|
286
|
+
|
|
287
|
+
## Feature snapshot
|
|
288
|
+
|
|
289
|
+
- `kmeans` returns rich outputs (`labels`, `centers`, `inertia`, `n_iter`) with deterministic validation.
|
|
290
|
+
- Hard-negative mining supports `top1`, `topk_sample`, and `distance_band`, plus `exclude_ids` / `exclude_mask`.
|
|
291
|
+
- Retrieval evaluation includes `retrieval_report_detailed(include_per_query=...)` and `batch_metrics_summary(include_std=True)`.
|
|
292
|
+
- Public demo bootstrap is available under `demo_repo_template/`.
|
|
293
|
+
|
|
294
|
+
## v1.0 readiness gates
|
|
295
|
+
|
|
296
|
+
- Benchmark matrix artifacts produced with fixed protocol and environment metadata.
|
|
297
|
+
- Stability harness demonstrates repeatability for latency/QPS/quality summaries.
|
|
298
|
+
- API stability contract documented in `docs/api.md` and enforced in `tests/test_api_stability.py`.
|
|
299
|
+
- Release packaging includes reproducible command blocks and artifact policy.
|
|
300
|
+
|
|
301
|
+
## Governance and trust
|
|
302
|
+
|
|
303
|
+
- `LICENSE`
|
|
304
|
+
- `CITATION.cff`
|
|
305
|
+
|
|
306
|
+
## Error cases
|
|
307
|
+
|
|
308
|
+
Stable error prefixes are used for fast debugging:
|
|
309
|
+
|
|
310
|
+
- `vector_array_error`: malformed array, IDs, metadata, subset lookup
|
|
311
|
+
- `metric_error`: unsupported or invalid metric definitions
|
|
312
|
+
- `index_error`: index lifecycle/search/add/persistence consistency issues
|
|
313
|
+
- `manifest_error`: missing/unsupported manifest fields or version
|
|
314
|
+
|
|
315
|
+
## Troubleshooting
|
|
316
|
+
|
|
317
|
+
- **Faiss not available**
|
|
318
|
+
- Install with `pip install -e ".[faiss]"`.
|
|
319
|
+
- **Dimension mismatch at search/add**
|
|
320
|
+
- Ensure both base vectors and query vectors use the same embedding dimension.
|
|
321
|
+
- **Metric confusion**
|
|
322
|
+
- For cosine similarity, pass normalized vectors or set `normalize=True`.
|
|
323
|
+
- **Persistence load failure**
|
|
324
|
+
- Check manifest version compatibility and whether artifacts were modified after save.
|