vector-engine 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,342 @@
1
+ Metadata-Version: 2.4
2
+ Name: vector-engine
3
+ Version: 1.0.0
4
+ Summary: ML-first vector computation and retrieval engine.
5
+ Author: Vector Engine Contributors
6
+ License-Expression: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: numpy>=1.24
11
+ Provides-Extra: faiss
12
+ Requires-Dist: faiss-cpu>=1.7.4; (platform_system != "Darwin" or platform_machine != "arm64") and extra == "faiss"
13
+ Provides-Extra: ml
14
+ Requires-Dist: scikit-learn>=1.3; extra == "ml"
15
+ Provides-Extra: dev
16
+ Requires-Dist: pytest>=7.4; extra == "dev"
17
+ Dynamic: license-file
18
+
19
+ # Vector Engine
20
+
21
+ ML-first vector computation and retrieval for Python.
22
+
23
+ Vector Engine provides one clean API for exact search, ANN backends, metadata-aware retrieval, and ML utilities such as kNN and retrieval metrics.
24
+
25
+ ## Why this exists
26
+
27
+ - ANN libraries are powerful but low-level and backend-specific.
28
+ - Vector DBs solve infra and ops, but many ML workflows need fast local experimentation.
29
+ - Existing ML APIs do not offer a unified, backend-pluggable vector layer for embedding-heavy work.
30
+
31
+ ## Install
32
+
33
+ ```bash
34
+ pip install -e .
35
+ ```
36
+
37
+ Optional extras:
38
+
39
+ ```bash
40
+ pip install -e ".[dev,ml]"
41
+ pip install -e ".[faiss]"
42
+ ```
43
+
44
+ ## API contracts (v1.0 target)
45
+
46
+ - `VectorArray` accepts only 2D arrays with shape `(n, d)` where `n > 0` and `d > 0`.
47
+ - `VectorArray` IDs must be unique and must be `int` or `str`.
48
+ - Metadata length must always align with number of vectors.
49
+ - `VectorIndex.search(..., k=...)` requires `k` to be an integer `> 0`.
50
+ - Score direction is explicit in `Metric.higher_is_better`:
51
+ - cosine/ip: higher is better
52
+ - l2: lower is better
53
+ - `kmeans(..., random_state=...)` requires an integer seed and finite vectors.
54
+ - `mine_hard_negatives` supports `top1`, `topk_sample`, and `distance_band` strategies with explicit exclusion controls.
55
+ - Retrieval eval validates malformed ground-truth shapes/types with stable `eval_error` messages.
56
+
57
+ ## 60-second quickstart
58
+
59
+ ```python
60
+ import numpy as np
61
+ from vector_engine import VectorArray, VectorIndex
62
+
63
+ xb = VectorArray.from_numpy(
64
+ np.random.randn(1000, 384).astype("float32"),
65
+ ids=[f"doc-{i}" for i in range(1000)],
66
+ normalize=True,
67
+ )
68
+ xq = VectorArray.from_numpy(np.random.randn(2, 384).astype("float32"), normalize=True)
69
+
70
+ index = VectorIndex.create(xb, metric="cosine", backend="bruteforce")
71
+ res = index.search(xq, k=5)
72
+ print(res.ids[0], res.scores[0])
73
+ ```
74
+
75
+ ## Core API
76
+
77
+ - `VectorArray`: canonical vector storage with IDs and metadata.
78
+ - `Metric`: built-in and custom metric definitions.
79
+ - `VectorIndex`: backend-agnostic build/add/search/save/load.
80
+ - `vector_engine.ml`: `knn_classify`, `knn_regress`, `kmeans`.
81
+ - `vector_engine.training`: `mine_hard_negatives` with configurable sampling strategies.
82
+ - `vector_engine.eval`: `precision_at_k`, `recall_at_k`, `ndcg_at_k`, `retrieval_report`, `retrieval_report_detailed`, `batch_metrics_summary`.
83
+
84
+ ## Backend support matrix
85
+
86
+ | Backend | Search | Add | Save/Load | Custom Metric |
87
+ | --- | ---: | ---: | ---: | ---: |
88
+ | `bruteforce` | yes | yes | yes | yes |
89
+ | `faiss` | yes | yes | yes | no |
90
+
91
+ ## Examples and notebooks
92
+
93
+ - `notebooks/01_semantic_search.ipynb`
94
+ - `notebooks/02_knn_baseline.ipynb`
95
+ - `notebooks/03_recommender_similarity.ipynb`
96
+
97
+ ## Benchmarks
98
+
99
+ Run:
100
+
101
+ ```bash
102
+ python benchmarks/compare_bruteforce_vs_faiss.py --mode exact
103
+ ```
104
+
105
+ With exact-overlap gate and artifact output:
106
+
107
+ ```bash
108
+ python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/benchmark_exact.json
109
+ ```
110
+
111
+ Benchmark matrix (publishable aggregate):
112
+
113
+ ```bash
114
+ python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
115
+ ```
116
+
117
+ Compose publishable summary bundle:
118
+
119
+ ```bash
120
+ python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
121
+ ```
122
+
123
+ ANN mode (optional):
124
+
125
+ ```bash
126
+ python benchmarks/compare_bruteforce_vs_faiss.py --mode ann
127
+ ```
128
+
129
+ The benchmark reports:
130
+
131
+ - `qps`: queries per second (higher is better)
132
+ - `latency_p50_ms` and `latency_p95_ms`: median and tail latency (lower is better)
133
+ - `overlap_vs_bruteforce`: top-k neighbor overlap against exact brute-force (closer to `1.0` is better)
134
+ - `memory_mb_estimate`: coarse in-process memory estimate for vector/query buffers
135
+
136
+ Recommended protocol for publishable results:
137
+
138
+ - Use fixed seed and fixed hardware notes.
139
+ - Warm up before timed runs.
140
+ - Run at least 3 repeated trials and report median numbers.
141
+ - Keep dataset size (`n`, `d`, `nq`, `k`) fixed across backend comparisons.
142
+ - Include machine-readable matrix summary artifacts in release evidence.
143
+
144
+ ## Validation snapshot
145
+
146
+ Artifacts produced in this repo:
147
+
148
+ - Real-corpus style 3-run reports:
149
+ - `artifacts/real_corpus_runs/run_1.json`
150
+ - `artifacts/real_corpus_runs/run_2.json`
151
+ - `artifacts/real_corpus_runs/run_3.json`
152
+ - Faiss Flat exact-equivalence checks (3 runs):
153
+ - `artifacts/faiss_equivalence/run_1.json`
154
+ - `artifacts/faiss_equivalence/run_2.json`
155
+ - `artifacts/faiss_equivalence/run_3.json`
156
+ - 200-run stability study:
157
+ - `artifacts/testing_runs/stability_runs_bruteforce_200.jsonl`
158
+ - `artifacts/testing_runs/stability_summary_bruteforce_200.json`
159
+ - `artifacts/testing_runs/stability_plot_p95_qps.png`
160
+ - Matrix benchmark summary:
161
+ - `artifacts/benchmark_matrix/matrix_summary.json`
162
+ - `artifacts/benchmark_matrix/publishable_results.v1.json`
163
+
164
+ Observed outcomes for current mock/public-safe corpus:
165
+
166
+ - 3-run quality is identical across runs (`recall@1/3/6 = 1.0`, `ndcg@1/3/6 = 1.0`).
167
+ - 3-run latency envelope: p95 ranges from `0.0376 ms` to `0.0717 ms`.
168
+ - Faiss Flat exact mode achieves `overlap_vs_bruteforce = 1.0` for all 3 runs with `--min-flat-overlap 0.99`.
169
+ - In exact benchmark runs (`n=10000`, `d=128`, `nq=200`, `k=10`), Faiss Flat p95 latency is `4.17-15.03 ms` vs bruteforce `29.99-37.63 ms`.
170
+ - 200-run bruteforce study: p95 mean `0.0255 ms` (95% interval `0.0203-0.0547 ms`), QPS mean `188,097` (95% interval `117,499-214,111`).
171
+
172
+ Run the 200-run study:
173
+
174
+ ```bash
175
+ python3 scripts/stability_runs.py \
176
+ --embeddings artifacts/real_corpus_inputs/embeddings.npy \
177
+ --query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
178
+ --ids artifacts/real_corpus_inputs/ids.json \
179
+ --ground-truth artifacts/real_corpus_inputs/ground_truth.json \
180
+ --metadata artifacts/real_corpus_inputs/metadata.json \
181
+ --backend bruteforce \
182
+ --k 6 \
183
+ --ks 1,3,6 \
184
+ --loops 5 \
185
+ --run-count 200 \
186
+ --output-dir artifacts/testing_runs \
187
+ --threshold-recall 0.75 \
188
+ --threshold-ndcg 0.70 \
189
+ --threshold-p95-ms 120
190
+ ```
191
+
192
+ Example result table format:
193
+
194
+ | Backend | QPS | p50 ms | p95 ms | overlap@k vs brute-force |
195
+ | --- | ---: | ---: | ---: | ---: |
196
+ | bruteforce | ... | ... | ... | 1.000 |
197
+ | faiss_flat | ... | ... | ... | ... |
198
+ | faiss_ivf (optional) | ... | ... | ... | ... |
199
+
200
+ ## Integration quickstarts
201
+
202
+ ### Local RAG app path
203
+
204
+ ```bash
205
+ pip install -e ".[dev,ml]"
206
+ python examples/minimal_rag_integration.py
207
+ ```
208
+
209
+ ### Batch evaluation path
210
+
211
+ ```bash
212
+ python scripts/rag_real_corpus_eval.py \
213
+ --embeddings artifacts/real_corpus_inputs/embeddings.npy \
214
+ --query-embeddings artifacts/real_corpus_inputs/query_embeddings.npy \
215
+ --ids artifacts/real_corpus_inputs/ids.json \
216
+ --ground-truth artifacts/real_corpus_inputs/ground_truth.json \
217
+ --metadata artifacts/real_corpus_inputs/metadata.json \
218
+ --output artifacts/real_corpus_runs/run_1.json \
219
+ --backend bruteforce \
220
+ --k 6 \
221
+ --ks 1,3,6 \
222
+ --loops 5 \
223
+ --threshold-recall 0.75 \
224
+ --threshold-ndcg 0.70 \
225
+ --threshold-p95-ms 120
226
+ ```
227
+
228
+ ### Benchmark interpretation path
229
+
230
+ ```bash
231
+ python benchmarks/compare_bruteforce_vs_faiss.py \
232
+ --mode exact \
233
+ --min-flat-overlap 0.99 \
234
+ --output artifacts/faiss_equivalence/run_1.json
235
+ ```
236
+
237
+ - If `overlap_vs_bruteforce` is near `1.0`, approximation risk is low for that configuration.
238
+ - Use `latency_p95_ms` for user-facing SLO decisions.
239
+ - Use repeated runs + median values before publishing backend comparisons.
240
+
241
+ ### Minimal production path (copy-paste)
242
+
243
+ ```bash
244
+ pip install -e ".[dev,ml]"
245
+ python scripts/rag_baseline.py --output-dir artifacts --k 3
246
+ python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --output artifacts/real_corpus_runs/run_1.json --backend bruteforce --k 10 --ks 1,5,10 --loops 5
247
+ python scripts/stability_runs.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --backend bruteforce --run-count 200 --output-dir artifacts/testing_runs
248
+ python benchmarks/compare_bruteforce_vs_faiss.py --mode exact --min-flat-overlap 0.99 --output artifacts/faiss_equivalence/run_1.json
249
+ python scripts/benchmark_matrix.py --mode exact --warmup 2 --loops 8 --seed 7 --min-flat-overlap 0.99 --output-dir artifacts/benchmark_matrix
250
+ python scripts/publishable_results.py --matrix-summary artifacts/benchmark_matrix/matrix_summary.json --stability-summary artifacts/testing_runs/stability_summary_bruteforce_200.json --output artifacts/benchmark_matrix/publishable_results.v1.json
251
+ ```
252
+
253
+ Expected artifacts:
254
+
255
+ - `artifacts/rag_baseline_metrics.v1.json`
256
+ - `artifacts/real_corpus_runs/run_*.json`
257
+ - `artifacts/testing_runs/stability_summary_*.json`
258
+ - `artifacts/faiss_equivalence/run_*.json`
259
+ - `artifacts/benchmark_matrix/matrix_summary.json`
260
+ - `artifacts/benchmark_matrix/publishable_results.v1.json`
261
+
262
+ Further reading:
263
+
264
+ - `docs/integration_guides.md`
265
+ - `docs/reproducibility.md`
266
+ - `docs/kpi_charter.md`
267
+ - `docs/research_claims.md`
268
+ - `docs/credibility_audit.md`
269
+ - `docs/limitations.md`
270
+ - `docs/releases/v1.0.0.md`
271
+ - `docs/paper/reproducibility_appendix.md`
272
+
273
+ ## Publication and release bundle
274
+
275
+ Generate a release-bundle manifest that checks required docs/governance/evidence files:
276
+
277
+ ```bash
278
+ python scripts/build_release_bundle.py --output-dir artifacts/release_bundle
279
+ ```
280
+
281
+ ## Artifact policy (publish vs private)
282
+
283
+ - Safe to publish:
284
+ - benchmark result summaries
285
+ - stability aggregate summaries
286
+ - synthetic/mock input examples
287
+ - Keep private:
288
+ - real corpus raw embeddings
289
+ - query embeddings derived from private data
290
+ - sensitive metadata and ID mappings
291
+ - Recommended:
292
+ - commit docs + summary metrics in repo
293
+ - keep private input blobs in external storage
294
+
295
+ ## Project adoption checklist
296
+
297
+ - Install: `pip install -e ".[dev,ml]"` and optional `.[faiss]`.
298
+ - Validation: run `pytest -q`.
299
+ - Quality baseline: run `python scripts/rag_baseline.py`.
300
+ - Real corpus eval: run `python scripts/rag_real_corpus_eval.py --embeddings ... --query-embeddings ... --ids ... --ground-truth ... --threshold-recall 0.75 --threshold-ndcg 0.70 --threshold-p95-ms 120`.
301
+ - Persistence: verify `VectorIndex.save/load` on your own embeddings snapshot.
302
+ - Performance: run benchmark script with your target `n`, `d`, `nq`, `k`.
303
+ - Integration: run `python examples/minimal_rag_integration.py`.
304
+
305
+ ## Feature snapshot
306
+
307
+ - `kmeans` returns rich outputs (`labels`, `centers`, `inertia`, `n_iter`) with deterministic validation.
308
+ - Hard-negative mining supports `top1`, `topk_sample`, and `distance_band`, plus `exclude_ids` / `exclude_mask`.
309
+ - Retrieval evaluation includes `retrieval_report_detailed(include_per_query=...)` and `batch_metrics_summary(include_std=True)`.
310
+ - Public demo bootstrap is available under `demo_repo_template/`.
311
+
312
+ ## v1.0 readiness gates
313
+
314
+ - Benchmark matrix artifacts produced with fixed protocol and environment metadata.
315
+ - Stability harness demonstrates repeatability for latency/QPS/quality summaries.
316
+ - API stability contract documented in `docs/api.md` and enforced in `tests/test_api_stability.py`.
317
+ - Release packaging includes reproducible command blocks and artifact policy.
318
+
319
+ ## Governance and trust
320
+
321
+ - `LICENSE`
322
+ - `CITATION.cff`
323
+
324
+ ## Error cases
325
+
326
+ Stable error prefixes are used for fast debugging:
327
+
328
+ - `vector_array_error`: malformed array, IDs, metadata, subset lookup
329
+ - `metric_error`: unsupported or invalid metric definitions
330
+ - `index_error`: index lifecycle/search/add/persistence consistency issues
331
+ - `manifest_error`: missing/unsupported manifest fields or version
332
+
333
+ ## Troubleshooting
334
+
335
+ - **Faiss not available**
336
+ - Install with `pip install -e ".[faiss]"`.
337
+ - **Dimension mismatch at search/add**
338
+ - Ensure both base vectors and query vectors use the same embedding dimension.
339
+ - **Metric confusion**
340
+ - For cosine similarity, pass normalized vectors or set `normalize=True`.
341
+ - **Persistence load failure**
342
+ - Check manifest version compatibility and whether artifacts were modified after save.
@@ -0,0 +1,24 @@
1
+ vector_engine/__init__.py,sha256=VbXKWPz-SPssPhVvSwzMQvpkaiBRS7K6jEigfyLRWaI,285
2
+ vector_engine/array.py,sha256=mEulAl6fJFFAnB-k2Lav8E93Rw4oIBHsjYioWCiBwWE,5703
3
+ vector_engine/index.py,sha256=wAhkRbzqu9odUbSeF3kmNJZ3o5oOvGrWyqmc3VxO0Sc,7702
4
+ vector_engine/metric.py,sha256=FiYhbOrUuX9u84EeCgjrrg4np-eXKjP5jgILeiNm-Bs,1824
5
+ vector_engine/results.py,sha256=dQ8B6vXQvsR9moGGcbqMc5fnsguN5aDKayQE3fI9Ypc,310
6
+ vector_engine/backends/__init__.py,sha256=-vEe47Gsu-xLR7rv5-Y-DXpcFfT49qHoZR2cdt7YZyM,328
7
+ vector_engine/backends/base.py,sha256=UfY7I3gvbEcEraYeMkd224BNOu_Ces0G7H1THsV7XgI,575
8
+ vector_engine/backends/bruteforce.py,sha256=bOjoFeu__lbswaps2Tnxa5-u4gHTKMPYZ0GvRlkNKWE,4137
9
+ vector_engine/backends/faiss_backend.py,sha256=7x6UT6EEwMZmos_Nh3Q6kiOeS1n3R-WAt2o-OELReZs,4520
10
+ vector_engine/backends/registry.py,sha256=Dzgi4mdztuMz0sM2G_uGkKB-FNy5ZVk8VHthLZmTSiw,347
11
+ vector_engine/eval/__init__.py,sha256=-EPSOvNxi3kaFocdVzXY10ZtCkaI-KKaBNGeiJ5QeH4,318
12
+ vector_engine/eval/retrieval.py,sha256=Ij3DjtE8dhpjgMyXpBQBXOae0GJNr1wzlfMCXo804t8,6586
13
+ vector_engine/io/__init__.py,sha256=ygJY89J9bOexA4lU2A9o1cC_PkeyR0IzD58_vrFBx6w,129
14
+ vector_engine/io/manifest.py,sha256=cIsF5QtwuUJGmsNGIOLcQqyBSciREEUgswQClD6MGLY,1385
15
+ vector_engine/ml/__init__.py,sha256=NmB0JvD7Y8UiaYnM6vXA45GPBosmKNSZAzY80NgFMnc,157
16
+ vector_engine/ml/clustering.py,sha256=clz52N7_-A08roL5k5_U4ZqKCQe5R7lThX6_dIGNq5M,1664
17
+ vector_engine/ml/knn.py,sha256=2Rk8hBgKvpimLJKjDuCv33n86-00MEh_AF0WZK-pZ_Q,2315
18
+ vector_engine/training/__init__.py,sha256=XJ-EQTrFwA7TnWEfPgvkx9KoN_FVnX0pspfG1vCOUhE,112
19
+ vector_engine/training/hard_negative.py,sha256=_ZnvwD0XbNcPUvvCvW71VMJkvlSH5wv7pPPCNX-gKbw,4873
20
+ vector_engine-1.0.0.dist-info/licenses/LICENSE,sha256=Xy3nx0VJ3Dd9vtPu7o1CNZ0sC1jHf1jdNQlQ_qY8mp0,1083
21
+ vector_engine-1.0.0.dist-info/METADATA,sha256=t5HIUlDOpGJaUfWnxoe8iLAM9dHVGgZ_QYJ5Yx_7EW8,12583
22
+ vector_engine-1.0.0.dist-info/WHEEL,sha256=YCfwYGOYMi5Jhw2fU4yNgwErybb2IX5PEwBKV4ZbdBo,91
23
+ vector_engine-1.0.0.dist-info/top_level.txt,sha256=DWhv4njcslmKqCi_n2IiOUrrYMwEV5IvXiWpkaQAf5A,14
24
+ vector_engine-1.0.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Vector Engine Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ vector_engine