flatnav 0.1.2__cp39-cp39-macosx_10_14_x86_64.whl

Sign up to get free protection for your applications and to get access to all the features.
flatnav/__init__.py ADDED
@@ -0,0 +1,35 @@
1
+ import sys
2
+ from ._core import (
3
+ MetricType,
4
+ data_type,
5
+ __version__,
6
+ __doc__
7
+ )
8
+
9
+ class _DataTypeModule:
10
+ from ._core.data_type import DataType
11
+
12
+
13
+ class _IndexModule:
14
+ from ._core.index import (
15
+ IndexL2Float,
16
+ IndexIPFloat,
17
+ IndexL2Uint8,
18
+ IndexIPUint8,
19
+ IndexL2Int8,
20
+ IndexIPInt8,
21
+ create,
22
+ )
23
+
24
+
25
+ index = _IndexModule
26
+ sys.modules['flatnav.index'] = _IndexModule
27
+ sys.modules['flatnav.data_type'] = _DataTypeModule
28
+
29
+ __all__ = [
30
+ 'MetricType',
31
+ 'data_type',
32
+ 'index',
33
+ '__version__',
34
+ '__doc__'
35
+ ]
Binary file
@@ -0,0 +1,311 @@
1
+ Metadata-Version: 2.2
2
+ Name: flatnav
3
+ Version: 0.1.2
4
+ Summary: A performant graph-based kNN search library with re-ordering.
5
+ Home-page: https://flatnav.net
6
+ Author: Benjamin Coleman, Blaise Munyampirwa, Vihan Lakshman
7
+ Author-email: benjamin.ray.coleman@gmail.com, blaisemunyampirwa@gmail.com, vihan@mit.edu
8
+ Maintainer-email: blaisemunyampirwa@gmail.com
9
+ License: Apache License, Version 2.0
10
+ Project-URL: Source Code, https://github.com/BlaiseMuhirwa/flatnav
11
+ Project-URL: Documentation, https://blaisemuhirwa.github.io/flatnav
12
+ Project-URL: Bug Tracker, https://github.com/BlaiseMuhirwa/flatnav/issues
13
+ Keywords: similarity search,vector databases,machine learning
14
+ Classifier: Development Status :: 4 - Beta
15
+ Classifier: Environment :: Console
16
+ Classifier: Operating System :: POSIX :: Linux
17
+ Classifier: Intended Audience :: Science/Research
18
+ Classifier: Intended Audience :: Developers
19
+ Classifier: Intended Audience :: Other Audience
20
+ Classifier: License :: OSI Approved :: Apache Software License
21
+ Classifier: Programming Language :: C++
22
+ Classifier: Programming Language :: Python :: 3
23
+ Classifier: Programming Language :: Python :: 3.8
24
+ Classifier: Programming Language :: Python :: 3.9
25
+ Classifier: Programming Language :: Python :: 3.10
26
+ Classifier: Programming Language :: Python :: 3.11
27
+ Classifier: Programming Language :: Python :: 3.12
28
+ Classifier: Topic :: System
29
+ Classifier: Topic :: Scientific/Engineering
30
+ Classifier: Topic :: Software Development
31
+ Description-Content-Type: text/markdown
32
+ Requires-Dist: numpy<2,>=1.21.0
33
+ Requires-Dist: h5py==3.11.0
34
+ Dynamic: author
35
+ Dynamic: author-email
36
+ Dynamic: classifier
37
+ Dynamic: description
38
+ Dynamic: description-content-type
39
+ Dynamic: home-page
40
+ Dynamic: keywords
41
+ Dynamic: license
42
+ Dynamic: maintainer-email
43
+ Dynamic: project-url
44
+ Dynamic: requires-dist
45
+ Dynamic: summary
46
+
47
+ ## FlatNav
48
+
49
+ FlatNav is a fast and header-only graph-based index for Approximate Nearest Neighbor Search (ANNS). FlatNav is inspired by the influential [Hierarchical Navigable Small World (HNSW) index](https://github.com/nmslib/hnswlib), but with the hierarchical component removed. As detailed in our [research paper](https://arxiv.org/pdf/2412.01940), we found that FlatNav achieved identical performance to HNSW on high-dimensional datasets (dimensionality > 32) with approximately 38% less peak memory consumption and a simplified implementation.
50
+
51
+ We hope to maintain this open source library as a resource for broader community. Please consider opening a Github Issue for bugs and feature requests, or get in touch with us directly for discussions.
52
+
53
+
54
+ ### Installation
55
+ FlatNav is implemented in C++ with a complete Python extension with [cereal](https://uscilab.github.io/cereal/) as the only external dependency. This is a header-only library, so there is nothing to build. You can just include the necessary headers in your existing code.
56
+
57
+ FlatNav is supported on x86-64 machines on linux and MacOS (we can extend this to windows if there is sufficient interest). To get the C++ library working and run examples under the [tools](https://github.com/BlaiseMuhirwa/flatnav/blob/main/tools) directory, you will need
58
+
59
+ * C++17 compiler with OpenMP support (version >= 2.0)
60
+ * CMake (version >= 3.14)
61
+
62
+ We provide some helpful scripts for installing the above in the [bin](https://github.com/BlaiseMuhirwa/flatnav/tree/main/bin) directory.
63
+
64
+ To generate the library with CMake and compile examples, run
65
+
66
+ ```shell
67
+ $ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
68
+ $ cd flatnav
69
+ $ ./bin/build.sh -e
70
+ ```
71
+
72
+ You can get all options available with the `build.sh` script by passing it the `-h` argument.
73
+
74
+ This will display all available build options:
75
+
76
+ ```shell
77
+ Usage ./build.sh [OPTIONS]
78
+
79
+ Available Options:
80
+ -t, --tests: Build tests
81
+ -e, --examples: Build examples
82
+ -v, --verbose: Make verbose
83
+ -b, --benchmark: Build benchmarks
84
+ -bt, --build_type: Build type (Debug, Release, RelWithDebInfo, MinSizeRel)
85
+ -nmv, --no_simd_vectorization:Disable SIMD instructions
86
+ -h, --help: Print this help message
87
+
88
+ Example Usage:
89
+ ./build.sh -t -e -v
90
+ ```
91
+
92
+ To build the Python bindings, follow instructions [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/README.md). There are also examples for how to use the library to build an index and run queries on top of it [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/unit_tests/test_index.py).
93
+
94
+ ### Support for SIMD Extensions
95
+
96
+ We currently support SIMD extensions for certain platforms as detailed below.
97
+
98
+ | Operation | x86_64 | arm64v8 | Apple silicon |
99
+ |-----------|--------|---------|-----------------|
100
+ | FP32 Inner product |SSE, AVX, AVX512 | No SIMD support | No SIMD support |
101
+ | FP32 L2 distance |SSE, AVX, AVX512| No SIMD support | No SIMD support |
102
+ | UINT8 L2 distance |AVX512 | No SIMD support | No SIMD support |
103
+ | INT8 L2 distance | SSE | No SIMD support | No SIMD support |
104
+
105
+
106
+ ### Getting Started in Python
107
+
108
+ Currently, we support Python wheels for versions 3.8 through 3.12 on x86_64 architectures (Intel, AMD and MacOS). Support for
109
+ ARM wheels is a future improvement.
110
+
111
+ The python library can be installed from PyPI by using
112
+ ```shell
113
+ $ pip install flatnav
114
+ ```
115
+
116
+ Similarly, `flatnav` can be installed from source via [cibuildwheel](https://cibuildwheel.pypa.io/en/stable/), which
117
+ builds cross-platform wheels. Follow the following steps
118
+
119
+ ```shell
120
+ $ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
121
+ $ cd flatnav
122
+ $ make install-cibuildwheel
123
+
124
+ # This will build flatnav for the current version in your environment. If you want to build wheels
125
+ # for all supported python versions (3.8 to 3.12), remove the --current-version flag.
126
+ $ ./cibuild.sh --current-version
127
+
128
+ $ pip install wheelhouse/flatnav*.whl --force-reinstall
129
+ ```
130
+
131
+ Once you have the python library installed and you have a dataset you want to index as a numpy array, you can construct the index as shown below. This will allocate memory and create a directed graph with vectors as nodes.
132
+
133
+ ```python
134
+ import numpy as np
135
+ import flatnav
136
+ from flatnav.data_type import DataType
137
+
138
+ # Get your numpy-formatted dataset.
139
+ dataset_size = 1_000_000
140
+ dataset_dimension = 128
141
+ dataset_to_index = np.random.randn(dataset_size, dataset_dimension)
142
+
143
+ # Define index construction parameters.
144
+ distance_type = "l2"
145
+ max_edges_per_node = 32
146
+ ef_construction = 100
147
+ num_build_threads = 16
148
+
149
+ # Create index configuration and pre-allocate memory
150
+ index = flatnav.index.create(
151
+ distance_type=distance_type,
152
+ index_data_type=DataType.float32,
153
+ dim=dataset_dimension,
154
+ dataset_size=dataset_size,
155
+ max_edges_per_node=max_edges_per_node,
156
+ verbose=True,
157
+ collect_stats=True,
158
+ )
159
+ index.set_num_threads(num_build_threads)
160
+
161
+ # Now index the dataset
162
+ index.add(data=dataset_to_index, ef_construction=ef_construction)
163
+ ```
164
+
165
+ Note that we specified `DataType.float32` to indicate that we want to build an index with vectors represented with `float` type. If you want to use a different precision, such as `uint8_t` or `int8_t` (which are the only other ones currently supported), you can use `DataType.uint8` or `DataType.int8`.
166
+ The distance type can either be `l2` or `angular`. The `collect_stats` flag will record the number of distance evaluations.
167
+
168
+ To query the index we just created by generating IID vectors from the standard normal distribution, we do it as follows
169
+
170
+ ```python
171
+
172
+ # Set query-time parameters
173
+ k = 100
174
+ ef_search = 100
175
+
176
+ # Run k-NN query with a single thread.
177
+ index.set_num_threads(1)
178
+
179
+ queries = np.random.randn(1000, dataset_to_index.shape[1])
180
+ for query in queries:
181
+ distances, indices = index.search_single(
182
+ query=query,
183
+ ef_search=ef_search,
184
+ K=k,
185
+ )
186
+
187
+ ```
188
+
189
+ You can parallelize the search by setting the number of threads to a desired number and using a different API that also returns the exact same results as `search_single`.
190
+
191
+ ```python
192
+ index.set_num_threads(16)
193
+ distances, indices = index.search(queries=queries, ef_search=ef_search, K=k)
194
+ ```
195
+
196
+ ### Getting Started in C++
197
+
198
+ As mentioned earlier, there is nothing to build since this is header-only. We will translate the above Python code in C++ to illustrate how to use the C++ API.
199
+
200
+ ```c++
201
+ #include <cstdint>
202
+ #include <flatnav/index/Index.h>
203
+ #include <flatnav/distances/SquaredL2Distance.h>
204
+ #include <flatnav/distances/DistanceInterface.h>
205
+
206
+ template <typename dist_t>
207
+ void run_knn_search(Index<dist_t, int>>* index, float *queries, int* gtruth,
208
+ int ef_search, int K, int num_queries, int num_gtruth, int dim) {
209
+
210
+ float mean_recall = 0;
211
+ for (int i = 0; i < num_queries; i++) {
212
+ float *q = queries + dim * i;
213
+ int *g = gtruth + num_gtruth * i;
214
+ std::vector<std::pair<float, int>> result =
215
+ index->search(q, K, ef_search);
216
+
217
+ float recall = 0;
218
+ for (int j = 0; j < K; j++) {
219
+ for (int l = 0; l < K; l++) {
220
+ if (result[j].second == g[l]) {
221
+ recall = recall + 1;
222
+ }
223
+ }
224
+ }
225
+ recall = recall / K;
226
+ mean_recall = mean_recall + recall;
227
+ }
228
+ }
229
+
230
+
231
+ int main(int argc, char** argv) {
232
+ uint32_t dataset_size = 1000000;
233
+ uint32_t dataset_dimension = 128;
234
+
235
+ // We skip the random data generation, but you can do that with std::mt19937, std::random_device
236
+ // and std::normal_distribution
237
+ // std::vector<float> dataset_to_index;
238
+
239
+ uint32_t max_edges_per_node = 32;
240
+ uint32_t ef_construction = 100;
241
+
242
+ // Create an index with l2 distance
243
+ auto distance = SquaredL2Distance<>::create(dataset_dimension);
244
+ auto* index = new Index<SquaredL2Distance<DataType::float32>>, int>(
245
+ /* dist = */ std::move(distance), /* dataset_size = */ dataset_size,
246
+ /* max_edges_per_node = */ max_edges_per_node);
247
+
248
+ index->setNumThreads(build_num_threads);
249
+
250
+ std::vector<int> labels(dataset_size);
251
+ std::iota(labels.begin(), labels.end(), 0);
252
+ index->template addBatch<float>(/* data = */ (void *)dataset_to_index,
253
+ /* labels = */ labels,
254
+ /* ef_construction */ ef_construction);
255
+
256
+ // Now query the index and compute the recall
257
+ // We assume you have a ground truth (int*) array and a queries (float*) array
258
+ uint32_t ef_search = 100;
259
+ uint32_t k = 100;
260
+ uint32_t num_queries = 1000;
261
+ uint32_t num_gtruth = 1000;
262
+
263
+ // Query the index and compute the recall.
264
+ run_knn_search(index, queries, gtruth, ef_search, k, num_queries, num_gtruth, dataset_dimension);
265
+ }
266
+
267
+ ```
268
+
269
+ ### Datasets from ANN-Benchmarks
270
+
271
+ ANN-Benchmarks provide HDF5 files for a standard benchmark of near-neighbor datasets, queries and ground-truth results. To index any of these datasets you can use the `construct_npy.cpp` and `query_npy.cpp` files linked above.
272
+
273
+ To generate the [ANNS benchmark datasets](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets), run the following script
274
+
275
+ ```shell
276
+ $ ./bin/download_anns_datasets.sh <dataset-name> [--normalize]
277
+ ```
278
+
279
+ For datasets that use the angular/cosine similarity, you will need to use `--normalize` option so that the distances are computed correctly.
280
+
281
+ Available dataset names include:
282
+
283
+ ```shell
284
+ _ mnist-784-euclidean
285
+ _ sift-128-euclidean
286
+ _ glove-25-angular
287
+ _ glove-50-angular
288
+ _ glove-100-angular
289
+ _ glove-200-angular
290
+ _ deep-image-96-angular
291
+ _ gist-960-euclidean
292
+ _ nytimes-256-angular
293
+ ```
294
+
295
+ ### Experimental API and Future Extensions
296
+
297
+ You can find the current work under development under the [development-features](https://github.com/BlaiseMuhirwa/flatnav/blob/main/development-features) directory.
298
+ While some of these features may be usable, they are not guarranteed to be stable. Stable features will be expected to be part of the PyPI releases.
299
+ The most notable on-going extension that's under development is product quantization.
300
+
301
+ ## Citation
302
+ If you find this library useful, please consider citing our associated paper:
303
+
304
+ ```
305
+ @article{munyampirwa2024down,
306
+ title={Down with the Hierarchy: The'H'in HNSW Stands for" Hubs"},
307
+ author={Munyampirwa, Blaise and Lakshman, Vihan and Coleman, Benjamin},
308
+ journal={arXiv preprint arXiv:2412.01940},
309
+ year={2024}
310
+ }
311
+ ```
@@ -0,0 +1,6 @@
1
+ flatnav/__init__.py,sha256=Rrz04JKfpOf0cP1T8yy_sfT2z5dL9HMrnL9lsIzCCBg,579
2
+ flatnav/_core.cpython-39-darwin.so,sha256=K3MqALy53wJLldxA3zTALbFw0wX7UbrVPAQheSbacxQ,683104
3
+ flatnav-0.1.2.dist-info/METADATA,sha256=sE_6EUeEaCrgoF2ADwtqIHbM9HFjMExUN0HoZ-3tg0c,12109
4
+ flatnav-0.1.2.dist-info/WHEEL,sha256=iTFmO13zeQqKMhexv916yMBT3HF07MzmsCxsrTHtkmg,104
5
+ flatnav-0.1.2.dist-info/top_level.txt,sha256=FVUKVYK356G2MlNoIaTtjmGUzJNV_2wLRmcHtuSUP3Y,8
6
+ flatnav-0.1.2.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: skbuild 0.18.1
3
+ Root-Is-Purelib: false
4
+ Tag: cp39-cp39-macosx_10_14_x86_64
5
+
@@ -0,0 +1 @@
1
+ flatnav