flatnav 0.1.2__cp39-cp39-macosx_10_14_x86_64.whl
Sign up to get free protection for your applications and to get access to all the features.
flatnav/__init__.py
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
import sys
|
2
|
+
from ._core import (
|
3
|
+
MetricType,
|
4
|
+
data_type,
|
5
|
+
__version__,
|
6
|
+
__doc__
|
7
|
+
)
|
8
|
+
|
9
|
+
class _DataTypeModule:
|
10
|
+
from ._core.data_type import DataType
|
11
|
+
|
12
|
+
|
13
|
+
class _IndexModule:
|
14
|
+
from ._core.index import (
|
15
|
+
IndexL2Float,
|
16
|
+
IndexIPFloat,
|
17
|
+
IndexL2Uint8,
|
18
|
+
IndexIPUint8,
|
19
|
+
IndexL2Int8,
|
20
|
+
IndexIPInt8,
|
21
|
+
create,
|
22
|
+
)
|
23
|
+
|
24
|
+
|
25
|
+
index = _IndexModule
|
26
|
+
sys.modules['flatnav.index'] = _IndexModule
|
27
|
+
sys.modules['flatnav.data_type'] = _DataTypeModule
|
28
|
+
|
29
|
+
__all__ = [
|
30
|
+
'MetricType',
|
31
|
+
'data_type',
|
32
|
+
'index',
|
33
|
+
'__version__',
|
34
|
+
'__doc__'
|
35
|
+
]
|
Binary file
|
@@ -0,0 +1,311 @@
|
|
1
|
+
Metadata-Version: 2.2
|
2
|
+
Name: flatnav
|
3
|
+
Version: 0.1.2
|
4
|
+
Summary: A performant graph-based kNN search library with re-ordering.
|
5
|
+
Home-page: https://flatnav.net
|
6
|
+
Author: Benjamin Coleman, Blaise Munyampirwa, Vihan Lakshman
|
7
|
+
Author-email: benjamin.ray.coleman@gmail.com, blaisemunyampirwa@gmail.com, vihan@mit.edu
|
8
|
+
Maintainer-email: blaisemunyampirwa@gmail.com
|
9
|
+
License: Apache License, Version 2.0
|
10
|
+
Project-URL: Source Code, https://github.com/BlaiseMuhirwa/flatnav
|
11
|
+
Project-URL: Documentation, https://blaisemuhirwa.github.io/flatnav
|
12
|
+
Project-URL: Bug Tracker, https://github.com/BlaiseMuhirwa/flatnav/issues
|
13
|
+
Keywords: similarity search,vector databases,machine learning
|
14
|
+
Classifier: Development Status :: 4 - Beta
|
15
|
+
Classifier: Environment :: Console
|
16
|
+
Classifier: Operating System :: POSIX :: Linux
|
17
|
+
Classifier: Intended Audience :: Science/Research
|
18
|
+
Classifier: Intended Audience :: Developers
|
19
|
+
Classifier: Intended Audience :: Other Audience
|
20
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
21
|
+
Classifier: Programming Language :: C++
|
22
|
+
Classifier: Programming Language :: Python :: 3
|
23
|
+
Classifier: Programming Language :: Python :: 3.8
|
24
|
+
Classifier: Programming Language :: Python :: 3.9
|
25
|
+
Classifier: Programming Language :: Python :: 3.10
|
26
|
+
Classifier: Programming Language :: Python :: 3.11
|
27
|
+
Classifier: Programming Language :: Python :: 3.12
|
28
|
+
Classifier: Topic :: System
|
29
|
+
Classifier: Topic :: Scientific/Engineering
|
30
|
+
Classifier: Topic :: Software Development
|
31
|
+
Description-Content-Type: text/markdown
|
32
|
+
Requires-Dist: numpy<2,>=1.21.0
|
33
|
+
Requires-Dist: h5py==3.11.0
|
34
|
+
Dynamic: author
|
35
|
+
Dynamic: author-email
|
36
|
+
Dynamic: classifier
|
37
|
+
Dynamic: description
|
38
|
+
Dynamic: description-content-type
|
39
|
+
Dynamic: home-page
|
40
|
+
Dynamic: keywords
|
41
|
+
Dynamic: license
|
42
|
+
Dynamic: maintainer-email
|
43
|
+
Dynamic: project-url
|
44
|
+
Dynamic: requires-dist
|
45
|
+
Dynamic: summary
|
46
|
+
|
47
|
+
## FlatNav
|
48
|
+
|
49
|
+
FlatNav is a fast and header-only graph-based index for Approximate Nearest Neighbor Search (ANNS). FlatNav is inspired by the influential [Hierarchical Navigable Small World (HNSW) index](https://github.com/nmslib/hnswlib), but with the hierarchical component removed. As detailed in our [research paper](https://arxiv.org/pdf/2412.01940), we found that FlatNav achieved identical performance to HNSW on high-dimensional datasets (dimensionality > 32) with approximately 38% less peak memory consumption and a simplified implementation.
|
50
|
+
|
51
|
+
We hope to maintain this open source library as a resource for broader community. Please consider opening a Github Issue for bugs and feature requests, or get in touch with us directly for discussions.
|
52
|
+
|
53
|
+
|
54
|
+
### Installation
|
55
|
+
FlatNav is implemented in C++ with a complete Python extension with [cereal](https://uscilab.github.io/cereal/) as the only external dependency. This is a header-only library, so there is nothing to build. You can just include the necessary headers in your existing code.
|
56
|
+
|
57
|
+
FlatNav is supported on x86-64 machines on linux and MacOS (we can extend this to windows if there is sufficient interest). To get the C++ library working and run examples under the [tools](https://github.com/BlaiseMuhirwa/flatnav/blob/main/tools) directory, you will need
|
58
|
+
|
59
|
+
* C++17 compiler with OpenMP support (version >= 2.0)
|
60
|
+
* CMake (version >= 3.14)
|
61
|
+
|
62
|
+
We provide some helpful scripts for installing the above in the [bin](https://github.com/BlaiseMuhirwa/flatnav/tree/main/bin) directory.
|
63
|
+
|
64
|
+
To generate the library with CMake and compile examples, run
|
65
|
+
|
66
|
+
```shell
|
67
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
68
|
+
$ cd flatnav
|
69
|
+
$ ./bin/build.sh -e
|
70
|
+
```
|
71
|
+
|
72
|
+
You can get all options available with the `build.sh` script by passing it the `-h` argument.
|
73
|
+
|
74
|
+
This will display all available build options:
|
75
|
+
|
76
|
+
```shell
|
77
|
+
Usage ./build.sh [OPTIONS]
|
78
|
+
|
79
|
+
Available Options:
|
80
|
+
-t, --tests: Build tests
|
81
|
+
-e, --examples: Build examples
|
82
|
+
-v, --verbose: Make verbose
|
83
|
+
-b, --benchmark: Build benchmarks
|
84
|
+
-bt, --build_type: Build type (Debug, Release, RelWithDebInfo, MinSizeRel)
|
85
|
+
-nmv, --no_simd_vectorization:Disable SIMD instructions
|
86
|
+
-h, --help: Print this help message
|
87
|
+
|
88
|
+
Example Usage:
|
89
|
+
./build.sh -t -e -v
|
90
|
+
```
|
91
|
+
|
92
|
+
To build the Python bindings, follow instructions [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/README.md). There are also examples for how to use the library to build an index and run queries on top of it [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/unit_tests/test_index.py).
|
93
|
+
|
94
|
+
### Support for SIMD Extensions
|
95
|
+
|
96
|
+
We currently support SIMD extensions for certain platforms as detailed below.
|
97
|
+
|
98
|
+
| Operation | x86_64 | arm64v8 | Apple silicon |
|
99
|
+
|-----------|--------|---------|-----------------|
|
100
|
+
| FP32 Inner product |SSE, AVX, AVX512 | No SIMD support | No SIMD support |
|
101
|
+
| FP32 L2 distance |SSE, AVX, AVX512| No SIMD support | No SIMD support |
|
102
|
+
| UINT8 L2 distance |AVX512 | No SIMD support | No SIMD support |
|
103
|
+
| INT8 L2 distance | SSE | No SIMD support | No SIMD support |
|
104
|
+
|
105
|
+
|
106
|
+
### Getting Started in Python
|
107
|
+
|
108
|
+
Currently, we support Python wheels for versions 3.8 through 3.12 on x86_64 architectures (Intel, AMD and MacOS). Support for
|
109
|
+
ARM wheels is a future improvement.
|
110
|
+
|
111
|
+
The python library can be installed from PyPI by using
|
112
|
+
```shell
|
113
|
+
$ pip install flatnav
|
114
|
+
```
|
115
|
+
|
116
|
+
Similarly, `flatnav` can be installed from source via [cibuildwheel](https://cibuildwheel.pypa.io/en/stable/), which
|
117
|
+
builds cross-platform wheels. Follow the following steps
|
118
|
+
|
119
|
+
```shell
|
120
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
121
|
+
$ cd flatnav
|
122
|
+
$ make install-cibuildwheel
|
123
|
+
|
124
|
+
# This will build flatnav for the current version in your environment. If you want to build wheels
|
125
|
+
# for all supported python versions (3.8 to 3.12), remove the --current-version flag.
|
126
|
+
$ ./cibuild.sh --current-version
|
127
|
+
|
128
|
+
$ pip install wheelhouse/flatnav*.whl --force-reinstall
|
129
|
+
```
|
130
|
+
|
131
|
+
Once you have the python library installed and you have a dataset you want to index as a numpy array, you can construct the index as shown below. This will allocate memory and create a directed graph with vectors as nodes.
|
132
|
+
|
133
|
+
```python
|
134
|
+
import numpy as np
|
135
|
+
import flatnav
|
136
|
+
from flatnav.data_type import DataType
|
137
|
+
|
138
|
+
# Get your numpy-formatted dataset.
|
139
|
+
dataset_size = 1_000_000
|
140
|
+
dataset_dimension = 128
|
141
|
+
dataset_to_index = np.random.randn(dataset_size, dataset_dimension)
|
142
|
+
|
143
|
+
# Define index construction parameters.
|
144
|
+
distance_type = "l2"
|
145
|
+
max_edges_per_node = 32
|
146
|
+
ef_construction = 100
|
147
|
+
num_build_threads = 16
|
148
|
+
|
149
|
+
# Create index configuration and pre-allocate memory
|
150
|
+
index = flatnav.index.create(
|
151
|
+
distance_type=distance_type,
|
152
|
+
index_data_type=DataType.float32,
|
153
|
+
dim=dataset_dimension,
|
154
|
+
dataset_size=dataset_size,
|
155
|
+
max_edges_per_node=max_edges_per_node,
|
156
|
+
verbose=True,
|
157
|
+
collect_stats=True,
|
158
|
+
)
|
159
|
+
index.set_num_threads(num_build_threads)
|
160
|
+
|
161
|
+
# Now index the dataset
|
162
|
+
index.add(data=dataset_to_index, ef_construction=ef_construction)
|
163
|
+
```
|
164
|
+
|
165
|
+
Note that we specified `DataType.float32` to indicate that we want to build an index with vectors represented with `float` type. If you want to use a different precision, such as `uint8_t` or `int8_t` (which are the only other ones currently supported), you can use `DataType.uint8` or `DataType.int8`.
|
166
|
+
The distance type can either be `l2` or `angular`. The `collect_stats` flag will record the number of distance evaluations.
|
167
|
+
|
168
|
+
To query the index we just created by generating IID vectors from the standard normal distribution, we do it as follows
|
169
|
+
|
170
|
+
```python
|
171
|
+
|
172
|
+
# Set query-time parameters
|
173
|
+
k = 100
|
174
|
+
ef_search = 100
|
175
|
+
|
176
|
+
# Run k-NN query with a single thread.
|
177
|
+
index.set_num_threads(1)
|
178
|
+
|
179
|
+
queries = np.random.randn(1000, dataset_to_index.shape[1])
|
180
|
+
for query in queries:
|
181
|
+
distances, indices = index.search_single(
|
182
|
+
query=query,
|
183
|
+
ef_search=ef_search,
|
184
|
+
K=k,
|
185
|
+
)
|
186
|
+
|
187
|
+
```
|
188
|
+
|
189
|
+
You can parallelize the search by setting the number of threads to a desired number and using a different API that also returns the exact same results as `search_single`.
|
190
|
+
|
191
|
+
```python
|
192
|
+
index.set_num_threads(16)
|
193
|
+
distances, indices = index.search(queries=queries, ef_search=ef_search, K=k)
|
194
|
+
```
|
195
|
+
|
196
|
+
### Getting Started in C++
|
197
|
+
|
198
|
+
As mentioned earlier, there is nothing to build since this is header-only. We will translate the above Python code in C++ to illustrate how to use the C++ API.
|
199
|
+
|
200
|
+
```c++
|
201
|
+
#include <cstdint>
|
202
|
+
#include <flatnav/index/Index.h>
|
203
|
+
#include <flatnav/distances/SquaredL2Distance.h>
|
204
|
+
#include <flatnav/distances/DistanceInterface.h>
|
205
|
+
|
206
|
+
template <typename dist_t>
|
207
|
+
void run_knn_search(Index<dist_t, int>>* index, float *queries, int* gtruth,
|
208
|
+
int ef_search, int K, int num_queries, int num_gtruth, int dim) {
|
209
|
+
|
210
|
+
float mean_recall = 0;
|
211
|
+
for (int i = 0; i < num_queries; i++) {
|
212
|
+
float *q = queries + dim * i;
|
213
|
+
int *g = gtruth + num_gtruth * i;
|
214
|
+
std::vector<std::pair<float, int>> result =
|
215
|
+
index->search(q, K, ef_search);
|
216
|
+
|
217
|
+
float recall = 0;
|
218
|
+
for (int j = 0; j < K; j++) {
|
219
|
+
for (int l = 0; l < K; l++) {
|
220
|
+
if (result[j].second == g[l]) {
|
221
|
+
recall = recall + 1;
|
222
|
+
}
|
223
|
+
}
|
224
|
+
}
|
225
|
+
recall = recall / K;
|
226
|
+
mean_recall = mean_recall + recall;
|
227
|
+
}
|
228
|
+
}
|
229
|
+
|
230
|
+
|
231
|
+
int main(int argc, char** argv) {
|
232
|
+
uint32_t dataset_size = 1000000;
|
233
|
+
uint32_t dataset_dimension = 128;
|
234
|
+
|
235
|
+
// We skip the random data generation, but you can do that with std::mt19937, std::random_device
|
236
|
+
// and std::normal_distribution
|
237
|
+
// std::vector<float> dataset_to_index;
|
238
|
+
|
239
|
+
uint32_t max_edges_per_node = 32;
|
240
|
+
uint32_t ef_construction = 100;
|
241
|
+
|
242
|
+
// Create an index with l2 distance
|
243
|
+
auto distance = SquaredL2Distance<>::create(dataset_dimension);
|
244
|
+
auto* index = new Index<SquaredL2Distance<DataType::float32>>, int>(
|
245
|
+
/* dist = */ std::move(distance), /* dataset_size = */ dataset_size,
|
246
|
+
/* max_edges_per_node = */ max_edges_per_node);
|
247
|
+
|
248
|
+
index->setNumThreads(build_num_threads);
|
249
|
+
|
250
|
+
std::vector<int> labels(dataset_size);
|
251
|
+
std::iota(labels.begin(), labels.end(), 0);
|
252
|
+
index->template addBatch<float>(/* data = */ (void *)dataset_to_index,
|
253
|
+
/* labels = */ labels,
|
254
|
+
/* ef_construction */ ef_construction);
|
255
|
+
|
256
|
+
// Now query the index and compute the recall
|
257
|
+
// We assume you have a ground truth (int*) array and a queries (float*) array
|
258
|
+
uint32_t ef_search = 100;
|
259
|
+
uint32_t k = 100;
|
260
|
+
uint32_t num_queries = 1000;
|
261
|
+
uint32_t num_gtruth = 1000;
|
262
|
+
|
263
|
+
// Query the index and compute the recall.
|
264
|
+
run_knn_search(index, queries, gtruth, ef_search, k, num_queries, num_gtruth, dataset_dimension);
|
265
|
+
}
|
266
|
+
|
267
|
+
```
|
268
|
+
|
269
|
+
### Datasets from ANN-Benchmarks
|
270
|
+
|
271
|
+
ANN-Benchmarks provide HDF5 files for a standard benchmark of near-neighbor datasets, queries and ground-truth results. To index any of these datasets you can use the `construct_npy.cpp` and `query_npy.cpp` files linked above.
|
272
|
+
|
273
|
+
To generate the [ANNS benchmark datasets](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets), run the following script
|
274
|
+
|
275
|
+
```shell
|
276
|
+
$ ./bin/download_anns_datasets.sh <dataset-name> [--normalize]
|
277
|
+
```
|
278
|
+
|
279
|
+
For datasets that use the angular/cosine similarity, you will need to use `--normalize` option so that the distances are computed correctly.
|
280
|
+
|
281
|
+
Available dataset names include:
|
282
|
+
|
283
|
+
```shell
|
284
|
+
_ mnist-784-euclidean
|
285
|
+
_ sift-128-euclidean
|
286
|
+
_ glove-25-angular
|
287
|
+
_ glove-50-angular
|
288
|
+
_ glove-100-angular
|
289
|
+
_ glove-200-angular
|
290
|
+
_ deep-image-96-angular
|
291
|
+
_ gist-960-euclidean
|
292
|
+
_ nytimes-256-angular
|
293
|
+
```
|
294
|
+
|
295
|
+
### Experimental API and Future Extensions
|
296
|
+
|
297
|
+
You can find the current work under development under the [development-features](https://github.com/BlaiseMuhirwa/flatnav/blob/main/development-features) directory.
|
298
|
+
While some of these features may be usable, they are not guarranteed to be stable. Stable features will be expected to be part of the PyPI releases.
|
299
|
+
The most notable on-going extension that's under development is product quantization.
|
300
|
+
|
301
|
+
## Citation
|
302
|
+
If you find this library useful, please consider citing our associated paper:
|
303
|
+
|
304
|
+
```
|
305
|
+
@article{munyampirwa2024down,
|
306
|
+
title={Down with the Hierarchy: The'H'in HNSW Stands for" Hubs"},
|
307
|
+
author={Munyampirwa, Blaise and Lakshman, Vihan and Coleman, Benjamin},
|
308
|
+
journal={arXiv preprint arXiv:2412.01940},
|
309
|
+
year={2024}
|
310
|
+
}
|
311
|
+
```
|
@@ -0,0 +1,6 @@
|
|
1
|
+
flatnav/__init__.py,sha256=Rrz04JKfpOf0cP1T8yy_sfT2z5dL9HMrnL9lsIzCCBg,579
|
2
|
+
flatnav/_core.cpython-39-darwin.so,sha256=K3MqALy53wJLldxA3zTALbFw0wX7UbrVPAQheSbacxQ,683104
|
3
|
+
flatnav-0.1.2.dist-info/METADATA,sha256=sE_6EUeEaCrgoF2ADwtqIHbM9HFjMExUN0HoZ-3tg0c,12109
|
4
|
+
flatnav-0.1.2.dist-info/WHEEL,sha256=iTFmO13zeQqKMhexv916yMBT3HF07MzmsCxsrTHtkmg,104
|
5
|
+
flatnav-0.1.2.dist-info/top_level.txt,sha256=FVUKVYK356G2MlNoIaTtjmGUzJNV_2wLRmcHtuSUP3Y,8
|
6
|
+
flatnav-0.1.2.dist-info/RECORD,,
|
@@ -0,0 +1 @@
|
|
1
|
+
flatnav
|