flatnav 0.1.2__cp38-cp38-macosx_10_14_x86_64.whl
Sign up to get free protection for your applications and to get access to all the features.
flatnav/__init__.py
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
import sys
|
2
|
+
from ._core import (
|
3
|
+
MetricType,
|
4
|
+
data_type,
|
5
|
+
__version__,
|
6
|
+
__doc__
|
7
|
+
)
|
8
|
+
|
9
|
+
class _DataTypeModule:
|
10
|
+
from ._core.data_type import DataType
|
11
|
+
|
12
|
+
|
13
|
+
class _IndexModule:
|
14
|
+
from ._core.index import (
|
15
|
+
IndexL2Float,
|
16
|
+
IndexIPFloat,
|
17
|
+
IndexL2Uint8,
|
18
|
+
IndexIPUint8,
|
19
|
+
IndexL2Int8,
|
20
|
+
IndexIPInt8,
|
21
|
+
create,
|
22
|
+
)
|
23
|
+
|
24
|
+
|
25
|
+
index = _IndexModule
|
26
|
+
sys.modules['flatnav.index'] = _IndexModule
|
27
|
+
sys.modules['flatnav.data_type'] = _DataTypeModule
|
28
|
+
|
29
|
+
__all__ = [
|
30
|
+
'MetricType',
|
31
|
+
'data_type',
|
32
|
+
'index',
|
33
|
+
'__version__',
|
34
|
+
'__doc__'
|
35
|
+
]
|
Binary file
|
@@ -0,0 +1,299 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: flatnav
|
3
|
+
Version: 0.1.2
|
4
|
+
Summary: A performant graph-based kNN search library with re-ordering.
|
5
|
+
Home-page: https://flatnav.net
|
6
|
+
Author: Benjamin Coleman, Blaise Munyampirwa, Vihan Lakshman
|
7
|
+
Author-email: benjamin.ray.coleman@gmail.com, blaisemunyampirwa@gmail.com, vihan@mit.edu
|
8
|
+
Maintainer-email: blaisemunyampirwa@gmail.com
|
9
|
+
License: Apache License, Version 2.0
|
10
|
+
Project-URL: Source Code, https://github.com/BlaiseMuhirwa/flatnav
|
11
|
+
Project-URL: Documentation, https://blaisemuhirwa.github.io/flatnav
|
12
|
+
Project-URL: Bug Tracker, https://github.com/BlaiseMuhirwa/flatnav/issues
|
13
|
+
Keywords: similarity search,vector databases,machine learning
|
14
|
+
Classifier: Development Status :: 4 - Beta
|
15
|
+
Classifier: Environment :: Console
|
16
|
+
Classifier: Operating System :: POSIX :: Linux
|
17
|
+
Classifier: Intended Audience :: Science/Research
|
18
|
+
Classifier: Intended Audience :: Developers
|
19
|
+
Classifier: Intended Audience :: Other Audience
|
20
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
21
|
+
Classifier: Programming Language :: C++
|
22
|
+
Classifier: Programming Language :: Python :: 3
|
23
|
+
Classifier: Programming Language :: Python :: 3.8
|
24
|
+
Classifier: Programming Language :: Python :: 3.9
|
25
|
+
Classifier: Programming Language :: Python :: 3.10
|
26
|
+
Classifier: Programming Language :: Python :: 3.11
|
27
|
+
Classifier: Programming Language :: Python :: 3.12
|
28
|
+
Classifier: Topic :: System
|
29
|
+
Classifier: Topic :: Scientific/Engineering
|
30
|
+
Classifier: Topic :: Software Development
|
31
|
+
Description-Content-Type: text/markdown
|
32
|
+
Requires-Dist: numpy<2,>=1.21.0
|
33
|
+
Requires-Dist: h5py==3.11.0
|
34
|
+
|
35
|
+
## FlatNav
|
36
|
+
|
37
|
+
FlatNav is a fast and header-only graph-based index for Approximate Nearest Neighbor Search (ANNS). FlatNav is inspired by the influential [Hierarchical Navigable Small World (HNSW) index](https://github.com/nmslib/hnswlib), but with the hierarchical component removed. As detailed in our [research paper](https://arxiv.org/pdf/2412.01940), we found that FlatNav achieved identical performance to HNSW on high-dimensional datasets (dimensionality > 32) with approximately 38% less peak memory consumption and a simplified implementation.
|
38
|
+
|
39
|
+
We hope to maintain this open source library as a resource for broader community. Please consider opening a Github Issue for bugs and feature requests, or get in touch with us directly for discussions.
|
40
|
+
|
41
|
+
|
42
|
+
### Installation
|
43
|
+
FlatNav is implemented in C++ with a complete Python extension with [cereal](https://uscilab.github.io/cereal/) as the only external dependency. This is a header-only library, so there is nothing to build. You can just include the necessary headers in your existing code.
|
44
|
+
|
45
|
+
FlatNav is supported on x86-64 machines on linux and MacOS (we can extend this to windows if there is sufficient interest). To get the C++ library working and run examples under the [tools](https://github.com/BlaiseMuhirwa/flatnav/blob/main/tools) directory, you will need
|
46
|
+
|
47
|
+
* C++17 compiler with OpenMP support (version >= 2.0)
|
48
|
+
* CMake (version >= 3.14)
|
49
|
+
|
50
|
+
We provide some helpful scripts for installing the above in the [bin](https://github.com/BlaiseMuhirwa/flatnav/tree/main/bin) directory.
|
51
|
+
|
52
|
+
To generate the library with CMake and compile examples, run
|
53
|
+
|
54
|
+
```shell
|
55
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
56
|
+
$ cd flatnav
|
57
|
+
$ ./bin/build.sh -e
|
58
|
+
```
|
59
|
+
|
60
|
+
You can get all options available with the `build.sh` script by passing it the `-h` argument.
|
61
|
+
|
62
|
+
This will display all available build options:
|
63
|
+
|
64
|
+
```shell
|
65
|
+
Usage ./build.sh [OPTIONS]
|
66
|
+
|
67
|
+
Available Options:
|
68
|
+
-t, --tests: Build tests
|
69
|
+
-e, --examples: Build examples
|
70
|
+
-v, --verbose: Make verbose
|
71
|
+
-b, --benchmark: Build benchmarks
|
72
|
+
-bt, --build_type: Build type (Debug, Release, RelWithDebInfo, MinSizeRel)
|
73
|
+
-nmv, --no_simd_vectorization:Disable SIMD instructions
|
74
|
+
-h, --help: Print this help message
|
75
|
+
|
76
|
+
Example Usage:
|
77
|
+
./build.sh -t -e -v
|
78
|
+
```
|
79
|
+
|
80
|
+
To build the Python bindings, follow instructions [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/README.md). There are also examples for how to use the library to build an index and run queries on top of it [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/unit_tests/test_index.py).
|
81
|
+
|
82
|
+
### Support for SIMD Extensions
|
83
|
+
|
84
|
+
We currently support SIMD extensions for certain platforms as detailed below.
|
85
|
+
|
86
|
+
| Operation | x86_64 | arm64v8 | Apple silicon |
|
87
|
+
|-----------|--------|---------|-----------------|
|
88
|
+
| FP32 Inner product |SSE, AVX, AVX512 | No SIMD support | No SIMD support |
|
89
|
+
| FP32 L2 distance |SSE, AVX, AVX512| No SIMD support | No SIMD support |
|
90
|
+
| UINT8 L2 distance |AVX512 | No SIMD support | No SIMD support |
|
91
|
+
| INT8 L2 distance | SSE | No SIMD support | No SIMD support |
|
92
|
+
|
93
|
+
|
94
|
+
### Getting Started in Python
|
95
|
+
|
96
|
+
Currently, we support Python wheels for versions 3.8 through 3.12 on x86_64 architectures (Intel, AMD and MacOS). Support for
|
97
|
+
ARM wheels is a future improvement.
|
98
|
+
|
99
|
+
The python library can be installed from PyPI by using
|
100
|
+
```shell
|
101
|
+
$ pip install flatnav
|
102
|
+
```
|
103
|
+
|
104
|
+
Similarly, `flatnav` can be installed from source via [cibuildwheel](https://cibuildwheel.pypa.io/en/stable/), which
|
105
|
+
builds cross-platform wheels. Follow the following steps
|
106
|
+
|
107
|
+
```shell
|
108
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
109
|
+
$ cd flatnav
|
110
|
+
$ make install-cibuildwheel
|
111
|
+
|
112
|
+
# This will build flatnav for the current version in your environment. If you want to build wheels
|
113
|
+
# for all supported python versions (3.8 to 3.12), remove the --current-version flag.
|
114
|
+
$ ./cibuild.sh --current-version
|
115
|
+
|
116
|
+
$ pip install wheelhouse/flatnav*.whl --force-reinstall
|
117
|
+
```
|
118
|
+
|
119
|
+
Once you have the python library installed and you have a dataset you want to index as a numpy array, you can construct the index as shown below. This will allocate memory and create a directed graph with vectors as nodes.
|
120
|
+
|
121
|
+
```python
|
122
|
+
import numpy as np
|
123
|
+
import flatnav
|
124
|
+
from flatnav.data_type import DataType
|
125
|
+
|
126
|
+
# Get your numpy-formatted dataset.
|
127
|
+
dataset_size = 1_000_000
|
128
|
+
dataset_dimension = 128
|
129
|
+
dataset_to_index = np.random.randn(dataset_size, dataset_dimension)
|
130
|
+
|
131
|
+
# Define index construction parameters.
|
132
|
+
distance_type = "l2"
|
133
|
+
max_edges_per_node = 32
|
134
|
+
ef_construction = 100
|
135
|
+
num_build_threads = 16
|
136
|
+
|
137
|
+
# Create index configuration and pre-allocate memory
|
138
|
+
index = flatnav.index.create(
|
139
|
+
distance_type=distance_type,
|
140
|
+
index_data_type=DataType.float32,
|
141
|
+
dim=dataset_dimension,
|
142
|
+
dataset_size=dataset_size,
|
143
|
+
max_edges_per_node=max_edges_per_node,
|
144
|
+
verbose=True,
|
145
|
+
collect_stats=True,
|
146
|
+
)
|
147
|
+
index.set_num_threads(num_build_threads)
|
148
|
+
|
149
|
+
# Now index the dataset
|
150
|
+
index.add(data=dataset_to_index, ef_construction=ef_construction)
|
151
|
+
```
|
152
|
+
|
153
|
+
Note that we specified `DataType.float32` to indicate that we want to build an index with vectors represented with `float` type. If you want to use a different precision, such as `uint8_t` or `int8_t` (which are the only other ones currently supported), you can use `DataType.uint8` or `DataType.int8`.
|
154
|
+
The distance type can either be `l2` or `angular`. The `collect_stats` flag will record the number of distance evaluations.
|
155
|
+
|
156
|
+
To query the index we just created by generating IID vectors from the standard normal distribution, we do it as follows
|
157
|
+
|
158
|
+
```python
|
159
|
+
|
160
|
+
# Set query-time parameters
|
161
|
+
k = 100
|
162
|
+
ef_search = 100
|
163
|
+
|
164
|
+
# Run k-NN query with a single thread.
|
165
|
+
index.set_num_threads(1)
|
166
|
+
|
167
|
+
queries = np.random.randn(1000, dataset_to_index.shape[1])
|
168
|
+
for query in queries:
|
169
|
+
distances, indices = index.search_single(
|
170
|
+
query=query,
|
171
|
+
ef_search=ef_search,
|
172
|
+
K=k,
|
173
|
+
)
|
174
|
+
|
175
|
+
```
|
176
|
+
|
177
|
+
You can parallelize the search by setting the number of threads to a desired number and using a different API that also returns the exact same results as `search_single`.
|
178
|
+
|
179
|
+
```python
|
180
|
+
index.set_num_threads(16)
|
181
|
+
distances, indices = index.search(queries=queries, ef_search=ef_search, K=k)
|
182
|
+
```
|
183
|
+
|
184
|
+
### Getting Started in C++
|
185
|
+
|
186
|
+
As mentioned earlier, there is nothing to build since this is header-only. We will translate the above Python code in C++ to illustrate how to use the C++ API.
|
187
|
+
|
188
|
+
```c++
|
189
|
+
#include <cstdint>
|
190
|
+
#include <flatnav/index/Index.h>
|
191
|
+
#include <flatnav/distances/SquaredL2Distance.h>
|
192
|
+
#include <flatnav/distances/DistanceInterface.h>
|
193
|
+
|
194
|
+
template <typename dist_t>
|
195
|
+
void run_knn_search(Index<dist_t, int>>* index, float *queries, int* gtruth,
|
196
|
+
int ef_search, int K, int num_queries, int num_gtruth, int dim) {
|
197
|
+
|
198
|
+
float mean_recall = 0;
|
199
|
+
for (int i = 0; i < num_queries; i++) {
|
200
|
+
float *q = queries + dim * i;
|
201
|
+
int *g = gtruth + num_gtruth * i;
|
202
|
+
std::vector<std::pair<float, int>> result =
|
203
|
+
index->search(q, K, ef_search);
|
204
|
+
|
205
|
+
float recall = 0;
|
206
|
+
for (int j = 0; j < K; j++) {
|
207
|
+
for (int l = 0; l < K; l++) {
|
208
|
+
if (result[j].second == g[l]) {
|
209
|
+
recall = recall + 1;
|
210
|
+
}
|
211
|
+
}
|
212
|
+
}
|
213
|
+
recall = recall / K;
|
214
|
+
mean_recall = mean_recall + recall;
|
215
|
+
}
|
216
|
+
}
|
217
|
+
|
218
|
+
|
219
|
+
int main(int argc, char** argv) {
|
220
|
+
uint32_t dataset_size = 1000000;
|
221
|
+
uint32_t dataset_dimension = 128;
|
222
|
+
|
223
|
+
// We skip the random data generation, but you can do that with std::mt19937, std::random_device
|
224
|
+
// and std::normal_distribution
|
225
|
+
// std::vector<float> dataset_to_index;
|
226
|
+
|
227
|
+
uint32_t max_edges_per_node = 32;
|
228
|
+
uint32_t ef_construction = 100;
|
229
|
+
|
230
|
+
// Create an index with l2 distance
|
231
|
+
auto distance = SquaredL2Distance<>::create(dataset_dimension);
|
232
|
+
auto* index = new Index<SquaredL2Distance<DataType::float32>>, int>(
|
233
|
+
/* dist = */ std::move(distance), /* dataset_size = */ dataset_size,
|
234
|
+
/* max_edges_per_node = */ max_edges_per_node);
|
235
|
+
|
236
|
+
index->setNumThreads(build_num_threads);
|
237
|
+
|
238
|
+
std::vector<int> labels(dataset_size);
|
239
|
+
std::iota(labels.begin(), labels.end(), 0);
|
240
|
+
index->template addBatch<float>(/* data = */ (void *)dataset_to_index,
|
241
|
+
/* labels = */ labels,
|
242
|
+
/* ef_construction */ ef_construction);
|
243
|
+
|
244
|
+
// Now query the index and compute the recall
|
245
|
+
// We assume you have a ground truth (int*) array and a queries (float*) array
|
246
|
+
uint32_t ef_search = 100;
|
247
|
+
uint32_t k = 100;
|
248
|
+
uint32_t num_queries = 1000;
|
249
|
+
uint32_t num_gtruth = 1000;
|
250
|
+
|
251
|
+
// Query the index and compute the recall.
|
252
|
+
run_knn_search(index, queries, gtruth, ef_search, k, num_queries, num_gtruth, dataset_dimension);
|
253
|
+
}
|
254
|
+
|
255
|
+
```
|
256
|
+
|
257
|
+
### Datasets from ANN-Benchmarks
|
258
|
+
|
259
|
+
ANN-Benchmarks provide HDF5 files for a standard benchmark of near-neighbor datasets, queries and ground-truth results. To index any of these datasets you can use the `construct_npy.cpp` and `query_npy.cpp` files linked above.
|
260
|
+
|
261
|
+
To generate the [ANNS benchmark datasets](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets), run the following script
|
262
|
+
|
263
|
+
```shell
|
264
|
+
$ ./bin/download_anns_datasets.sh <dataset-name> [--normalize]
|
265
|
+
```
|
266
|
+
|
267
|
+
For datasets that use the angular/cosine similarity, you will need to use `--normalize` option so that the distances are computed correctly.
|
268
|
+
|
269
|
+
Available dataset names include:
|
270
|
+
|
271
|
+
```shell
|
272
|
+
_ mnist-784-euclidean
|
273
|
+
_ sift-128-euclidean
|
274
|
+
_ glove-25-angular
|
275
|
+
_ glove-50-angular
|
276
|
+
_ glove-100-angular
|
277
|
+
_ glove-200-angular
|
278
|
+
_ deep-image-96-angular
|
279
|
+
_ gist-960-euclidean
|
280
|
+
_ nytimes-256-angular
|
281
|
+
```
|
282
|
+
|
283
|
+
### Experimental API and Future Extensions
|
284
|
+
|
285
|
+
You can find the current work under development under the [development-features](https://github.com/BlaiseMuhirwa/flatnav/blob/main/development-features) directory.
|
286
|
+
While some of these features may be usable, they are not guarranteed to be stable. Stable features will be expected to be part of the PyPI releases.
|
287
|
+
The most notable on-going extension that's under development is product quantization.
|
288
|
+
|
289
|
+
## Citation
|
290
|
+
If you find this library useful, please consider citing our associated paper:
|
291
|
+
|
292
|
+
```
|
293
|
+
@article{munyampirwa2024down,
|
294
|
+
title={Down with the Hierarchy: The'H'in HNSW Stands for" Hubs"},
|
295
|
+
author={Munyampirwa, Blaise and Lakshman, Vihan and Coleman, Benjamin},
|
296
|
+
journal={arXiv preprint arXiv:2412.01940},
|
297
|
+
year={2024}
|
298
|
+
}
|
299
|
+
```
|
@@ -0,0 +1,6 @@
|
|
1
|
+
flatnav/__init__.py,sha256=Rrz04JKfpOf0cP1T8yy_sfT2z5dL9HMrnL9lsIzCCBg,579
|
2
|
+
flatnav/_core.cpython-38-darwin.so,sha256=oxJWP5Pjk_V9RAf67YRg5e5nuk5y3nsV0w_i18CjH6s,682920
|
3
|
+
flatnav-0.1.2.dist-info/METADATA,sha256=2Zsista1n0hTA5dqRQj6KwxySEAQnOjk78w3FKGen2s,11855
|
4
|
+
flatnav-0.1.2.dist-info/WHEEL,sha256=1Q0ilh_UmUoDz5aBvEtcYwUkqD8WzqUHpTXCjcN17Ek,104
|
5
|
+
flatnav-0.1.2.dist-info/top_level.txt,sha256=FVUKVYK356G2MlNoIaTtjmGUzJNV_2wLRmcHtuSUP3Y,8
|
6
|
+
flatnav-0.1.2.dist-info/RECORD,,
|
@@ -0,0 +1 @@
|
|
1
|
+
flatnav
|