flatnav 0.1.2__cp38-cp38-macosx_10_14_x86_64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
flatnav/__init__.py
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
import sys
|
2
|
+
from ._core import (
|
3
|
+
MetricType,
|
4
|
+
data_type,
|
5
|
+
__version__,
|
6
|
+
__doc__
|
7
|
+
)
|
8
|
+
|
9
|
+
class _DataTypeModule:
|
10
|
+
from ._core.data_type import DataType
|
11
|
+
|
12
|
+
|
13
|
+
class _IndexModule:
|
14
|
+
from ._core.index import (
|
15
|
+
IndexL2Float,
|
16
|
+
IndexIPFloat,
|
17
|
+
IndexL2Uint8,
|
18
|
+
IndexIPUint8,
|
19
|
+
IndexL2Int8,
|
20
|
+
IndexIPInt8,
|
21
|
+
create,
|
22
|
+
)
|
23
|
+
|
24
|
+
|
25
|
+
index = _IndexModule
|
26
|
+
sys.modules['flatnav.index'] = _IndexModule
|
27
|
+
sys.modules['flatnav.data_type'] = _DataTypeModule
|
28
|
+
|
29
|
+
__all__ = [
|
30
|
+
'MetricType',
|
31
|
+
'data_type',
|
32
|
+
'index',
|
33
|
+
'__version__',
|
34
|
+
'__doc__'
|
35
|
+
]
|
Binary file
|
@@ -0,0 +1,299 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: flatnav
|
3
|
+
Version: 0.1.2
|
4
|
+
Summary: A performant graph-based kNN search library with re-ordering.
|
5
|
+
Home-page: https://flatnav.net
|
6
|
+
Author: Benjamin Coleman, Blaise Munyampirwa, Vihan Lakshman
|
7
|
+
Author-email: benjamin.ray.coleman@gmail.com, blaisemunyampirwa@gmail.com, vihan@mit.edu
|
8
|
+
Maintainer-email: blaisemunyampirwa@gmail.com
|
9
|
+
License: Apache License, Version 2.0
|
10
|
+
Project-URL: Source Code, https://github.com/BlaiseMuhirwa/flatnav
|
11
|
+
Project-URL: Documentation, https://blaisemuhirwa.github.io/flatnav
|
12
|
+
Project-URL: Bug Tracker, https://github.com/BlaiseMuhirwa/flatnav/issues
|
13
|
+
Keywords: similarity search,vector databases,machine learning
|
14
|
+
Classifier: Development Status :: 4 - Beta
|
15
|
+
Classifier: Environment :: Console
|
16
|
+
Classifier: Operating System :: POSIX :: Linux
|
17
|
+
Classifier: Intended Audience :: Science/Research
|
18
|
+
Classifier: Intended Audience :: Developers
|
19
|
+
Classifier: Intended Audience :: Other Audience
|
20
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
21
|
+
Classifier: Programming Language :: C++
|
22
|
+
Classifier: Programming Language :: Python :: 3
|
23
|
+
Classifier: Programming Language :: Python :: 3.8
|
24
|
+
Classifier: Programming Language :: Python :: 3.9
|
25
|
+
Classifier: Programming Language :: Python :: 3.10
|
26
|
+
Classifier: Programming Language :: Python :: 3.11
|
27
|
+
Classifier: Programming Language :: Python :: 3.12
|
28
|
+
Classifier: Topic :: System
|
29
|
+
Classifier: Topic :: Scientific/Engineering
|
30
|
+
Classifier: Topic :: Software Development
|
31
|
+
Description-Content-Type: text/markdown
|
32
|
+
Requires-Dist: numpy<2,>=1.21.0
|
33
|
+
Requires-Dist: h5py==3.11.0
|
34
|
+
|
35
|
+
## FlatNav
|
36
|
+
|
37
|
+
FlatNav is a fast and header-only graph-based index for Approximate Nearest Neighbor Search (ANNS). FlatNav is inspired by the influential [Hierarchical Navigable Small World (HNSW) index](https://github.com/nmslib/hnswlib), but with the hierarchical component removed. As detailed in our [research paper](https://arxiv.org/pdf/2412.01940), we found that FlatNav achieved identical performance to HNSW on high-dimensional datasets (dimensionality > 32) with approximately 38% less peak memory consumption and a simplified implementation.
|
38
|
+
|
39
|
+
We hope to maintain this open source library as a resource for broader community. Please consider opening a Github Issue for bugs and feature requests, or get in touch with us directly for discussions.
|
40
|
+
|
41
|
+
|
42
|
+
### Installation
|
43
|
+
FlatNav is implemented in C++ with a complete Python extension with [cereal](https://uscilab.github.io/cereal/) as the only external dependency. This is a header-only library, so there is nothing to build. You can just include the necessary headers in your existing code.
|
44
|
+
|
45
|
+
FlatNav is supported on x86-64 machines on linux and MacOS (we can extend this to windows if there is sufficient interest). To get the C++ library working and run examples under the [tools](https://github.com/BlaiseMuhirwa/flatnav/blob/main/tools) directory, you will need
|
46
|
+
|
47
|
+
* C++17 compiler with OpenMP support (version >= 2.0)
|
48
|
+
* CMake (version >= 3.14)
|
49
|
+
|
50
|
+
We provide some helpful scripts for installing the above in the [bin](https://github.com/BlaiseMuhirwa/flatnav/tree/main/bin) directory.
|
51
|
+
|
52
|
+
To generate the library with CMake and compile examples, run
|
53
|
+
|
54
|
+
```shell
|
55
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
56
|
+
$ cd flatnav
|
57
|
+
$ ./bin/build.sh -e
|
58
|
+
```
|
59
|
+
|
60
|
+
You can get all options available with the `build.sh` script by passing it the `-h` argument.
|
61
|
+
|
62
|
+
This will display all available build options:
|
63
|
+
|
64
|
+
```shell
|
65
|
+
Usage ./build.sh [OPTIONS]
|
66
|
+
|
67
|
+
Available Options:
|
68
|
+
-t, --tests: Build tests
|
69
|
+
-e, --examples: Build examples
|
70
|
+
-v, --verbose: Make verbose
|
71
|
+
-b, --benchmark: Build benchmarks
|
72
|
+
-bt, --build_type: Build type (Debug, Release, RelWithDebInfo, MinSizeRel)
|
73
|
+
-nmv, --no_simd_vectorization:Disable SIMD instructions
|
74
|
+
-h, --help: Print this help message
|
75
|
+
|
76
|
+
Example Usage:
|
77
|
+
./build.sh -t -e -v
|
78
|
+
```
|
79
|
+
|
80
|
+
To build the Python bindings, follow instructions [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/README.md). There are also examples for how to use the library to build an index and run queries on top of it [here](https://github.com/BlaiseMuhirwa/flatnav/blob/main/flatnav_python/unit_tests/test_index.py).
|
81
|
+
|
82
|
+
### Support for SIMD Extensions
|
83
|
+
|
84
|
+
We currently support SIMD extensions for certain platforms as detailed below.
|
85
|
+
|
86
|
+
| Operation | x86_64 | arm64v8 | Apple silicon |
|
87
|
+
|-----------|--------|---------|-----------------|
|
88
|
+
| FP32 Inner product |SSE, AVX, AVX512 | No SIMD support | No SIMD support |
|
89
|
+
| FP32 L2 distance |SSE, AVX, AVX512| No SIMD support | No SIMD support |
|
90
|
+
| UINT8 L2 distance |AVX512 | No SIMD support | No SIMD support |
|
91
|
+
| INT8 L2 distance | SSE | No SIMD support | No SIMD support |
|
92
|
+
|
93
|
+
|
94
|
+
### Getting Started in Python
|
95
|
+
|
96
|
+
Currently, we support Python wheels for versions 3.8 through 3.12 on x86_64 architectures (Intel, AMD and MacOS). Support for
|
97
|
+
ARM wheels is a future improvement.
|
98
|
+
|
99
|
+
The python library can be installed from PyPI by using
|
100
|
+
```shell
|
101
|
+
$ pip install flatnav
|
102
|
+
```
|
103
|
+
|
104
|
+
Similarly, `flatnav` can be installed from source via [cibuildwheel](https://cibuildwheel.pypa.io/en/stable/), which
|
105
|
+
builds cross-platform wheels. Follow the following steps
|
106
|
+
|
107
|
+
```shell
|
108
|
+
$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
|
109
|
+
$ cd flatnav
|
110
|
+
$ make install-cibuildwheel
|
111
|
+
|
112
|
+
# This will build flatnav for the current version in your environment. If you want to build wheels
|
113
|
+
# for all supported python versions (3.8 to 3.12), remove the --current-version flag.
|
114
|
+
$ ./cibuild.sh --current-version
|
115
|
+
|
116
|
+
$ pip install wheelhouse/flatnav*.whl --force-reinstall
|
117
|
+
```
|
118
|
+
|
119
|
+
Once you have the python library installed and you have a dataset you want to index as a numpy array, you can construct the index as shown below. This will allocate memory and create a directed graph with vectors as nodes.
|
120
|
+
|
121
|
+
```python
|
122
|
+
import numpy as np
|
123
|
+
import flatnav
|
124
|
+
from flatnav.data_type import DataType
|
125
|
+
|
126
|
+
# Get your numpy-formatted dataset.
|
127
|
+
dataset_size = 1_000_000
|
128
|
+
dataset_dimension = 128
|
129
|
+
dataset_to_index = np.random.randn(dataset_size, dataset_dimension)
|
130
|
+
|
131
|
+
# Define index construction parameters.
|
132
|
+
distance_type = "l2"
|
133
|
+
max_edges_per_node = 32
|
134
|
+
ef_construction = 100
|
135
|
+
num_build_threads = 16
|
136
|
+
|
137
|
+
# Create index configuration and pre-allocate memory
|
138
|
+
index = flatnav.index.create(
|
139
|
+
distance_type=distance_type,
|
140
|
+
index_data_type=DataType.float32,
|
141
|
+
dim=dataset_dimension,
|
142
|
+
dataset_size=dataset_size,
|
143
|
+
max_edges_per_node=max_edges_per_node,
|
144
|
+
verbose=True,
|
145
|
+
collect_stats=True,
|
146
|
+
)
|
147
|
+
index.set_num_threads(num_build_threads)
|
148
|
+
|
149
|
+
# Now index the dataset
|
150
|
+
index.add(data=dataset_to_index, ef_construction=ef_construction)
|
151
|
+
```
|
152
|
+
|
153
|
+
Note that we specified `DataType.float32` to indicate that we want to build an index with vectors represented with `float` type. If you want to use a different precision, such as `uint8_t` or `int8_t` (which are the only other ones currently supported), you can use `DataType.uint8` or `DataType.int8`.
|
154
|
+
The distance type can either be `l2` or `angular`. The `collect_stats` flag will record the number of distance evaluations.
|
155
|
+
|
156
|
+
To query the index we just created by generating IID vectors from the standard normal distribution, we do it as follows
|
157
|
+
|
158
|
+
```python
|
159
|
+
|
160
|
+
# Set query-time parameters
|
161
|
+
k = 100
|
162
|
+
ef_search = 100
|
163
|
+
|
164
|
+
# Run k-NN query with a single thread.
|
165
|
+
index.set_num_threads(1)
|
166
|
+
|
167
|
+
queries = np.random.randn(1000, dataset_to_index.shape[1])
|
168
|
+
for query in queries:
|
169
|
+
distances, indices = index.search_single(
|
170
|
+
query=query,
|
171
|
+
ef_search=ef_search,
|
172
|
+
K=k,
|
173
|
+
)
|
174
|
+
|
175
|
+
```
|
176
|
+
|
177
|
+
You can parallelize the search by setting the number of threads to a desired number and using a different API that also returns the exact same results as `search_single`.
|
178
|
+
|
179
|
+
```python
|
180
|
+
index.set_num_threads(16)
|
181
|
+
distances, indices = index.search(queries=queries, ef_search=ef_search, K=k)
|
182
|
+
```
|
183
|
+
|
184
|
+
### Getting Started in C++
|
185
|
+
|
186
|
+
As mentioned earlier, there is nothing to build since this is header-only. We will translate the above Python code in C++ to illustrate how to use the C++ API.
|
187
|
+
|
188
|
+
```c++
|
189
|
+
#include <cstdint>
|
190
|
+
#include <flatnav/index/Index.h>
|
191
|
+
#include <flatnav/distances/SquaredL2Distance.h>
|
192
|
+
#include <flatnav/distances/DistanceInterface.h>
|
193
|
+
|
194
|
+
template <typename dist_t>
|
195
|
+
void run_knn_search(Index<dist_t, int>>* index, float *queries, int* gtruth,
|
196
|
+
int ef_search, int K, int num_queries, int num_gtruth, int dim) {
|
197
|
+
|
198
|
+
float mean_recall = 0;
|
199
|
+
for (int i = 0; i < num_queries; i++) {
|
200
|
+
float *q = queries + dim * i;
|
201
|
+
int *g = gtruth + num_gtruth * i;
|
202
|
+
std::vector<std::pair<float, int>> result =
|
203
|
+
index->search(q, K, ef_search);
|
204
|
+
|
205
|
+
float recall = 0;
|
206
|
+
for (int j = 0; j < K; j++) {
|
207
|
+
for (int l = 0; l < K; l++) {
|
208
|
+
if (result[j].second == g[l]) {
|
209
|
+
recall = recall + 1;
|
210
|
+
}
|
211
|
+
}
|
212
|
+
}
|
213
|
+
recall = recall / K;
|
214
|
+
mean_recall = mean_recall + recall;
|
215
|
+
}
|
216
|
+
}
|
217
|
+
|
218
|
+
|
219
|
+
int main(int argc, char** argv) {
|
220
|
+
uint32_t dataset_size = 1000000;
|
221
|
+
uint32_t dataset_dimension = 128;
|
222
|
+
|
223
|
+
// We skip the random data generation, but you can do that with std::mt19937, std::random_device
|
224
|
+
// and std::normal_distribution
|
225
|
+
// std::vector<float> dataset_to_index;
|
226
|
+
|
227
|
+
uint32_t max_edges_per_node = 32;
|
228
|
+
uint32_t ef_construction = 100;
|
229
|
+
|
230
|
+
// Create an index with l2 distance
|
231
|
+
auto distance = SquaredL2Distance<>::create(dataset_dimension);
|
232
|
+
auto* index = new Index<SquaredL2Distance<DataType::float32>>, int>(
|
233
|
+
/* dist = */ std::move(distance), /* dataset_size = */ dataset_size,
|
234
|
+
/* max_edges_per_node = */ max_edges_per_node);
|
235
|
+
|
236
|
+
index->setNumThreads(build_num_threads);
|
237
|
+
|
238
|
+
std::vector<int> labels(dataset_size);
|
239
|
+
std::iota(labels.begin(), labels.end(), 0);
|
240
|
+
index->template addBatch<float>(/* data = */ (void *)dataset_to_index,
|
241
|
+
/* labels = */ labels,
|
242
|
+
/* ef_construction */ ef_construction);
|
243
|
+
|
244
|
+
// Now query the index and compute the recall
|
245
|
+
// We assume you have a ground truth (int*) array and a queries (float*) array
|
246
|
+
uint32_t ef_search = 100;
|
247
|
+
uint32_t k = 100;
|
248
|
+
uint32_t num_queries = 1000;
|
249
|
+
uint32_t num_gtruth = 1000;
|
250
|
+
|
251
|
+
// Query the index and compute the recall.
|
252
|
+
run_knn_search(index, queries, gtruth, ef_search, k, num_queries, num_gtruth, dataset_dimension);
|
253
|
+
}
|
254
|
+
|
255
|
+
```
|
256
|
+
|
257
|
+
### Datasets from ANN-Benchmarks
|
258
|
+
|
259
|
+
ANN-Benchmarks provide HDF5 files for a standard benchmark of near-neighbor datasets, queries and ground-truth results. To index any of these datasets you can use the `construct_npy.cpp` and `query_npy.cpp` files linked above.
|
260
|
+
|
261
|
+
To generate the [ANNS benchmark datasets](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets), run the following script
|
262
|
+
|
263
|
+
```shell
|
264
|
+
$ ./bin/download_anns_datasets.sh <dataset-name> [--normalize]
|
265
|
+
```
|
266
|
+
|
267
|
+
For datasets that use the angular/cosine similarity, you will need to use `--normalize` option so that the distances are computed correctly.
|
268
|
+
|
269
|
+
Available dataset names include:
|
270
|
+
|
271
|
+
```shell
|
272
|
+
_ mnist-784-euclidean
|
273
|
+
_ sift-128-euclidean
|
274
|
+
_ glove-25-angular
|
275
|
+
_ glove-50-angular
|
276
|
+
_ glove-100-angular
|
277
|
+
_ glove-200-angular
|
278
|
+
_ deep-image-96-angular
|
279
|
+
_ gist-960-euclidean
|
280
|
+
_ nytimes-256-angular
|
281
|
+
```
|
282
|
+
|
283
|
+
### Experimental API and Future Extensions
|
284
|
+
|
285
|
+
You can find the current work under development under the [development-features](https://github.com/BlaiseMuhirwa/flatnav/blob/main/development-features) directory.
|
286
|
+
While some of these features may be usable, they are not guarranteed to be stable. Stable features will be expected to be part of the PyPI releases.
|
287
|
+
The most notable on-going extension that's under development is product quantization.
|
288
|
+
|
289
|
+
## Citation
|
290
|
+
If you find this library useful, please consider citing our associated paper:
|
291
|
+
|
292
|
+
```
|
293
|
+
@article{munyampirwa2024down,
|
294
|
+
title={Down with the Hierarchy: The'H'in HNSW Stands for" Hubs"},
|
295
|
+
author={Munyampirwa, Blaise and Lakshman, Vihan and Coleman, Benjamin},
|
296
|
+
journal={arXiv preprint arXiv:2412.01940},
|
297
|
+
year={2024}
|
298
|
+
}
|
299
|
+
```
|
@@ -0,0 +1,6 @@
|
|
1
|
+
flatnav/__init__.py,sha256=Rrz04JKfpOf0cP1T8yy_sfT2z5dL9HMrnL9lsIzCCBg,579
|
2
|
+
flatnav/_core.cpython-38-darwin.so,sha256=oxJWP5Pjk_V9RAf67YRg5e5nuk5y3nsV0w_i18CjH6s,682920
|
3
|
+
flatnav-0.1.2.dist-info/METADATA,sha256=2Zsista1n0hTA5dqRQj6KwxySEAQnOjk78w3FKGen2s,11855
|
4
|
+
flatnav-0.1.2.dist-info/WHEEL,sha256=1Q0ilh_UmUoDz5aBvEtcYwUkqD8WzqUHpTXCjcN17Ek,104
|
5
|
+
flatnav-0.1.2.dist-info/top_level.txt,sha256=FVUKVYK356G2MlNoIaTtjmGUzJNV_2wLRmcHtuSUP3Y,8
|
6
|
+
flatnav-0.1.2.dist-info/RECORD,,
|
@@ -0,0 +1 @@
|
|
1
|
+
flatnav
|