usearch 2.23.0__cp314-cp314t-macosx_11_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,602 @@
1
+ Metadata-Version: 2.4
2
+ Name: usearch
3
+ Version: 2.23.0
4
+ Summary: Smaller & Faster Single-File Vector Search Engine from Unum
5
+ Home-page: https://github.com/unum-cloud/usearch
6
+ Author: Ash Vardanian
7
+ Author-email: info@unum.cloud
8
+ License: Apache-2.0
9
+ Classifier: Development Status :: 5 - Production/Stable
10
+ Classifier: Natural Language :: English
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: Information Technology
13
+ Classifier: License :: OSI Approved :: Apache Software License
14
+ Classifier: Programming Language :: C++
15
+ Classifier: Programming Language :: Python :: 3 :: Only
16
+ Classifier: Programming Language :: Python :: Implementation :: CPython
17
+ Classifier: Programming Language :: Java
18
+ Classifier: Programming Language :: JavaScript
19
+ Classifier: Programming Language :: Objective C
20
+ Classifier: Programming Language :: Rust
21
+ Classifier: Programming Language :: Other
22
+ Classifier: Operating System :: MacOS
23
+ Classifier: Operating System :: Unix
24
+ Classifier: Operating System :: Microsoft :: Windows
25
+ Classifier: Topic :: System :: Clustering
26
+ Classifier: Topic :: Database :: Database Engines/Servers
27
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: numpy
31
+ Requires-Dist: tqdm
32
+ Requires-Dist: simsimd<7.0.0,>=6.0.5
33
+ Dynamic: author
34
+ Dynamic: author-email
35
+ Dynamic: classifier
36
+ Dynamic: description
37
+ Dynamic: description-content-type
38
+ Dynamic: home-page
39
+ Dynamic: license
40
+ Dynamic: license-file
41
+ Dynamic: requires-dist
42
+ Dynamic: summary
43
+
44
+ <h1 align="center">USearch</h1>
45
+ <h3 align="center">
46
+ Smaller & <a href="https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel">Faster</a> Single-File<br/>
47
+ Similarity Search & Clustering Engine for <a href="https://github.com/ashvardanian/simsimd">Vectors</a> & 🔜 <a href="https://github.com/ashvardanian/stringzilla">Texts</a>
48
+ </h3>
49
+ <br/>
50
+
51
+ <p align="center">
52
+ <a href="https://discord.gg/A6wxt6dS9j"><img height="25" src="https://github.com/unum-cloud/.github/raw/main/assets/discord.svg" alt="Discord"></a>
53
+ &nbsp;&nbsp;&nbsp;
54
+ <a href="https://www.linkedin.com/company/unum-cloud/"><img height="25" src="https://github.com/unum-cloud/.github/raw/main/assets/linkedin.svg" alt="LinkedIn"></a>
55
+ &nbsp;&nbsp;&nbsp;
56
+ <a href="https://twitter.com/unum_cloud"><img height="25" src="https://github.com/unum-cloud/.github/raw/main/assets/twitter.svg" alt="Twitter"></a>
57
+ &nbsp;&nbsp;&nbsp;
58
+ <a href="https://unum.cloud/post"><img height="25" src="https://github.com/unum-cloud/.github/raw/main/assets/blog.svg" alt="Blog"></a>
59
+ &nbsp;&nbsp;&nbsp;
60
+ <a href="https://github.com/unum-cloud/usearch"><img height="25" src="https://github.com/unum-cloud/.github/raw/main/assets/github.svg" alt="GitHub"></a>
61
+ </p>
62
+
63
+ <p align="center">
64
+ Spatial • Binary • Probabilistic • User-Defined Metrics
65
+ <br/>
66
+ <a href="https://unum-cloud.github.io/usearch/cpp">C++11</a> •
67
+ <a href="https://unum-cloud.github.io/usearch/python">Python 3</a> •
68
+ <a href="https://unum-cloud.github.io/usearch/javascript">JavaScript</a> •
69
+ <a href="https://unum-cloud.github.io/usearch/java">Java</a> •
70
+ <a href="https://unum-cloud.github.io/usearch/rust">Rust</a> •
71
+ <a href="https://unum-cloud.github.io/usearch/c">C99</a> •
72
+ <a href="https://unum-cloud.github.io/usearch/objective-c">Objective-C</a> •
73
+ <a href="https://unum-cloud.github.io/usearch/swift">Swift</a> •
74
+ <a href="https://unum-cloud.github.io/usearch/csharp">C#</a> •
75
+ <a href="https://unum-cloud.github.io/usearch/golang">Go</a> •
76
+ <a href="https://unum-cloud.github.io/usearch/wolfram">Wolfram</a>
77
+ <br/>
78
+ Linux • macOS • Windows • iOS • Android • WebAssembly •
79
+ <a href="https://unum-cloud.github.io/usearch/sqlite">SQLite</a>
80
+ </p>
81
+
82
+ <div align="center">
83
+ <a href="https://pepy.tech/project/usearch"> <img alt="PyPI" src="https://static.pepy.tech/personalized-badge/usearch?period=total&units=abbreviation&left_color=black&right_color=blue&left_text=Python%20PyPI%20installs"> </a>
84
+ <a href="https://www.npmjs.com/package/usearch"> <img alt="NPM" src="https://img.shields.io/npm/dy/usearch?label=JavaScript%20NPM%20installs"> </a>
85
+ <a href="https://crates.io/crates/usearch"> <img alt="Crate" src="https://img.shields.io/crates/d/usearch?label=Rust%20Crate%20installs"> </a>
86
+ <a href="https://www.nuget.org/packages/Cloud.Unum.USearch"> <img alt="NuGet" src="https://img.shields.io/nuget/dt/Cloud.Unum.USearch?label=CSharp%20NuGet%20installs"> </a>
87
+ <!-- Maven Central publishing is deprecated for now; fat-JAR download is the supported path. -->
88
+ <img alt="GitHub code size in bytes" src="https://img.shields.io/github/languages/code-size/unum-cloud/usearch?label=Repo%20size">
89
+ </div>
90
+
91
+ ---
92
+
93
+ - ✅ __[10x faster][faster-than-faiss]__ [HNSW][hnsw-algorithm] implementation than [FAISS][faiss].
94
+ - ✅ Simple and extensible [single C++11 header][usearch-header] __library__.
95
+ - ✅ [Trusted](#integrations) by giants like Google and DBs like [ClickHouse][clickhouse-docs] & [DuckDB][duckdb-docs].
96
+ - ✅ [SIMD][simd]-optimized and [user-defined metrics](#user-defined-functions) with JIT compilation.
97
+ - ✅ Hardware-agnostic `f16` & `i8` - [half-precision & quarter-precision support](#memory-efficiency-downcasting-and-quantization).
98
+ - ✅ [View large indexes from disk](#serialization--serving-index-from-disk) without loading into RAM.
99
+ - ✅ Heterogeneous lookups, renaming/relabeling, and on-the-fly deletions.
100
+ - ✅ Binary Tanimoto and Sorensen coefficients for [Genomics and Chemistry applications](#usearch--rdkit--molecular-search).
101
+ - ✅ Space-efficient point-clouds with `uint40_t`, accommodating 4B+ size.
102
+ - ✅ Compatible with OpenMP and custom "executors" for fine-grained parallelism.
103
+ - ✅ [Semantic Search](#usearch--ai--multi-modal-semantic-search) and [Joins](#joins-one-to-one-one-to-many-and-many-to-many-mappings).
104
+ - 🔄 Near-real-time [clustering and sub-clustering](#clustering) for Tens or Millions of clusters.
105
+
106
+ [faiss]: https://github.com/facebookresearch/faiss
107
+ [usearch-header]: https://github.com/unum-cloud/usearch/blob/main/include/usearch/index.hpp
108
+ [obscure-use-cases]: https://ashvardanian.com/posts/abusing-vector-search
109
+ [hnsw-algorithm]: https://arxiv.org/abs/1603.09320
110
+ [simd]: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
111
+ [faster-than-faiss]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel
112
+ [clickhouse-docs]: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes#usearch
113
+ [duckdb-docs]: https://duckdb.org/2024/05/03/vector-similarity-search-vss.html
114
+
115
+ __Technical Insights__ and related articles:
116
+
117
+ - [Uses Arm SVE and x86 AVX-512's masked loads to eliminate tail `for`-loops](https://ashvardanian.com/posts/simsimd-faster-scipy/#tails-of-the-past-the-significance-of-masked-loads).
118
+ - [Uses Horner's method for polynomial approximations, beating GCC 12 by 119x](https://ashvardanian.com/posts/gcc-12-vs-avx512fp16/).
119
+ - [For every language implements a custom separate binding](https://ashvardanian.com/posts/porting-cpp-library-to-ten-languages/).
120
+
121
+
122
+ ## Comparison with FAISS
123
+
124
+ FAISS is a widely recognized standard for high-performance vector search engines.
125
+ USearch and FAISS both employ the same HNSW algorithm, but they differ significantly in their design principles.
126
+ USearch is compact and broadly compatible without sacrificing performance, primarily focusing on user-defined metrics and fewer dependencies.
127
+
128
+ | | FAISS | USearch | Improvement |
129
+ | :------------------------------------------- | ----------------------: | -----------------------: | ----------------------: |
130
+ | Indexing time ⁰ | | | |
131
+ | 100 Million 96d `f32`, `f16`, `i8` vectors | 2.6 · 2.6 · 2.6 h | 0.3 · 0.2 · 0.2 h | __9.6 · 10.4 · 10.7 x__ |
132
+ | 100 Million 1536d `f32`, `f16`, `i8` vectors | 5.0 · 4.1 · 3.8 h | 2.1 · 1.1 · 0.8 h | __2.3 · 3.6 · 4.4 x__ |
133
+ | | | | |
134
+ | Codebase length ¹ | 84 K [SLOC][sloc] | 3 K [SLOC][sloc] | maintainable |
135
+ | Supported metrics ² | 9 fixed metrics | any metric | extendible |
136
+ | Supported languages ³ | C++, Python | 10 languages | portable |
137
+ | Supported ID types ⁴ | 32-bit, 64-bit | 32-bit, 40-bit, 64-bit | efficient |
138
+ | Filtering ⁵ | ban-lists | any predicates | composable |
139
+ | Required dependencies ⁶ | BLAS, OpenMP | - | light-weight |
140
+ | Bindings ⁷ | SWIG | Native | low-latency |
141
+ | Python binding size ⁸ | [~ 10 MB][faiss-weight] | [< 1 MB][usearch-weight] | deployable |
142
+
143
+ [sloc]: https://en.wikipedia.org/wiki/Source_lines_of_code
144
+ [faiss-weight]: https://pypi.org/project/faiss-cpu/#files
145
+ [usearch-weight]: https://pypi.org/project/usearch/#files
146
+
147
+ > ⁰ [Tested][intel-benchmarks] on Intel Sapphire Rapids, with the simplest inner-product distance, equivalent recall, and memory consumption while also providing far superior search speed.
148
+ > ¹ A shorter codebase of `usearch/` over `faiss/` makes the project easier to maintain and audit.
149
+ > ² User-defined metrics allow you to customize your search for various applications, from GIS to creating custom metrics for composite embeddings from multiple AI models or hybrid full-text and semantic search.
150
+ > ³ With USearch, you can reuse the same preconstructed index in various programming languages.
151
+ > ⁴ The 40-bit integer allows you to store 4B+ vectors without allocating 8 bytes for every neighbor reference in the proximity graph.
152
+ > ⁵ With USearch the index can be combined with arbitrary external containers, like Bloom filters or third-party databases, to filter out irrelevant keys during index traversal.
153
+ > ⁶ Lack of obligatory dependencies makes USearch much more portable.
154
+ > ⁷ Native bindings introduce lower call latencies than more straightforward approaches.
155
+ > ⁸ Lighter bindings make downloads and deployments faster.
156
+
157
+ [intel-benchmarks]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel
158
+
159
+ Base functionality is identical to FAISS, and the interface must be familiar if you have ever investigated Approximate Nearest Neighbors search:
160
+
161
+ ```py
162
+ # pip install usearch
163
+
164
+ import numpy as np
165
+ from usearch.index import Index
166
+
167
+ index = Index(ndim=3) # Default settings for 3D vectors
168
+ vector = np.array([0.2, 0.6, 0.4]) # Can be a matrix for batch operations
169
+ index.add(42, vector) # Add one or many vectors in parallel
170
+ matches = index.search(vector, 10) # Find 10 nearest neighbors
171
+
172
+ assert matches[0].key == 42
173
+ assert matches[0].distance <= 0.001
174
+ assert np.allclose(index[42], vector, atol=0.1) # Ensure high tolerance in mixed-precision comparisons
175
+ ```
176
+
177
+ More settings are always available, and the API is designed to be as flexible as possible.
178
+ The default storage/quantization level is hardware-dependant for efficiency, but `bf16` is recommended for most modern CPUs.
179
+
180
+ ```py
181
+ index = Index(
182
+ ndim=3, # Define the number of dimensions in input vectors
183
+ metric='cos', # Choose 'l2sq', 'ip', 'haversine' or other metric, default = 'cos'
184
+ dtype='bf16', # Store as 'f64', 'f32', 'f16', 'i8', 'b1'..., default = None
185
+ connectivity=16, # Optional: Limit number of neighbors per graph node
186
+ expansion_add=128, # Optional: Control the recall of indexing
187
+ expansion_search=64, # Optional: Control the quality of the search
188
+ multi=False, # Optional: Allow multiple vectors per key, default = False
189
+ )
190
+ ```
191
+
192
+ ## Serialization & Serving `Index` from Disk
193
+
194
+ USearch supports multiple forms of serialization:
195
+
196
+ - Into a __file__ defined with a path.
197
+ - Into a __stream__ defined with a callback, serializing or reconstructing incrementally.
198
+ - Into a __buffer__ of fixed length or a memory-mapped file that supports random access.
199
+
200
+ The latter allows you to serve indexes from external memory, enabling you to optimize your server choices for indexing speed and serving costs.
201
+ This can result in __20x cost reduction__ on AWS and other public clouds.
202
+
203
+ ```py
204
+ index.save("index.usearch")
205
+
206
+ index.load("index.usearch")
207
+ view = Index.restore("index.usearch", view=True, ...)
208
+
209
+ other_view = Index(ndim=..., metric=...)
210
+ other_view.view("index.usearch")
211
+ ```
212
+
213
+ ## Exact vs. Approximate Search
214
+
215
+ Approximate search methods, such as HNSW, are predominantly used when an exact brute-force search becomes too resource-intensive.
216
+ This typically occurs when you have millions of entries in a collection.
217
+ For smaller collections, we offer a more direct approach with the `search` method.
218
+
219
+ ```py
220
+ from usearch.index import search, MetricKind, Matches, BatchMatches
221
+ import numpy as np
222
+
223
+ # Generate 10'000 random vectors with 1024 dimensions
224
+ vectors = np.random.rand(10_000, 1024).astype(np.float32)
225
+ vector = np.random.rand(1024).astype(np.float32)
226
+
227
+ one_in_many: Matches = search(vectors, vector, 50, MetricKind.L2sq, exact=True)
228
+ many_in_many: BatchMatches = search(vectors, vectors, 50, MetricKind.L2sq, exact=True)
229
+ ```
230
+
231
+ If you pass the `exact=True` argument, the system bypasses indexing altogether and performs a brute-force search through the entire dataset using SIMD-optimized similarity metrics from [SimSIMD](https://github.com/ashvardanian/simsimd).
232
+ When compared to FAISS's `IndexFlatL2` in Google Colab, __[USearch may offer up to a 20x performance improvement](https://github.com/unum-cloud/usearch/issues/176#issuecomment-1666650778)__:
233
+
234
+ - `faiss.IndexFlatL2`: __55.3 ms__.
235
+ - `usearch.index.search`: __2.54 ms__.
236
+
237
+ ## User-Defined Metrics
238
+
239
+ While most vector search packages concentrate on just two metrics, "Inner Product distance" and "Euclidean distance", USearch allows arbitrary user-defined metrics.
240
+ This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare [Haversine][haversine] distance to creating custom metrics for composite embeddings from multiple AI models, like joint image-text embeddings.
241
+ You can use [Numba][numba], [Cppyy][cppyy], or [PeachPy][peachpy] to define your [custom metric even in Python](https://unum-cloud.github.io/usearch/python#user-defined-metrics-and-jit-in-python):
242
+
243
+ ```py
244
+ from numba import cfunc, types, carray
245
+ from usearch.index import Index, MetricKind, MetricSignature, CompiledMetric
246
+
247
+ ndim = 256
248
+
249
+ @cfunc(types.float32(types.CPointer(types.float32), types.CPointer(types.float32)))
250
+ def python_inner_product(a, b):
251
+ a_array = carray(a, ndim)
252
+ b_array = carray(b, ndim)
253
+ c = 0.0
254
+ for i in range(ndim):
255
+ c += a_array[i] * b_array[i]
256
+ return 1 - c
257
+
258
+ metric = CompiledMetric(pointer=python_inner_product.address, kind=MetricKind.IP, signature=MetricSignature.ArrayArray)
259
+ index = Index(ndim=ndim, metric=metric, dtype=np.float32)
260
+ ```
261
+
262
+ Similar effect is even easier to achieve in C, C++, and Rust interfaces.
263
+ Moreover, unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length.
264
+ They only have to be comparable.
265
+ So you can apply it in [obscure][obscure] applications, like searching for similar sets or fuzzy text matching, using [GZip][gzip-similarity] compression-ratio as a distance function.
266
+
267
+ [haversine]: https://ashvardanian.com/posts/abusing-vector-search#geo-spatial-indexing
268
+ [obscure]: https://ashvardanian.com/posts/abusing-vector-search
269
+ [gzip-similarity]: https://twitter.com/LukeGessler/status/1679211291292889100?s=20
270
+
271
+ [numba]: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#c-callbacks
272
+ [cppyy]: https://cppyy.readthedocs.io/en/latest/
273
+ [peachpy]: https://github.com/Maratyszcza/PeachPy
274
+
275
+ ## Filtering and Predicate Functions
276
+
277
+ Sometimes you may want to cross-reference search-results against some external database or filter them based on some criteria.
278
+ In most engines, you'd have to manually perform paging requests, successively filtering the results.
279
+ In USearch you can simply pass a predicate function to the search method, which will be applied directly during graph traversal.
280
+ In Rust that would look like this:
281
+
282
+ ```rust
283
+ let is_odd = |key: Key| key % 2 == 1;
284
+ let query = vec![0.2, 0.1, 0.2, 0.1, 0.3];
285
+ let results = index.filtered_search(&query, 10, is_odd).unwrap();
286
+ assert!(
287
+ results.keys.iter().all(|&key| key % 2 == 1),
288
+ "All keys must be odd"
289
+ );
290
+ ```
291
+
292
+ ## Memory Efficiency, Downcasting, and Quantization
293
+
294
+ Training a quantization model and dimension-reduction is a common approach to accelerate vector search.
295
+ Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.
296
+ Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors.
297
+ The same index, and `add` and `search` operations will automatically down-cast or up-cast between `f64_t`, `f32_t`, `f16_t`, `i8_t`, and single-bit `b1x8_t` representations.
298
+ You can use the following command to check, if hardware acceleration is enabled:
299
+
300
+ ```sh
301
+ $ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
302
+ > sapphire
303
+ $ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
304
+ > ice
305
+ ```
306
+
307
+ In most cases, it's recommended to use half-precision floating-point numbers on modern hardware.
308
+ When quantization is enabled, the "get"-like functions won't be able to recover the original data, so you may want to replicate the original vectors elsewhere.
309
+ When quantizing to `i8_t` integers, note that it's only valid for cosine-like metrics.
310
+ As part of the quantization process, the vectors are normalized to unit length and later scaled to [-127, 127] range to occupy the full 8-bit range.
311
+ When quantizing to `b1x8_t` single-bit representations, note that it's only valid for binary metrics like Jaccard, Hamming, etc.
312
+ As part of the quantization process, the scalar components greater than zero are set to `true`, and the rest to `false`.
313
+
314
+ ![USearch uint40_t support](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-neighbor-types.png?raw=true)
315
+
316
+ Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs.
317
+ By default, 32-bit `uint32_t` is used to enumerate those, which is not enough if you need to address over 4 Billion entries.
318
+ For such cases we provide a custom `uint40_t` type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.
319
+
320
+
321
+ ## `Indexes` for Multi-Index Lookups
322
+
323
+ For larger workloads targeting billions or even trillions of vectors, parallel multi-index lookups become invaluable.
324
+ Instead of constructing one extensive index, you can build multiple smaller ones and view them together.
325
+
326
+ ```py
327
+ from usearch.index import Indexes
328
+
329
+ multi_index = Indexes(
330
+ indexes: Iterable[usearch.index.Index] = [...],
331
+ paths: Iterable[os.PathLike] = [...],
332
+ view: bool = False,
333
+ threads: int = 0,
334
+ )
335
+ multi_index.search(...)
336
+ ```
337
+
338
+ ## Clustering
339
+
340
+ Once the index is constructed, USearch can perform K-Nearest Neighbors Clustering much faster than standalone clustering libraries, like SciPy,
341
+ UMap, and tSNE.
342
+ Same for dimensionality reduction with PCA.
343
+ Essentially, the `Index` itself can be seen as a clustering, allowing iterative deepening.
344
+
345
+ ```py
346
+ clustering = index.cluster(
347
+ min_count=10, # Optional
348
+ max_count=15, # Optional
349
+ threads=..., # Optional
350
+ )
351
+
352
+ # Get the clusters and their sizes
353
+ centroid_keys, sizes = clustering.centroids_popularity
354
+
355
+ # Use Matplotlib to draw a histogram
356
+ clustering.plot_centroids_popularity()
357
+
358
+ # Export a NetworkX graph of the clusters
359
+ g = clustering.network
360
+
361
+ # Get members of a specific cluster
362
+ first_members = clustering.members_of(centroid_keys[0])
363
+
364
+ # Deepen into that cluster, splitting it into more parts, all the same arguments supported
365
+ sub_clustering = clustering.subcluster(min_count=..., max_count=...)
366
+ ```
367
+
368
+ The resulting clustering isn't identical to K-Means or other conventional approaches but serves the same purpose.
369
+ Alternatively, using Scikit-Learn on a 1 Million point dataset, one may expect queries to take anywhere from minutes to hours, depending on the number of clusters you want to highlight.
370
+ For 50'000 clusters, the performance difference between USearch and conventional clustering methods may easily reach 100x.
371
+
372
+ ## Joins, One-to-One, One-to-Many, and Many-to-Many Mappings
373
+
374
+ One of the big questions these days is how AI will change the world of databases and data management.
375
+ Most databases are still struggling to implement high-quality fuzzy search, and the only kind of joins they know are deterministic.
376
+ A `join` differs from searching for every entry, requiring a one-to-one mapping banning collisions among separate search results.
377
+
378
+ | Exact Search | Fuzzy Search | Semantic Search ? |
379
+ | :----------: | :----------: | :---------------: |
380
+ | Exact Join | Fuzzy Join ? | Semantic Join ?? |
381
+
382
+ Using USearch, one can implement sub-quadratic complexity approximate, fuzzy, and semantic joins.
383
+ This can be useful in any fuzzy-matching tasks common to Database Management Software.
384
+
385
+ ```py
386
+ men = Index(...)
387
+ women = Index(...)
388
+ pairs: dict = men.join(women, max_proposals=0, exact=False)
389
+ ```
390
+
391
+ > Read more in the post: [Combinatorial Stable Marriages for Semantic Search 💍](https://ashvardanian.com/posts/searching-stable-marriages)
392
+
393
+
394
+ ## Functionality
395
+
396
+ By now, the core functionality is supported across all bindings.
397
+ Broader functionality is ported per request.
398
+ In some cases, like Batch operations, feature parity is meaningless, as the host language has full multi-threading capabilities and the USearch index structure is concurrent by design, so the users can implement batching/scheduling/load-balancing in the most optimal way for their applications.
399
+
400
+ | | C++ 11 | Python 3 | C 99 | Java | JavaScript | Rust | Go | Swift |
401
+ | :---------------------- | :----: | :------: | :---: | :---: | :--------: | :---: | :---: | :---: |
402
+ | Add, search, remove | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
403
+ | Save, load, view | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
404
+ | User-defined metrics | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
405
+ | Batch operations | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
406
+ | Filter predicates | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ |
407
+ | Joins | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
408
+ | Variable-length vectors | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
409
+ | 4B+ capacities | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
410
+
411
+ ## Application Examples
412
+
413
+ ### USearch + UForm + UCall = Multimodal Semantic Search
414
+
415
+ AI has a growing number of applications, but one of the coolest classic ideas is to use it for Semantic Search.
416
+ One can take an encoder model, like the multi-modal [UForm](https://github.com/unum-cloud/uform), and a web-programming framework, like [UCall](https://github.com/unum-cloud/ucall), and build a text-to-image search platform in just 20 lines of Python.
417
+
418
+ ```python
419
+ from ucall import Server
420
+ from uform import get_model, Modality
421
+ from usearch.index import Index
422
+
423
+ import numpy as np
424
+ import PIL as pil
425
+
426
+ processors, models = get_model('unum-cloud/uform3-image-text-english-small')
427
+ model_text = models[Modality.TEXT_ENCODER]
428
+ model_image = models[Modality.IMAGE_ENCODER]
429
+ processor_text = processors[Modality.TEXT_ENCODER]
430
+ processor_image = processors[Modality.IMAGE_ENCODER]
431
+
432
+ server = Server()
433
+ index = Index(ndim=256)
434
+
435
+ @server
436
+ def add(key: int, photo: pil.Image.Image):
437
+ image = processor_image(photo)
438
+ vector = model_image(image)
439
+ index.add(key, vector.flatten(), copy=True)
440
+
441
+ @server
442
+ def search(query: str) -> np.ndarray:
443
+ tokens = processor_text(query)
444
+ vector = model_text(tokens)
445
+ matches = index.search(vector.flatten(), 3)
446
+ return matches.keys
447
+
448
+ server.run()
449
+ ```
450
+
451
+ Similar experiences can also be implemented in other languages and on the client side, removing the network latency.
452
+ For Swift and iOS, check out the [`ashvardanian/SwiftSemanticSearch`](https://github.com/ashvardanian/SwiftSemanticSearch) repository.
453
+
454
+ <table>
455
+ <tr>
456
+ <td>
457
+ <img src="https://github.com/ashvardanian/ashvardanian/blob/master/demos/SwiftSemanticSearch-Dog.gif?raw=true" alt="SwiftSemanticSearch demo Dog">
458
+ </td>
459
+ <td>
460
+ <img src="https://github.com/ashvardanian/ashvardanian/blob/master/demos/SwiftSemanticSearch-Flowers.gif?raw=true" alt="SwiftSemanticSearch demo with Flowers">
461
+ </td>
462
+ </tr>
463
+ </table>
464
+
465
+ A more complete [demo with Streamlit is available on GitHub](https://github.com/ashvardanian/usearch-images).
466
+ We have pre-processed some commonly used datasets, cleaned the images, produced the vectors, and pre-built the index.
467
+
468
+ | Dataset | Modalities | Images | Download |
469
+ | :---------------------------------- | --------------------: | -----: | ------------------------------------: |
470
+ | [Unsplash][unsplash-25k-origin] | Images & Descriptions | 25 K | [HuggingFace / Unum][unsplash-25k-hf] |
471
+ | [Conceptual Captions][cc-3m-origin] | Images & Descriptions | 3 M | [HuggingFace / Unum][cc-3m-hf] |
472
+ | [Arxiv][arxiv-2m-origin] | Titles & Abstracts | 2 M | [HuggingFace / Unum][arxiv-2m-hf] |
473
+
474
+ [unsplash-25k-origin]: https://github.com/unsplash/datasets
475
+ [cc-3m-origin]: https://huggingface.co/datasets/conceptual_captions
476
+ [arxiv-2m-origin]: https://www.kaggle.com/datasets/Cornell-University/arxiv
477
+
478
+ [unsplash-25k-hf]: https://huggingface.co/datasets/unum-cloud/ann-unsplash-25k
479
+ [cc-3m-hf]: https://huggingface.co/datasets/unum-cloud/ann-cc-3m
480
+ [arxiv-2m-hf]: https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m
481
+
482
+ ### USearch + RDKit = Molecular Search
483
+
484
+ Comparing molecule graphs and searching for similar structures is expensive and slow.
485
+ It can be seen as a special case of the NP-Complete Subgraph Isomorphism problem.
486
+ Luckily, domain-specific approximate methods exist.
487
+ The one commonly used in Chemistry is to generate structures from [SMILES][smiles] and later hash them into binary fingerprints.
488
+ The latter are searchable with binary similarity metrics, like the Tanimoto coefficient.
489
+ Below is an example using the RDKit package.
490
+
491
+ ```python
492
+ from usearch.index import Index, MetricKind
493
+ from rdkit import Chem
494
+ from rdkit.Chem import AllChem
495
+
496
+ import numpy as np
497
+
498
+ molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
499
+ encoder = AllChem.GetRDKitFPGenerator()
500
+
501
+ fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
502
+ fingerprints = np.packbits(fingerprints, axis=1)
503
+
504
+ index = Index(ndim=2048, metric=MetricKind.Tanimoto)
505
+ keys = np.arange(len(molecules))
506
+
507
+ index.add(keys, fingerprints)
508
+ matches = index.search(fingerprints, 10)
509
+ ```
510
+
511
+ That method was used to build the ["USearch Molecules"](https://github.com/ashvardanian/usearch-molecules), one of the largest Chem-Informatics datasets, containing 7 billion small molecules and 28 billion fingerprints.
512
+
513
+ [smiles]: https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
514
+ [rdkit-fingerprints]: https://www.rdkit.org/docs/RDKit_Book.html#additional-information-about-the-fingerprints
515
+
516
+ ### USearch + POI Coordinates = GIS Applications
517
+
518
+ Similar to Vector and Molecule search, USearch can be used for Geospatial Information Systems.
519
+ The Haversine distance is available out of the box, but you can also define more complex relationships, like the Vincenty formula, that accounts for the Earth's oblateness.
520
+
521
+ ```py
522
+ from numba import cfunc, types, carray
523
+ import math
524
+
525
+ # Define the dimension as 2 for latitude and longitude
526
+ ndim = 2
527
+
528
+ # Signature for the custom metric
529
+ signature = types.float32(
530
+ types.CPointer(types.float32),
531
+ types.CPointer(types.float32))
532
+
533
+ # WGS-84 ellipsoid parameters
534
+ a = 6378137.0 # major axis in meters
535
+ f = 1 / 298.257223563 # flattening
536
+ b = (1 - f) * a # minor axis
537
+
538
+ def vincenty_distance(a_ptr, b_ptr):
539
+ a_array = carray(a_ptr, ndim)
540
+ b_array = carray(b_ptr, ndim)
541
+ lat1, lon1, lat2, lon2 = a_array[0], a_array[1], b_array[0], b_array[1]
542
+ L, U1, U2 = lon2 - lon1, math.atan((1 - f) * math.tan(lat1)), math.atan((1 - f) * math.tan(lat2))
543
+ sinU1, cosU1, sinU2, cosU2 = math.sin(U1), math.cos(U1), math.sin(U2), math.cos(U2)
544
+ lambda_, iterLimit = L, 100
545
+ while iterLimit > 0:
546
+ iterLimit -= 1
547
+ sinLambda, cosLambda = math.sin(lambda_), math.cos(lambda_)
548
+ sinSigma = math.sqrt((cosU2 * sinLambda) ** 2 + (cosU1 * sinU2 - sinU1 * cosU2 * cosLambda) ** 2)
549
+ if sinSigma == 0: return 0.0 # Co-incident points
550
+ cosSigma, sigma = sinU1 * sinU2 + cosU1 * cosU2 * cosLambda, math.atan2(sinSigma, cosSigma)
551
+ sinAlpha, cos2Alpha = cosU1 * cosU2 * sinLambda / sinSigma, 1 - (cosU1 * cosU2 * sinLambda / sinSigma) ** 2
552
+ cos2SigmaM = cosSigma - 2 * sinU1 * sinU2 / cos2Alpha if not math.isnan(cosSigma - 2 * sinU1 * sinU2 / cos2Alpha) else 0 # Equatorial line
553
+ C = f / 16 * cos2Alpha * (4 + f * (4 - 3 * cos2Alpha))
554
+ lambda_, lambdaP = L + (1 - C) * f * (sinAlpha * (sigma + C * sinSigma * (cos2SigmaM + C * cosSigma * (-1 + 2 * cos2SigmaM ** 2)))), lambda_
555
+ if abs(lambda_ - lambdaP) <= 1e-12: break
556
+ if iterLimit == 0: return float('nan') # formula failed to converge
557
+ u2 = cos2Alpha * (a ** 2 - b ** 2) / (b ** 2)
558
+ A = 1 + u2 / 16384 * (4096 + u2 * (-768 + u2 * (320 - 175 * u2)))
559
+ B = u2 / 1024 * (256 + u2 * (-128 + u2 * (74 - 47 * u2)))
560
+ deltaSigma = B * sinSigma * (cos2SigmaM + B / 4 * (cosSigma * (-1 + 2 * cos2SigmaM ** 2) - B / 6 * cos2SigmaM * (-3 + 4 * sinSigma ** 2) * (-3 + 4 * cos2SigmaM ** 2)))
561
+ s = b * A * (sigma - deltaSigma)
562
+ return s / 1000.0 # Distance in kilometers
563
+
564
+ # Example usage:
565
+ index = Index(ndim=ndim, metric=CompiledMetric(
566
+ pointer=vincenty_distance.address,
567
+ kind=MetricKind.Haversine,
568
+ signature=MetricSignature.ArrayArray,
569
+ ))
570
+ ```
571
+
572
+ ## Integrations & Users
573
+
574
+ - [x] ClickHouse: [C++](https://github.com/ClickHouse/ClickHouse/pull/53447), [docs](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes#usearch).
575
+ - [x] DuckDB: [post](https://duckdb.org/2024/05/03/vector-similarity-search-vss.html).
576
+ - [x] ScyllaDB: [Rust](https://github.com/scylladb/vector-store), [presentation](https://www.slideshare.net/slideshow/vector-search-with-scylladb-by-szymon-wasik/276571548).
577
+ - [x] TiDB & TiFlash: [C++](https://github.com/pingcap/tiflash), [announcement](https://www.pingcap.com/article/introduce-vector-search-indexes-in-tidb/).
578
+ - [x] YugaByte: [C++](https://github.com/yugabyte/yugabyte-db/blob/366b9f5e3c4df3a1a17d553db41d6dc50146f488/src/yb/vector_index/usearch_wrapper.cc).
579
+ - [x] Google: [UniSim](https://github.com/google/unisim), [RetSim](https://arxiv.org/abs/2311.17264) paper.
580
+ - [x] MemGraph: [C++](https://github.com/memgraph/memgraph/blob/784dd8520f65050d033aea8b29446e84e487d091/src/storage/v2/indices/vector_index.cpp), [announcement](https://memgraph.com/blog/simplify-data-retrieval-memgraph-vector-search).
581
+ - [x] LanternDB: [C++](https://github.com/lanterndata/lantern), [Rust](https://github.com/lanterndata/lantern_extras), [docs](https://lantern.dev/blog/hnsw-index-creation).
582
+ - [x] LangChain: [Python](https://github.com/langchain-ai/langchain/releases/tag/v0.0.257) and [JavaScript](https://github.com/hwchase17/langchainjs/releases/tag/0.0.125).
583
+ - [x] Microsoft Semantic Kernel: [Python](https://github.com/microsoft/semantic-kernel/releases/tag/python-0.3.9.dev) and C#.
584
+ - [x] GPTCache: [Python](https://github.com/zilliztech/GPTCache/releases/tag/0.1.29).
585
+ - [x] Sentence-Transformers: Python [docs](https://www.sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch).
586
+ - [x] Pathway: [Rust](https://github.com/pathwaycom/pathway).
587
+ - [x] Vald: [GoLang](https://github.com/vdaas/vald).
588
+
589
+
590
+ ## Citations
591
+
592
+ ```bibtex
593
+ @software{Vardanian_USearch_2023,
594
+ doi = {10.5281/zenodo.7949416},
595
+ author = {Vardanian, Ash},
596
+ title = {{USearch by Unum Cloud}},
597
+ url = {https://github.com/unum-cloud/usearch},
598
+ version = {2.23.0},
599
+ year = {2023},
600
+ month = oct,
601
+ }
602
+ ```
@@ -0,0 +1,14 @@
1
+ usearch-2.23.0.dist-info/RECORD,,
2
+ usearch-2.23.0.dist-info/WHEEL,sha256=26nyvDx4qlf6NyRSh1NSNrXJDCQeX0hnJ7EH1bB1egM,137
3
+ usearch-2.23.0.dist-info/top_level.txt,sha256=zFbid1SmQjk8RsbEgpqF7tTjgWdFvE2z0e1LQ2hKdPg,8
4
+ usearch-2.23.0.dist-info/METADATA,sha256=DTcl2SDqkS0OuhePkVavPAneFlHx86VjEYvSlyAmwkg,33101
5
+ usearch-2.23.0.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
6
+ usearch/server.py,sha256=ExEj8Sge6yxXbTbKJMbb5APb1waHaInS52ImNoGAy0I,3679
7
+ usearch/index.py,sha256=fcOErSY3t4PumI3gdgbX3zZIfzB8jbh-v2Rn1nAzPOc,61291
8
+ usearch/numba.py,sha256=qxHLChxggVUhPChnKBwUuFXFRKHx0dcXDHoRyFxkap0,3920
9
+ usearch/client.py,sha256=r7Gp4e8V6udeEU1shn7cXVRO42mufD8bsXnHdKACq-o,4137
10
+ usearch/io.py,sha256=1U_O4PjKYwpEbTUVfxEDbJvu8SK7k2iDAwOwwoZe1p0,4230
11
+ usearch/__init__.py,sha256=jTm0I7dYLKOC5k3vQ_uodYm1gR8HxbOY5zWvTnrurcU,5954
12
+ usearch/compiled.cpython-314t-darwin.so,sha256=RytjU38b5FMmAAyzWcO-Vs2UcyysXaIGr1GQCtDObKE,1352432
13
+ usearch/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
14
+ usearch/eval.py,sha256=3IBDUbyubAZs29djZUcixg0pL7IgvwqDssAEpWY-Qj0,16743
@@ -0,0 +1,6 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.9.0)
3
+ Root-Is-Purelib: false
4
+ Tag: cp314-cp314t-macosx_11_0_arm64
5
+ Generator: delocate 0.13.0
6
+