abloom 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
abloom-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 ampribe
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,6 @@
1
+ include abloom/*.c
2
+ include abloom/*.h
3
+ include abloom/*.pyi
4
+ include abloom/py.typed
5
+ include LICENSE
6
+ include README.md
abloom-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,94 @@
1
+ Metadata-Version: 2.4
2
+ Name: abloom
3
+ Version: 0.1.0
4
+ Summary: High-performance Bloom filter for Python
5
+ Author-email: Andrew Pribe <andrewpribe@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/ampribe/abloom
8
+ Classifier: Development Status :: 4 - Beta
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Python :: 3.8
13
+ Classifier: Programming Language :: Python :: 3.9
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: C
19
+ Classifier: Topic :: Software Development :: Libraries
20
+ Requires-Python: >=3.8
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Dynamic: license-file
24
+
25
+ # abloom
26
+ [![Tests](https://github.com/ampribe/abloom/actions/workflows/test.yml/badge.svg)](https://github.com/ampribe/abloom/actions/workflows/test.yml)
27
+
28
+ `abloom` is a high-performance Bloom filter implementation for Python, written in C.
29
+
30
+ Why use `abloom`?
31
+ - `abloom` significantly outperforms all other Python Bloom filter libraries. It's 2.77x faster on `add`, 2.43x faster on `update`, and 1.34x faster on lookups than the next fastest implementation, `rbloom` (1M ints, 1% FPR). Complete benchmark results [here](BENCHMARK.md).
32
+ - `abloom` is rigorously tested for Python versions >= 3.8 on Ubuntu, Windows, and macOS.
33
+
34
+ ## Usage
35
+ Install with `pip install abloom`.
36
+
37
+ ```python
38
+ from abloom import BloomFilter
39
+
40
+ bf = BloomFilter(1_000_000, 0.01)
41
+ bf.add(1)
42
+ bf.add(("arbitrary", "object", "that", "implements", "hash"))
43
+ bf.update([2,3,4])
44
+
45
+ assert 1 in bf
46
+ assert ("arbitrary", "object", "that", "implements", "hash") in bf
47
+ assert 5 not in bf
48
+ repr(bf) # '<BloomFilter capacity=1_000_000 items=5 fp_rate=0.01>'
49
+ ```
50
+
51
+ `abloom` relies on Python's built-in hash function, so types must implement `__hash__`. Python uses a unique seed for hashes within each process, so transferring Bloom filters between processes is not possible.
52
+
53
+ `abloom` implements a split block Bloom filter with 512 bits per block and power-of-2 rounding for block count. This requires ~1.5-2x memory overhead compared to the standard implementation and can reduce performance for extremely high capacities or low false positive rates. The benchmark on 10M ints, 0.1% FPR shows this effect, though `abloom` is still significantly faster than alternative libraries. See [implementation](IMPLEMENTATION.md) for additional implementation and memory usage details.
54
+
55
+ ## Testing
56
+
57
+ ```bash
58
+ # Install dev dependencies
59
+ pip install -e ".[test]"
60
+
61
+ # Run unit tests
62
+ pytest tests/ --ignore=tests/test_benchmark.py --ignore=tests/test_fpr.py -v
63
+
64
+ # Run all tests including slow FPR validation
65
+ pytest tests/ --ignore=tests/test_benchmark.py -v
66
+
67
+ # Cross-version testing (requires tox and multiple Python versions)
68
+ pip install tox
69
+ tox
70
+ ```
71
+
72
+ ## Benchmarking
73
+
74
+ See [BENCHMARK.md](BENCHMARK.md) for detailed results and filtering options.
75
+
76
+ ```bash
77
+ # Install benchmark dependencies
78
+ pip install -e ".[benchmark]"
79
+
80
+ # Run all benchmarks
81
+ pytest tests/test_benchmark.py --benchmark-only
82
+
83
+ # Run canonical benchmark (1M ints, 1% FPR)
84
+ pytest tests/test_benchmark.py -k "int_1000000_0.01" --benchmark-only -v
85
+
86
+ # Filter by operation, library, or data type
87
+ pytest tests/test_benchmark.py -k "add" --benchmark-only # Add only
88
+ pytest tests/test_benchmark.py -k "abloom" --benchmark-only # abloom only
89
+ pytest tests/test_benchmark.py -k "uuid" --benchmark-only # UUIDs only
90
+
91
+ # Save results for report generation
92
+ pytest tests/test_benchmark.py --benchmark-only --benchmark-json=results.json
93
+ python scripts/generate_benchmark_report.py results.json
94
+ ```
abloom-0.1.0/README.md ADDED
@@ -0,0 +1,70 @@
1
+ # abloom
2
+ [![Tests](https://github.com/ampribe/abloom/actions/workflows/test.yml/badge.svg)](https://github.com/ampribe/abloom/actions/workflows/test.yml)
3
+
4
+ `abloom` is a high-performance Bloom filter implementation for Python, written in C.
5
+
6
+ Why use `abloom`?
7
+ - `abloom` significantly outperforms all other Python Bloom filter libraries. It's 2.77x faster on `add`, 2.43x faster on `update`, and 1.34x faster on lookups than the next fastest implementation, `rbloom` (1M ints, 1% FPR). Complete benchmark results [here](BENCHMARK.md).
8
+ - `abloom` is rigorously tested for Python versions >= 3.8 on Ubuntu, Windows, and macOS.
9
+
10
+ ## Usage
11
+ Install with `pip install abloom`.
12
+
13
+ ```python
14
+ from abloom import BloomFilter
15
+
16
+ bf = BloomFilter(1_000_000, 0.01)
17
+ bf.add(1)
18
+ bf.add(("arbitrary", "object", "that", "implements", "hash"))
19
+ bf.update([2,3,4])
20
+
21
+ assert 1 in bf
22
+ assert ("arbitrary", "object", "that", "implements", "hash") in bf
23
+ assert 5 not in bf
24
+ repr(bf) # '<BloomFilter capacity=1_000_000 items=5 fp_rate=0.01>'
25
+ ```
26
+
27
+ `abloom` relies on Python's built-in hash function, so types must implement `__hash__`. Python uses a unique seed for hashes within each process, so transferring Bloom filters between processes is not possible.
28
+
29
+ `abloom` implements a split block Bloom filter with 512 bits per block and power-of-2 rounding for block count. This requires ~1.5-2x memory overhead compared to the standard implementation and can reduce performance for extremely high capacities or low false positive rates. The benchmark on 10M ints, 0.1% FPR shows this effect, though `abloom` is still significantly faster than alternative libraries. See [implementation](IMPLEMENTATION.md) for additional implementation and memory usage details.
30
+
31
+ ## Testing
32
+
33
+ ```bash
34
+ # Install dev dependencies
35
+ pip install -e ".[test]"
36
+
37
+ # Run unit tests
38
+ pytest tests/ --ignore=tests/test_benchmark.py --ignore=tests/test_fpr.py -v
39
+
40
+ # Run all tests including slow FPR validation
41
+ pytest tests/ --ignore=tests/test_benchmark.py -v
42
+
43
+ # Cross-version testing (requires tox and multiple Python versions)
44
+ pip install tox
45
+ tox
46
+ ```
47
+
48
+ ## Benchmarking
49
+
50
+ See [BENCHMARK.md](BENCHMARK.md) for detailed results and filtering options.
51
+
52
+ ```bash
53
+ # Install benchmark dependencies
54
+ pip install -e ".[benchmark]"
55
+
56
+ # Run all benchmarks
57
+ pytest tests/test_benchmark.py --benchmark-only
58
+
59
+ # Run canonical benchmark (1M ints, 1% FPR)
60
+ pytest tests/test_benchmark.py -k "int_1000000_0.01" --benchmark-only -v
61
+
62
+ # Filter by operation, library, or data type
63
+ pytest tests/test_benchmark.py -k "add" --benchmark-only # Add only
64
+ pytest tests/test_benchmark.py -k "abloom" --benchmark-only # abloom only
65
+ pytest tests/test_benchmark.py -k "uuid" --benchmark-only # UUIDs only
66
+
67
+ # Save results for report generation
68
+ pytest tests/test_benchmark.py --benchmark-only --benchmark-json=results.json
69
+ python scripts/generate_benchmark_report.py results.json
70
+ ```
@@ -0,0 +1,8 @@
1
+ """
2
+ abloom - A high-performance Bloom filter for Python
3
+ """
4
+
5
+ from abloom._abloom import BloomFilter
6
+
7
+ __version__ = '0.1.0'
8
+ __all__ = ['BloomFilter']
@@ -0,0 +1,348 @@
1
+ #define PY_SSIZE_T_CLEAN
2
+ #include <Python.h>
3
+ #include <math.h>
4
+ #include <string.h>
5
+
6
+ // SBBF constants: 512-bit blocks (8 x 64-bit words)
7
+ #define BLOCK_BITS 512
8
+ #define BLOCK_BYTES 64
9
+ #define BLOCK_WORDS 8
10
+ #define BITS_PER_WORD 64
11
+
12
+ // Salt constants from Parquet spec
13
+ static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU,
14
+ 0xa2b7289dU, 0x705495c7U, 0x2df1424bU,
15
+ 0x9efc4947U, 0x5c6bfb31U};
16
+
17
+ typedef struct {
18
+ PyObject_HEAD uint64_t *blocks;
19
+ uint64_t block_count;
20
+ uint64_t block_mask;
21
+ uint64_t item_count;
22
+ uint64_t capacity;
23
+ double fp_rate;
24
+ } BloomFilter;
25
+
26
+ static const float SBBF512_LUT[] = {
27
+ 3.2304f, 3.8302f, 4.3978f, 4.9555f, 5.5148f, 6.0828f, 6.6644f,
28
+ 7.2634f, 7.8830f, 8.5260f, 9.1952f, 9.8929f, 10.6217f, 11.3841f,
29
+ 12.1826f, 13.0199f, 13.8988f, 14.8220f, 15.7926f, 16.8139f, 17.8892f,
30
+ 19.0222f, 20.2168f, 21.4771f, 22.8076f, 24.2130f, 25.6984f, 27.2693f,
31
+ 28.9318f, 30.6921f, 32.5573f, 34.5347f, 36.6325f, 38.8595f, 41.2251f,
32
+ 43.7396f, 46.4143f, 49.2614f, 52.2942f};
33
+ #define SBBF512_LUT_SIZE 39
34
+
35
+ static inline uint64_t mix64(uint64_t x) {
36
+ x ^= x >> 33;
37
+ x *= 0xff51afd7ed558ccdULL;
38
+ x ^= x >> 33;
39
+ x *= 0xc4ceb9fe1a85ec53ULL;
40
+ x ^= x >> 33;
41
+ return x;
42
+ }
43
+
44
+ static uint64_t next_power_of_2(uint64_t n) {
45
+ if (n == 0)
46
+ return 1;
47
+ n--;
48
+ n |= n >> 1;
49
+ n |= n >> 2;
50
+ n |= n >> 4;
51
+ n |= n >> 8;
52
+ n |= n >> 16;
53
+ n |= n >> 32;
54
+ return n + 1;
55
+ }
56
+
57
+ static uint64_t calculate_block_count(uint64_t capacity, double fp_rate) {
58
+ if (capacity == 0)
59
+ capacity = 1;
60
+
61
+ double x = -log2(fp_rate);
62
+
63
+ double bits_per_item;
64
+ if (x <= 1.0) {
65
+ bits_per_item = SBBF512_LUT[0];
66
+ } else if (x >= 20.0) {
67
+ double slope = (SBBF512_LUT[38] - SBBF512_LUT[37]) / 0.5;
68
+ bits_per_item = SBBF512_LUT[38] + slope * (x - 20.0);
69
+ } else {
70
+ int idx = (int)((x - 1.0) / 0.5);
71
+ if (idx >= SBBF512_LUT_SIZE - 1)
72
+ idx = SBBF512_LUT_SIZE - 2;
73
+ double t = (x - 1.0 - idx * 0.5) / 0.5;
74
+ bits_per_item = SBBF512_LUT[idx] * (1.0 - t) + SBBF512_LUT[idx + 1] * t;
75
+ }
76
+
77
+ if (bits_per_item < 8.0)
78
+ bits_per_item = 8.0;
79
+
80
+ uint64_t total_bits = (uint64_t)ceil(capacity * bits_per_item);
81
+ uint64_t min_blocks = (total_bits + BLOCK_BITS - 1) / BLOCK_BITS;
82
+
83
+ return next_power_of_2(min_blocks);
84
+ }
85
+
86
+ static inline void bloom_insert(BloomFilter *bf, uint64_t hash) {
87
+ // Upper 32 bits select the block
88
+ uint64_t block_idx = (hash >> 32) & bf->block_mask;
89
+ uint32_t h_low = (uint32_t)hash;
90
+
91
+ uint64_t *block = &bf->blocks[block_idx * BLOCK_WORDS];
92
+
93
+ uint32_t p0 = (h_low * SALT[0]) >> 26;
94
+ uint32_t p1 = (h_low * SALT[1]) >> 26;
95
+ uint32_t p2 = (h_low * SALT[2]) >> 26;
96
+ uint32_t p3 = (h_low * SALT[3]) >> 26;
97
+ uint32_t p4 = (h_low * SALT[4]) >> 26;
98
+ uint32_t p5 = (h_low * SALT[5]) >> 26;
99
+ uint32_t p6 = (h_low * SALT[6]) >> 26;
100
+ uint32_t p7 = (h_low * SALT[7]) >> 26;
101
+
102
+ block[0] |= (1ULL << p0);
103
+ block[1] |= (1ULL << p1);
104
+ block[2] |= (1ULL << p2);
105
+ block[3] |= (1ULL << p3);
106
+ block[4] |= (1ULL << p4);
107
+ block[5] |= (1ULL << p5);
108
+ block[6] |= (1ULL << p6);
109
+ block[7] |= (1ULL << p7);
110
+ }
111
+
112
+ static inline int bloom_check(BloomFilter *bf, uint64_t hash) {
113
+ uint64_t block_idx = (hash >> 32) & bf->block_mask;
114
+ uint32_t h_low = (uint32_t)hash;
115
+ uint64_t *block = &bf->blocks[block_idx * BLOCK_WORDS];
116
+
117
+ #define CHECK_WORD(i) \
118
+ if (!(block[i] & (1ULL << ((h_low * SALT[i]) >> 26)))) \
119
+ return 0
120
+
121
+ CHECK_WORD(0);
122
+ CHECK_WORD(1);
123
+ CHECK_WORD(2);
124
+ CHECK_WORD(3);
125
+ CHECK_WORD(4);
126
+ CHECK_WORD(5);
127
+ CHECK_WORD(6);
128
+ CHECK_WORD(7);
129
+
130
+ return 1;
131
+ #undef CHECK_WORD
132
+ }
133
+
134
+ static int get_hash(PyObject *item, uint64_t *out_hash) {
135
+ Py_hash_t py_hash = PyObject_Hash(item);
136
+ if (py_hash == -1 && PyErr_Occurred()) return -1;
137
+
138
+ // try mixing first
139
+ *out_hash = mix64((uint64_t)py_hash);
140
+ return 0;
141
+ }
142
+
143
+ static PyObject *BloomFilter_update(BloomFilter *self, PyObject *iterable) {
144
+ PyObject *iter = PyObject_GetIter(iterable);
145
+ if (iter == NULL)
146
+ return NULL;
147
+
148
+ PyObject *item;
149
+ while ((item = PyIter_Next(iter)) != NULL) {
150
+ uint64_t hash;
151
+ if (get_hash(item, &hash) < 0) {
152
+ Py_DECREF(item);
153
+ Py_DECREF(iter);
154
+ return NULL;
155
+ }
156
+ bloom_insert(self, hash);
157
+ self->item_count++;
158
+ Py_DECREF(item);
159
+ }
160
+ Py_DECREF(iter);
161
+
162
+ if (PyErr_Occurred())
163
+ return NULL;
164
+ Py_RETURN_NONE;
165
+ }
166
+
167
+ static PyObject *BloomFilter_add(BloomFilter *self, PyObject *item) {
168
+ uint64_t hash;
169
+ if (get_hash(item, &hash) < 0)
170
+ return NULL;
171
+
172
+ bloom_insert(self, hash);
173
+ self->item_count++;
174
+
175
+ Py_RETURN_NONE;
176
+ }
177
+
178
+ static int BloomFilter_contains(BloomFilter *self, PyObject *item) {
179
+ uint64_t hash;
180
+ if (get_hash(item, &hash) < 0)
181
+ return -1;
182
+
183
+ return bloom_check(self, hash);
184
+ }
185
+
186
+ static Py_ssize_t BloomFilter_len(BloomFilter *self) {
187
+ return (Py_ssize_t)self->item_count;
188
+ }
189
+
190
+ static PyObject *BloomFilter_get_capacity(BloomFilter *self, void *closure) {
191
+ return PyLong_FromUnsignedLongLong(self->capacity);
192
+ }
193
+
194
+ static PyObject *BloomFilter_get_fp_rate(BloomFilter *self, void *closure) {
195
+ return PyFloat_FromDouble(self->fp_rate);
196
+ }
197
+
198
+ static PyObject *BloomFilter_get_k(BloomFilter *self, void *closure) {
199
+ return PyLong_FromLong(BLOCK_WORDS); // Always 8 for SBBF
200
+ }
201
+
202
+ static PyObject *BloomFilter_get_byte_count(BloomFilter *self, void *closure) {
203
+ uint64_t bytes = self->block_count * BLOCK_BYTES;
204
+ return PyLong_FromUnsignedLongLong(bytes);
205
+ }
206
+
207
+ static PyObject *BloomFilter_get_bit_count(BloomFilter *self, void *closure) {
208
+ uint64_t bits = self->block_count * BLOCK_BITS;
209
+ return PyLong_FromUnsignedLongLong(bits);
210
+ }
211
+
212
+ static void BloomFilter_dealloc(BloomFilter *self) {
213
+ if (self->blocks) {
214
+ PyMem_Free(self->blocks);
215
+ }
216
+ Py_TYPE(self)->tp_free((PyObject *)self);
217
+ }
218
+
219
+ static int BloomFilter_init(BloomFilter *self, PyObject *args, PyObject *kwds) {
220
+ static char *kwlist[] = {"capacity", "fp_rate", NULL};
221
+ unsigned long long capacity;
222
+ double fp_rate = 0.01;
223
+
224
+ if (!PyArg_ParseTupleAndKeywords(args, kwds, "K|d", kwlist, &capacity,
225
+ &fp_rate)) {
226
+ return -1;
227
+ }
228
+
229
+ if (capacity == 0) {
230
+ PyErr_SetString(PyExc_ValueError, "Capacity must be greater than 0");
231
+ return -1;
232
+ }
233
+
234
+ if (fp_rate <= 0.0 || fp_rate >= 1.0) {
235
+ PyErr_SetString(PyExc_ValueError,
236
+ "False positive rate must be between 0.0 and 1.0");
237
+ return -1;
238
+ }
239
+
240
+ self->capacity = capacity;
241
+ self->fp_rate = fp_rate;
242
+ self->block_count = calculate_block_count(capacity, fp_rate);
243
+ self->block_mask = self->block_count - 1;
244
+ self->item_count = 0;
245
+
246
+ size_t num_bytes = self->block_count * BLOCK_BYTES;
247
+ self->blocks = PyMem_Calloc(num_bytes, 1);
248
+ if (self->blocks == NULL) {
249
+ PyErr_NoMemory();
250
+ return -1;
251
+ }
252
+
253
+ return 0;
254
+ }
255
+
256
+ static PyObject *BloomFilter_new(PyTypeObject *type, PyObject *args,
257
+ PyObject *kwds) {
258
+ BloomFilter *self = (BloomFilter *)type->tp_alloc(type, 0);
259
+ if (self != NULL) {
260
+ self->blocks = NULL;
261
+ self->block_count = 0;
262
+ self->block_mask = 0;
263
+ self->item_count = 0;
264
+ self->capacity = 0;
265
+ self->fp_rate = 0.0;
266
+ }
267
+ return (PyObject *)self;
268
+ }
269
+
270
+ static PyMethodDef BloomFilter_methods[] = {
271
+ {"add", (PyCFunction)BloomFilter_add, METH_O,
272
+ "Add an item to the bloom filter"},
273
+ {"update", (PyCFunction)BloomFilter_update, METH_O,
274
+ "Add items from an iterable to the bloom filter"},
275
+ {NULL}};
276
+
277
+ static PyGetSetDef BloomFilter_getsetters[] = {
278
+ {"capacity", (getter)BloomFilter_get_capacity, NULL,
279
+ "Expected number of items", NULL},
280
+ {"fp_rate", (getter)BloomFilter_get_fp_rate, NULL,
281
+ "Target false positive rate", NULL},
282
+ {"k", (getter)BloomFilter_get_k, NULL,
283
+ "Number of hash functions (always 8 for SBBF)", NULL},
284
+ {"byte_count", (getter)BloomFilter_get_byte_count, NULL,
285
+ "Memory usage in bytes", NULL},
286
+ {"bit_count", (getter)BloomFilter_get_bit_count, NULL,
287
+ "Total bits in filter", NULL},
288
+ {NULL}};
289
+
290
+ static PySequenceMethods BloomFilter_as_sequence = {
291
+ .sq_length = (lenfunc)BloomFilter_len,
292
+ .sq_contains = (objobjproc)BloomFilter_contains,
293
+ };
294
+
295
+ static PyObject *BloomFilter_repr(BloomFilter *self) {
296
+ PyObject *fp_obj = PyFloat_FromDouble(self->fp_rate);
297
+ if (!fp_obj)
298
+ return NULL;
299
+
300
+ PyObject *repr =
301
+ PyUnicode_FromFormat("<BloomFilter capacity=%llu items=%llu fp_rate=%R>",
302
+ self->capacity, self->item_count, fp_obj);
303
+
304
+ Py_DECREF(fp_obj);
305
+ return repr;
306
+ }
307
+
308
+ static PyTypeObject BloomFilterType = {
309
+ PyVarObject_HEAD_INIT(NULL, 0).tp_name = "_abloom.BloomFilter",
310
+ .tp_doc = "High-performance Split Block Bloom Filter",
311
+ .tp_basicsize = sizeof(BloomFilter),
312
+ .tp_itemsize = 0,
313
+ .tp_flags = Py_TPFLAGS_DEFAULT,
314
+ .tp_new = BloomFilter_new,
315
+ .tp_init = (initproc)BloomFilter_init,
316
+ .tp_dealloc = (destructor)BloomFilter_dealloc,
317
+ .tp_repr = (reprfunc)BloomFilter_repr,
318
+ .tp_methods = BloomFilter_methods,
319
+ .tp_getset = BloomFilter_getsetters,
320
+ .tp_as_sequence = &BloomFilter_as_sequence,
321
+ };
322
+
323
+ static PyModuleDef abloommodule = {
324
+ PyModuleDef_HEAD_INIT,
325
+ .m_name = "_abloom",
326
+ .m_doc = "High-performance Split Block Bloom Filter for Python",
327
+ .m_size = -1,
328
+ };
329
+
330
+ PyMODINIT_FUNC PyInit__abloom(void) {
331
+ PyObject *m;
332
+
333
+ if (PyType_Ready(&BloomFilterType) < 0)
334
+ return NULL;
335
+
336
+ m = PyModule_Create(&abloommodule);
337
+ if (m == NULL)
338
+ return NULL;
339
+
340
+ Py_INCREF(&BloomFilterType);
341
+ if (PyModule_AddObject(m, "BloomFilter", (PyObject *)&BloomFilterType) < 0) {
342
+ Py_DECREF(&BloomFilterType);
343
+ Py_DECREF(m);
344
+ return NULL;
345
+ }
346
+
347
+ return m;
348
+ }
@@ -0,0 +1,14 @@
1
+ from typing import Iterable
2
+
3
+ class BloomFilter:
4
+ capacity: int
5
+ fp_rate: float
6
+ k: int
7
+ byte_count: int
8
+ bit_count: int
9
+
10
+ def __init__(self, capacity: int, fp_rate: float = 0.01) -> None: ...
11
+ def add(self, item: object) -> None: ...
12
+ def update(self, items: Iterable[object]) -> None: ...
13
+ def __contains__(self, item: object) -> bool: ...
14
+ def __len__(self) -> int: ...
File without changes
@@ -0,0 +1,94 @@
1
+ Metadata-Version: 2.4
2
+ Name: abloom
3
+ Version: 0.1.0
4
+ Summary: High-performance Bloom filter for Python
5
+ Author-email: Andrew Pribe <andrewpribe@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/ampribe/abloom
8
+ Classifier: Development Status :: 4 - Beta
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Python :: 3.8
13
+ Classifier: Programming Language :: Python :: 3.9
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: C
19
+ Classifier: Topic :: Software Development :: Libraries
20
+ Requires-Python: >=3.8
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Dynamic: license-file
24
+
25
+ # abloom
26
+ [![Tests](https://github.com/ampribe/abloom/actions/workflows/test.yml/badge.svg)](https://github.com/ampribe/abloom/actions/workflows/test.yml)
27
+
28
+ `abloom` is a high-performance Bloom filter implementation for Python, written in C.
29
+
30
+ Why use `abloom`?
31
+ - `abloom` significantly outperforms all other Python Bloom filter libraries. It's 2.77x faster on `add`, 2.43x faster on `update`, and 1.34x faster on lookups than the next fastest implementation, `rbloom` (1M ints, 1% FPR). Complete benchmark results [here](BENCHMARK.md).
32
+ - `abloom` is rigorously tested for Python versions >= 3.8 on Ubuntu, Windows, and macOS.
33
+
34
+ ## Usage
35
+ Install with `pip install abloom`.
36
+
37
+ ```python
38
+ from abloom import BloomFilter
39
+
40
+ bf = BloomFilter(1_000_000, 0.01)
41
+ bf.add(1)
42
+ bf.add(("arbitrary", "object", "that", "implements", "hash"))
43
+ bf.update([2,3,4])
44
+
45
+ assert 1 in bf
46
+ assert ("arbitrary", "object", "that", "implements", "hash") in bf
47
+ assert 5 not in bf
48
+ repr(bf) # '<BloomFilter capacity=1_000_000 items=5 fp_rate=0.01>'
49
+ ```
50
+
51
+ `abloom` relies on Python's built-in hash function, so types must implement `__hash__`. Python uses a unique seed for hashes within each process, so transferring Bloom filters between processes is not possible.
52
+
53
+ `abloom` implements a split block Bloom filter with 512 bits per block and power-of-2 rounding for block count. This requires ~1.5-2x memory overhead compared to the standard implementation and can reduce performance for extremely high capacities or low false positive rates. The benchmark on 10M ints, 0.1% FPR shows this effect, though `abloom` is still significantly faster than alternative libraries. See [implementation](IMPLEMENTATION.md) for additional implementation and memory usage details.
54
+
55
+ ## Testing
56
+
57
+ ```bash
58
+ # Install dev dependencies
59
+ pip install -e ".[test]"
60
+
61
+ # Run unit tests
62
+ pytest tests/ --ignore=tests/test_benchmark.py --ignore=tests/test_fpr.py -v
63
+
64
+ # Run all tests including slow FPR validation
65
+ pytest tests/ --ignore=tests/test_benchmark.py -v
66
+
67
+ # Cross-version testing (requires tox and multiple Python versions)
68
+ pip install tox
69
+ tox
70
+ ```
71
+
72
+ ## Benchmarking
73
+
74
+ See [BENCHMARK.md](BENCHMARK.md) for detailed results and filtering options.
75
+
76
+ ```bash
77
+ # Install benchmark dependencies
78
+ pip install -e ".[benchmark]"
79
+
80
+ # Run all benchmarks
81
+ pytest tests/test_benchmark.py --benchmark-only
82
+
83
+ # Run canonical benchmark (1M ints, 1% FPR)
84
+ pytest tests/test_benchmark.py -k "int_1000000_0.01" --benchmark-only -v
85
+
86
+ # Filter by operation, library, or data type
87
+ pytest tests/test_benchmark.py -k "add" --benchmark-only # Add only
88
+ pytest tests/test_benchmark.py -k "abloom" --benchmark-only # abloom only
89
+ pytest tests/test_benchmark.py -k "uuid" --benchmark-only # UUIDs only
90
+
91
+ # Save results for report generation
92
+ pytest tests/test_benchmark.py --benchmark-only --benchmark-json=results.json
93
+ python scripts/generate_benchmark_report.py results.json
94
+ ```
@@ -0,0 +1,19 @@
1
+ LICENSE
2
+ MANIFEST.in
3
+ README.md
4
+ pyproject.toml
5
+ setup.py
6
+ abloom/__init__.py
7
+ abloom/_abloom.c
8
+ abloom/_abloom.pyi
9
+ abloom/py.typed
10
+ abloom.egg-info/PKG-INFO
11
+ abloom.egg-info/SOURCES.txt
12
+ abloom.egg-info/dependency_links.txt
13
+ abloom.egg-info/entry_points.txt
14
+ abloom.egg-info/top_level.txt
15
+ tests/test_benchmark.py
16
+ tests/test_edge_cases.py
17
+ tests/test_fpr.py
18
+ tests/test_functionality.py
19
+ tests/test_properties.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ benchmark-report = analysis.generate_benchmark_report:main
@@ -0,0 +1 @@
1
+ abloom