datasketches 0.3.1 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/ext/datasketches/cpc_wrapper.cpp +1 -1
- data/lib/datasketches/version.rb +1 -1
- data/vendor/datasketches-cpp/CMakeLists.txt +22 -20
- data/vendor/datasketches-cpp/NOTICE +1 -1
- data/vendor/datasketches-cpp/common/include/MurmurHash3.h +25 -27
- data/vendor/datasketches-cpp/common/include/common_defs.hpp +8 -6
- data/vendor/datasketches-cpp/common/include/count_zeros.hpp +11 -0
- data/vendor/datasketches-cpp/common/include/memory_operations.hpp +5 -4
- data/vendor/datasketches-cpp/common/test/CMakeLists.txt +1 -1
- data/vendor/datasketches-cpp/common/test/integration_test.cpp +6 -0
- data/vendor/datasketches-cpp/count/CMakeLists.txt +42 -0
- data/vendor/datasketches-cpp/count/include/count_min.hpp +351 -0
- data/vendor/datasketches-cpp/count/include/count_min_impl.hpp +517 -0
- data/vendor/datasketches-cpp/count/test/CMakeLists.txt +43 -0
- data/vendor/datasketches-cpp/count/test/count_min_allocation_test.cpp +155 -0
- data/vendor/datasketches-cpp/count/test/count_min_test.cpp +306 -0
- data/vendor/datasketches-cpp/cpc/include/cpc_confidence.hpp +3 -3
- data/vendor/datasketches-cpp/cpc/include/cpc_sketch_impl.hpp +1 -1
- data/vendor/datasketches-cpp/cpc/include/cpc_util.hpp +16 -8
- data/vendor/datasketches-cpp/density/CMakeLists.txt +42 -0
- data/vendor/datasketches-cpp/density/include/density_sketch.hpp +236 -0
- data/vendor/datasketches-cpp/density/include/density_sketch_impl.hpp +543 -0
- data/vendor/datasketches-cpp/density/test/CMakeLists.txt +35 -0
- data/vendor/datasketches-cpp/density/test/density_sketch_test.cpp +244 -0
- data/vendor/datasketches-cpp/fi/include/reverse_purge_hash_map.hpp +9 -3
- data/vendor/datasketches-cpp/hll/include/Hll4Array-internal.hpp +19 -11
- data/vendor/datasketches-cpp/hll/include/Hll4Array.hpp +2 -5
- data/vendor/datasketches-cpp/hll/include/Hll6Array-internal.hpp +19 -7
- data/vendor/datasketches-cpp/hll/include/Hll6Array.hpp +1 -1
- data/vendor/datasketches-cpp/hll/include/Hll8Array-internal.hpp +98 -42
- data/vendor/datasketches-cpp/hll/include/Hll8Array.hpp +2 -0
- data/vendor/datasketches-cpp/hll/include/HllArray-internal.hpp +92 -59
- data/vendor/datasketches-cpp/hll/include/HllArray.hpp +16 -6
- data/vendor/datasketches-cpp/hll/include/HllSketchImplFactory.hpp +3 -21
- data/vendor/datasketches-cpp/hll/include/HllUnion-internal.hpp +8 -0
- data/vendor/datasketches-cpp/hll/include/HllUtil.hpp +14 -6
- data/vendor/datasketches-cpp/hll/include/coupon_iterator-internal.hpp +1 -1
- data/vendor/datasketches-cpp/hll/include/coupon_iterator.hpp +8 -2
- data/vendor/datasketches-cpp/hll/include/hll.hpp +9 -8
- data/vendor/datasketches-cpp/hll/test/HllUnionTest.cpp +7 -1
- data/vendor/datasketches-cpp/kll/include/kll_helper.hpp +0 -1
- data/vendor/datasketches-cpp/kll/include/kll_sketch.hpp +8 -3
- data/vendor/datasketches-cpp/kll/include/kll_sketch_impl.hpp +2 -2
- data/vendor/datasketches-cpp/kll/test/kll_sketch_test.cpp +2 -2
- data/vendor/datasketches-cpp/python/CMakeLists.txt +6 -0
- data/vendor/datasketches-cpp/python/README.md +5 -5
- data/vendor/datasketches-cpp/python/datasketches/DensityWrapper.py +87 -0
- data/vendor/datasketches-cpp/python/datasketches/KernelFunction.py +35 -0
- data/vendor/datasketches-cpp/python/datasketches/PySerDe.py +15 -9
- data/vendor/datasketches-cpp/python/datasketches/TuplePolicy.py +77 -0
- data/vendor/datasketches-cpp/python/datasketches/TupleWrapper.py +205 -0
- data/vendor/datasketches-cpp/python/datasketches/__init__.py +17 -1
- data/vendor/datasketches-cpp/python/include/kernel_function.hpp +98 -0
- data/vendor/datasketches-cpp/python/include/py_object_lt.hpp +37 -0
- data/vendor/datasketches-cpp/python/include/py_object_ostream.hpp +48 -0
- data/vendor/datasketches-cpp/python/include/quantile_conditional.hpp +104 -0
- data/vendor/datasketches-cpp/python/include/tuple_policy.hpp +136 -0
- data/vendor/datasketches-cpp/python/src/count_wrapper.cpp +101 -0
- data/vendor/datasketches-cpp/python/src/cpc_wrapper.cpp +16 -30
- data/vendor/datasketches-cpp/python/src/datasketches.cpp +6 -0
- data/vendor/datasketches-cpp/python/src/density_wrapper.cpp +95 -0
- data/vendor/datasketches-cpp/python/src/fi_wrapper.cpp +127 -73
- data/vendor/datasketches-cpp/python/src/hll_wrapper.cpp +28 -36
- data/vendor/datasketches-cpp/python/src/kll_wrapper.cpp +108 -160
- data/vendor/datasketches-cpp/python/src/py_serde.cpp +5 -4
- data/vendor/datasketches-cpp/python/src/quantiles_wrapper.cpp +99 -148
- data/vendor/datasketches-cpp/python/src/req_wrapper.cpp +117 -178
- data/vendor/datasketches-cpp/python/src/theta_wrapper.cpp +67 -73
- data/vendor/datasketches-cpp/python/src/tuple_wrapper.cpp +215 -0
- data/vendor/datasketches-cpp/python/src/vo_wrapper.cpp +1 -1
- data/vendor/datasketches-cpp/python/tests/count_min_test.py +86 -0
- data/vendor/datasketches-cpp/python/tests/cpc_test.py +10 -10
- data/vendor/datasketches-cpp/python/tests/density_test.py +93 -0
- data/vendor/datasketches-cpp/python/tests/fi_test.py +41 -2
- data/vendor/datasketches-cpp/python/tests/hll_test.py +19 -20
- data/vendor/datasketches-cpp/python/tests/kll_test.py +40 -6
- data/vendor/datasketches-cpp/python/tests/quantiles_test.py +39 -5
- data/vendor/datasketches-cpp/python/tests/req_test.py +38 -5
- data/vendor/datasketches-cpp/python/tests/theta_test.py +16 -14
- data/vendor/datasketches-cpp/python/tests/tuple_test.py +206 -0
- data/vendor/datasketches-cpp/python/tests/vo_test.py +7 -0
- data/vendor/datasketches-cpp/quantiles/include/quantiles_sketch.hpp +8 -3
- data/vendor/datasketches-cpp/quantiles/include/quantiles_sketch_impl.hpp +4 -4
- data/vendor/datasketches-cpp/quantiles/test/quantiles_sketch_test.cpp +1 -1
- data/vendor/datasketches-cpp/req/include/req_compactor_impl.hpp +0 -2
- data/vendor/datasketches-cpp/req/include/req_sketch.hpp +8 -3
- data/vendor/datasketches-cpp/req/include/req_sketch_impl.hpp +2 -2
- data/vendor/datasketches-cpp/sampling/include/var_opt_sketch.hpp +20 -6
- data/vendor/datasketches-cpp/sampling/include/var_opt_sketch_impl.hpp +30 -16
- data/vendor/datasketches-cpp/sampling/include/var_opt_union.hpp +5 -1
- data/vendor/datasketches-cpp/sampling/include/var_opt_union_impl.hpp +19 -15
- data/vendor/datasketches-cpp/sampling/test/var_opt_sketch_test.cpp +33 -14
- data/vendor/datasketches-cpp/sampling/test/var_opt_union_test.cpp +0 -2
- data/vendor/datasketches-cpp/setup.py +1 -1
- data/vendor/datasketches-cpp/theta/CMakeLists.txt +1 -0
- data/vendor/datasketches-cpp/theta/include/bit_packing.hpp +6279 -0
- data/vendor/datasketches-cpp/theta/include/compact_theta_sketch_parser.hpp +14 -8
- data/vendor/datasketches-cpp/theta/include/compact_theta_sketch_parser_impl.hpp +60 -46
- data/vendor/datasketches-cpp/theta/include/theta_helpers.hpp +4 -2
- data/vendor/datasketches-cpp/theta/include/theta_sketch.hpp +58 -10
- data/vendor/datasketches-cpp/theta/include/theta_sketch_impl.hpp +430 -130
- data/vendor/datasketches-cpp/theta/include/theta_union_base_impl.hpp +9 -9
- data/vendor/datasketches-cpp/theta/include/theta_update_sketch_base.hpp +16 -4
- data/vendor/datasketches-cpp/theta/include/theta_update_sketch_base_impl.hpp +2 -2
- data/vendor/datasketches-cpp/theta/test/CMakeLists.txt +1 -0
- data/vendor/datasketches-cpp/theta/test/bit_packing_test.cpp +80 -0
- data/vendor/datasketches-cpp/theta/test/theta_sketch_test.cpp +42 -3
- data/vendor/datasketches-cpp/theta/test/theta_union_test.cpp +25 -0
- data/vendor/datasketches-cpp/tuple/include/tuple_sketch_impl.hpp +2 -1
- data/vendor/datasketches-cpp/version.cfg.in +1 -1
- metadata +31 -3
|
@@ -43,20 +43,44 @@ kxq1_(0.0),
|
|
|
43
43
|
hllByteArr_(allocator),
|
|
44
44
|
curMin_(0),
|
|
45
45
|
numAtCurMin_(1 << lgConfigK),
|
|
46
|
-
oooFlag_(false)
|
|
46
|
+
oooFlag_(false),
|
|
47
|
+
rebuild_kxq_curmin_(false)
|
|
48
|
+
{}
|
|
49
|
+
|
|
50
|
+
template<typename A>
|
|
51
|
+
HllArray<A>::HllArray(const HllArray& other, target_hll_type tgtHllType) :
|
|
52
|
+
HllSketchImpl<A>(other.getLgConfigK(), tgtHllType, hll_mode::HLL, other.isStartFullSize()),
|
|
53
|
+
// remaining fields are initialized to empty sketch defaults
|
|
54
|
+
// and left to subclass constructor to populate
|
|
55
|
+
hipAccum_(0.0),
|
|
56
|
+
kxq0_(1 << other.getLgConfigK()),
|
|
57
|
+
kxq1_(0.0),
|
|
58
|
+
hllByteArr_(other.getAllocator()),
|
|
59
|
+
curMin_(0),
|
|
60
|
+
numAtCurMin_(1 << other.getLgConfigK()),
|
|
61
|
+
oooFlag_(false),
|
|
62
|
+
rebuild_kxq_curmin_(false)
|
|
47
63
|
{}
|
|
48
64
|
|
|
49
65
|
template<typename A>
|
|
50
66
|
HllArray<A>* HllArray<A>::copyAs(target_hll_type tgtHllType) const {
|
|
51
|
-
|
|
67
|
+
// we may need to recompute KxQ and curMin data for a union gadget,
|
|
68
|
+
// so only use a direct copy if we have a valid sketch
|
|
69
|
+
if (tgtHllType == this->getTgtHllType() && !this->isRebuildKxqCurminFlag()) {
|
|
52
70
|
return static_cast<HllArray*>(copy());
|
|
53
71
|
}
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
72
|
+
|
|
73
|
+
// the factory methods replay the coupons and will always rebuild
|
|
74
|
+
// the sketch in a consistent way
|
|
75
|
+
switch (tgtHllType) {
|
|
76
|
+
case target_hll_type::HLL_4:
|
|
77
|
+
return HllSketchImplFactory<A>::convertToHll4(*this);
|
|
78
|
+
case target_hll_type::HLL_6:
|
|
79
|
+
return HllSketchImplFactory<A>::convertToHll6(*this);
|
|
80
|
+
case target_hll_type::HLL_8:
|
|
81
|
+
return HllSketchImplFactory<A>::convertToHll8(*this);
|
|
82
|
+
default:
|
|
83
|
+
throw std::invalid_argument("Invalid target HLL type");
|
|
60
84
|
}
|
|
61
85
|
}
|
|
62
86
|
|
|
@@ -299,7 +323,7 @@ double HllArray<A>::getEstimate() const {
|
|
|
299
323
|
if (oooFlag_) {
|
|
300
324
|
return getCompositeEstimate();
|
|
301
325
|
}
|
|
302
|
-
return
|
|
326
|
+
return hipAccum_;
|
|
303
327
|
}
|
|
304
328
|
|
|
305
329
|
// HLL UPPER AND LOWER BOUNDS
|
|
@@ -322,54 +346,20 @@ double HllArray<A>::getLowerBound(uint8_t numStdDev) const {
|
|
|
322
346
|
HllUtil<A>::checkNumStdDev(numStdDev);
|
|
323
347
|
const uint32_t configK = 1 << this->lgConfigK_;
|
|
324
348
|
const double numNonZeros = ((curMin_ == 0) ? (configK - numAtCurMin_) : configK);
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
double rseFactor;
|
|
328
|
-
if (oooFlag_) {
|
|
329
|
-
estimate = getCompositeEstimate();
|
|
330
|
-
rseFactor = hll_constants::HLL_NON_HIP_RSE_FACTOR;
|
|
331
|
-
} else {
|
|
332
|
-
estimate = hipAccum_;
|
|
333
|
-
rseFactor = hll_constants::HLL_HIP_RSE_FACTOR;
|
|
334
|
-
}
|
|
335
|
-
|
|
336
|
-
double relErr;
|
|
337
|
-
if (this->lgConfigK_ > 12) {
|
|
338
|
-
relErr = (numStdDev * rseFactor) / sqrt(configK);
|
|
339
|
-
} else {
|
|
340
|
-
relErr = HllUtil<A>::getRelErr(false, oooFlag_, this->lgConfigK_, numStdDev);
|
|
341
|
-
}
|
|
342
|
-
return fmax(estimate / (1.0 + relErr), numNonZeros);
|
|
349
|
+
const double relErr = HllUtil<A>::getRelErr(false, this->oooFlag_, this->lgConfigK_, numStdDev);
|
|
350
|
+
return fmax(getEstimate() / (1.0 + relErr), numNonZeros);
|
|
343
351
|
}
|
|
344
352
|
|
|
345
353
|
template<typename A>
|
|
346
354
|
double HllArray<A>::getUpperBound(uint8_t numStdDev) const {
|
|
347
355
|
HllUtil<A>::checkNumStdDev(numStdDev);
|
|
348
|
-
const
|
|
349
|
-
|
|
350
|
-
double estimate;
|
|
351
|
-
double rseFactor;
|
|
352
|
-
if (oooFlag_) {
|
|
353
|
-
estimate = getCompositeEstimate();
|
|
354
|
-
rseFactor = hll_constants::HLL_NON_HIP_RSE_FACTOR;
|
|
355
|
-
} else {
|
|
356
|
-
estimate = hipAccum_;
|
|
357
|
-
rseFactor = hll_constants::HLL_HIP_RSE_FACTOR;
|
|
358
|
-
}
|
|
359
|
-
|
|
360
|
-
double relErr;
|
|
361
|
-
if (this->lgConfigK_ > 12) {
|
|
362
|
-
relErr = (-1.0) * (numStdDev * rseFactor) / sqrt(configK);
|
|
363
|
-
} else {
|
|
364
|
-
relErr = HllUtil<A>::getRelErr(true, oooFlag_, this->lgConfigK_, numStdDev);
|
|
365
|
-
}
|
|
366
|
-
return estimate / (1.0 + relErr);
|
|
356
|
+
const double relErr = HllUtil<A>::getRelErr(true, this->oooFlag_, this->lgConfigK_, numStdDev);
|
|
357
|
+
return getEstimate() / (1.0 + relErr);
|
|
367
358
|
}
|
|
368
359
|
|
|
369
360
|
/**
|
|
370
361
|
* This is the (non-HIP) estimator.
|
|
371
362
|
* It is called "composite" because multiple estimators are pasted together.
|
|
372
|
-
* @param absHllArr an instance of the AbstractHllArray class.
|
|
373
363
|
* @return the composite estimate
|
|
374
364
|
*/
|
|
375
365
|
// Original C: again-two-registers.c hhb_get_composite_estimate L1489
|
|
@@ -468,16 +458,6 @@ void HllArray<A>::putNumAtCurMin(uint32_t numAtCurMin) {
|
|
|
468
458
|
numAtCurMin_ = numAtCurMin;
|
|
469
459
|
}
|
|
470
460
|
|
|
471
|
-
template<typename A>
|
|
472
|
-
void HllArray<A>::decNumAtCurMin() {
|
|
473
|
-
--numAtCurMin_;
|
|
474
|
-
}
|
|
475
|
-
|
|
476
|
-
template<typename A>
|
|
477
|
-
void HllArray<A>::addToHipAccum(double delta) {
|
|
478
|
-
hipAccum_ += delta;
|
|
479
|
-
}
|
|
480
|
-
|
|
481
461
|
template<typename A>
|
|
482
462
|
bool HllArray<A>::isCompact() const {
|
|
483
463
|
return false;
|
|
@@ -486,7 +466,7 @@ bool HllArray<A>::isCompact() const {
|
|
|
486
466
|
template<typename A>
|
|
487
467
|
bool HllArray<A>::isEmpty() const {
|
|
488
468
|
const uint32_t configK = 1 << this->lgConfigK_;
|
|
489
|
-
return (
|
|
469
|
+
return (curMin_ == 0) && (numAtCurMin_ == configK);
|
|
490
470
|
}
|
|
491
471
|
|
|
492
472
|
template<typename A>
|
|
@@ -556,6 +536,11 @@ AuxHashMap<A>* HllArray<A>::getAuxHashMap() const {
|
|
|
556
536
|
return nullptr;
|
|
557
537
|
}
|
|
558
538
|
|
|
539
|
+
template<typename A>
|
|
540
|
+
const vector_u8<A>& HllArray<A>::getHllArray() const {
|
|
541
|
+
return hllByteArr_;
|
|
542
|
+
}
|
|
543
|
+
|
|
559
544
|
template<typename A>
|
|
560
545
|
void HllArray<A>::hipAndKxQIncrementalUpdate(uint8_t oldValue, uint8_t newValue) {
|
|
561
546
|
const uint32_t configK = 1 << this->getLgConfigK();
|
|
@@ -601,6 +586,52 @@ double HllArray<A>::getHllRawEstimate() const {
|
|
|
601
586
|
return hyperEst;
|
|
602
587
|
}
|
|
603
588
|
|
|
589
|
+
template<typename A>
|
|
590
|
+
void HllArray<A>::setRebuildKxqCurminFlag(bool rebuild) {
|
|
591
|
+
rebuild_kxq_curmin_ = rebuild;
|
|
592
|
+
}
|
|
593
|
+
|
|
594
|
+
template<typename A>
|
|
595
|
+
bool HllArray<A>::isRebuildKxqCurminFlag() const {
|
|
596
|
+
return rebuild_kxq_curmin_;
|
|
597
|
+
}
|
|
598
|
+
|
|
599
|
+
template<typename A>
|
|
600
|
+
void HllArray<A>::check_rebuild_kxq_cur_min() {
|
|
601
|
+
if (!rebuild_kxq_curmin_) { return; }
|
|
602
|
+
|
|
603
|
+
uint8_t cur_min = 64;
|
|
604
|
+
uint32_t num_at_cur_min = 0;
|
|
605
|
+
double kxq0 = 1 << this->lgConfigK_;
|
|
606
|
+
double kxq1 = 0;
|
|
607
|
+
|
|
608
|
+
auto it = this->begin(true); // want all points to adjust cur_min
|
|
609
|
+
const auto end = this->end();
|
|
610
|
+
while (it != end) {
|
|
611
|
+
uint8_t v = HllUtil<A>::getValue(*it);
|
|
612
|
+
if (v > 0) {
|
|
613
|
+
if (v < 32) { kxq0 += INVERSE_POWERS_OF_2[v] - 1.0; }
|
|
614
|
+
else { kxq1 += INVERSE_POWERS_OF_2[v] - 1.0; }
|
|
615
|
+
}
|
|
616
|
+
if (v > cur_min) { ++it; continue; }
|
|
617
|
+
if (v < cur_min) {
|
|
618
|
+
cur_min = v;
|
|
619
|
+
num_at_cur_min = 1;
|
|
620
|
+
} else {
|
|
621
|
+
++num_at_cur_min;
|
|
622
|
+
}
|
|
623
|
+
++it;
|
|
624
|
+
}
|
|
625
|
+
|
|
626
|
+
kxq0_ = kxq0;
|
|
627
|
+
kxq1_ = kxq1;
|
|
628
|
+
curMin_ = cur_min;
|
|
629
|
+
numAtCurMin_ = num_at_cur_min;
|
|
630
|
+
rebuild_kxq_curmin_ = false;
|
|
631
|
+
// HipAccum is not affected
|
|
632
|
+
|
|
633
|
+
}
|
|
634
|
+
|
|
604
635
|
template<typename A>
|
|
605
636
|
typename HllArray<A>::const_iterator HllArray<A>::begin(bool all) const {
|
|
606
637
|
return const_iterator(hllByteArr_.data(), 1 << this->lgConfigK_, 0, this->tgtHllType_, nullptr, 0, all);
|
|
@@ -637,12 +668,14 @@ bool HllArray<A>::const_iterator::operator!=(const const_iterator& other) const
|
|
|
637
668
|
}
|
|
638
669
|
|
|
639
670
|
template<typename A>
|
|
640
|
-
|
|
671
|
+
auto HllArray<A>::const_iterator::operator*() const -> reference {
|
|
641
672
|
return HllUtil<A>::pair(index_, value_);
|
|
642
673
|
}
|
|
643
674
|
|
|
644
675
|
template<typename A>
|
|
645
676
|
uint8_t HllArray<A>::const_iterator::get_value(const uint8_t* array, uint32_t index, target_hll_type hll_type, const AuxHashMap<A>* exceptions, uint8_t offset) {
|
|
677
|
+
// TODO: we should be able to improve efficiency here by reading multiple bytes at a time
|
|
678
|
+
// for HLL4 and HLL6
|
|
646
679
|
if (hll_type == target_hll_type::HLL_4) {
|
|
647
680
|
uint8_t value = array[index >> 1];
|
|
648
681
|
if ((index & 1) > 0) { // odd
|
|
@@ -32,6 +32,7 @@ template<typename A>
|
|
|
32
32
|
class HllArray : public HllSketchImpl<A> {
|
|
33
33
|
public:
|
|
34
34
|
HllArray(uint8_t lgConfigK, target_hll_type tgtHllType, bool startFullSize, const A& allocator);
|
|
35
|
+
explicit HllArray(const HllArray& other, target_hll_type tgtHllType);
|
|
35
36
|
|
|
36
37
|
static HllArray* newHll(const void* bytes, size_t len, const A& allocator);
|
|
37
38
|
static HllArray* newHll(std::istream& is, const A& allocator);
|
|
@@ -52,10 +53,6 @@ class HllArray : public HllSketchImpl<A> {
|
|
|
52
53
|
virtual double getLowerBound(uint8_t numStdDev) const;
|
|
53
54
|
virtual double getUpperBound(uint8_t numStdDev) const;
|
|
54
55
|
|
|
55
|
-
inline void addToHipAccum(double delta);
|
|
56
|
-
|
|
57
|
-
inline void decNumAtCurMin();
|
|
58
|
-
|
|
59
56
|
inline uint8_t getCurMin() const;
|
|
60
57
|
inline uint32_t getNumAtCurMin() const;
|
|
61
58
|
inline double getHipAccum() const;
|
|
@@ -90,12 +87,18 @@ class HllArray : public HllSketchImpl<A> {
|
|
|
90
87
|
|
|
91
88
|
virtual AuxHashMap<A>* getAuxHashMap() const;
|
|
92
89
|
|
|
90
|
+
void setRebuildKxqCurminFlag(bool rebuild);
|
|
91
|
+
bool isRebuildKxqCurminFlag() const;
|
|
92
|
+
void check_rebuild_kxq_cur_min();
|
|
93
|
+
|
|
93
94
|
class const_iterator;
|
|
94
95
|
virtual const_iterator begin(bool all = false) const;
|
|
95
96
|
virtual const_iterator end() const;
|
|
96
97
|
|
|
97
98
|
virtual A getAllocator() const;
|
|
98
99
|
|
|
100
|
+
const vector_u8<A>& getHllArray() const;
|
|
101
|
+
|
|
99
102
|
protected:
|
|
100
103
|
void hipAndKxQIncrementalUpdate(uint8_t oldValue, uint8_t newValue);
|
|
101
104
|
double getHllBitMapEstimate() const;
|
|
@@ -108,17 +111,24 @@ class HllArray : public HllSketchImpl<A> {
|
|
|
108
111
|
uint8_t curMin_; //always zero for Hll6 and Hll8, only tracked by Hll4Array
|
|
109
112
|
uint32_t numAtCurMin_; //interpreted as num zeros when curMin == 0
|
|
110
113
|
bool oooFlag_; //Out-Of-Order Flag
|
|
114
|
+
bool rebuild_kxq_curmin_; // flag to recompute
|
|
111
115
|
|
|
112
116
|
friend class HllSketchImplFactory<A>;
|
|
113
117
|
};
|
|
114
118
|
|
|
115
119
|
template<typename A>
|
|
116
|
-
class HllArray<A>::const_iterator
|
|
120
|
+
class HllArray<A>::const_iterator {
|
|
117
121
|
public:
|
|
122
|
+
using iterator_category = std::input_iterator_tag;
|
|
123
|
+
using value_type = uint32_t;
|
|
124
|
+
using difference_type = void;
|
|
125
|
+
using pointer = uint32_t*;
|
|
126
|
+
using reference = uint32_t;
|
|
127
|
+
|
|
118
128
|
const_iterator(const uint8_t* array, uint32_t array_slze, uint32_t index, target_hll_type hll_type, const AuxHashMap<A>* exceptions, uint8_t offset, bool all);
|
|
119
129
|
const_iterator& operator++();
|
|
120
130
|
bool operator!=(const const_iterator& other) const;
|
|
121
|
-
|
|
131
|
+
reference operator*() const;
|
|
122
132
|
private:
|
|
123
133
|
const uint8_t* array_;
|
|
124
134
|
uint32_t array_size_;
|
|
@@ -136,38 +136,20 @@ HllSketchImpl<A>* HllSketchImplFactory<A>::reset(HllSketchImpl<A>* impl, bool st
|
|
|
136
136
|
|
|
137
137
|
template<typename A>
|
|
138
138
|
Hll4Array<A>* HllSketchImplFactory<A>::convertToHll4(const HllArray<A>& srcHllArr) {
|
|
139
|
-
const uint8_t lgConfigK = srcHllArr.getLgConfigK();
|
|
140
139
|
using Hll4Alloc = typename std::allocator_traits<A>::template rebind_alloc<Hll4Array<A>>;
|
|
141
|
-
|
|
142
|
-
Hll4Array<A>(lgConfigK, srcHllArr.isStartFullSize(), srcHllArr.getAllocator());
|
|
143
|
-
hll4Array->putOutOfOrderFlag(srcHllArr.isOutOfOrderFlag());
|
|
144
|
-
hll4Array->mergeHll(srcHllArr);
|
|
145
|
-
hll4Array->putHipAccum(srcHllArr.getHipAccum());
|
|
146
|
-
return hll4Array;
|
|
140
|
+
return new (Hll4Alloc(srcHllArr.getAllocator()).allocate(1)) Hll4Array<A>(srcHllArr);
|
|
147
141
|
}
|
|
148
142
|
|
|
149
143
|
template<typename A>
|
|
150
144
|
Hll6Array<A>* HllSketchImplFactory<A>::convertToHll6(const HllArray<A>& srcHllArr) {
|
|
151
|
-
const uint8_t lgConfigK = srcHllArr.getLgConfigK();
|
|
152
145
|
using Hll6Alloc = typename std::allocator_traits<A>::template rebind_alloc<Hll6Array<A>>;
|
|
153
|
-
|
|
154
|
-
Hll6Array<A>(lgConfigK, srcHllArr.isStartFullSize(), srcHllArr.getAllocator());
|
|
155
|
-
hll6Array->putOutOfOrderFlag(srcHllArr.isOutOfOrderFlag());
|
|
156
|
-
hll6Array->mergeHll(srcHllArr);
|
|
157
|
-
hll6Array->putHipAccum(srcHllArr.getHipAccum());
|
|
158
|
-
return hll6Array;
|
|
146
|
+
return new (Hll6Alloc(srcHllArr.getAllocator()).allocate(1)) Hll6Array<A>(srcHllArr);
|
|
159
147
|
}
|
|
160
148
|
|
|
161
149
|
template<typename A>
|
|
162
150
|
Hll8Array<A>* HllSketchImplFactory<A>::convertToHll8(const HllArray<A>& srcHllArr) {
|
|
163
|
-
const uint8_t lgConfigK = srcHllArr.getLgConfigK();
|
|
164
151
|
using Hll8Alloc = typename std::allocator_traits<A>::template rebind_alloc<Hll8Array<A>>;
|
|
165
|
-
|
|
166
|
-
Hll8Array<A>(lgConfigK, srcHllArr.isStartFullSize(), srcHllArr.getAllocator());
|
|
167
|
-
hll8Array->putOutOfOrderFlag(srcHllArr.isOutOfOrderFlag());
|
|
168
|
-
hll8Array->mergeHll(srcHllArr);
|
|
169
|
-
hll8Array->putHipAccum(srcHllArr.getHipAccum());
|
|
170
|
-
return hll8Array;
|
|
152
|
+
return new (Hll8Alloc(srcHllArr.getAllocator()).allocate(1)) Hll8Array<A>(srcHllArr);
|
|
171
153
|
}
|
|
172
154
|
|
|
173
155
|
}
|
|
@@ -131,21 +131,29 @@ void hll_union_alloc<A>::coupon_update(uint32_t coupon) {
|
|
|
131
131
|
|
|
132
132
|
template<typename A>
|
|
133
133
|
double hll_union_alloc<A>::get_estimate() const {
|
|
134
|
+
if (gadget_.sketch_impl->getCurMode() == hll_mode::HLL)
|
|
135
|
+
static_cast<HllArray<A>*>(gadget_.sketch_impl)->check_rebuild_kxq_cur_min();
|
|
134
136
|
return gadget_.get_estimate();
|
|
135
137
|
}
|
|
136
138
|
|
|
137
139
|
template<typename A>
|
|
138
140
|
double hll_union_alloc<A>::get_composite_estimate() const {
|
|
141
|
+
if (gadget_.sketch_impl->getCurMode() == hll_mode::HLL)
|
|
142
|
+
static_cast<HllArray<A>*>(gadget_.sketch_impl)->check_rebuild_kxq_cur_min();
|
|
139
143
|
return gadget_.get_composite_estimate();
|
|
140
144
|
}
|
|
141
145
|
|
|
142
146
|
template<typename A>
|
|
143
147
|
double hll_union_alloc<A>::get_lower_bound(uint8_t num_std_dev) const {
|
|
148
|
+
if (gadget_.sketch_impl->getCurMode() == hll_mode::HLL)
|
|
149
|
+
static_cast<HllArray<A>*>(gadget_.sketch_impl)->check_rebuild_kxq_cur_min();
|
|
144
150
|
return gadget_.get_lower_bound(num_std_dev);
|
|
145
151
|
}
|
|
146
152
|
|
|
147
153
|
template<typename A>
|
|
148
154
|
double hll_union_alloc<A>::get_upper_bound(uint8_t num_std_dev) const {
|
|
155
|
+
if (gadget_.sketch_impl->getCurMode() == hll_mode::HLL)
|
|
156
|
+
static_cast<HllArray<A>*>(gadget_.sketch_impl)->check_rebuild_kxq_cur_min();
|
|
149
157
|
return gadget_.get_upper_bound(num_std_dev);
|
|
150
158
|
}
|
|
151
159
|
|
|
@@ -152,12 +152,6 @@ inline void HllUtil<A>::hash(const void* key, size_t keyLen, uint64_t seed, Hash
|
|
|
152
152
|
MurmurHash3_x64_128(key, keyLen, seed, result);
|
|
153
153
|
}
|
|
154
154
|
|
|
155
|
-
template<typename A>
|
|
156
|
-
inline double HllUtil<A>::getRelErr(bool upperBound, bool unioned,
|
|
157
|
-
uint8_t lgConfigK, uint8_t numStdDev) {
|
|
158
|
-
return RelativeErrorTables<A>::getRelErr(upperBound, unioned, lgConfigK, numStdDev);
|
|
159
|
-
}
|
|
160
|
-
|
|
161
155
|
template<typename A>
|
|
162
156
|
inline uint8_t HllUtil<A>::checkLgK(uint8_t lgK) {
|
|
163
157
|
if ((lgK >= hll_constants::MIN_LOG_K) && (lgK <= hll_constants::MAX_LOG_K)) {
|
|
@@ -167,6 +161,20 @@ inline uint8_t HllUtil<A>::checkLgK(uint8_t lgK) {
|
|
|
167
161
|
}
|
|
168
162
|
}
|
|
169
163
|
|
|
164
|
+
template<typename A>
|
|
165
|
+
inline double HllUtil<A>::getRelErr(bool upperBound, bool unioned,
|
|
166
|
+
uint8_t lgConfigK, uint8_t numStdDev) {
|
|
167
|
+
checkLgK(lgConfigK);
|
|
168
|
+
if (lgConfigK > 12) {
|
|
169
|
+
const double rseFactor = unioned ?
|
|
170
|
+
hll_constants::HLL_NON_HIP_RSE_FACTOR : hll_constants::HLL_HIP_RSE_FACTOR;
|
|
171
|
+
const uint32_t configK = 1 << lgConfigK;
|
|
172
|
+
return (upperBound ? -1 : 1) * (numStdDev * rseFactor) / sqrt(configK);
|
|
173
|
+
} else {
|
|
174
|
+
return RelativeErrorTables<A>::getRelErr(upperBound, unioned, lgConfigK, numStdDev);
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
|
|
170
178
|
template<typename A>
|
|
171
179
|
inline void HllUtil<A>::checkMemSize(uint64_t minBytes, uint64_t capBytes) {
|
|
172
180
|
if (capBytes < minBytes) {
|
|
@@ -23,12 +23,18 @@
|
|
|
23
23
|
namespace datasketches {
|
|
24
24
|
|
|
25
25
|
template<typename A>
|
|
26
|
-
class coupon_iterator
|
|
26
|
+
class coupon_iterator {
|
|
27
27
|
public:
|
|
28
|
+
using iterator_category = std::input_iterator_tag;
|
|
29
|
+
using value_type = uint32_t;
|
|
30
|
+
using difference_type = void;
|
|
31
|
+
using pointer = uint32_t*;
|
|
32
|
+
using reference = uint32_t;
|
|
33
|
+
|
|
28
34
|
coupon_iterator(const uint32_t* array, size_t array_slze, size_t index, bool all);
|
|
29
35
|
coupon_iterator& operator++();
|
|
30
36
|
bool operator!=(const coupon_iterator& other) const;
|
|
31
|
-
|
|
37
|
+
reference operator*() const;
|
|
32
38
|
private:
|
|
33
39
|
const uint32_t* array_;
|
|
34
40
|
size_t array_size_;
|
|
@@ -23,8 +23,9 @@
|
|
|
23
23
|
#include "common_defs.hpp"
|
|
24
24
|
#include "HllUtil.hpp"
|
|
25
25
|
|
|
26
|
-
#include <memory>
|
|
27
26
|
#include <iostream>
|
|
27
|
+
#include <memory>
|
|
28
|
+
#include <string>
|
|
28
29
|
#include <vector>
|
|
29
30
|
|
|
30
31
|
namespace datasketches {
|
|
@@ -144,7 +145,7 @@ class hll_sketch_alloc final {
|
|
|
144
145
|
|
|
145
146
|
/**
|
|
146
147
|
* Reconstructs a sketch from a serialized image in a byte array.
|
|
147
|
-
* @param
|
|
148
|
+
* @param bytes An input array with a binary image of a sketch
|
|
148
149
|
* @param len Length of the input array, in bytes
|
|
149
150
|
*/
|
|
150
151
|
static hll_sketch_alloc deserialize(const void* bytes, size_t len, const A& allocator = A());
|
|
@@ -197,7 +198,7 @@ class hll_sketch_alloc final {
|
|
|
197
198
|
* Human readable summary with optional detail
|
|
198
199
|
* @param summary if true, output the sketch summary
|
|
199
200
|
* @param detail if true, output the internal data array
|
|
200
|
-
* @param
|
|
201
|
+
* @param aux_detail if true, output the internal Aux array, if it exists.
|
|
201
202
|
* @param all if true, outputs all entries including empty ones
|
|
202
203
|
* @return human readable string with optional detail.
|
|
203
204
|
*/
|
|
@@ -358,7 +359,7 @@ class hll_sketch_alloc final {
|
|
|
358
359
|
* value can be exceeded in extremely rare cases. If exceeded, it
|
|
359
360
|
* will be larger by only a few percent.
|
|
360
361
|
*
|
|
361
|
-
* @param
|
|
362
|
+
* @param lg_k The Log2 of K for the target HLL sketch. This value must be
|
|
362
363
|
* between 4 and 21 inclusively.
|
|
363
364
|
* @param tgt_type the desired Hll type
|
|
364
365
|
* @return the maximum size in bytes that this sketch can grow to.
|
|
@@ -495,20 +496,20 @@ class hll_union_alloc {
|
|
|
495
496
|
/**
|
|
496
497
|
* Returns the result of this union operator with the specified
|
|
497
498
|
* #tgt_hll_type.
|
|
498
|
-
* @param The tgt_hll_type enum value of the desired result (Default: HLL_4)
|
|
499
|
+
* @param tgt_type The tgt_hll_type enum value of the desired result (Default: HLL_4)
|
|
499
500
|
* @return The result of this union with the specified tgt_hll_type
|
|
500
501
|
*/
|
|
501
502
|
hll_sketch_alloc<A> get_result(target_hll_type tgt_type = HLL_4) const;
|
|
502
503
|
|
|
503
504
|
/**
|
|
504
505
|
* Update this union operator with the given sketch.
|
|
505
|
-
* @param The given sketch.
|
|
506
|
+
* @param sketch The given sketch.
|
|
506
507
|
*/
|
|
507
508
|
void update(const hll_sketch_alloc<A>& sketch);
|
|
508
509
|
|
|
509
510
|
/**
|
|
510
511
|
* Update this union operator with the given temporary sketch.
|
|
511
|
-
* @param The given sketch.
|
|
512
|
+
* @param sketch The given sketch.
|
|
512
513
|
*/
|
|
513
514
|
void update(hll_sketch_alloc<A>&& sketch);
|
|
514
515
|
|
|
@@ -608,7 +609,7 @@ class hll_union_alloc {
|
|
|
608
609
|
* perform the union. This may involve swapping, down-sampling, transforming, and / or
|
|
609
610
|
* copying one of the arguments and may completely replace the internals of the union.
|
|
610
611
|
*
|
|
611
|
-
* @param
|
|
612
|
+
* @param sketch the given incoming sketch, which may not be modified.
|
|
612
613
|
* @param lg_max_k the maximum value of log2 K for this union.
|
|
613
614
|
*/
|
|
614
615
|
inline void union_impl(const hll_sketch_alloc<A>& sketch, uint8_t lg_max_k);
|
|
@@ -53,11 +53,16 @@ static void basicUnion(uint64_t n1, uint64_t n2,
|
|
|
53
53
|
v += n2;
|
|
54
54
|
|
|
55
55
|
hll_union u(lgMaxK);
|
|
56
|
-
u.update(
|
|
56
|
+
u.update(h1);
|
|
57
57
|
u.update(h2);
|
|
58
58
|
|
|
59
59
|
hll_sketch result = u.get_result(resultType);
|
|
60
60
|
|
|
61
|
+
// ensure we check a direct union estimate, without first caling get_result()
|
|
62
|
+
u.reset();
|
|
63
|
+
u.update(std::move(h1));
|
|
64
|
+
u.update(h2);
|
|
65
|
+
|
|
61
66
|
// force non-HIP estimates to avoid issues with in- vs out-of-order
|
|
62
67
|
double uEst = result.get_composite_estimate();
|
|
63
68
|
double uUb = result.get_upper_bound(2);
|
|
@@ -74,6 +79,7 @@ static void basicUnion(uint64_t n1, uint64_t n2,
|
|
|
74
79
|
REQUIRE((uEst - uLb) >= 0.0);
|
|
75
80
|
|
|
76
81
|
REQUIRE(controlEst == uEst);
|
|
82
|
+
REQUIRE(controlEst == u.get_composite_estimate());
|
|
77
83
|
}
|
|
78
84
|
|
|
79
85
|
/**
|
|
@@ -586,16 +586,21 @@ class kll_sketch {
|
|
|
586
586
|
};
|
|
587
587
|
|
|
588
588
|
template<typename T, typename C, typename A>
|
|
589
|
-
class kll_sketch<T, C, A>::const_iterator
|
|
589
|
+
class kll_sketch<T, C, A>::const_iterator {
|
|
590
590
|
public:
|
|
591
|
+
using iterator_category = std::input_iterator_tag;
|
|
591
592
|
using value_type = std::pair<const T&, const uint64_t>;
|
|
593
|
+
using difference_type = void;
|
|
594
|
+
using pointer = const return_value_holder<value_type>;
|
|
595
|
+
using reference = const value_type;
|
|
596
|
+
|
|
592
597
|
friend class kll_sketch<T, C, A>;
|
|
593
598
|
const_iterator& operator++();
|
|
594
599
|
const_iterator& operator++(int);
|
|
595
600
|
bool operator==(const const_iterator& other) const;
|
|
596
601
|
bool operator!=(const const_iterator& other) const;
|
|
597
|
-
|
|
598
|
-
|
|
602
|
+
reference operator*() const;
|
|
603
|
+
pointer operator->() const;
|
|
599
604
|
private:
|
|
600
605
|
const T* items;
|
|
601
606
|
const uint32_t* levels;
|
|
@@ -1105,12 +1105,12 @@ bool kll_sketch<T, C, A>::const_iterator::operator!=(const const_iterator& other
|
|
|
1105
1105
|
}
|
|
1106
1106
|
|
|
1107
1107
|
template<typename T, typename C, typename A>
|
|
1108
|
-
auto kll_sketch<T, C, A>::const_iterator::operator*() const ->
|
|
1108
|
+
auto kll_sketch<T, C, A>::const_iterator::operator*() const -> reference {
|
|
1109
1109
|
return value_type(items[index], weight);
|
|
1110
1110
|
}
|
|
1111
1111
|
|
|
1112
1112
|
template<typename T, typename C, typename A>
|
|
1113
|
-
auto kll_sketch<T, C, A>::const_iterator::operator->() const ->
|
|
1113
|
+
auto kll_sketch<T, C, A>::const_iterator::operator->() const -> pointer {
|
|
1114
1114
|
return **this;
|
|
1115
1115
|
}
|
|
1116
1116
|
|
|
@@ -242,7 +242,7 @@ TEST_CASE("kll sketch", "[kll_sketch]") {
|
|
|
242
242
|
FAIL("checking rank vs CDF for value " + std::to_string(i));
|
|
243
243
|
}
|
|
244
244
|
subtotal_pmf += pmf[i];
|
|
245
|
-
if (abs(ranks[i] - subtotal_pmf) > NUMERIC_NOISE_TOLERANCE) {
|
|
245
|
+
if (std::abs(ranks[i] - subtotal_pmf) > NUMERIC_NOISE_TOLERANCE) {
|
|
246
246
|
FAIL("CDF vs PMF for value " + std::to_string(i));
|
|
247
247
|
}
|
|
248
248
|
}
|
|
@@ -257,7 +257,7 @@ TEST_CASE("kll sketch", "[kll_sketch]") {
|
|
|
257
257
|
FAIL("checking rank vs CDF for value " + std::to_string(i));
|
|
258
258
|
}
|
|
259
259
|
subtotal_pmf += pmf[i];
|
|
260
|
-
if (abs(ranks[i] - subtotal_pmf) > NUMERIC_NOISE_TOLERANCE) {
|
|
260
|
+
if (std::abs(ranks[i] - subtotal_pmf) > NUMERIC_NOISE_TOLERANCE) {
|
|
261
261
|
FAIL("CDF vs PMF for value " + std::to_string(i));
|
|
262
262
|
}
|
|
263
263
|
}
|
|
@@ -42,9 +42,12 @@ target_link_libraries(python
|
|
|
42
42
|
cpc
|
|
43
43
|
fi
|
|
44
44
|
theta
|
|
45
|
+
tuple
|
|
45
46
|
sampling
|
|
46
47
|
req
|
|
47
48
|
quantiles
|
|
49
|
+
count
|
|
50
|
+
density
|
|
48
51
|
pybind11::module
|
|
49
52
|
)
|
|
50
53
|
|
|
@@ -72,10 +75,13 @@ target_sources(python
|
|
|
72
75
|
src/cpc_wrapper.cpp
|
|
73
76
|
src/fi_wrapper.cpp
|
|
74
77
|
src/theta_wrapper.cpp
|
|
78
|
+
src/tuple_wrapper.cpp
|
|
75
79
|
src/vo_wrapper.cpp
|
|
76
80
|
src/req_wrapper.cpp
|
|
77
81
|
src/quantiles_wrapper.cpp
|
|
82
|
+
src/density_wrapper.cpp
|
|
78
83
|
src/ks_wrapper.cpp
|
|
84
|
+
src/count_wrapper.cpp
|
|
79
85
|
src/vector_of_kll.cpp
|
|
80
86
|
src/py_serde.cpp
|
|
81
87
|
)
|
|
@@ -12,15 +12,15 @@ This package provides a variety of sketches as described below. Wherever a speci
|
|
|
12
12
|
|
|
13
13
|
## Building and Installation
|
|
14
14
|
|
|
15
|
-
Once cloned, the library can be installed by running `python3 -m pip install .` in the project root directory -- not the python subdirectory -- which will also install the necessary dependencies, namely
|
|
15
|
+
Once cloned, the library can be installed by running `python3 -m pip install .` in the project root directory -- not the python subdirectory -- which will also install the necessary dependencies, namely NumPy and [pybind11[global]](https://github.com/pybind/pybind11).
|
|
16
16
|
|
|
17
|
-
If you prefer to call the `setup.py` build script directly, which is
|
|
17
|
+
If you prefer to call the `setup.py` build script directly, which is discouraged, you must first install `pybind11[global]`, as well as any other dependencies listed under the build-system section in `pyproject.toml`.
|
|
18
18
|
|
|
19
19
|
The library is also available from PyPI via `python3 -m pip install datasketches`.
|
|
20
20
|
|
|
21
21
|
## Usage
|
|
22
22
|
|
|
23
|
-
Having installed the library, loading the Apache
|
|
23
|
+
Having installed the library, loading the Apache DataSketches Library in Python is simple: `import datasketches`.
|
|
24
24
|
|
|
25
25
|
The unit tests are mostly structured in a tutorial style and can be used as a reference example for how to feed data into and query the different types of sketches.
|
|
26
26
|
|
|
@@ -76,10 +76,10 @@ The only developer-specific instructions relate to running unit tests.
|
|
|
76
76
|
|
|
77
77
|
### Unit tests
|
|
78
78
|
|
|
79
|
-
The Python unit tests are run via `tox`, with no arguments, from the project root directory -- not the python subdirectory. Tox creates a temporary virtual environment in which to build and run the unit tests. In the event you are missing the necessary
|
|
79
|
+
The Python unit tests are run via `tox`, with no arguments, from the project root directory -- not the python subdirectory. Tox creates a temporary virtual environment in which to build and run the unit tests. In the event you are missing the necessary package, tox may be installed with `python3 -m pip install --upgrade tox`.
|
|
80
80
|
|
|
81
81
|
## License
|
|
82
82
|
|
|
83
|
-
The Apache DataSketches Library is
|
|
83
|
+
The Apache DataSketches Library is distributed under the Apache 2.0 License.
|
|
84
84
|
|
|
85
85
|
There may be precompiled binaries provided as a convenience and distributed through PyPI via [https://pypi.org/project/datasketches/] contain compiled code from [pybind11](https://github.com/pybind/pybind11), which is distributed under a BSD license.
|