d_heap 0.6.1 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.clang-format +21 -0
- data/.github/workflows/main.yml +16 -1
- data/.rubocop.yml +1 -0
- data/CHANGELOG.md +17 -0
- data/{N → D} +1 -1
- data/README.md +313 -261
- data/d_heap.gemspec +16 -5
- data/docs/benchmarks-2.txt +79 -61
- data/docs/benchmarks.txt +587 -416
- data/docs/profile.txt +99 -133
- data/ext/d_heap/.rubocop.yml +7 -0
- data/ext/d_heap/d_heap.c +575 -424
- data/ext/d_heap/extconf.rb +34 -3
- data/images/push_n.png +0 -0
- data/images/push_n_pop_n.png +0 -0
- data/images/push_pop.png +0 -0
- data/lib/d_heap.rb +25 -1
- data/lib/d_heap/version.rb +1 -1
- metadata +6 -30
- data/.rspec +0 -3
- data/.travis.yml +0 -6
- data/Gemfile +0 -20
- data/Gemfile.lock +0 -83
- data/Rakefile +0 -20
- data/benchmarks/perf.rb +0 -29
- data/benchmarks/push_n.yml +0 -35
- data/benchmarks/push_n_pop_n.yml +0 -52
- data/benchmarks/push_pop.yml +0 -32
- data/benchmarks/stackprof.rb +0 -31
- data/bin/bench_charts +0 -13
- data/bin/bench_n +0 -7
- data/bin/benchmark-driver +0 -29
- data/bin/benchmarks +0 -10
- data/bin/console +0 -15
- data/bin/profile +0 -10
- data/bin/rake +0 -29
- data/bin/rspec +0 -29
- data/bin/rubocop +0 -29
- data/bin/setup +0 -8
- data/lib/benchmark_driver/runner/ips_zero_fail.rb +0 -158
- data/lib/d_heap/benchmarks.rb +0 -112
- data/lib/d_heap/benchmarks/benchmarker.rb +0 -116
- data/lib/d_heap/benchmarks/implementations.rb +0 -224
- data/lib/d_heap/benchmarks/profiler.rb +0 -71
- data/lib/d_heap/benchmarks/rspec_matchers.rb +0 -352
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5b51ed52baf74b585a7ab7799f92a446aef5852431ba10e146658b419657ffbe
|
4
|
+
data.tar.gz: cc7c6786eee78ec13214582b8701448d312f59fb723d12676fb673447ab409a7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5de98f8c9084b30694fff5f8154a6e42e7e67d76518c25136ab4fb0c0afb047ad3c923f4544dcf613ded4c3b01417729aa796c973100faaa7ee93051fa630c7d
|
7
|
+
data.tar.gz: e5dbcc90da7adfba7ef45cd9a2da5fd1781a2bd489002a5ffc0a764915c035c178db30ae9b8431a8fc810cfa6f03a1b38ec0a50cbf23c2e1ba5dfc36549c0609
|
data/.clang-format
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
---
|
2
|
+
BasedOnStyle: mozilla
|
3
|
+
IndentWidth: 4
|
4
|
+
PointerAlignment: Right
|
5
|
+
AlignAfterOpenBracket: Align
|
6
|
+
AlignConsecutiveAssignments: true
|
7
|
+
AlignConsecutiveDeclarations: true
|
8
|
+
AlignConsecutiveBitFields: true
|
9
|
+
AlignConsecutiveMacros: true
|
10
|
+
AlignEscapedNewlines: Right
|
11
|
+
AlignOperands: true
|
12
|
+
|
13
|
+
AllowAllConstructorInitializersOnNextLine: false
|
14
|
+
AllowShortIfStatementsOnASingleLine: WithoutElse
|
15
|
+
|
16
|
+
IndentCaseLabels: false
|
17
|
+
IndentPPDirectives: AfterHash
|
18
|
+
|
19
|
+
ForEachMacros:
|
20
|
+
- WHILE_PEEK_LT_P
|
21
|
+
...
|
data/.github/workflows/main.yml
CHANGED
@@ -23,4 +23,19 @@ jobs:
|
|
23
23
|
run: |
|
24
24
|
gem install bundler -v 2.2.3
|
25
25
|
bundle install
|
26
|
-
bundle exec rake
|
26
|
+
bundle exec rake ci
|
27
|
+
|
28
|
+
benchmarks:
|
29
|
+
runs-on: ubuntu-latest
|
30
|
+
steps:
|
31
|
+
- uses: actions/checkout@v2
|
32
|
+
- name: Set up Ruby
|
33
|
+
uses: ruby/setup-ruby@v1
|
34
|
+
with:
|
35
|
+
ruby-version: 2.7
|
36
|
+
bundler-cache: true
|
37
|
+
- name: Run the benchmarks
|
38
|
+
run: |
|
39
|
+
gem install bundler -v 2.2.3
|
40
|
+
bundle install
|
41
|
+
bundle exec rake ci:benchmarks
|
data/.rubocop.yml
CHANGED
@@ -135,6 +135,7 @@ Style/ClassAndModuleChildren: { Enabled: false }
|
|
135
135
|
Style/EachWithObject: { Enabled: false }
|
136
136
|
Style/FormatStringToken: { Enabled: false }
|
137
137
|
Style/FloatDivision: { Enabled: false }
|
138
|
+
Style/GuardClause: { Enabled: false } # usually nice to do, but...
|
138
139
|
Style/IfUnlessModifier: { Enabled: false }
|
139
140
|
Style/IfWithSemicolon: { Enabled: false }
|
140
141
|
Style/Lambda: { Enabled: false }
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,22 @@
|
|
1
1
|
## Current/Unreleased
|
2
2
|
|
3
|
+
## Release v0.7.0 (2021-01-24)
|
4
|
+
|
5
|
+
* 💥⚡️ **BREAKING**: Uses `double`) for _all_ scores.
|
6
|
+
* 💥 Integers larger than a double mantissa (53-bits) will lose some
|
7
|
+
precision.
|
8
|
+
* ⚡️ big speed up
|
9
|
+
* ⚡️ Much better memory usage
|
10
|
+
* ⚡️ Simplifies score conversion between ruby and C
|
11
|
+
* ✨ Added `DHeap::Map` for ensuring values can only be added once, by `#hash`.
|
12
|
+
* Adding again will update the score.
|
13
|
+
* Adds `DHeap::Map#[]` for quick lookup of existing scores
|
14
|
+
* Adds `DHeap::Map#[]=` for adjustments of existing scores
|
15
|
+
* TODO: `DHeap::Map#delete`
|
16
|
+
* 📝📈 SO MANY BENCHMARKS
|
17
|
+
* ⚡️ Set DEFAULT_D to 6, based on benchmarks.
|
18
|
+
* 🐛♻️ convert all `long` indexes to `size_t`
|
19
|
+
|
3
20
|
## Release v0.6.1 (2021-01-24)
|
4
21
|
|
5
22
|
* 📝 Fix link to CHANGELOG.md in gemspec
|
data/{N → D}
RENAMED
data/README.md
CHANGED
@@ -7,6 +7,13 @@
|
|
7
7
|
A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
|
8
8
|
implemented as a C extension.
|
9
9
|
|
10
|
+
A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":
|
11
|
+
last in first out. A priority queue pushes each element with a score and pops
|
12
|
+
out in order by score. Priority queues are often used in algorithms for e.g.
|
13
|
+
[scheduling] of timers or bandwidth management, for [Huffman coding], and for
|
14
|
+
various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
|
15
|
+
[Prim's algorithm].
|
16
|
+
|
10
17
|
From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
|
11
18
|
> A heap is a specialized tree-based data structure which is essentially an
|
12
19
|
> almost complete tree that satisfies the heap property: in a min heap, for any
|
@@ -16,26 +23,17 @@ From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
|
|
16
23
|
|
17
24
|

|
18
25
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
have better memory cache behavior than binary heaps, allowing them to run more
|
31
|
-
quickly in practice despite slower worst-case time complexity. In the worst
|
32
|
-
case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
|
33
|
-
the tradeoff that pop requires `O(d log n / log d)`.
|
34
|
-
|
35
|
-
Although you should probably just use the default _d_ value of `4` (see the
|
36
|
-
analysis below), it's always advisable to benchmark your specific use-case. In
|
37
|
-
particular, if you push items more than you pop, higher values for _d_ can give
|
38
|
-
a faster total runtime.
|
26
|
+
The _d_-ary heap data structure is a generalization of a [binary heap] in which
|
27
|
+
each node has _d_ children instead of 2. This speeds up "push" or "decrease
|
28
|
+
priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
|
29
|
+
"increase priority" (`O(d log n / log d)`). Additionally, _d_-ary heaps can
|
30
|
+
have better memory cache behavior than binary heaps, letting them run more
|
31
|
+
quickly in practice.
|
32
|
+
|
33
|
+
Although the default _d_ value will usually perform best (see the time
|
34
|
+
complexity analysis below), it's always advisable to benchmark your specific
|
35
|
+
use-case. In particular, if you push items more than you pop, higher values for
|
36
|
+
_d_ can give a faster total runtime.
|
39
37
|
|
40
38
|
[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
|
41
39
|
[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
|
@@ -46,41 +44,39 @@ a faster total runtime.
|
|
46
44
|
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
|
47
45
|
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
|
48
46
|
|
47
|
+
## Installation
|
48
|
+
|
49
|
+
Add this line to your application's Gemfile:
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
gem 'd_heap'
|
53
|
+
```
|
54
|
+
|
55
|
+
And then execute:
|
56
|
+
|
57
|
+
$ bundle install
|
58
|
+
|
59
|
+
Or install it yourself as:
|
60
|
+
|
61
|
+
$ gem install d_heap
|
62
|
+
|
49
63
|
## Usage
|
50
64
|
|
51
|
-
The basic API is `#push(object, score)` and `#pop`. Please read the
|
52
|
-
|
65
|
+
The basic API is `#push(object, score)` and `#pop`. Please read the [full
|
66
|
+
documentation] for more details. The score must be convertable to a `Float` via
|
67
|
+
`Float(score)` (i.e. it should properly implement `#to_f`).
|
53
68
|
|
54
|
-
Quick reference for
|
69
|
+
Quick reference for the most common methods:
|
55
70
|
|
56
|
-
* `heap << object` adds a value,
|
71
|
+
* `heap << object` adds a value, using `Float(object)` as its intrinsic score.
|
57
72
|
* `heap.push(object, score)` adds a value with an extrinsic score.
|
58
|
-
* `heap.pop` removes and returns the value with the minimum score.
|
59
|
-
* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
|
60
73
|
* `heap.peek` to view the minimum value without popping it.
|
74
|
+
* `heap.pop` removes and returns the value with the minimum score.
|
75
|
+
* `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
|
61
76
|
* `heap.clear` to remove all items from the heap.
|
62
77
|
* `heap.empty?` returns true if the heap is empty.
|
63
78
|
* `heap.size` returns the number of items in the heap.
|
64
79
|
|
65
|
-
If the score changes while the object is still in the heap, it will not be
|
66
|
-
re-evaluated again.
|
67
|
-
|
68
|
-
The score must either be `Integer` or `Float` or convertable to a `Float` via
|
69
|
-
`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
|
70
|
-
numeric values gives more than 50% speedup under some benchmarks! _n.b._
|
71
|
-
`Integer` _scores must have an absolute value that fits into_ `unsigned long
|
72
|
-
long`. This is compiler and architecture dependant but with gcc on an IA-64
|
73
|
-
system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
|
74
|
-
+18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
|
75
|
-
in nanoseconds.
|
76
|
-
|
77
|
-
_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
|
78
|
-
added back in a future version,_ if (and only if) _it can be done without
|
79
|
-
impacting the speed of numeric comparisons. The speedup from this constraint is
|
80
|
-
huge!_
|
81
|
-
|
82
|
-
[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
|
83
|
-
|
84
80
|
### Examples
|
85
81
|
|
86
82
|
```ruby
|
@@ -128,251 +124,272 @@ heap.size # => 0
|
|
128
124
|
heap.pop # => nil
|
129
125
|
```
|
130
126
|
|
131
|
-
Please see the [
|
127
|
+
Please see the [full documentation] for more methods and more examples.
|
132
128
|
|
133
|
-
|
129
|
+
[full documentation]: https://rubydoc.info/gems/d_heap/DHeap
|
134
130
|
|
135
|
-
|
131
|
+
### DHeap::Map
|
136
132
|
|
137
|
-
|
138
|
-
|
139
|
-
|
133
|
+
`DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
|
134
|
+
index in the heap. For simple push/pop this a bit slower than a normal `DHeap`
|
135
|
+
heap, but it can enable huge speed-ups for algorithms that need to adjust scores
|
136
|
+
after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:
|
140
137
|
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
138
|
+
* a uniqueness constraint, by `#hash` value
|
139
|
+
* `#[obj] # => score` or `#score(obj)` in `O(1)`
|
140
|
+
* `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
|
141
|
+
* TODO:
|
142
|
+
* optionally unique by object identity
|
143
|
+
* `#delete(obj)` in `O(d log n / log d)` (TODO)
|
146
144
|
|
147
|
-
|
145
|
+
## Scores
|
148
146
|
|
149
|
-
|
147
|
+
If a score changes while the object is still in the heap, it will not be
|
148
|
+
re-evaluated again.
|
150
149
|
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
the
|
150
|
+
Constraining scores to `Float` gives enormous performance benefits. n.b.
|
151
|
+
very large `Integer` values will lose precision when converted to `Float`. This
|
152
|
+
is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
|
153
|
+
is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
|
154
|
+
to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
|
155
|
+
time since the epoch in nanoseconds. This can be worked around by adding a
|
156
|
+
bias, but probably it's good enough for most usage.
|
156
157
|
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
makes the tree shorter but broader, which reduces to `O(log n / log d)` while
|
161
|
-
increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
|
158
|
+
_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
|
159
|
+
added back in a future version,_ if (and only if) _it can be done without
|
160
|
+
impacting the speed of numeric comparisons._
|
162
161
|
|
163
|
-
|
164
|
-
slowly than the naive approach—even for heaps containing ten thousand items.
|
165
|
-
Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
|
166
|
-
has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons
|
167
|
-
for each push + pop than `bsearch` + `insert`.
|
162
|
+
## Thread safety
|
168
163
|
|
169
|
-
|
170
|
-
|
171
|
-
heap instead of the traditional binary heap.
|
164
|
+
`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
|
165
|
+
take precautions such as locking access behind a mutex.
|
172
166
|
|
173
167
|
## Benchmarks
|
174
168
|
|
175
|
-
_See
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
`
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
169
|
+
_See full benchmark output in subdirs of `benchmarks`. See also or updated
|
170
|
+
results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
|
171
|
+
with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
|
172
|
+
|
173
|
+
### Implementations
|
174
|
+
|
175
|
+
* **findmin** -
|
176
|
+
A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
|
177
|
+
very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
|
178
|
+
`Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but
|
179
|
+
unusably slow for `n > 1000`.
|
180
|
+
|
181
|
+
* **bsearch** -
|
182
|
+
A simple implementation with a slow `O(n)` push using `Array#bsearch` +
|
183
|
+
`Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
|
184
|
+
`Array#pop`. It is still relatively fast for `n < 10000`, but its linear
|
185
|
+
time complexity really destroys it after that.
|
186
|
+
|
187
|
+
* **rb_heap** -
|
188
|
+
A pure ruby binary min-heap that has been tuned for performance by making
|
189
|
+
few method calls and allocating and assigning as few variables as possible.
|
190
|
+
It runs in `O(log n)` for both push and pop, although pop is slower than
|
191
|
+
push by a constant factor. Its much higher constant factors makes it lose
|
192
|
+
to `bsearch` push + pop for `n < 10000` but it holds steady with very little
|
193
|
+
slowdown even with `n > 10000000`.
|
194
|
+
|
195
|
+
* **c++ stl** -
|
196
|
+
A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
|
197
|
+
priority_queue]. The wrapper is simply to provide compatibility with the
|
198
|
+
other benchmarked implementations, but it should be possible to speed this
|
199
|
+
up a little bit by benchmarking the `priority_queue_cxx` API directly. It
|
200
|
+
has the same time complexity as rb_heap but its much lower constant
|
201
|
+
factors allow it to easily outperform `bsearch`.
|
202
|
+
|
203
|
+
* **c_dheap** -
|
204
|
+
A {DHeap} instance with the default `d` value of `4`. It has the same time
|
205
|
+
complexity as `rb_heap` and `c++ stl`, but is faster than both in every
|
206
|
+
benchmarked scenario.
|
191
207
|
|
192
208
|
[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
|
193
209
|
[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
|
194
210
|
|
195
|
-
|
211
|
+
### Scenarios
|
196
212
|
|
197
|
-
|
213
|
+
Each benchmark increases N exponentially, either by √1̅0̅ or approximating
|
214
|
+
(alternating between x3 and x3.333) in order to simplify keeping loop counts
|
215
|
+
evenly divisible by N.
|
198
216
|
|
199
|
-
|
217
|
+
#### push N items
|
200
218
|
|
201
|
-
|
219
|
+
This measures the _average time per insert_ to create a queue of size N
|
220
|
+
(clearing the queue once it reaches that size). Use cases which push (or
|
221
|
+
decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's
|
222
|
+
algorithm] when the graph has more edges than verticies, may want to pay more
|
223
|
+
attention to this benchmark.
|
202
224
|
|
203
|
-
|
225
|
+

|
204
226
|
|
205
|
-
|
206
|
-
|
207
|
-
|
227
|
+
== push N (N=100) ==========================================================
|
228
|
+
push N (c_dheap): 10522662.6 i/s
|
229
|
+
push N (findmin): 9980622.3 i/s - 1.05x slower
|
230
|
+
push N (c++ stl): 7991608.3 i/s - 1.32x slower
|
231
|
+
push N (rb_heap): 4607849.4 i/s - 2.28x slower
|
232
|
+
push N (bsearch): 2769106.2 i/s - 3.80x slower
|
233
|
+
== push N (N=10,000) =======================================================
|
234
|
+
push N (c_dheap): 10444588.3 i/s
|
235
|
+
push N (findmin): 10191797.4 i/s - 1.02x slower
|
236
|
+
push N (c++ stl): 8210895.4 i/s - 1.27x slower
|
237
|
+
push N (rb_heap): 4369252.9 i/s - 2.39x slower
|
238
|
+
push N (bsearch): 1213580.4 i/s - 8.61x slower
|
239
|
+
== push N (N=1,000,000) ====================================================
|
240
|
+
push N (c_dheap): 10342183.7 i/s
|
241
|
+
push N (findmin): 9963898.8 i/s - 1.04x slower
|
242
|
+
push N (c++ stl): 7891924.8 i/s - 1.31x slower
|
243
|
+
push N (rb_heap): 4350116.0 i/s - 2.38x slower
|
244
|
+
|
245
|
+
All three heap implementations have little to no perceptible slowdown for `N >
|
246
|
+
100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
|
247
|
+
|
248
|
+
#### push then pop N items
|
249
|
+
|
250
|
+
This measures the _average_ for a push **or** a pop, filling up a queue with N
|
251
|
+
items and then draining that queue until empty. It represents the amortized
|
252
|
+
cost of balanced pushes and pops to fill a heap and drain it.
|
208
253
|
|
209
254
|

|
210
255
|
|
211
|
-
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
|
256
|
+
== push N then pop N (N=100) ===============================================
|
257
|
+
push N + pop N (c_dheap): 10954469.2 i/s
|
258
|
+
push N + pop N (c++ stl): 9317140.2 i/s - 1.18x slower
|
259
|
+
push N + pop N (bsearch): 4808770.2 i/s - 2.28x slower
|
260
|
+
push N + pop N (findmin): 4321411.9 i/s - 2.53x slower
|
261
|
+
push N + pop N (rb_heap): 2467417.0 i/s - 4.44x slower
|
262
|
+
== push N then pop N (N=10,000) ============================================
|
263
|
+
push N + pop N (c_dheap): 8083962.7 i/s
|
264
|
+
push N + pop N (c++ stl): 7365661.8 i/s - 1.10x slower
|
265
|
+
push N + pop N (bsearch): 2257047.9 i/s - 3.58x slower
|
266
|
+
push N + pop N (rb_heap): 1439204.3 i/s - 5.62x slower
|
267
|
+
== push N then pop N (N=1,000,000) =========================================
|
268
|
+
push N + pop N (c++ stl): 5274657.5 i/s
|
269
|
+
push N + pop N (c_dheap): 4731117.9 i/s - 1.11x slower
|
270
|
+
push N + pop N (rb_heap): 976688.6 i/s - 5.40x slower
|
271
|
+
|
272
|
+
At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
|
273
|
+
much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above
|
274
|
+
30k it slows down too much to be useful. `DHeap` consistently runs 4.5-5.5x
|
275
|
+
faster than the pure ruby heap.
|
276
|
+
|
277
|
+
#### push & pop on N-item heap
|
278
|
+
|
279
|
+
This measures the combined time to push once and pop once, which is done
|
280
|
+
repeatedly while keeping a stable heap size of N. Its an approximation for
|
281
|
+
scenarios which reach a stable size and then plateau with balanced pushes and
|
282
|
+
pops. E.g. timers and timeouts will often reschedule themselves or replace
|
283
|
+
themselves with new timers or timeouts, maintaining a roughly stable total count
|
284
|
+
of timers.
|
217
285
|
|
218
286
|

|
219
287
|
|
220
|
-
|
221
|
-
|
222
|
-
|
223
|
-
|
224
|
-
|
225
|
-
|
226
|
-
|
227
|
-
|
228
|
-
|
229
|
-
|
230
|
-
|
231
|
-
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
|
259
|
-
|
260
|
-
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
|
265
|
-
|
266
|
-
|
267
|
-
|
268
|
-
|
269
|
-
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
queue size = 5000000: 445234.5 i/s - 1.65x slower
|
290
|
-
queue size = 10000000: 423119.0 i/s - 1.74x slower
|
291
|
-
|
292
|
-
== push + pop (bsearch)
|
293
|
-
queue size = 10000: 786334.2 i/s
|
294
|
-
queue size = 25000: 364963.8 i/s - 2.15x slower
|
295
|
-
queue size = 50000: 200520.6 i/s - 3.92x slower
|
296
|
-
queue size = 100000: 88607.0 i/s - 8.87x slower
|
297
|
-
queue size = 250000: 34530.5 i/s - 22.77x slower
|
298
|
-
queue size = 500000: 17965.4 i/s - 43.77x slower
|
299
|
-
queue size = 1000000: 5638.7 i/s - 139.45x slower
|
300
|
-
queue size = 2500000: 1302.0 i/s - 603.93x slower
|
301
|
-
queue size = 5000000: 592.0 i/s - 1328.25x slower
|
302
|
-
queue size = 10000000: 288.8 i/s - 2722.66x slower
|
303
|
-
|
304
|
-
== push + pop (c_dheap)
|
305
|
-
queue size = 10000: 7311366.6 i/s
|
306
|
-
queue size = 50000: 6737824.5 i/s - 1.09x slower
|
307
|
-
queue size = 25000: 6407340.6 i/s - 1.14x slower
|
308
|
-
queue size = 100000: 6254396.3 i/s - 1.17x slower
|
309
|
-
queue size = 250000: 5917684.5 i/s - 1.24x slower
|
310
|
-
queue size = 500000: 5126307.6 i/s - 1.43x slower
|
311
|
-
queue size = 1000000: 4403494.1 i/s - 1.66x slower
|
312
|
-
queue size = 2500000: 3304088.2 i/s - 2.21x slower
|
313
|
-
queue size = 5000000: 2664897.7 i/s - 2.74x slower
|
314
|
-
queue size = 10000000: 2137927.6 i/s - 3.42x slower
|
315
|
-
|
316
|
-
## Analysis
|
317
|
-
|
318
|
-
### Time complexity
|
319
|
-
|
320
|
-
There are two fundamental heap operations: sift-up (used by push) and sift-down
|
321
|
-
(used by pop).
|
322
|
-
|
323
|
-
* A _d_-ary heap will have `log n / log d` layers, so both sift operations can
|
324
|
-
perform as many as `log n / log d` writes, when a member sifts the entire
|
325
|
-
length of the tree.
|
326
|
-
* Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
|
327
|
-
* Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
|
328
|
-
|
329
|
-
So, in the simplest case of running balanced push/pop while maintaining the same
|
330
|
-
heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,
|
331
|
-
when every sift traverses every layer of the tree, `d=4` requires the fewest
|
332
|
-
comparisons for combined insert and delete:
|
333
|
-
|
334
|
-
* (1 + 2) lg n / lg d ≈ 4.328085 lg n
|
335
|
-
* (1 + 3) lg n / lg d ≈ 3.640957 lg n
|
336
|
-
* (1 + 4) lg n / lg d ≈ 3.606738 lg n
|
337
|
-
* (1 + 5) lg n / lg d ≈ 3.728010 lg n
|
338
|
-
* (1 + 6) lg n / lg d ≈ 3.906774 lg n
|
339
|
-
* (1 + 7) lg n / lg d ≈ 4.111187 lg n
|
340
|
-
* (1 + 8) lg n / lg d ≈ 4.328085 lg n
|
341
|
-
* (1 + 9) lg n / lg d ≈ 4.551196 lg n
|
342
|
-
* (1 + 10) lg n / lg d ≈ 4.777239 lg n
|
288
|
+
push + pop (findmin)
|
289
|
+
N 10: 5480288.0 i/s
|
290
|
+
N 100: 2595178.8 i/s - 2.11x slower
|
291
|
+
N 1000: 224813.9 i/s - 24.38x slower
|
292
|
+
N 10000: 12630.7 i/s - 433.89x slower
|
293
|
+
N 100000: 1097.3 i/s - 4994.31x slower
|
294
|
+
N 1000000: 135.9 i/s - 40313.05x slower
|
295
|
+
N 10000000: 12.9 i/s - 425838.01x slower
|
296
|
+
|
297
|
+
push + pop (bsearch)
|
298
|
+
N 10: 3931408.4 i/s
|
299
|
+
N 100: 2904181.8 i/s - 1.35x slower
|
300
|
+
N 1000: 2203157.1 i/s - 1.78x slower
|
301
|
+
N 10000: 1209584.9 i/s - 3.25x slower
|
302
|
+
N 100000: 81121.4 i/s - 48.46x slower
|
303
|
+
N 1000000: 5356.0 i/s - 734.02x slower
|
304
|
+
N 10000000: 281.9 i/s - 13946.33x slower
|
305
|
+
|
306
|
+
push + pop (rb_heap)
|
307
|
+
N 10: 2325816.5 i/s
|
308
|
+
N 100: 1603540.3 i/s - 1.45x slower
|
309
|
+
N 1000: 1262515.2 i/s - 1.84x slower
|
310
|
+
N 10000: 950389.3 i/s - 2.45x slower
|
311
|
+
N 100000: 732548.8 i/s - 3.17x slower
|
312
|
+
N 1000000: 673577.8 i/s - 3.45x slower
|
313
|
+
N 10000000: 467512.3 i/s - 4.97x slower
|
314
|
+
|
315
|
+
push + pop (c++ stl)
|
316
|
+
N 10: 7706818.6 i/s - 1.01x slower
|
317
|
+
N 100: 7393127.3 i/s - 1.05x slower
|
318
|
+
N 1000: 6898781.3 i/s - 1.13x slower
|
319
|
+
N 10000: 5731130.5 i/s - 1.36x slower
|
320
|
+
N 100000: 4842393.2 i/s - 1.60x slower
|
321
|
+
N 1000000: 4170936.4 i/s - 1.86x slower
|
322
|
+
N 10000000: 2737146.6 i/s - 2.84x slower
|
323
|
+
|
324
|
+
push + pop (c_dheap)
|
325
|
+
N 10: 10196454.1 i/s
|
326
|
+
N 100: 9668679.8 i/s - 1.05x slower
|
327
|
+
N 1000: 9339557.0 i/s - 1.09x slower
|
328
|
+
N 10000: 8045103.0 i/s - 1.27x slower
|
329
|
+
N 100000: 7150276.7 i/s - 1.43x slower
|
330
|
+
N 1000000: 6490261.6 i/s - 1.57x slower
|
331
|
+
N 10000000: 3734856.5 i/s - 2.73x slower
|
332
|
+
|
333
|
+
## Time complexity analysis
|
334
|
+
|
335
|
+
There are two fundamental heap operations: sift-up (used by push or decrease
|
336
|
+
score) and sift-down (used by pop or delete or increase score). Each sift
|
337
|
+
bubbles an item to its correct location in the tree.
|
338
|
+
|
339
|
+
* A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
|
340
|
+
`log n / log d` writes, when a member sifts the entire length of the tree.
|
341
|
+
* Sift-up needs one comparison per layer: `O(log n / log d)`.
|
342
|
+
* Sift-down needs d comparions per layer: `O(d log n / log d)`.
|
343
|
+
|
344
|
+
So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
|
345
|
+
comparisons are made. Looking only at this worst case combo, `d=4` requires the
|
346
|
+
fewest comparisons for a combined push and pop:
|
347
|
+
|
348
|
+
* `(1 + 2) log n / log d ≈ 4.328085 log n`
|
349
|
+
* `(1 + 3) log n / log d ≈ 3.640957 log n`
|
350
|
+
* `(1 + 4) log n / log d ≈ 3.606738 log n`
|
351
|
+
* `(1 + 5) log n / log d ≈ 3.728010 log n`
|
352
|
+
* `(1 + 6) log n / log d ≈ 3.906774 log n`
|
353
|
+
* `(1 + 7) log n / log d ≈ 4.111187 log n`
|
354
|
+
* `(1 + 8) log n / log d ≈ 4.328085 log n`
|
355
|
+
* `(1 + 9) log n / log d ≈ 4.551196 log n`
|
356
|
+
* `(1 + 10) log n / log d ≈ 4.777239 log n`
|
343
357
|
* etc...
|
344
358
|
|
345
359
|
See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
|
346
360
|
|
347
|
-
|
361
|
+
However, what this simple count of comparisons misses is the extent to which
|
362
|
+
modern compilers can optimize code (e.g. by unrolling the comparison loop to
|
363
|
+
execute on registers) and more importantly how well modern processors are at
|
364
|
+
pipelined speculative execution using branch prediction, etc. Benchmarks should
|
365
|
+
be run on the _exact same_ hardware platform that production code will use,
|
366
|
+
as the sift-down operation is especially sensitive to good pipelining.
|
348
367
|
|
349
|
-
|
350
|
-
provide better cache locality. Because the heap is a complete binary tree, the
|
351
|
-
elements can be stored in an array, without the need for tree or list pointers.
|
368
|
+
## Comparison performance
|
352
369
|
|
353
|
-
|
354
|
-
|
355
|
-
|
356
|
-
|
357
|
-
as an array which only stores values.
|
370
|
+
It is often useful to use external scores for otherwise uncomparable values.
|
371
|
+
And casting an item or score (via `to_f`) can also be time consuming. So
|
372
|
+
`DHeap` evaluates and stores scores at the time of insertion, and they will be
|
373
|
+
compared directly without needing any further lookup.
|
358
374
|
|
359
|
-
|
375
|
+
Numeric values can be compared _much_ faster than other ruby objects, even if
|
376
|
+
those objects simply delegate comparison to internal Numeric values.
|
377
|
+
Additionally, native C integers or floats can be compared _much_ faster than
|
378
|
+
ruby `Numeric` objects. So scores are converted to Float and stored as
|
379
|
+
`double`, which is 64 bits on an [LP64 64-bit system].
|
360
380
|
|
361
|
-
|
362
|
-
take precautions such as locking access behind a mutex.
|
381
|
+
[LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
|
363
382
|
|
364
383
|
## Alternative data structures
|
365
384
|
|
366
385
|
As always, you should run benchmarks with your expected scenarios to determine
|
367
386
|
which is best for your application.
|
368
387
|
|
369
|
-
Depending on your use-case,
|
370
|
-
and `#insert` might be just fine!
|
371
|
-
|
372
|
-
`memcpy` is so fast on modern hardware that your dataset might not be large
|
373
|
-
enough for it to matter.
|
388
|
+
Depending on your use-case, using a sorted `Array` using `#bsearch_index`
|
389
|
+
and `#insert` might be just fine! It only takes a couple of lines of code and
|
390
|
+
is probably "Fast Enough".
|
374
391
|
|
375
|
-
More complex heap
|
392
|
+
More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
|
376
393
|
merged which gives some graph algorithms a lower amortized time complexity. But
|
377
394
|
in practice, _d_-ary heaps have much lower overhead and often run faster.
|
378
395
|
|
@@ -385,25 +402,60 @@ of values in it, then you may want to use a self-balancing binary search tree
|
|
385
402
|
[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
|
386
403
|
[skip-list]: https://en.wikipedia.org/wiki/Skip_list
|
387
404
|
|
388
|
-
[Hashed and Heirarchical Timing Wheels][timing
|
389
|
-
family of data structures) can
|
390
|
-
|
405
|
+
[Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
|
406
|
+
timing wheel family of data structures) can have effectively `O(1)` running time
|
407
|
+
in most cases. Although the implementation for that data structure is more
|
391
408
|
complex than a heap, it may be necessary for enormous values of N.
|
392
409
|
|
393
|
-
[timing
|
410
|
+
[timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
|
411
|
+
|
412
|
+
## Supported platforms
|
413
|
+
|
414
|
+
See the [CI workflow] for all supported platforms.
|
415
|
+
|
416
|
+
[CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
|
417
|
+
|
418
|
+
`d_heap` may contain bugs on 32-bit systems. Currently, `d_heap` is only tested
|
419
|
+
on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
|
420
|
+
|
421
|
+
## Caveats and TODOs (PRs welcome!)
|
422
|
+
|
423
|
+
A `DHeap`'s internal array grows but never shrinks. At the very least, there
|
424
|
+
should be a `#compact` or `#shrink` method and during `#freeze`. It might make
|
425
|
+
sense to automatically shrink (to no more than 2x the current size) during GC's
|
426
|
+
compact phase.
|
427
|
+
|
428
|
+
Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F. This
|
429
|
+
might lead to a different default `d` value (maybe 16 or 24?).
|
430
|
+
|
431
|
+
Shrink scores to 64-bits: either store a type flag with each entry (this could
|
432
|
+
be used to support non-numeric scores) or require users to choose between
|
433
|
+
`Integer` or `Float` at construction time. Reducing memory usage should also
|
434
|
+
improve speed for very large heaps.
|
435
|
+
|
436
|
+
Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
|
437
|
+
welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
|
438
|
+
Other platforms could fallback on the (slower) pure ruby implementation used by
|
439
|
+
the benchmarks.
|
440
|
+
|
441
|
+
[Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
|
442
|
+
|
443
|
+
Allow a max-heap (or other configurations of the compare function). This can be
|
444
|
+
very easily implemented by just reversing the scores.
|
394
445
|
|
395
|
-
|
446
|
+
_Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
|
447
|
+
numeric use case simplicity and speed can be preserved.
|
396
448
|
|
397
|
-
|
398
|
-
|
399
|
-
heap. This enforces a uniqueness constraint on items on the heap, and also
|
400
|
-
allows items to be more efficiently deleted or adjusted. However maintaining
|
401
|
-
the hash does lead to a small drop in normal `#push` and `#pop` performance.
|
449
|
+
Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
|
450
|
+
and move all values below that time onto an Array.
|
402
451
|
|
403
|
-
|
404
|
-
features that are loosely inspired by go's timers.
|
405
|
-
heap after deletion
|
406
|
-
|
452
|
+
Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
|
453
|
+
some features that are loosely inspired by go's timers. Go lazily sifts its
|
454
|
+
heap after deletion or adjustments, to achieve faster amortized runtime.
|
455
|
+
There's no need to actually remove a deleted item from the heap, if you re-add
|
456
|
+
it back before it's next evaluated. A similar trick can be to store "far away"
|
457
|
+
values in an internal `Hash`, assuming many will be deleted before they rise to
|
458
|
+
the top. This could naturally evolve into a [timing wheel] variant.
|
407
459
|
|
408
460
|
## Development
|
409
461
|
|