d_heap 0.6.1 → 0.7.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.clang-format +21 -0
- data/.github/workflows/main.yml +16 -1
- data/.rubocop.yml +1 -0
- data/CHANGELOG.md +17 -0
- data/{N → D} +1 -1
- data/README.md +313 -261
- data/d_heap.gemspec +16 -5
- data/docs/benchmarks-2.txt +79 -61
- data/docs/benchmarks.txt +587 -416
- data/docs/profile.txt +99 -133
- data/ext/d_heap/.rubocop.yml +7 -0
- data/ext/d_heap/d_heap.c +575 -424
- data/ext/d_heap/extconf.rb +34 -3
- data/images/push_n.png +0 -0
- data/images/push_n_pop_n.png +0 -0
- data/images/push_pop.png +0 -0
- data/lib/d_heap.rb +25 -1
- data/lib/d_heap/version.rb +1 -1
- metadata +6 -30
- data/.rspec +0 -3
- data/.travis.yml +0 -6
- data/Gemfile +0 -20
- data/Gemfile.lock +0 -83
- data/Rakefile +0 -20
- data/benchmarks/perf.rb +0 -29
- data/benchmarks/push_n.yml +0 -35
- data/benchmarks/push_n_pop_n.yml +0 -52
- data/benchmarks/push_pop.yml +0 -32
- data/benchmarks/stackprof.rb +0 -31
- data/bin/bench_charts +0 -13
- data/bin/bench_n +0 -7
- data/bin/benchmark-driver +0 -29
- data/bin/benchmarks +0 -10
- data/bin/console +0 -15
- data/bin/profile +0 -10
- data/bin/rake +0 -29
- data/bin/rspec +0 -29
- data/bin/rubocop +0 -29
- data/bin/setup +0 -8
- data/lib/benchmark_driver/runner/ips_zero_fail.rb +0 -158
- data/lib/d_heap/benchmarks.rb +0 -112
- data/lib/d_heap/benchmarks/benchmarker.rb +0 -116
- data/lib/d_heap/benchmarks/implementations.rb +0 -224
- data/lib/d_heap/benchmarks/profiler.rb +0 -71
- data/lib/d_heap/benchmarks/rspec_matchers.rb +0 -352
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5b51ed52baf74b585a7ab7799f92a446aef5852431ba10e146658b419657ffbe
|
4
|
+
data.tar.gz: cc7c6786eee78ec13214582b8701448d312f59fb723d12676fb673447ab409a7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5de98f8c9084b30694fff5f8154a6e42e7e67d76518c25136ab4fb0c0afb047ad3c923f4544dcf613ded4c3b01417729aa796c973100faaa7ee93051fa630c7d
|
7
|
+
data.tar.gz: e5dbcc90da7adfba7ef45cd9a2da5fd1781a2bd489002a5ffc0a764915c035c178db30ae9b8431a8fc810cfa6f03a1b38ec0a50cbf23c2e1ba5dfc36549c0609
|
data/.clang-format
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
---
|
2
|
+
BasedOnStyle: mozilla
|
3
|
+
IndentWidth: 4
|
4
|
+
PointerAlignment: Right
|
5
|
+
AlignAfterOpenBracket: Align
|
6
|
+
AlignConsecutiveAssignments: true
|
7
|
+
AlignConsecutiveDeclarations: true
|
8
|
+
AlignConsecutiveBitFields: true
|
9
|
+
AlignConsecutiveMacros: true
|
10
|
+
AlignEscapedNewlines: Right
|
11
|
+
AlignOperands: true
|
12
|
+
|
13
|
+
AllowAllConstructorInitializersOnNextLine: false
|
14
|
+
AllowShortIfStatementsOnASingleLine: WithoutElse
|
15
|
+
|
16
|
+
IndentCaseLabels: false
|
17
|
+
IndentPPDirectives: AfterHash
|
18
|
+
|
19
|
+
ForEachMacros:
|
20
|
+
- WHILE_PEEK_LT_P
|
21
|
+
...
|
data/.github/workflows/main.yml
CHANGED
@@ -23,4 +23,19 @@ jobs:
|
|
23
23
|
run: |
|
24
24
|
gem install bundler -v 2.2.3
|
25
25
|
bundle install
|
26
|
-
bundle exec rake
|
26
|
+
bundle exec rake ci
|
27
|
+
|
28
|
+
benchmarks:
|
29
|
+
runs-on: ubuntu-latest
|
30
|
+
steps:
|
31
|
+
- uses: actions/checkout@v2
|
32
|
+
- name: Set up Ruby
|
33
|
+
uses: ruby/setup-ruby@v1
|
34
|
+
with:
|
35
|
+
ruby-version: 2.7
|
36
|
+
bundler-cache: true
|
37
|
+
- name: Run the benchmarks
|
38
|
+
run: |
|
39
|
+
gem install bundler -v 2.2.3
|
40
|
+
bundle install
|
41
|
+
bundle exec rake ci:benchmarks
|
data/.rubocop.yml
CHANGED
@@ -135,6 +135,7 @@ Style/ClassAndModuleChildren: { Enabled: false }
|
|
135
135
|
Style/EachWithObject: { Enabled: false }
|
136
136
|
Style/FormatStringToken: { Enabled: false }
|
137
137
|
Style/FloatDivision: { Enabled: false }
|
138
|
+
Style/GuardClause: { Enabled: false } # usually nice to do, but...
|
138
139
|
Style/IfUnlessModifier: { Enabled: false }
|
139
140
|
Style/IfWithSemicolon: { Enabled: false }
|
140
141
|
Style/Lambda: { Enabled: false }
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,22 @@
|
|
1
1
|
## Current/Unreleased
|
2
2
|
|
3
|
+
## Release v0.7.0 (2021-01-24)
|
4
|
+
|
5
|
+
* 💥⚡️ **BREAKING**: Uses `double`) for _all_ scores.
|
6
|
+
* 💥 Integers larger than a double mantissa (53-bits) will lose some
|
7
|
+
precision.
|
8
|
+
* ⚡️ big speed up
|
9
|
+
* ⚡️ Much better memory usage
|
10
|
+
* ⚡️ Simplifies score conversion between ruby and C
|
11
|
+
* ✨ Added `DHeap::Map` for ensuring values can only be added once, by `#hash`.
|
12
|
+
* Adding again will update the score.
|
13
|
+
* Adds `DHeap::Map#[]` for quick lookup of existing scores
|
14
|
+
* Adds `DHeap::Map#[]=` for adjustments of existing scores
|
15
|
+
* TODO: `DHeap::Map#delete`
|
16
|
+
* 📝📈 SO MANY BENCHMARKS
|
17
|
+
* ⚡️ Set DEFAULT_D to 6, based on benchmarks.
|
18
|
+
* 🐛♻️ convert all `long` indexes to `size_t`
|
19
|
+
|
3
20
|
## Release v0.6.1 (2021-01-24)
|
4
21
|
|
5
22
|
* 📝 Fix link to CHANGELOG.md in gemspec
|
data/{N → D}
RENAMED
data/README.md
CHANGED
@@ -7,6 +7,13 @@
|
|
7
7
|
A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
|
8
8
|
implemented as a C extension.
|
9
9
|
|
10
|
+
A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":
|
11
|
+
last in first out. A priority queue pushes each element with a score and pops
|
12
|
+
out in order by score. Priority queues are often used in algorithms for e.g.
|
13
|
+
[scheduling] of timers or bandwidth management, for [Huffman coding], and for
|
14
|
+
various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
|
15
|
+
[Prim's algorithm].
|
16
|
+
|
10
17
|
From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
|
11
18
|
> A heap is a specialized tree-based data structure which is essentially an
|
12
19
|
> almost complete tree that satisfies the heap property: in a min heap, for any
|
@@ -16,26 +23,17 @@ From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
|
|
16
23
|
|
17
24
|
![tree representation of a min heap](images/wikipedia-min-heap.png)
|
18
25
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
have better memory cache behavior than binary heaps, allowing them to run more
|
31
|
-
quickly in practice despite slower worst-case time complexity. In the worst
|
32
|
-
case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
|
33
|
-
the tradeoff that pop requires `O(d log n / log d)`.
|
34
|
-
|
35
|
-
Although you should probably just use the default _d_ value of `4` (see the
|
36
|
-
analysis below), it's always advisable to benchmark your specific use-case. In
|
37
|
-
particular, if you push items more than you pop, higher values for _d_ can give
|
38
|
-
a faster total runtime.
|
26
|
+
The _d_-ary heap data structure is a generalization of a [binary heap] in which
|
27
|
+
each node has _d_ children instead of 2. This speeds up "push" or "decrease
|
28
|
+
priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
|
29
|
+
"increase priority" (`O(d log n / log d)`). Additionally, _d_-ary heaps can
|
30
|
+
have better memory cache behavior than binary heaps, letting them run more
|
31
|
+
quickly in practice.
|
32
|
+
|
33
|
+
Although the default _d_ value will usually perform best (see the time
|
34
|
+
complexity analysis below), it's always advisable to benchmark your specific
|
35
|
+
use-case. In particular, if you push items more than you pop, higher values for
|
36
|
+
_d_ can give a faster total runtime.
|
39
37
|
|
40
38
|
[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
|
41
39
|
[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
|
@@ -46,41 +44,39 @@ a faster total runtime.
|
|
46
44
|
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
|
47
45
|
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
|
48
46
|
|
47
|
+
## Installation
|
48
|
+
|
49
|
+
Add this line to your application's Gemfile:
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
gem 'd_heap'
|
53
|
+
```
|
54
|
+
|
55
|
+
And then execute:
|
56
|
+
|
57
|
+
$ bundle install
|
58
|
+
|
59
|
+
Or install it yourself as:
|
60
|
+
|
61
|
+
$ gem install d_heap
|
62
|
+
|
49
63
|
## Usage
|
50
64
|
|
51
|
-
The basic API is `#push(object, score)` and `#pop`. Please read the
|
52
|
-
|
65
|
+
The basic API is `#push(object, score)` and `#pop`. Please read the [full
|
66
|
+
documentation] for more details. The score must be convertable to a `Float` via
|
67
|
+
`Float(score)` (i.e. it should properly implement `#to_f`).
|
53
68
|
|
54
|
-
Quick reference for
|
69
|
+
Quick reference for the most common methods:
|
55
70
|
|
56
|
-
* `heap << object` adds a value,
|
71
|
+
* `heap << object` adds a value, using `Float(object)` as its intrinsic score.
|
57
72
|
* `heap.push(object, score)` adds a value with an extrinsic score.
|
58
|
-
* `heap.pop` removes and returns the value with the minimum score.
|
59
|
-
* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
|
60
73
|
* `heap.peek` to view the minimum value without popping it.
|
74
|
+
* `heap.pop` removes and returns the value with the minimum score.
|
75
|
+
* `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
|
61
76
|
* `heap.clear` to remove all items from the heap.
|
62
77
|
* `heap.empty?` returns true if the heap is empty.
|
63
78
|
* `heap.size` returns the number of items in the heap.
|
64
79
|
|
65
|
-
If the score changes while the object is still in the heap, it will not be
|
66
|
-
re-evaluated again.
|
67
|
-
|
68
|
-
The score must either be `Integer` or `Float` or convertable to a `Float` via
|
69
|
-
`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
|
70
|
-
numeric values gives more than 50% speedup under some benchmarks! _n.b._
|
71
|
-
`Integer` _scores must have an absolute value that fits into_ `unsigned long
|
72
|
-
long`. This is compiler and architecture dependant but with gcc on an IA-64
|
73
|
-
system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
|
74
|
-
+18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
|
75
|
-
in nanoseconds.
|
76
|
-
|
77
|
-
_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
|
78
|
-
added back in a future version,_ if (and only if) _it can be done without
|
79
|
-
impacting the speed of numeric comparisons. The speedup from this constraint is
|
80
|
-
huge!_
|
81
|
-
|
82
|
-
[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
|
83
|
-
|
84
80
|
### Examples
|
85
81
|
|
86
82
|
```ruby
|
@@ -128,251 +124,272 @@ heap.size # => 0
|
|
128
124
|
heap.pop # => nil
|
129
125
|
```
|
130
126
|
|
131
|
-
Please see the [
|
127
|
+
Please see the [full documentation] for more methods and more examples.
|
132
128
|
|
133
|
-
|
129
|
+
[full documentation]: https://rubydoc.info/gems/d_heap/DHeap
|
134
130
|
|
135
|
-
|
131
|
+
### DHeap::Map
|
136
132
|
|
137
|
-
|
138
|
-
|
139
|
-
|
133
|
+
`DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
|
134
|
+
index in the heap. For simple push/pop this a bit slower than a normal `DHeap`
|
135
|
+
heap, but it can enable huge speed-ups for algorithms that need to adjust scores
|
136
|
+
after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:
|
140
137
|
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
138
|
+
* a uniqueness constraint, by `#hash` value
|
139
|
+
* `#[obj] # => score` or `#score(obj)` in `O(1)`
|
140
|
+
* `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
|
141
|
+
* TODO:
|
142
|
+
* optionally unique by object identity
|
143
|
+
* `#delete(obj)` in `O(d log n / log d)` (TODO)
|
146
144
|
|
147
|
-
|
145
|
+
## Scores
|
148
146
|
|
149
|
-
|
147
|
+
If a score changes while the object is still in the heap, it will not be
|
148
|
+
re-evaluated again.
|
150
149
|
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
the
|
150
|
+
Constraining scores to `Float` gives enormous performance benefits. n.b.
|
151
|
+
very large `Integer` values will lose precision when converted to `Float`. This
|
152
|
+
is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
|
153
|
+
is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
|
154
|
+
to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
|
155
|
+
time since the epoch in nanoseconds. This can be worked around by adding a
|
156
|
+
bias, but probably it's good enough for most usage.
|
156
157
|
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
makes the tree shorter but broader, which reduces to `O(log n / log d)` while
|
161
|
-
increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
|
158
|
+
_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
|
159
|
+
added back in a future version,_ if (and only if) _it can be done without
|
160
|
+
impacting the speed of numeric comparisons._
|
162
161
|
|
163
|
-
|
164
|
-
slowly than the naive approach—even for heaps containing ten thousand items.
|
165
|
-
Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
|
166
|
-
has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons
|
167
|
-
for each push + pop than `bsearch` + `insert`.
|
162
|
+
## Thread safety
|
168
163
|
|
169
|
-
|
170
|
-
|
171
|
-
heap instead of the traditional binary heap.
|
164
|
+
`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
|
165
|
+
take precautions such as locking access behind a mutex.
|
172
166
|
|
173
167
|
## Benchmarks
|
174
168
|
|
175
|
-
_See
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
`
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
169
|
+
_See full benchmark output in subdirs of `benchmarks`. See also or updated
|
170
|
+
results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
|
171
|
+
with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
|
172
|
+
|
173
|
+
### Implementations
|
174
|
+
|
175
|
+
* **findmin** -
|
176
|
+
A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
|
177
|
+
very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
|
178
|
+
`Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but
|
179
|
+
unusably slow for `n > 1000`.
|
180
|
+
|
181
|
+
* **bsearch** -
|
182
|
+
A simple implementation with a slow `O(n)` push using `Array#bsearch` +
|
183
|
+
`Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
|
184
|
+
`Array#pop`. It is still relatively fast for `n < 10000`, but its linear
|
185
|
+
time complexity really destroys it after that.
|
186
|
+
|
187
|
+
* **rb_heap** -
|
188
|
+
A pure ruby binary min-heap that has been tuned for performance by making
|
189
|
+
few method calls and allocating and assigning as few variables as possible.
|
190
|
+
It runs in `O(log n)` for both push and pop, although pop is slower than
|
191
|
+
push by a constant factor. Its much higher constant factors makes it lose
|
192
|
+
to `bsearch` push + pop for `n < 10000` but it holds steady with very little
|
193
|
+
slowdown even with `n > 10000000`.
|
194
|
+
|
195
|
+
* **c++ stl** -
|
196
|
+
A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
|
197
|
+
priority_queue]. The wrapper is simply to provide compatibility with the
|
198
|
+
other benchmarked implementations, but it should be possible to speed this
|
199
|
+
up a little bit by benchmarking the `priority_queue_cxx` API directly. It
|
200
|
+
has the same time complexity as rb_heap but its much lower constant
|
201
|
+
factors allow it to easily outperform `bsearch`.
|
202
|
+
|
203
|
+
* **c_dheap** -
|
204
|
+
A {DHeap} instance with the default `d` value of `4`. It has the same time
|
205
|
+
complexity as `rb_heap` and `c++ stl`, but is faster than both in every
|
206
|
+
benchmarked scenario.
|
191
207
|
|
192
208
|
[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
|
193
209
|
[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
|
194
210
|
|
195
|
-
|
211
|
+
### Scenarios
|
196
212
|
|
197
|
-
|
213
|
+
Each benchmark increases N exponentially, either by √1̅0̅ or approximating
|
214
|
+
(alternating between x3 and x3.333) in order to simplify keeping loop counts
|
215
|
+
evenly divisible by N.
|
198
216
|
|
199
|
-
|
217
|
+
#### push N items
|
200
218
|
|
201
|
-
|
219
|
+
This measures the _average time per insert_ to create a queue of size N
|
220
|
+
(clearing the queue once it reaches that size). Use cases which push (or
|
221
|
+
decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's
|
222
|
+
algorithm] when the graph has more edges than verticies, may want to pay more
|
223
|
+
attention to this benchmark.
|
202
224
|
|
203
|
-
|
225
|
+
![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
|
204
226
|
|
205
|
-
|
206
|
-
|
207
|
-
|
227
|
+
== push N (N=100) ==========================================================
|
228
|
+
push N (c_dheap): 10522662.6 i/s
|
229
|
+
push N (findmin): 9980622.3 i/s - 1.05x slower
|
230
|
+
push N (c++ stl): 7991608.3 i/s - 1.32x slower
|
231
|
+
push N (rb_heap): 4607849.4 i/s - 2.28x slower
|
232
|
+
push N (bsearch): 2769106.2 i/s - 3.80x slower
|
233
|
+
== push N (N=10,000) =======================================================
|
234
|
+
push N (c_dheap): 10444588.3 i/s
|
235
|
+
push N (findmin): 10191797.4 i/s - 1.02x slower
|
236
|
+
push N (c++ stl): 8210895.4 i/s - 1.27x slower
|
237
|
+
push N (rb_heap): 4369252.9 i/s - 2.39x slower
|
238
|
+
push N (bsearch): 1213580.4 i/s - 8.61x slower
|
239
|
+
== push N (N=1,000,000) ====================================================
|
240
|
+
push N (c_dheap): 10342183.7 i/s
|
241
|
+
push N (findmin): 9963898.8 i/s - 1.04x slower
|
242
|
+
push N (c++ stl): 7891924.8 i/s - 1.31x slower
|
243
|
+
push N (rb_heap): 4350116.0 i/s - 2.38x slower
|
244
|
+
|
245
|
+
All three heap implementations have little to no perceptible slowdown for `N >
|
246
|
+
100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
|
247
|
+
|
248
|
+
#### push then pop N items
|
249
|
+
|
250
|
+
This measures the _average_ for a push **or** a pop, filling up a queue with N
|
251
|
+
items and then draining that queue until empty. It represents the amortized
|
252
|
+
cost of balanced pushes and pops to fill a heap and drain it.
|
208
253
|
|
209
254
|
![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
|
210
255
|
|
211
|
-
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
|
256
|
+
== push N then pop N (N=100) ===============================================
|
257
|
+
push N + pop N (c_dheap): 10954469.2 i/s
|
258
|
+
push N + pop N (c++ stl): 9317140.2 i/s - 1.18x slower
|
259
|
+
push N + pop N (bsearch): 4808770.2 i/s - 2.28x slower
|
260
|
+
push N + pop N (findmin): 4321411.9 i/s - 2.53x slower
|
261
|
+
push N + pop N (rb_heap): 2467417.0 i/s - 4.44x slower
|
262
|
+
== push N then pop N (N=10,000) ============================================
|
263
|
+
push N + pop N (c_dheap): 8083962.7 i/s
|
264
|
+
push N + pop N (c++ stl): 7365661.8 i/s - 1.10x slower
|
265
|
+
push N + pop N (bsearch): 2257047.9 i/s - 3.58x slower
|
266
|
+
push N + pop N (rb_heap): 1439204.3 i/s - 5.62x slower
|
267
|
+
== push N then pop N (N=1,000,000) =========================================
|
268
|
+
push N + pop N (c++ stl): 5274657.5 i/s
|
269
|
+
push N + pop N (c_dheap): 4731117.9 i/s - 1.11x slower
|
270
|
+
push N + pop N (rb_heap): 976688.6 i/s - 5.40x slower
|
271
|
+
|
272
|
+
At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
|
273
|
+
much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above
|
274
|
+
30k it slows down too much to be useful. `DHeap` consistently runs 4.5-5.5x
|
275
|
+
faster than the pure ruby heap.
|
276
|
+
|
277
|
+
#### push & pop on N-item heap
|
278
|
+
|
279
|
+
This measures the combined time to push once and pop once, which is done
|
280
|
+
repeatedly while keeping a stable heap size of N. Its an approximation for
|
281
|
+
scenarios which reach a stable size and then plateau with balanced pushes and
|
282
|
+
pops. E.g. timers and timeouts will often reschedule themselves or replace
|
283
|
+
themselves with new timers or timeouts, maintaining a roughly stable total count
|
284
|
+
of timers.
|
217
285
|
|
218
286
|
![bar graph for push_pop benchmarks](./images/push_pop.png)
|
219
287
|
|
220
|
-
|
221
|
-
|
222
|
-
|
223
|
-
|
224
|
-
|
225
|
-
|
226
|
-
|
227
|
-
|
228
|
-
|
229
|
-
|
230
|
-
|
231
|
-
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
|
259
|
-
|
260
|
-
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
|
265
|
-
|
266
|
-
|
267
|
-
|
268
|
-
|
269
|
-
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
queue size = 5000000: 445234.5 i/s - 1.65x slower
|
290
|
-
queue size = 10000000: 423119.0 i/s - 1.74x slower
|
291
|
-
|
292
|
-
== push + pop (bsearch)
|
293
|
-
queue size = 10000: 786334.2 i/s
|
294
|
-
queue size = 25000: 364963.8 i/s - 2.15x slower
|
295
|
-
queue size = 50000: 200520.6 i/s - 3.92x slower
|
296
|
-
queue size = 100000: 88607.0 i/s - 8.87x slower
|
297
|
-
queue size = 250000: 34530.5 i/s - 22.77x slower
|
298
|
-
queue size = 500000: 17965.4 i/s - 43.77x slower
|
299
|
-
queue size = 1000000: 5638.7 i/s - 139.45x slower
|
300
|
-
queue size = 2500000: 1302.0 i/s - 603.93x slower
|
301
|
-
queue size = 5000000: 592.0 i/s - 1328.25x slower
|
302
|
-
queue size = 10000000: 288.8 i/s - 2722.66x slower
|
303
|
-
|
304
|
-
== push + pop (c_dheap)
|
305
|
-
queue size = 10000: 7311366.6 i/s
|
306
|
-
queue size = 50000: 6737824.5 i/s - 1.09x slower
|
307
|
-
queue size = 25000: 6407340.6 i/s - 1.14x slower
|
308
|
-
queue size = 100000: 6254396.3 i/s - 1.17x slower
|
309
|
-
queue size = 250000: 5917684.5 i/s - 1.24x slower
|
310
|
-
queue size = 500000: 5126307.6 i/s - 1.43x slower
|
311
|
-
queue size = 1000000: 4403494.1 i/s - 1.66x slower
|
312
|
-
queue size = 2500000: 3304088.2 i/s - 2.21x slower
|
313
|
-
queue size = 5000000: 2664897.7 i/s - 2.74x slower
|
314
|
-
queue size = 10000000: 2137927.6 i/s - 3.42x slower
|
315
|
-
|
316
|
-
## Analysis
|
317
|
-
|
318
|
-
### Time complexity
|
319
|
-
|
320
|
-
There are two fundamental heap operations: sift-up (used by push) and sift-down
|
321
|
-
(used by pop).
|
322
|
-
|
323
|
-
* A _d_-ary heap will have `log n / log d` layers, so both sift operations can
|
324
|
-
perform as many as `log n / log d` writes, when a member sifts the entire
|
325
|
-
length of the tree.
|
326
|
-
* Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
|
327
|
-
* Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
|
328
|
-
|
329
|
-
So, in the simplest case of running balanced push/pop while maintaining the same
|
330
|
-
heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,
|
331
|
-
when every sift traverses every layer of the tree, `d=4` requires the fewest
|
332
|
-
comparisons for combined insert and delete:
|
333
|
-
|
334
|
-
* (1 + 2) lg n / lg d ≈ 4.328085 lg n
|
335
|
-
* (1 + 3) lg n / lg d ≈ 3.640957 lg n
|
336
|
-
* (1 + 4) lg n / lg d ≈ 3.606738 lg n
|
337
|
-
* (1 + 5) lg n / lg d ≈ 3.728010 lg n
|
338
|
-
* (1 + 6) lg n / lg d ≈ 3.906774 lg n
|
339
|
-
* (1 + 7) lg n / lg d ≈ 4.111187 lg n
|
340
|
-
* (1 + 8) lg n / lg d ≈ 4.328085 lg n
|
341
|
-
* (1 + 9) lg n / lg d ≈ 4.551196 lg n
|
342
|
-
* (1 + 10) lg n / lg d ≈ 4.777239 lg n
|
288
|
+
push + pop (findmin)
|
289
|
+
N 10: 5480288.0 i/s
|
290
|
+
N 100: 2595178.8 i/s - 2.11x slower
|
291
|
+
N 1000: 224813.9 i/s - 24.38x slower
|
292
|
+
N 10000: 12630.7 i/s - 433.89x slower
|
293
|
+
N 100000: 1097.3 i/s - 4994.31x slower
|
294
|
+
N 1000000: 135.9 i/s - 40313.05x slower
|
295
|
+
N 10000000: 12.9 i/s - 425838.01x slower
|
296
|
+
|
297
|
+
push + pop (bsearch)
|
298
|
+
N 10: 3931408.4 i/s
|
299
|
+
N 100: 2904181.8 i/s - 1.35x slower
|
300
|
+
N 1000: 2203157.1 i/s - 1.78x slower
|
301
|
+
N 10000: 1209584.9 i/s - 3.25x slower
|
302
|
+
N 100000: 81121.4 i/s - 48.46x slower
|
303
|
+
N 1000000: 5356.0 i/s - 734.02x slower
|
304
|
+
N 10000000: 281.9 i/s - 13946.33x slower
|
305
|
+
|
306
|
+
push + pop (rb_heap)
|
307
|
+
N 10: 2325816.5 i/s
|
308
|
+
N 100: 1603540.3 i/s - 1.45x slower
|
309
|
+
N 1000: 1262515.2 i/s - 1.84x slower
|
310
|
+
N 10000: 950389.3 i/s - 2.45x slower
|
311
|
+
N 100000: 732548.8 i/s - 3.17x slower
|
312
|
+
N 1000000: 673577.8 i/s - 3.45x slower
|
313
|
+
N 10000000: 467512.3 i/s - 4.97x slower
|
314
|
+
|
315
|
+
push + pop (c++ stl)
|
316
|
+
N 10: 7706818.6 i/s - 1.01x slower
|
317
|
+
N 100: 7393127.3 i/s - 1.05x slower
|
318
|
+
N 1000: 6898781.3 i/s - 1.13x slower
|
319
|
+
N 10000: 5731130.5 i/s - 1.36x slower
|
320
|
+
N 100000: 4842393.2 i/s - 1.60x slower
|
321
|
+
N 1000000: 4170936.4 i/s - 1.86x slower
|
322
|
+
N 10000000: 2737146.6 i/s - 2.84x slower
|
323
|
+
|
324
|
+
push + pop (c_dheap)
|
325
|
+
N 10: 10196454.1 i/s
|
326
|
+
N 100: 9668679.8 i/s - 1.05x slower
|
327
|
+
N 1000: 9339557.0 i/s - 1.09x slower
|
328
|
+
N 10000: 8045103.0 i/s - 1.27x slower
|
329
|
+
N 100000: 7150276.7 i/s - 1.43x slower
|
330
|
+
N 1000000: 6490261.6 i/s - 1.57x slower
|
331
|
+
N 10000000: 3734856.5 i/s - 2.73x slower
|
332
|
+
|
333
|
+
## Time complexity analysis
|
334
|
+
|
335
|
+
There are two fundamental heap operations: sift-up (used by push or decrease
|
336
|
+
score) and sift-down (used by pop or delete or increase score). Each sift
|
337
|
+
bubbles an item to its correct location in the tree.
|
338
|
+
|
339
|
+
* A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
|
340
|
+
`log n / log d` writes, when a member sifts the entire length of the tree.
|
341
|
+
* Sift-up needs one comparison per layer: `O(log n / log d)`.
|
342
|
+
* Sift-down needs d comparions per layer: `O(d log n / log d)`.
|
343
|
+
|
344
|
+
So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
|
345
|
+
comparisons are made. Looking only at this worst case combo, `d=4` requires the
|
346
|
+
fewest comparisons for a combined push and pop:
|
347
|
+
|
348
|
+
* `(1 + 2) log n / log d ≈ 4.328085 log n`
|
349
|
+
* `(1 + 3) log n / log d ≈ 3.640957 log n`
|
350
|
+
* `(1 + 4) log n / log d ≈ 3.606738 log n`
|
351
|
+
* `(1 + 5) log n / log d ≈ 3.728010 log n`
|
352
|
+
* `(1 + 6) log n / log d ≈ 3.906774 log n`
|
353
|
+
* `(1 + 7) log n / log d ≈ 4.111187 log n`
|
354
|
+
* `(1 + 8) log n / log d ≈ 4.328085 log n`
|
355
|
+
* `(1 + 9) log n / log d ≈ 4.551196 log n`
|
356
|
+
* `(1 + 10) log n / log d ≈ 4.777239 log n`
|
343
357
|
* etc...
|
344
358
|
|
345
359
|
See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
|
346
360
|
|
347
|
-
|
361
|
+
However, what this simple count of comparisons misses is the extent to which
|
362
|
+
modern compilers can optimize code (e.g. by unrolling the comparison loop to
|
363
|
+
execute on registers) and more importantly how well modern processors are at
|
364
|
+
pipelined speculative execution using branch prediction, etc. Benchmarks should
|
365
|
+
be run on the _exact same_ hardware platform that production code will use,
|
366
|
+
as the sift-down operation is especially sensitive to good pipelining.
|
348
367
|
|
349
|
-
|
350
|
-
provide better cache locality. Because the heap is a complete binary tree, the
|
351
|
-
elements can be stored in an array, without the need for tree or list pointers.
|
368
|
+
## Comparison performance
|
352
369
|
|
353
|
-
|
354
|
-
|
355
|
-
|
356
|
-
|
357
|
-
as an array which only stores values.
|
370
|
+
It is often useful to use external scores for otherwise uncomparable values.
|
371
|
+
And casting an item or score (via `to_f`) can also be time consuming. So
|
372
|
+
`DHeap` evaluates and stores scores at the time of insertion, and they will be
|
373
|
+
compared directly without needing any further lookup.
|
358
374
|
|
359
|
-
|
375
|
+
Numeric values can be compared _much_ faster than other ruby objects, even if
|
376
|
+
those objects simply delegate comparison to internal Numeric values.
|
377
|
+
Additionally, native C integers or floats can be compared _much_ faster than
|
378
|
+
ruby `Numeric` objects. So scores are converted to Float and stored as
|
379
|
+
`double`, which is 64 bits on an [LP64 64-bit system].
|
360
380
|
|
361
|
-
|
362
|
-
take precautions such as locking access behind a mutex.
|
381
|
+
[LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
|
363
382
|
|
364
383
|
## Alternative data structures
|
365
384
|
|
366
385
|
As always, you should run benchmarks with your expected scenarios to determine
|
367
386
|
which is best for your application.
|
368
387
|
|
369
|
-
Depending on your use-case,
|
370
|
-
and `#insert` might be just fine!
|
371
|
-
|
372
|
-
`memcpy` is so fast on modern hardware that your dataset might not be large
|
373
|
-
enough for it to matter.
|
388
|
+
Depending on your use-case, using a sorted `Array` using `#bsearch_index`
|
389
|
+
and `#insert` might be just fine! It only takes a couple of lines of code and
|
390
|
+
is probably "Fast Enough".
|
374
391
|
|
375
|
-
More complex heap
|
392
|
+
More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
|
376
393
|
merged which gives some graph algorithms a lower amortized time complexity. But
|
377
394
|
in practice, _d_-ary heaps have much lower overhead and often run faster.
|
378
395
|
|
@@ -385,25 +402,60 @@ of values in it, then you may want to use a self-balancing binary search tree
|
|
385
402
|
[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
|
386
403
|
[skip-list]: https://en.wikipedia.org/wiki/Skip_list
|
387
404
|
|
388
|
-
[Hashed and Heirarchical Timing Wheels][timing
|
389
|
-
family of data structures) can
|
390
|
-
|
405
|
+
[Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
|
406
|
+
timing wheel family of data structures) can have effectively `O(1)` running time
|
407
|
+
in most cases. Although the implementation for that data structure is more
|
391
408
|
complex than a heap, it may be necessary for enormous values of N.
|
392
409
|
|
393
|
-
[timing
|
410
|
+
[timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
|
411
|
+
|
412
|
+
## Supported platforms
|
413
|
+
|
414
|
+
See the [CI workflow] for all supported platforms.
|
415
|
+
|
416
|
+
[CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
|
417
|
+
|
418
|
+
`d_heap` may contain bugs on 32-bit systems. Currently, `d_heap` is only tested
|
419
|
+
on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
|
420
|
+
|
421
|
+
## Caveats and TODOs (PRs welcome!)
|
422
|
+
|
423
|
+
A `DHeap`'s internal array grows but never shrinks. At the very least, there
|
424
|
+
should be a `#compact` or `#shrink` method and during `#freeze`. It might make
|
425
|
+
sense to automatically shrink (to no more than 2x the current size) during GC's
|
426
|
+
compact phase.
|
427
|
+
|
428
|
+
Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F. This
|
429
|
+
might lead to a different default `d` value (maybe 16 or 24?).
|
430
|
+
|
431
|
+
Shrink scores to 64-bits: either store a type flag with each entry (this could
|
432
|
+
be used to support non-numeric scores) or require users to choose between
|
433
|
+
`Integer` or `Float` at construction time. Reducing memory usage should also
|
434
|
+
improve speed for very large heaps.
|
435
|
+
|
436
|
+
Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
|
437
|
+
welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
|
438
|
+
Other platforms could fallback on the (slower) pure ruby implementation used by
|
439
|
+
the benchmarks.
|
440
|
+
|
441
|
+
[Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
|
442
|
+
|
443
|
+
Allow a max-heap (or other configurations of the compare function). This can be
|
444
|
+
very easily implemented by just reversing the scores.
|
394
445
|
|
395
|
-
|
446
|
+
_Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
|
447
|
+
numeric use case simplicity and speed can be preserved.
|
396
448
|
|
397
|
-
|
398
|
-
|
399
|
-
heap. This enforces a uniqueness constraint on items on the heap, and also
|
400
|
-
allows items to be more efficiently deleted or adjusted. However maintaining
|
401
|
-
the hash does lead to a small drop in normal `#push` and `#pop` performance.
|
449
|
+
Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
|
450
|
+
and move all values below that time onto an Array.
|
402
451
|
|
403
|
-
|
404
|
-
features that are loosely inspired by go's timers.
|
405
|
-
heap after deletion
|
406
|
-
|
452
|
+
Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
|
453
|
+
some features that are loosely inspired by go's timers. Go lazily sifts its
|
454
|
+
heap after deletion or adjustments, to achieve faster amortized runtime.
|
455
|
+
There's no need to actually remove a deleted item from the heap, if you re-add
|
456
|
+
it back before it's next evaluated. A similar trick can be to store "far away"
|
457
|
+
values in an internal `Hash`, assuming many will be deleted before they rise to
|
458
|
+
the top. This could naturally evolve into a [timing wheel] variant.
|
407
459
|
|
408
460
|
## Development
|
409
461
|
|