d_heap 0.3.0 → 0.4.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +30 -1
- data/CHANGELOG.md +42 -0
- data/Gemfile +1 -0
- data/Gemfile.lock +11 -10
- data/README.md +353 -121
- data/benchmarks/push_n.yml +28 -0
- data/benchmarks/push_n_pop_n.yml +31 -0
- data/benchmarks/push_pop.yml +24 -0
- data/bin/bench_n +7 -0
- data/bin/benchmark-driver +29 -0
- data/bin/benchmarks +10 -0
- data/bin/profile +10 -0
- data/d_heap.gemspec +2 -1
- data/docs/benchmarks-2.txt +52 -0
- data/docs/benchmarks.txt +443 -0
- data/docs/profile.txt +392 -0
- data/ext/d_heap/d_heap.c +428 -150
- data/ext/d_heap/d_heap.h +6 -3
- data/ext/d_heap/extconf.rb +8 -3
- data/lib/benchmark_driver/runner/ips_zero_fail.rb +120 -0
- data/lib/d_heap.rb +5 -3
- data/lib/d_heap/benchmarks.rb +111 -0
- data/lib/d_heap/benchmarks/benchmarker.rb +113 -0
- data/lib/d_heap/benchmarks/implementations.rb +168 -0
- data/lib/d_heap/benchmarks/profiler.rb +71 -0
- data/lib/d_heap/benchmarks/rspec_matchers.rb +374 -0
- data/lib/d_heap/version.rb +1 -1
- metadata +34 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 413c0a93e2c3cbdbb86ee433df47a310034d453e441a150d8317dc055b4a9a90
|
4
|
+
data.tar.gz: 4bf67447021da03b07da7f44bcf97a66f13fa42f6f67bcfe9a49d0866c8b8167
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5e55bf53c1062686e0863fb9c3b09f3c2b8b936b0cf83985092e1e906b0b24f40e02a42eada048dcea732a60ec4e3695bb861943b424a8cd3152b227abad8a4e
|
7
|
+
data.tar.gz: e021616d6dcdcec943fec11783f2147a2d175aa3a0caf668c2339e05795e32475cdf7be90201c38d7108c74087e22140a9da9d3f211d7f0335e3c173ae83893b
|
data/.rubocop.yml
CHANGED
@@ -6,6 +6,7 @@ AllCops:
|
|
6
6
|
TargetRubyVersion: 2.5
|
7
7
|
NewCops: disable
|
8
8
|
Exclude:
|
9
|
+
- bin/benchmark-driver
|
9
10
|
- bin/rake
|
10
11
|
- bin/rspec
|
11
12
|
- bin/rubocop
|
@@ -106,26 +107,49 @@ Naming/RescuedExceptionsVariableName: { Enabled: false }
|
|
106
107
|
###########################################################################
|
107
108
|
# Matrics:
|
108
109
|
|
110
|
+
Metrics/CyclomaticComplexity:
|
111
|
+
Max: 10
|
112
|
+
|
109
113
|
# Although it may be better to split specs into multiple files...?
|
110
114
|
Metrics/BlockLength:
|
111
115
|
Exclude:
|
112
116
|
- "spec/**/*_spec.rb"
|
117
|
+
CountAsOne:
|
118
|
+
- array
|
119
|
+
- hash
|
120
|
+
- heredoc
|
121
|
+
|
122
|
+
Metrics/ClassLength:
|
123
|
+
Max: 200
|
124
|
+
CountAsOne:
|
125
|
+
- array
|
126
|
+
- hash
|
127
|
+
- heredoc
|
113
128
|
|
114
129
|
###########################################################################
|
115
130
|
# Style...
|
116
131
|
|
117
132
|
Style/AccessorGrouping: { Enabled: false }
|
118
133
|
Style/AsciiComments: { Enabled: false } # 👮 can't stop our 🎉🥳🎊🥳!
|
134
|
+
Style/ClassAndModuleChildren: { Enabled: false }
|
119
135
|
Style/EachWithObject: { Enabled: false }
|
120
136
|
Style/FormatStringToken: { Enabled: false }
|
121
137
|
Style/FloatDivision: { Enabled: false }
|
138
|
+
Style/IfUnlessModifier: { Enabled: false }
|
139
|
+
Style/IfWithSemicolon: { Enabled: false }
|
122
140
|
Style/Lambda: { Enabled: false }
|
123
141
|
Style/LineEndConcatenation: { Enabled: false }
|
124
142
|
Style/MixinGrouping: { Enabled: false }
|
143
|
+
Style/MultilineBlockChain: { Enabled: false }
|
125
144
|
Style/PerlBackrefs: { Enabled: false } # use occasionally/sparingly
|
126
145
|
Style/RescueStandardError: { Enabled: false }
|
146
|
+
Style/Semicolon: { Enabled: false }
|
127
147
|
Style/SingleLineMethods: { Enabled: false }
|
128
148
|
Style/StabbyLambdaParentheses: { Enabled: false }
|
149
|
+
Style/WhenThen : { Enabled: false }
|
150
|
+
|
151
|
+
# I require trailing commas elsewhere, but these are optional
|
152
|
+
Style/TrailingCommaInArguments: { Enabled: false }
|
129
153
|
|
130
154
|
# If rubocop had an option to only enforce this on constants and literals (e.g.
|
131
155
|
# strings, regexp, range), I'd agree.
|
@@ -149,7 +173,9 @@ Style/BlockDelimiters:
|
|
149
173
|
EnforcedStyle: semantic
|
150
174
|
AllowBracesOnProceduralOneLiners: true
|
151
175
|
IgnoredMethods:
|
152
|
-
- expect
|
176
|
+
- expect # rspec
|
177
|
+
- profile # ruby-prof
|
178
|
+
- ips # benchmark-ips
|
153
179
|
|
154
180
|
|
155
181
|
Style/FormatString:
|
@@ -168,3 +194,6 @@ Style/TrailingCommaInHashLiteral:
|
|
168
194
|
|
169
195
|
Style/TrailingCommaInArrayLiteral:
|
170
196
|
EnforcedStyleForMultiline: consistent_comma
|
197
|
+
|
198
|
+
Style/YodaCondition:
|
199
|
+
EnforcedStyle: forbid_for_equality_operators_only
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,42 @@
|
|
1
|
+
## Current/Unreleased
|
2
|
+
|
3
|
+
## Release v0.4.0 (2021-01-12)
|
4
|
+
|
5
|
+
* ⚡️ Big performance improvements, by using C `long double *cscores` array
|
6
|
+
* ⚡️ Scores must be `Integer` in `-uint64..+uint64`, or convertable to `Float`
|
7
|
+
* ⚡️ many many (so many) updates to benchmarks
|
8
|
+
* ✨ Added `DHeap#clear`
|
9
|
+
* 🐛 Fixed `DHeap#initialize_copy` and `#freeze`
|
10
|
+
* ♻️ significant refactoring
|
11
|
+
* 📝 Updated docs (mostly adding benchmarks)
|
12
|
+
|
13
|
+
## Release v0.3.0 (2020-12-29)
|
14
|
+
|
15
|
+
* ⚡️ Big performance improvements, by converting to a `T_DATA` struct.
|
16
|
+
* ♻️ Major refactoring/rewriting of dheap.c
|
17
|
+
* ✅ Added benchmark specs
|
18
|
+
* 🔥 Removed class methods that operated directly on an array. They weren't
|
19
|
+
compatible with the performance improvements.
|
20
|
+
|
21
|
+
## Release v0.2.2 (2020-12-27)
|
22
|
+
|
23
|
+
* 🐛 fix `optimized_cmp`, avoiding internal symbols
|
24
|
+
* 📝 Update documentation
|
25
|
+
* 💚 fix macos CI
|
26
|
+
* ➕ Add rubocop 👮🎨
|
27
|
+
|
28
|
+
## Release v0.2.1 (2020-12-26)
|
29
|
+
|
30
|
+
* ⬆️ Upgraded rake (and bundler) to support ruby 3.0
|
31
|
+
|
32
|
+
## Release v0.2.0 (2020-12-24)
|
33
|
+
|
34
|
+
* ✨ Add ability to push separate score and value
|
35
|
+
* ⚡️ Big performance gain, by storing scores separately and using ruby's
|
36
|
+
internal `OPTIMIZED_CMP` instead of always directly calling `<=>`
|
37
|
+
|
38
|
+
## Release v0.1.0 (2020-12-22)
|
39
|
+
|
40
|
+
🎉 initial release 🎉
|
41
|
+
|
42
|
+
* ✨ Add basic d-ary Heap implementation
|
data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
@@ -1,19 +1,22 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
d_heap (0.
|
4
|
+
d_heap (0.4.0)
|
5
5
|
|
6
6
|
GEM
|
7
7
|
remote: https://rubygems.org/
|
8
8
|
specs:
|
9
9
|
ast (2.4.1)
|
10
|
-
|
11
|
-
|
12
|
-
benchmark-trend (0.4.0)
|
10
|
+
benchmark_driver (0.15.16)
|
11
|
+
coderay (1.1.3)
|
13
12
|
diff-lcs (1.4.4)
|
13
|
+
method_source (1.0.0)
|
14
14
|
parallel (1.19.2)
|
15
15
|
parser (2.7.2.0)
|
16
16
|
ast (~> 2.4.1)
|
17
|
+
pry (0.13.1)
|
18
|
+
coderay (~> 1.1)
|
19
|
+
method_source (~> 1.0)
|
17
20
|
rainbow (3.0.0)
|
18
21
|
rake (13.0.3)
|
19
22
|
rake-compiler (1.1.1)
|
@@ -24,11 +27,6 @@ GEM
|
|
24
27
|
rspec-core (~> 3.10.0)
|
25
28
|
rspec-expectations (~> 3.10.0)
|
26
29
|
rspec-mocks (~> 3.10.0)
|
27
|
-
rspec-benchmark (0.6.0)
|
28
|
-
benchmark-malloc (~> 0.2)
|
29
|
-
benchmark-perf (~> 0.6)
|
30
|
-
benchmark-trend (~> 0.4)
|
31
|
-
rspec (>= 3.0)
|
32
30
|
rspec-core (3.10.0)
|
33
31
|
rspec-support (~> 3.10.0)
|
34
32
|
rspec-expectations (3.10.0)
|
@@ -49,6 +47,7 @@ GEM
|
|
49
47
|
unicode-display_width (>= 1.4.0, < 2.0)
|
50
48
|
rubocop-ast (1.1.1)
|
51
49
|
parser (>= 2.7.1.5)
|
50
|
+
ruby-prof (1.4.2)
|
52
51
|
ruby-progressbar (1.10.1)
|
53
52
|
unicode-display_width (1.7.0)
|
54
53
|
|
@@ -56,12 +55,14 @@ PLATFORMS
|
|
56
55
|
ruby
|
57
56
|
|
58
57
|
DEPENDENCIES
|
58
|
+
benchmark_driver
|
59
59
|
d_heap!
|
60
|
+
pry
|
60
61
|
rake (~> 13.0)
|
61
62
|
rake-compiler
|
62
63
|
rspec (~> 3.10)
|
63
|
-
rspec-benchmark
|
64
64
|
rubocop (~> 1.0)
|
65
|
+
ruby-prof
|
65
66
|
|
66
67
|
BUNDLED WITH
|
67
68
|
2.2.3
|
data/README.md
CHANGED
@@ -1,28 +1,64 @@
|
|
1
1
|
# DHeap
|
2
2
|
|
3
|
-
A fast _d_-ary heap implementation for ruby,
|
4
|
-
|
3
|
+
A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
|
4
|
+
implemented as a C extension.
|
5
|
+
|
6
|
+
With a regular queue, you expect "FIFO" behavior: first in, first out. With a
|
7
|
+
stack you expect "LIFO": last in first out. With a priority queue, you push
|
8
|
+
elements along with a score and the lowest scored element is the first to be
|
9
|
+
popped. Priority queues are often used in algorithms for e.g. [scheduling] of
|
10
|
+
timers or bandwidth management, [Huffman coding], and various graph search
|
11
|
+
algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
|
12
|
+
|
13
|
+
The _d_-ary heap data structure is a generalization of the [binary heap], in
|
14
|
+
which the nodes have _d_ children instead of 2. This allows for "decrease
|
15
|
+
priority" operations to be performed more quickly with the tradeoff of slower
|
16
|
+
delete minimum. Additionally, _d_-ary heaps can have better memory cache
|
17
|
+
behavior than binary heaps, allowing them to run more quickly in practice
|
18
|
+
despite slower worst-case time complexity. In the worst case, a _d_-ary heap
|
19
|
+
requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
|
20
|
+
n / log d)`.
|
21
|
+
|
22
|
+
Although you should probably just use the default _d_ value of `4` (see the
|
23
|
+
analysis below), it's always advisable to benchmark your specific use-case.
|
24
|
+
|
25
|
+
[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
|
26
|
+
[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
|
27
|
+
[binary heap]: https://en.wikipedia.org/wiki/Binary_heap
|
28
|
+
[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
|
29
|
+
[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
|
30
|
+
[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
|
31
|
+
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
|
32
|
+
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
|
5
33
|
|
6
|
-
|
7
|
-
the nodes have _d_ children instead of 2. This allows for "decrease priority"
|
8
|
-
operations to be performed more quickly with the tradeoff of slower delete
|
9
|
-
minimum. Additionally, _d_-ary heaps can have better memory cache behavior than
|
10
|
-
binary heaps, allowing them to run more quickly in practice despite slower
|
11
|
-
worst-case time complexity. In the worst case, a _d_-ary heap requires only
|
12
|
-
`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.
|
34
|
+
## Usage
|
13
35
|
|
14
|
-
|
15
|
-
|
36
|
+
The basic API is:
|
37
|
+
* `heap << object` adds a value as its own score.
|
38
|
+
* `heap.push(score, value)` adds a value with an extrinsic score.
|
39
|
+
* `heap.pop` removes and returns the value with the minimum score.
|
40
|
+
* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
|
41
|
+
* `heap.peek` to view the minimum value without popping it.
|
16
42
|
|
17
|
-
|
43
|
+
The score must be `Integer` or `Float` or convertable to a `Float` via
|
44
|
+
`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
|
45
|
+
numeric values gives a 40+% speedup under some benchmarks!
|
46
|
+
|
47
|
+
_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
|
48
|
+
long long`. _This is architecture dependant but on an IA-64 system this is 64
|
49
|
+
bits, which gives a range of -18,446,744,073,709,551,615 to
|
50
|
+
+18446744073709551615._
|
18
51
|
|
19
|
-
|
20
|
-
|
52
|
+
_Comparing arbitary objects via_ `a <=> b` _was the original design and may
|
53
|
+
be added back in a future version,_ if (and only if) _it can be done without
|
54
|
+
impacting the speed of numeric comparisons._
|
21
55
|
|
22
56
|
```ruby
|
23
57
|
require "d_heap"
|
24
58
|
|
25
|
-
|
59
|
+
Task = Struct.new(:id) # for demonstration
|
60
|
+
|
61
|
+
heap = DHeap.new # defaults to a 4-heap
|
26
62
|
|
27
63
|
# storing [score, value] tuples
|
28
64
|
heap.push Time.now + 5*60, Task.new(1)
|
@@ -31,14 +67,61 @@ heap.push Time.now + 60, Task.new(3)
|
|
31
67
|
heap.push Time.now + 5, Task.new(4)
|
32
68
|
|
33
69
|
# peeking and popping (using last to get the task and ignore the time)
|
34
|
-
heap.pop
|
35
|
-
heap.pop
|
36
|
-
heap.
|
37
|
-
heap.pop
|
38
|
-
heap.pop
|
70
|
+
heap.pop # => Task[4]
|
71
|
+
heap.pop # => Task[2]
|
72
|
+
heap.peek # => Task[3], but don't pop it from the heap
|
73
|
+
heap.pop # => Task[3]
|
74
|
+
heap.pop # => Task[1]
|
75
|
+
heap.empty? # => true
|
76
|
+
heap.pop # => nil
|
39
77
|
```
|
40
78
|
|
41
|
-
|
79
|
+
If your values behave as their own score, by being convertible via
|
80
|
+
`Float(value)`, then you can use `#<<` for implicit scoring. The score should
|
81
|
+
not change for as long as the value remains in the heap, since it will not be
|
82
|
+
re-evaluated after being pushed.
|
83
|
+
|
84
|
+
```ruby
|
85
|
+
heap.clear
|
86
|
+
|
87
|
+
# The score can be derived from the value by using to_f.
|
88
|
+
# "a <=> b" is *much* slower than comparing numbers, so it isn't used.
|
89
|
+
class Event
|
90
|
+
include Comparable
|
91
|
+
attr_reader :time, :payload
|
92
|
+
alias_method :to_time, :time
|
93
|
+
|
94
|
+
def initialize(time, payload)
|
95
|
+
@time = time.to_time
|
96
|
+
@payload = payload
|
97
|
+
freeze
|
98
|
+
end
|
99
|
+
|
100
|
+
def to_f
|
101
|
+
time.to_f
|
102
|
+
end
|
103
|
+
|
104
|
+
def <=>(other)
|
105
|
+
to_f <=> other.to_f
|
106
|
+
end
|
107
|
+
end
|
108
|
+
|
109
|
+
heap << comparable_max # sorts last, using <=>
|
110
|
+
heap << comparable_min # sorts first, using <=>
|
111
|
+
heap << comparable_mid # sorts in the middle, using <=>
|
112
|
+
heap.pop # => comparable_min
|
113
|
+
heap.pop # => comparable_mid
|
114
|
+
heap.pop # => comparable_max
|
115
|
+
heap.empty? # => true
|
116
|
+
heap.pop # => nil
|
117
|
+
```
|
118
|
+
|
119
|
+
You can also pass a value into `#pop(max)` which will only pop if the minimum
|
120
|
+
score is less than or equal to `max`.
|
121
|
+
|
122
|
+
Read the [API documentation] for more detailed documentation and examples.
|
123
|
+
|
124
|
+
[API documentation]: https://rubydoc.info/gems/d_heap/DHeap
|
42
125
|
|
43
126
|
## Installation
|
44
127
|
|
@@ -58,109 +141,226 @@ Or install it yourself as:
|
|
58
141
|
|
59
142
|
## Motivation
|
60
143
|
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
a few hundred items at once, the overhead of those extra calls to `<=>` is far
|
87
|
-
more than occasionally calling `memcpy`.
|
88
|
-
|
89
|
-
It's likely that MJIT will eventually make the C-extension completely
|
90
|
-
unnecessary. This is definitely hotspot code, and the basic ruby implementation
|
91
|
-
would work fine, if not for that `<=>` overhead. Until then... this gem gets
|
92
|
-
the job done.
|
93
|
-
|
94
|
-
## TODOs...
|
95
|
-
|
96
|
-
_TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library
|
97
|
-
~~includes~~ _will include_ extensions to `Array`, allowing an Array to be
|
98
|
-
directly handled as a priority queue. These extension methods are meant to be
|
99
|
-
used similarly to how `#bsearch` and `#bsearch_index` might be used.
|
100
|
-
|
101
|
-
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
|
102
|
-
basic heap with an internal `Hash`, which maps a set of values to scores.
|
103
|
-
loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
|
104
|
-
and adjustments, to achieve faster average runtime for *add* and *cancel*
|
105
|
-
operations.
|
106
|
-
|
107
|
-
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some
|
108
|
-
features that are loosely inspired by go's timers. e.g: It lazily sifts its
|
109
|
-
heap after deletion and adjustments, to achieve faster average runtime for *add*
|
110
|
-
and *cancel* operations.
|
111
|
-
|
112
|
-
Additionally, I was inspired by reading go's "timer.go" implementation to
|
113
|
-
experiment with a 4-ary heap instead of the traditional binary heap. In the
|
114
|
-
case of timers, new timers are usually scheduled to run after most of the
|
115
|
-
existing timers. And timers are usually canceled before they have a chance to
|
116
|
-
run. While a binary heap holds 50% of its elements in its last layer, 75% of a
|
117
|
-
4-ary heap will have no children. That diminishes the extra comparison overhead
|
118
|
-
during sift-down.
|
119
|
-
|
120
|
-
## Benchmarks
|
121
|
-
|
122
|
-
_TODO: put benchmarks here._
|
144
|
+
One naive approach to a priority queue is to maintain an array in sorted order.
|
145
|
+
This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
|
146
|
+
This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
|
147
|
+
`O(n)` because it may need to `memcpy` a significant portion of the array.
|
148
|
+
|
149
|
+
The standard way to implement a priority queue is with a binary heap. Although
|
150
|
+
this increases the time for `pop`, it converts the amortized time per push + pop
|
151
|
+
from `O(n)` to `O(d log n / log d)`.
|
152
|
+
|
153
|
+
However, I was surprised to find that—at least for some benchmarks—my pure ruby
|
154
|
+
heap implementation was much slower than inserting into and popping from a fully
|
155
|
+
sorted array. The reason for this surprising result: Although it is `O(n)`,
|
156
|
+
`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
|
157
|
+
has relatively _much_ larger constant factors. If your queue contains only a
|
158
|
+
few thousand items, the overhead of those extra calls to `<=>` is _far_ more
|
159
|
+
than occasionally calling `memcpy`. In the worst case, a _d_-heap will require
|
160
|
+
`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
|
161
|
+
sorted array.
|
162
|
+
|
163
|
+
Moving the sift-up and sift-down code into C helps some. But much more helpful
|
164
|
+
is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
|
165
|
+
called. I'm hopeful that MJIT will eventually obsolete this C-extension. JRuby
|
166
|
+
or TruffleRuby may already run the pure ruby version at high speed. This can be
|
167
|
+
hotspot code, and the basic ruby implementation should perform well if not for
|
168
|
+
the high overhead of `<=>`.
|
123
169
|
|
124
170
|
## Analysis
|
125
171
|
|
126
172
|
### Time complexity
|
127
173
|
|
128
|
-
|
129
|
-
|
130
|
-
|
174
|
+
There are two fundamental heap operations: sift-up (used by push) and sift-down
|
175
|
+
(used by pop).
|
176
|
+
|
177
|
+
* Both sift operations can perform as many as `log n / log d` swaps, as the
|
178
|
+
element may sift from the bottom of the tree to the top, or vice versa.
|
179
|
+
* Sift-up performs a single comparison per swap: `O(1)`.
|
180
|
+
So pushing a new element is `O(log n / log d)`.
|
181
|
+
* Swap down performs as many as d comparions per swap: `O(d)`.
|
182
|
+
So popping the min element is `O(d log n / log d)`.
|
131
183
|
|
132
|
-
|
133
|
-
|
184
|
+
Assuming every inserted element is eventually deleted from the root, d=4
|
185
|
+
requires the fewest comparisons for combined insert and delete:
|
134
186
|
|
135
|
-
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
* (1 + 6) lg 6 = 3.906774
|
142
|
-
* etc...
|
187
|
+
* (1 + 2) lg 2 = 4.328085
|
188
|
+
* (1 + 3) lg 3 = 3.640957
|
189
|
+
* (1 + 4) lg 4 = 3.606738
|
190
|
+
* (1 + 5) lg 5 = 3.728010
|
191
|
+
* (1 + 6) lg 6 = 3.906774
|
192
|
+
* etc...
|
143
193
|
|
144
194
|
Leaf nodes require no comparisons to shift down, and higher values for d have
|
145
195
|
higher percentage of leaf nodes:
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
196
|
+
|
197
|
+
* d=2 has ~50% leaves,
|
198
|
+
* d=3 has ~67% leaves,
|
199
|
+
* d=4 has ~75% leaves,
|
200
|
+
* and so on...
|
150
201
|
|
151
202
|
See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
|
152
203
|
|
153
204
|
### Space complexity
|
154
205
|
|
155
|
-
|
156
|
-
|
206
|
+
Space usage is linear, regardless of d. However higher d values may
|
207
|
+
provide better cache locality. Because the heap is a complete binary tree, the
|
208
|
+
elements can be stored in an array, without the need for tree or list pointers.
|
209
|
+
|
210
|
+
Ruby can compare Numeric values _much_ faster than other ruby objects, even if
|
211
|
+
those objects simply delegate comparison to internal Numeric values. And it is
|
212
|
+
often useful to use external scores for otherwise uncomparable values. So
|
213
|
+
`DHeap` uses twice as many entries (one for score and one for value)
|
214
|
+
as an array which only stores values.
|
157
215
|
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
216
|
+
## Benchmarks
|
217
|
+
|
218
|
+
_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
|
219
|
+
`docs/profile.txt` for more details or updated results. These benchmarks were
|
220
|
+
measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
|
221
|
+
|
222
|
+
These benchmarks use very simple implementations for a pure-ruby heap and an
|
223
|
+
array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
|
224
|
+
comparison, an alternate implementation `Array#min` and `Array#delete_at` is
|
225
|
+
also shown.
|
226
|
+
|
227
|
+
Three different scenarios are measured:
|
228
|
+
* push N values but never pop (clearing between each set of pushes).
|
229
|
+
* push N values and then pop N values.
|
230
|
+
Although this could be used for heap sort, we're unlikely to choose heap sort
|
231
|
+
over Ruby's quick sort implementation. I'm using this scenario to represent
|
232
|
+
the amortized cost of creating a heap and (eventually) draining it.
|
233
|
+
* For a heap of size N, repeatedly push and pop while keeping a stable size.
|
234
|
+
This is a _very simple_ approximation for how most scheduler/timer heaps
|
235
|
+
would be used. Usually when a timer fires it will be quickly replaced by a
|
236
|
+
new timer, and the overall count of timers will remain roughly stable.
|
237
|
+
|
238
|
+
In these benchmarks, `DHeap` runs faster than all other implementations for
|
239
|
+
every scenario and every value of N, although the difference is much more
|
240
|
+
noticable at higher values of N. The pure ruby heap implementation is
|
241
|
+
competitive for `push` alone at every value of N, but is significantly slower
|
242
|
+
than bsearch + insert for push + pop until N is _very_ large (somewhere between
|
243
|
+
10k and 100k)!
|
244
|
+
|
245
|
+
For very small N values the benchmark implementations, `DHeap` runs faster than
|
246
|
+
the other implementations for each scenario, although the difference is still
|
247
|
+
relatively small. The pure ruby binary heap is 2x or more slower than bsearch +
|
248
|
+
insert for common common push/pop scenario.
|
249
|
+
|
250
|
+
== push N (N=5) ==========================================================
|
251
|
+
push N (c_dheap): 1701338.1 i/s
|
252
|
+
push N (rb_heap): 971614.1 i/s - 1.75x slower
|
253
|
+
push N (bsearch): 946363.7 i/s - 1.80x slower
|
254
|
+
|
255
|
+
== push N then pop N (N=5) ===============================================
|
256
|
+
push N + pop N (c_dheap): 1087944.8 i/s
|
257
|
+
push N + pop N (findmin): 841708.1 i/s - 1.29x slower
|
258
|
+
push N + pop N (bsearch): 773252.7 i/s - 1.41x slower
|
259
|
+
push N + pop N (rb_heap): 471852.9 i/s - 2.31x slower
|
260
|
+
|
261
|
+
== Push/pop with pre-filled queue of size=N (N=5) ========================
|
262
|
+
push + pop (c_dheap): 5525418.8 i/s
|
263
|
+
push + pop (findmin): 5003904.8 i/s - 1.10x slower
|
264
|
+
push + pop (bsearch): 4320581.8 i/s - 1.28x slower
|
265
|
+
push + pop (rb_heap): 2207042.0 i/s - 2.50x slower
|
266
|
+
|
267
|
+
By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
|
268
|
+
scenarios, but the pure ruby heap is still slower than every other
|
269
|
+
implementation—even resorting the array after every `#push`—in any scenario that
|
270
|
+
uses `#pop`.
|
271
|
+
|
272
|
+
== push N (N=21) =========================================================
|
273
|
+
push N (c_dheap): 408307.0 i/s
|
274
|
+
push N (rb_heap): 212275.2 i/s - 1.92x slower
|
275
|
+
push N (bsearch): 169583.2 i/s - 2.41x slower
|
276
|
+
|
277
|
+
== push N then pop N (N=21) ==============================================
|
278
|
+
push N + pop N (c_dheap): 199435.5 i/s
|
279
|
+
push N + pop N (findmin): 162024.5 i/s - 1.23x slower
|
280
|
+
push N + pop N (bsearch): 146284.3 i/s - 1.36x slower
|
281
|
+
push N + pop N (rb_heap): 72289.0 i/s - 2.76x slower
|
282
|
+
|
283
|
+
== Push/pop with pre-filled queue of size=N (N=21) =======================
|
284
|
+
push + pop (c_dheap): 4836860.0 i/s
|
285
|
+
push + pop (findmin): 4467453.9 i/s - 1.08x slower
|
286
|
+
push + pop (bsearch): 3345458.4 i/s - 1.45x slower
|
287
|
+
push + pop (rb_heap): 1560476.3 i/s - 3.10x slower
|
288
|
+
|
289
|
+
At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
|
290
|
+
of `DHeap#push`, while insert's linear growth causes it to run slower and
|
291
|
+
slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d)
|
292
|
+
for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
|
293
|
+
increases to 5k:
|
294
|
+
|
295
|
+
== Push/pop with pre-filled queue of size=N (N=5461) ==============
|
296
|
+
push + pop (c_dheap): 2718225.1 i/s
|
297
|
+
push + pop (bsearch): 1793546.4 i/s - 1.52x slower
|
298
|
+
push + pop (rb_heap): 707139.9 i/s - 3.84x slower
|
299
|
+
push + pop (findmin): 122316.0 i/s - 22.22x slower
|
300
|
+
|
301
|
+
Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
|
302
|
+
for the repeated push/pop scenario, all the way up to N as high as 87k:
|
303
|
+
|
304
|
+
== push N (N=87381) ======================================================
|
305
|
+
push N (c_dheap): 92.8 i/s
|
306
|
+
push N (rb_heap): 43.5 i/s - 2.13x slower
|
307
|
+
push N (bsearch): 2.9 i/s - 31.70x slower
|
308
|
+
|
309
|
+
== push N then pop N (N=87381) ===========================================
|
310
|
+
push N + pop N (c_dheap): 22.6 i/s
|
311
|
+
push N + pop N (rb_heap): 5.5 i/s - 4.08x slower
|
312
|
+
push N + pop N (bsearch): 2.9 i/s - 7.90x slower
|
313
|
+
|
314
|
+
== Push/pop with pre-filled queue of size=N (N=87381) ====================
|
315
|
+
push + pop (c_dheap): 1815277.3 i/s
|
316
|
+
push + pop (bsearch): 762343.2 i/s - 2.38x slower
|
317
|
+
push + pop (rb_heap): 535913.6 i/s - 3.39x slower
|
318
|
+
push + pop (findmin): 2262.8 i/s - 802.24x slower
|
319
|
+
|
320
|
+
## Profiling
|
321
|
+
|
322
|
+
_n.b. `Array#fetch` is reading the input data, external to heap operations.
|
323
|
+
These benchmarks use integers for all scores, which enables significantly faster
|
324
|
+
comparisons. If `a <=> b` were used instead, then the difference between push
|
325
|
+
and pop would be much larger. And ruby's `Tracepoint` impacts these different
|
326
|
+
implementations differently. So we can't use these profiler results for
|
327
|
+
comparisons between implementations. A sampling profiler would be needed for
|
328
|
+
more accurate relative measurements._
|
329
|
+
|
330
|
+
It's informative to look at the `ruby-prof` results for a simple binary search +
|
331
|
+
insert implementation, repeatedly pushing and popping to a large heap. In
|
332
|
+
particular, even with 1000 members, the linear `Array#insert` is _still_ faster
|
333
|
+
than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are
|
334
|
+
still (relatively) slow and `memcpy` is (relatively) quite fast!
|
335
|
+
|
336
|
+
%self total self wait child calls name location
|
337
|
+
34.79 2.222 2.222 0.000 0.000 1000000 Array#insert
|
338
|
+
32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index
|
339
|
+
12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
|
340
|
+
10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
|
341
|
+
5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
|
342
|
+
2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch
|
343
|
+
1.95 0.125 0.125 0.000 0.000 1000000 Array#pop
|
344
|
+
|
345
|
+
Contrast this with a simplistic pure-ruby implementation of a binary heap:
|
346
|
+
|
347
|
+
%self total self wait child calls name location
|
348
|
+
48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
|
349
|
+
42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
|
350
|
+
4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
|
351
|
+
|
352
|
+
You can see that it spends almost more time in pop than it does in push. That
|
353
|
+
is expected behavior for a heap: although both are O(log n), pop is
|
354
|
+
significantly more complex, and has _d_ comparisons per layer.
|
355
|
+
|
356
|
+
And `DHeap` shows a similar comparison between push and pop, although it spends
|
357
|
+
half of its time in the benchmark code (which is written in ruby):
|
358
|
+
|
359
|
+
%self total self wait child calls name location
|
360
|
+
43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
|
361
|
+
26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<<
|
362
|
+
23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop
|
363
|
+
7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch
|
164
364
|
|
165
365
|
### Timers
|
166
366
|
|
@@ -178,22 +378,54 @@ faster than a delete and re-insert.
|
|
178
378
|
|
179
379
|
## Alternative data structures
|
180
380
|
|
381
|
+
As always, you should run benchmarks with your expected scenarios to determine
|
382
|
+
which is right.
|
383
|
+
|
181
384
|
Depending on what you're doing, maintaining a sorted `Array` using
|
182
|
-
`#bsearch_index` and `#insert` might be
|
183
|
-
O(n) for insertions,
|
184
|
-
|
185
|
-
|
186
|
-
|
385
|
+
`#bsearch_index` and `#insert` might be just fine! As discussed above, although
|
386
|
+
it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this
|
387
|
+
may not matter. Also, if you can arrange for insertions to occur near the end
|
388
|
+
of the array, that could significantly reduce the `memcpy` overhead even more.
|
389
|
+
|
390
|
+
More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged
|
391
|
+
as well as lower amortized time.
|
392
|
+
|
393
|
+
[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
|
187
394
|
|
188
395
|
If it is important to be able to quickly enumerate the set or find the ranking
|
189
|
-
of values in it, then you
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
196
|
-
be
|
396
|
+
of values in it, then you may want to use a self-balancing binary search tree
|
397
|
+
(e.g. a [red-black tree]) or a [skip-list].
|
398
|
+
|
399
|
+
[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
|
400
|
+
[skip-list]: https://en.wikipedia.org/wiki/Skip_list
|
401
|
+
|
402
|
+
[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
|
403
|
+
family of data structures) can be constructed to have effectively `O(1)` running
|
404
|
+
time in most cases. Although the implementation for that data structure is more
|
405
|
+
complex than a heap, it may be necessary for enormous values of N.
|
406
|
+
|
407
|
+
[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
|
408
|
+
|
409
|
+
## TODOs...
|
410
|
+
|
411
|
+
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
|
412
|
+
basic heap with an internal `Hash`, which maps a set of values to scores.
|
413
|
+
loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
|
414
|
+
and adjustments, to achieve faster average runtime for *add* and *cancel*
|
415
|
+
operations.
|
416
|
+
|
417
|
+
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
|
418
|
+
features that are loosely inspired by go's timers. e.g: It lazily sifts its
|
419
|
+
heap after deletion and adjustments, to achieve faster average runtime for *add*
|
420
|
+
and *cancel* operations.
|
421
|
+
|
422
|
+
Additionally, I was inspired by reading go's "timer.go" implementation to
|
423
|
+
experiment with a 4-ary heap instead of the traditional binary heap. In the
|
424
|
+
case of timers, new timers are usually scheduled to run after most of the
|
425
|
+
existing timers. And timers are usually canceled before they have a chance to
|
426
|
+
run. While a binary heap holds 50% of its elements in its last layer, 75% of a
|
427
|
+
4-ary heap will have no children. That diminishes the extra comparison overhead
|
428
|
+
during sift-down.
|
197
429
|
|
198
430
|
## Development
|
199
431
|
|