d_heap 0.3.0 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5a20fe814944fa8945bc2cc0ce87f810c8bf8d21102b8c454ae5d639f0548576
4
- data.tar.gz: 65a1af345ae84c7f6e5da8af89a00730ffc4cbdb6bb2cac59461c3728eb0ad28
3
+ metadata.gz: 413c0a93e2c3cbdbb86ee433df47a310034d453e441a150d8317dc055b4a9a90
4
+ data.tar.gz: 4bf67447021da03b07da7f44bcf97a66f13fa42f6f67bcfe9a49d0866c8b8167
5
5
  SHA512:
6
- metadata.gz: 27c240634013397033925ee258de08135587c31c7a046a902c028842680dadec20e2ccc0f76570e6f7db34d2292e5dae2ae7d17449436f890498697de4352c91
7
- data.tar.gz: 8eb2cba120747cc7788c42f264d2109ef1afee8a28caa7effb8bcb1ff36ed7c99a0556b48b7d9e9479f6e8be8945eea3bf1ddb7f608c13c63252684643154179
6
+ metadata.gz: 5e55bf53c1062686e0863fb9c3b09f3c2b8b936b0cf83985092e1e906b0b24f40e02a42eada048dcea732a60ec4e3695bb861943b424a8cd3152b227abad8a4e
7
+ data.tar.gz: e021616d6dcdcec943fec11783f2147a2d175aa3a0caf668c2339e05795e32475cdf7be90201c38d7108c74087e22140a9da9d3f211d7f0335e3c173ae83893b
@@ -6,6 +6,7 @@ AllCops:
6
6
  TargetRubyVersion: 2.5
7
7
  NewCops: disable
8
8
  Exclude:
9
+ - bin/benchmark-driver
9
10
  - bin/rake
10
11
  - bin/rspec
11
12
  - bin/rubocop
@@ -106,26 +107,49 @@ Naming/RescuedExceptionsVariableName: { Enabled: false }
106
107
  ###########################################################################
107
108
  # Matrics:
108
109
 
110
+ Metrics/CyclomaticComplexity:
111
+ Max: 10
112
+
109
113
  # Although it may be better to split specs into multiple files...?
110
114
  Metrics/BlockLength:
111
115
  Exclude:
112
116
  - "spec/**/*_spec.rb"
117
+ CountAsOne:
118
+ - array
119
+ - hash
120
+ - heredoc
121
+
122
+ Metrics/ClassLength:
123
+ Max: 200
124
+ CountAsOne:
125
+ - array
126
+ - hash
127
+ - heredoc
113
128
 
114
129
  ###########################################################################
115
130
  # Style...
116
131
 
117
132
  Style/AccessorGrouping: { Enabled: false }
118
133
  Style/AsciiComments: { Enabled: false } # 👮 can't stop our 🎉🥳🎊🥳!
134
+ Style/ClassAndModuleChildren: { Enabled: false }
119
135
  Style/EachWithObject: { Enabled: false }
120
136
  Style/FormatStringToken: { Enabled: false }
121
137
  Style/FloatDivision: { Enabled: false }
138
+ Style/IfUnlessModifier: { Enabled: false }
139
+ Style/IfWithSemicolon: { Enabled: false }
122
140
  Style/Lambda: { Enabled: false }
123
141
  Style/LineEndConcatenation: { Enabled: false }
124
142
  Style/MixinGrouping: { Enabled: false }
143
+ Style/MultilineBlockChain: { Enabled: false }
125
144
  Style/PerlBackrefs: { Enabled: false } # use occasionally/sparingly
126
145
  Style/RescueStandardError: { Enabled: false }
146
+ Style/Semicolon: { Enabled: false }
127
147
  Style/SingleLineMethods: { Enabled: false }
128
148
  Style/StabbyLambdaParentheses: { Enabled: false }
149
+ Style/WhenThen : { Enabled: false }
150
+
151
+ # I require trailing commas elsewhere, but these are optional
152
+ Style/TrailingCommaInArguments: { Enabled: false }
129
153
 
130
154
  # If rubocop had an option to only enforce this on constants and literals (e.g.
131
155
  # strings, regexp, range), I'd agree.
@@ -149,7 +173,9 @@ Style/BlockDelimiters:
149
173
  EnforcedStyle: semantic
150
174
  AllowBracesOnProceduralOneLiners: true
151
175
  IgnoredMethods:
152
- - expect
176
+ - expect # rspec
177
+ - profile # ruby-prof
178
+ - ips # benchmark-ips
153
179
 
154
180
 
155
181
  Style/FormatString:
@@ -168,3 +194,6 @@ Style/TrailingCommaInHashLiteral:
168
194
 
169
195
  Style/TrailingCommaInArrayLiteral:
170
196
  EnforcedStyleForMultiline: consistent_comma
197
+
198
+ Style/YodaCondition:
199
+ EnforcedStyle: forbid_for_equality_operators_only
@@ -0,0 +1,42 @@
1
+ ## Current/Unreleased
2
+
3
+ ## Release v0.4.0 (2021-01-12)
4
+
5
+ * ⚡️ Big performance improvements, by using C `long double *cscores` array
6
+ * ⚡️ Scores must be `Integer` in `-uint64..+uint64`, or convertable to `Float`
7
+ * ⚡️ many many (so many) updates to benchmarks
8
+ * ✨ Added `DHeap#clear`
9
+ * 🐛 Fixed `DHeap#initialize_copy` and `#freeze`
10
+ * ♻️ significant refactoring
11
+ * 📝 Updated docs (mostly adding benchmarks)
12
+
13
+ ## Release v0.3.0 (2020-12-29)
14
+
15
+ * ⚡️ Big performance improvements, by converting to a `T_DATA` struct.
16
+ * ♻️ Major refactoring/rewriting of dheap.c
17
+ * ✅ Added benchmark specs
18
+ * 🔥 Removed class methods that operated directly on an array. They weren't
19
+ compatible with the performance improvements.
20
+
21
+ ## Release v0.2.2 (2020-12-27)
22
+
23
+ * 🐛 fix `optimized_cmp`, avoiding internal symbols
24
+ * 📝 Update documentation
25
+ * 💚 fix macos CI
26
+ * ➕ Add rubocop 👮🎨
27
+
28
+ ## Release v0.2.1 (2020-12-26)
29
+
30
+ * ⬆️ Upgraded rake (and bundler) to support ruby 3.0
31
+
32
+ ## Release v0.2.0 (2020-12-24)
33
+
34
+ * ✨ Add ability to push separate score and value
35
+ * ⚡️ Big performance gain, by storing scores separately and using ruby's
36
+ internal `OPTIMIZED_CMP` instead of always directly calling `<=>`
37
+
38
+ ## Release v0.1.0 (2020-12-22)
39
+
40
+ 🎉 initial release 🎉
41
+
42
+ * ✨ Add basic d-ary Heap implementation
data/Gemfile CHANGED
@@ -5,6 +5,7 @@ source "https://rubygems.org"
5
5
  # Specify your gem's dependencies in d_heap.gemspec
6
6
  gemspec
7
7
 
8
+ gem "pry"
8
9
  gem "rake", "~> 13.0"
9
10
  gem "rake-compiler"
10
11
  gem "rspec", "~> 3.10"
@@ -1,19 +1,22 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- d_heap (0.3.0)
4
+ d_heap (0.4.0)
5
5
 
6
6
  GEM
7
7
  remote: https://rubygems.org/
8
8
  specs:
9
9
  ast (2.4.1)
10
- benchmark-malloc (0.2.0)
11
- benchmark-perf (0.6.0)
12
- benchmark-trend (0.4.0)
10
+ benchmark_driver (0.15.16)
11
+ coderay (1.1.3)
13
12
  diff-lcs (1.4.4)
13
+ method_source (1.0.0)
14
14
  parallel (1.19.2)
15
15
  parser (2.7.2.0)
16
16
  ast (~> 2.4.1)
17
+ pry (0.13.1)
18
+ coderay (~> 1.1)
19
+ method_source (~> 1.0)
17
20
  rainbow (3.0.0)
18
21
  rake (13.0.3)
19
22
  rake-compiler (1.1.1)
@@ -24,11 +27,6 @@ GEM
24
27
  rspec-core (~> 3.10.0)
25
28
  rspec-expectations (~> 3.10.0)
26
29
  rspec-mocks (~> 3.10.0)
27
- rspec-benchmark (0.6.0)
28
- benchmark-malloc (~> 0.2)
29
- benchmark-perf (~> 0.6)
30
- benchmark-trend (~> 0.4)
31
- rspec (>= 3.0)
32
30
  rspec-core (3.10.0)
33
31
  rspec-support (~> 3.10.0)
34
32
  rspec-expectations (3.10.0)
@@ -49,6 +47,7 @@ GEM
49
47
  unicode-display_width (>= 1.4.0, < 2.0)
50
48
  rubocop-ast (1.1.1)
51
49
  parser (>= 2.7.1.5)
50
+ ruby-prof (1.4.2)
52
51
  ruby-progressbar (1.10.1)
53
52
  unicode-display_width (1.7.0)
54
53
 
@@ -56,12 +55,14 @@ PLATFORMS
56
55
  ruby
57
56
 
58
57
  DEPENDENCIES
58
+ benchmark_driver
59
59
  d_heap!
60
+ pry
60
61
  rake (~> 13.0)
61
62
  rake-compiler
62
63
  rspec (~> 3.10)
63
- rspec-benchmark
64
64
  rubocop (~> 1.0)
65
+ ruby-prof
65
66
 
66
67
  BUNDLED WITH
67
68
  2.2.3
data/README.md CHANGED
@@ -1,28 +1,64 @@
1
1
  # DHeap
2
2
 
3
- A fast _d_-ary heap implementation for ruby, useful in priority queues and graph
4
- algorithms.
3
+ A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
4
+ implemented as a C extension.
5
+
6
+ With a regular queue, you expect "FIFO" behavior: first in, first out. With a
7
+ stack you expect "LIFO": last in first out. With a priority queue, you push
8
+ elements along with a score and the lowest scored element is the first to be
9
+ popped. Priority queues are often used in algorithms for e.g. [scheduling] of
10
+ timers or bandwidth management, [Huffman coding], and various graph search
11
+ algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
12
+
13
+ The _d_-ary heap data structure is a generalization of the [binary heap], in
14
+ which the nodes have _d_ children instead of 2. This allows for "decrease
15
+ priority" operations to be performed more quickly with the tradeoff of slower
16
+ delete minimum. Additionally, _d_-ary heaps can have better memory cache
17
+ behavior than binary heaps, allowing them to run more quickly in practice
18
+ despite slower worst-case time complexity. In the worst case, a _d_-ary heap
19
+ requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
20
+ n / log d)`.
21
+
22
+ Although you should probably just use the default _d_ value of `4` (see the
23
+ analysis below), it's always advisable to benchmark your specific use-case.
24
+
25
+ [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
26
+ [priority queue]: https://en.wikipedia.org/wiki/Priority_queue
27
+ [binary heap]: https://en.wikipedia.org/wiki/Binary_heap
28
+ [scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
29
+ [Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
30
+ [Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
31
+ [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
32
+ [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
5
33
 
6
- The _d_-ary heap data structure is a generalization of the binary heap, in which
7
- the nodes have _d_ children instead of 2. This allows for "decrease priority"
8
- operations to be performed more quickly with the tradeoff of slower delete
9
- minimum. Additionally, _d_-ary heaps can have better memory cache behavior than
10
- binary heaps, allowing them to run more quickly in practice despite slower
11
- worst-case time complexity. In the worst case, a _d_-ary heap requires only
12
- `O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.
34
+ ## Usage
13
35
 
14
- Although you should probably just stick with the default _d_ value of `4`, it
15
- may be worthwhile to benchmark your specific scenario.
36
+ The basic API is:
37
+ * `heap << object` adds a value as its own score.
38
+ * `heap.push(score, value)` adds a value with an extrinsic score.
39
+ * `heap.pop` removes and returns the value with the minimum score.
40
+ * `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
41
+ * `heap.peek` to view the minimum value without popping it.
16
42
 
17
- ## Usage
43
+ The score must be `Integer` or `Float` or convertable to a `Float` via
44
+ `Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
45
+ numeric values gives a 40+% speedup under some benchmarks!
46
+
47
+ _n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
48
+ long long`. _This is architecture dependant but on an IA-64 system this is 64
49
+ bits, which gives a range of -18,446,744,073,709,551,615 to
50
+ +18446744073709551615._
18
51
 
19
- The simplest way to use it is simply with `#push` and `#pop`. Push takes a
20
- score and a value, and pop returns the value with the current minimum score.
52
+ _Comparing arbitary objects via_ `a <=> b` _was the original design and may
53
+ be added back in a future version,_ if (and only if) _it can be done without
54
+ impacting the speed of numeric comparisons._
21
55
 
22
56
  ```ruby
23
57
  require "d_heap"
24
58
 
25
- heap = DHeap.new # defaults to a 4-ary heap
59
+ Task = Struct.new(:id) # for demonstration
60
+
61
+ heap = DHeap.new # defaults to a 4-heap
26
62
 
27
63
  # storing [score, value] tuples
28
64
  heap.push Time.now + 5*60, Task.new(1)
@@ -31,14 +67,61 @@ heap.push Time.now + 60, Task.new(3)
31
67
  heap.push Time.now + 5, Task.new(4)
32
68
 
33
69
  # peeking and popping (using last to get the task and ignore the time)
34
- heap.pop.last # => Task[4]
35
- heap.pop.last # => Task[2]
36
- heap.peak.last # => Task[3], but don't pop it
37
- heap.pop.last # => Task[3]
38
- heap.pop.last # => Task[1]
70
+ heap.pop # => Task[4]
71
+ heap.pop # => Task[2]
72
+ heap.peek # => Task[3], but don't pop it from the heap
73
+ heap.pop # => Task[3]
74
+ heap.pop # => Task[1]
75
+ heap.empty? # => true
76
+ heap.pop # => nil
39
77
  ```
40
78
 
41
- Read the `rdoc` for more detailed documentation and examples.
79
+ If your values behave as their own score, by being convertible via
80
+ `Float(value)`, then you can use `#<<` for implicit scoring. The score should
81
+ not change for as long as the value remains in the heap, since it will not be
82
+ re-evaluated after being pushed.
83
+
84
+ ```ruby
85
+ heap.clear
86
+
87
+ # The score can be derived from the value by using to_f.
88
+ # "a <=> b" is *much* slower than comparing numbers, so it isn't used.
89
+ class Event
90
+ include Comparable
91
+ attr_reader :time, :payload
92
+ alias_method :to_time, :time
93
+
94
+ def initialize(time, payload)
95
+ @time = time.to_time
96
+ @payload = payload
97
+ freeze
98
+ end
99
+
100
+ def to_f
101
+ time.to_f
102
+ end
103
+
104
+ def <=>(other)
105
+ to_f <=> other.to_f
106
+ end
107
+ end
108
+
109
+ heap << comparable_max # sorts last, using <=>
110
+ heap << comparable_min # sorts first, using <=>
111
+ heap << comparable_mid # sorts in the middle, using <=>
112
+ heap.pop # => comparable_min
113
+ heap.pop # => comparable_mid
114
+ heap.pop # => comparable_max
115
+ heap.empty? # => true
116
+ heap.pop # => nil
117
+ ```
118
+
119
+ You can also pass a value into `#pop(max)` which will only pop if the minimum
120
+ score is less than or equal to `max`.
121
+
122
+ Read the [API documentation] for more detailed documentation and examples.
123
+
124
+ [API documentation]: https://rubydoc.info/gems/d_heap/DHeap
42
125
 
43
126
  ## Installation
44
127
 
@@ -58,109 +141,226 @@ Or install it yourself as:
58
141
 
59
142
  ## Motivation
60
143
 
61
- Sometimes you just need a priority queue, right? With a regular queue, you
62
- expect "FIFO" behavior: first in, first out. With a priority queue, you push
63
- with a score (or your elements are comparable), and you want to be able to
64
- efficiently pop off the minimum (or maximum) element.
65
-
66
- One obvious approach is to simply maintain an array in sorted order. And
67
- ruby's Array class makes it simple to maintain a sorted array by combining
68
- `#bsearch_index` with `#insert`. With certain insert/remove workloads that can
69
- perform very well, but in the worst-case an insert or delete can result in O(n),
70
- since `#insert` may need to `memcpy` or `memmove` a significant portion of the
71
- array.
72
-
73
- But the standard way to efficiently and simply solve this problem is using a
74
- binary heap. Although it increases the time for `pop`, it converts the
75
- amortized time per push + pop from `O(n)` to `O(d log n / log d)`.
76
-
77
- I was surprised to find that, at least under certain benchmarks, my pure ruby
78
- heap implementation was usually slower than inserting into a fully sorted
79
- array. While this is a testament to ruby's fine-tuned Array implementationw, a
80
- heap implementated in C should easily peform faster than `Array#insert`.
81
-
82
- The biggest issue is that it just takes far too much time to call `<=>` from
83
- ruby code: A sorted array only requires `log n / log 2` comparisons to insert
84
- and no comparisons to pop. However a _d_-ary heap requires `log n / log d` to
85
- insert plus an additional `d log n / log d` to pop. If your queue contains only
86
- a few hundred items at once, the overhead of those extra calls to `<=>` is far
87
- more than occasionally calling `memcpy`.
88
-
89
- It's likely that MJIT will eventually make the C-extension completely
90
- unnecessary. This is definitely hotspot code, and the basic ruby implementation
91
- would work fine, if not for that `<=>` overhead. Until then... this gem gets
92
- the job done.
93
-
94
- ## TODOs...
95
-
96
- _TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library
97
- ~~includes~~ _will include_ extensions to `Array`, allowing an Array to be
98
- directly handled as a priority queue. These extension methods are meant to be
99
- used similarly to how `#bsearch` and `#bsearch_index` might be used.
100
-
101
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
102
- basic heap with an internal `Hash`, which maps a set of values to scores.
103
- loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
104
- and adjustments, to achieve faster average runtime for *add* and *cancel*
105
- operations.
106
-
107
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some
108
- features that are loosely inspired by go's timers. e.g: It lazily sifts its
109
- heap after deletion and adjustments, to achieve faster average runtime for *add*
110
- and *cancel* operations.
111
-
112
- Additionally, I was inspired by reading go's "timer.go" implementation to
113
- experiment with a 4-ary heap instead of the traditional binary heap. In the
114
- case of timers, new timers are usually scheduled to run after most of the
115
- existing timers. And timers are usually canceled before they have a chance to
116
- run. While a binary heap holds 50% of its elements in its last layer, 75% of a
117
- 4-ary heap will have no children. That diminishes the extra comparison overhead
118
- during sift-down.
119
-
120
- ## Benchmarks
121
-
122
- _TODO: put benchmarks here._
144
+ One naive approach to a priority queue is to maintain an array in sorted order.
145
+ This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
146
+ This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
147
+ `O(n)` because it may need to `memcpy` a significant portion of the array.
148
+
149
+ The standard way to implement a priority queue is with a binary heap. Although
150
+ this increases the time for `pop`, it converts the amortized time per push + pop
151
+ from `O(n)` to `O(d log n / log d)`.
152
+
153
+ However, I was surprised to find that—at least for some benchmarks—my pure ruby
154
+ heap implementation was much slower than inserting into and popping from a fully
155
+ sorted array. The reason for this surprising result: Although it is `O(n)`,
156
+ `memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
157
+ has relatively _much_ larger constant factors. If your queue contains only a
158
+ few thousand items, the overhead of those extra calls to `<=>` is _far_ more
159
+ than occasionally calling `memcpy`. In the worst case, a _d_-heap will require
160
+ `d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
161
+ sorted array.
162
+
163
+ Moving the sift-up and sift-down code into C helps some. But much more helpful
164
+ is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
165
+ called. I'm hopeful that MJIT will eventually obsolete this C-extension. JRuby
166
+ or TruffleRuby may already run the pure ruby version at high speed. This can be
167
+ hotspot code, and the basic ruby implementation should perform well if not for
168
+ the high overhead of `<=>`.
123
169
 
124
170
  ## Analysis
125
171
 
126
172
  ### Time complexity
127
173
 
128
- Both sift operations can perform (log[d] n = log n / log d) swaps.
129
- Swap up performs only a single comparison per swap: O(1).
130
- Swap down performs as many as d comparions per swap: O(d).
174
+ There are two fundamental heap operations: sift-up (used by push) and sift-down
175
+ (used by pop).
176
+
177
+ * Both sift operations can perform as many as `log n / log d` swaps, as the
178
+ element may sift from the bottom of the tree to the top, or vice versa.
179
+ * Sift-up performs a single comparison per swap: `O(1)`.
180
+ So pushing a new element is `O(log n / log d)`.
181
+ * Swap down performs as many as d comparions per swap: `O(d)`.
182
+ So popping the min element is `O(d log n / log d)`.
131
183
 
132
- Inserting an item is O(log n / log d).
133
- Deleting the root is O(d log n / log d).
184
+ Assuming every inserted element is eventually deleted from the root, d=4
185
+ requires the fewest comparisons for combined insert and delete:
134
186
 
135
- Assuming every inserted item is eventually deleted from the root, d=4 requires
136
- the fewest comparisons for combined insert and delete:
137
- * (1 + 2) lg 2 = 4.328085
138
- * (1 + 3) lg 3 = 3.640957
139
- * (1 + 4) lg 4 = 3.606738
140
- * (1 + 5) lg 5 = 3.728010
141
- * (1 + 6) lg 6 = 3.906774
142
- * etc...
187
+ * (1 + 2) lg 2 = 4.328085
188
+ * (1 + 3) lg 3 = 3.640957
189
+ * (1 + 4) lg 4 = 3.606738
190
+ * (1 + 5) lg 5 = 3.728010
191
+ * (1 + 6) lg 6 = 3.906774
192
+ * etc...
143
193
 
144
194
  Leaf nodes require no comparisons to shift down, and higher values for d have
145
195
  higher percentage of leaf nodes:
146
- * d=2 has ~50% leaves,
147
- * d=3 has ~67% leaves,
148
- * d=4 has ~75% leaves,
149
- * and so on...
196
+
197
+ * d=2 has ~50% leaves,
198
+ * d=3 has ~67% leaves,
199
+ * d=4 has ~75% leaves,
200
+ * and so on...
150
201
 
151
202
  See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
152
203
 
153
204
  ### Space complexity
154
205
 
155
- Because the heap is a complete binary tree, space usage is linear, regardless
156
- of d. However higher d values may provide better cache locality.
206
+ Space usage is linear, regardless of d. However higher d values may
207
+ provide better cache locality. Because the heap is a complete binary tree, the
208
+ elements can be stored in an array, without the need for tree or list pointers.
209
+
210
+ Ruby can compare Numeric values _much_ faster than other ruby objects, even if
211
+ those objects simply delegate comparison to internal Numeric values. And it is
212
+ often useful to use external scores for otherwise uncomparable values. So
213
+ `DHeap` uses twice as many entries (one for score and one for value)
214
+ as an array which only stores values.
157
215
 
158
- We can run comparisons much much faster for Numeric or String objects than for
159
- ruby objects which delegate comparison to internal Numeric or String objects.
160
- And it is often advantageous to use extrinsic scores for uncomparable items.
161
- For this, our internal array uses twice as many entries (one for score and one
162
- for value) as it would if it only supported intrinsic comparison or used an
163
- un-memoized "sort_by" proc.
216
+ ## Benchmarks
217
+
218
+ _See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
219
+ `docs/profile.txt` for more details or updated results. These benchmarks were
220
+ measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
221
+
222
+ These benchmarks use very simple implementations for a pure-ruby heap and an
223
+ array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
224
+ comparison, an alternate implementation `Array#min` and `Array#delete_at` is
225
+ also shown.
226
+
227
+ Three different scenarios are measured:
228
+ * push N values but never pop (clearing between each set of pushes).
229
+ * push N values and then pop N values.
230
+ Although this could be used for heap sort, we're unlikely to choose heap sort
231
+ over Ruby's quick sort implementation. I'm using this scenario to represent
232
+ the amortized cost of creating a heap and (eventually) draining it.
233
+ * For a heap of size N, repeatedly push and pop while keeping a stable size.
234
+ This is a _very simple_ approximation for how most scheduler/timer heaps
235
+ would be used. Usually when a timer fires it will be quickly replaced by a
236
+ new timer, and the overall count of timers will remain roughly stable.
237
+
238
+ In these benchmarks, `DHeap` runs faster than all other implementations for
239
+ every scenario and every value of N, although the difference is much more
240
+ noticable at higher values of N. The pure ruby heap implementation is
241
+ competitive for `push` alone at every value of N, but is significantly slower
242
+ than bsearch + insert for push + pop until N is _very_ large (somewhere between
243
+ 10k and 100k)!
244
+
245
+ For very small N values the benchmark implementations, `DHeap` runs faster than
246
+ the other implementations for each scenario, although the difference is still
247
+ relatively small. The pure ruby binary heap is 2x or more slower than bsearch +
248
+ insert for common common push/pop scenario.
249
+
250
+ == push N (N=5) ==========================================================
251
+ push N (c_dheap): 1701338.1 i/s
252
+ push N (rb_heap): 971614.1 i/s - 1.75x slower
253
+ push N (bsearch): 946363.7 i/s - 1.80x slower
254
+
255
+ == push N then pop N (N=5) ===============================================
256
+ push N + pop N (c_dheap): 1087944.8 i/s
257
+ push N + pop N (findmin): 841708.1 i/s - 1.29x slower
258
+ push N + pop N (bsearch): 773252.7 i/s - 1.41x slower
259
+ push N + pop N (rb_heap): 471852.9 i/s - 2.31x slower
260
+
261
+ == Push/pop with pre-filled queue of size=N (N=5) ========================
262
+ push + pop (c_dheap): 5525418.8 i/s
263
+ push + pop (findmin): 5003904.8 i/s - 1.10x slower
264
+ push + pop (bsearch): 4320581.8 i/s - 1.28x slower
265
+ push + pop (rb_heap): 2207042.0 i/s - 2.50x slower
266
+
267
+ By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
268
+ scenarios, but the pure ruby heap is still slower than every other
269
+ implementation—even resorting the array after every `#push`—in any scenario that
270
+ uses `#pop`.
271
+
272
+ == push N (N=21) =========================================================
273
+ push N (c_dheap): 408307.0 i/s
274
+ push N (rb_heap): 212275.2 i/s - 1.92x slower
275
+ push N (bsearch): 169583.2 i/s - 2.41x slower
276
+
277
+ == push N then pop N (N=21) ==============================================
278
+ push N + pop N (c_dheap): 199435.5 i/s
279
+ push N + pop N (findmin): 162024.5 i/s - 1.23x slower
280
+ push N + pop N (bsearch): 146284.3 i/s - 1.36x slower
281
+ push N + pop N (rb_heap): 72289.0 i/s - 2.76x slower
282
+
283
+ == Push/pop with pre-filled queue of size=N (N=21) =======================
284
+ push + pop (c_dheap): 4836860.0 i/s
285
+ push + pop (findmin): 4467453.9 i/s - 1.08x slower
286
+ push + pop (bsearch): 3345458.4 i/s - 1.45x slower
287
+ push + pop (rb_heap): 1560476.3 i/s - 3.10x slower
288
+
289
+ At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
290
+ of `DHeap#push`, while insert's linear growth causes it to run slower and
291
+ slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d)
292
+ for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
293
+ increases to 5k:
294
+
295
+ == Push/pop with pre-filled queue of size=N (N=5461) ==============
296
+ push + pop (c_dheap): 2718225.1 i/s
297
+ push + pop (bsearch): 1793546.4 i/s - 1.52x slower
298
+ push + pop (rb_heap): 707139.9 i/s - 3.84x slower
299
+ push + pop (findmin): 122316.0 i/s - 22.22x slower
300
+
301
+ Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
302
+ for the repeated push/pop scenario, all the way up to N as high as 87k:
303
+
304
+ == push N (N=87381) ======================================================
305
+ push N (c_dheap): 92.8 i/s
306
+ push N (rb_heap): 43.5 i/s - 2.13x slower
307
+ push N (bsearch): 2.9 i/s - 31.70x slower
308
+
309
+ == push N then pop N (N=87381) ===========================================
310
+ push N + pop N (c_dheap): 22.6 i/s
311
+ push N + pop N (rb_heap): 5.5 i/s - 4.08x slower
312
+ push N + pop N (bsearch): 2.9 i/s - 7.90x slower
313
+
314
+ == Push/pop with pre-filled queue of size=N (N=87381) ====================
315
+ push + pop (c_dheap): 1815277.3 i/s
316
+ push + pop (bsearch): 762343.2 i/s - 2.38x slower
317
+ push + pop (rb_heap): 535913.6 i/s - 3.39x slower
318
+ push + pop (findmin): 2262.8 i/s - 802.24x slower
319
+
320
+ ## Profiling
321
+
322
+ _n.b. `Array#fetch` is reading the input data, external to heap operations.
323
+ These benchmarks use integers for all scores, which enables significantly faster
324
+ comparisons. If `a <=> b` were used instead, then the difference between push
325
+ and pop would be much larger. And ruby's `Tracepoint` impacts these different
326
+ implementations differently. So we can't use these profiler results for
327
+ comparisons between implementations. A sampling profiler would be needed for
328
+ more accurate relative measurements._
329
+
330
+ It's informative to look at the `ruby-prof` results for a simple binary search +
331
+ insert implementation, repeatedly pushing and popping to a large heap. In
332
+ particular, even with 1000 members, the linear `Array#insert` is _still_ faster
333
+ than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are
334
+ still (relatively) slow and `memcpy` is (relatively) quite fast!
335
+
336
+ %self total self wait child calls name location
337
+ 34.79 2.222 2.222 0.000 0.000 1000000 Array#insert
338
+ 32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index
339
+ 12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
340
+ 10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
341
+ 5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
342
+ 2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch
343
+ 1.95 0.125 0.125 0.000 0.000 1000000 Array#pop
344
+
345
+ Contrast this with a simplistic pure-ruby implementation of a binary heap:
346
+
347
+ %self total self wait child calls name location
348
+ 48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
349
+ 42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
350
+ 4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
351
+
352
+ You can see that it spends almost more time in pop than it does in push. That
353
+ is expected behavior for a heap: although both are O(log n), pop is
354
+ significantly more complex, and has _d_ comparisons per layer.
355
+
356
+ And `DHeap` shows a similar comparison between push and pop, although it spends
357
+ half of its time in the benchmark code (which is written in ruby):
358
+
359
+ %self total self wait child calls name location
360
+ 43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
361
+ 26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<<
362
+ 23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop
363
+ 7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch
164
364
 
165
365
  ### Timers
166
366
 
@@ -178,22 +378,54 @@ faster than a delete and re-insert.
178
378
 
179
379
  ## Alternative data structures
180
380
 
381
+ As always, you should run benchmarks with your expected scenarios to determine
382
+ which is right.
383
+
181
384
  Depending on what you're doing, maintaining a sorted `Array` using
182
- `#bsearch_index` and `#insert` might be faster! Although it is technically
183
- O(n) for insertions, the implementations for `memcpy` or `memmove` can be *very*
184
- fast on modern architectures. Also, it can be faster O(n) on average, if
185
- insertions are usually near the end of the array. You should run benchmarks
186
- with your expected scenarios to determine which is right.
385
+ `#bsearch_index` and `#insert` might be just fine! As discussed above, although
386
+ it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this
387
+ may not matter. Also, if you can arrange for insertions to occur near the end
388
+ of the array, that could significantly reduce the `memcpy` overhead even more.
389
+
390
+ More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged
391
+ as well as lower amortized time.
392
+
393
+ [Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
187
394
 
188
395
  If it is important to be able to quickly enumerate the set or find the ranking
189
- of values in it, then you probably want to use a self-balancing binary search
190
- tree (e.g. a red-black tree) or a skip-list.
191
-
192
- A Hashed Timing Wheel or Heirarchical Timing Wheels (or some variant in that
193
- family of data structures) can be constructed to have effectively O(1) running
194
- time in most cases. However, the implementation for that data structure is more
195
- complex than a heap. If a 4-ary heap is good enough for go's timers, it should
196
- be suitable for many use cases.
396
+ of values in it, then you may want to use a self-balancing binary search tree
397
+ (e.g. a [red-black tree]) or a [skip-list].
398
+
399
+ [red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
400
+ [skip-list]: https://en.wikipedia.org/wiki/Skip_list
401
+
402
+ [Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
403
+ family of data structures) can be constructed to have effectively `O(1)` running
404
+ time in most cases. Although the implementation for that data structure is more
405
+ complex than a heap, it may be necessary for enormous values of N.
406
+
407
+ [timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
408
+
409
+ ## TODOs...
410
+
411
+ _TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
412
+ basic heap with an internal `Hash`, which maps a set of values to scores.
413
+ loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
414
+ and adjustments, to achieve faster average runtime for *add* and *cancel*
415
+ operations.
416
+
417
+ _TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
418
+ features that are loosely inspired by go's timers. e.g: It lazily sifts its
419
+ heap after deletion and adjustments, to achieve faster average runtime for *add*
420
+ and *cancel* operations.
421
+
422
+ Additionally, I was inspired by reading go's "timer.go" implementation to
423
+ experiment with a 4-ary heap instead of the traditional binary heap. In the
424
+ case of timers, new timers are usually scheduled to run after most of the
425
+ existing timers. And timers are usually canceled before they have a chance to
426
+ run. While a binary heap holds 50% of its elements in its last layer, 75% of a
427
+ 4-ary heap will have no children. That diminishes the extra comparison overhead
428
+ during sift-down.
197
429
 
198
430
  ## Development
199
431