d_heap 0.6.1 → 0.7.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1ad095ff29343f83c8bbe6fd0bc7f4acd79fa9c298aa4f8d007acf02ebedba30
4
- data.tar.gz: b2806a066a173a83d12259342c3f7d90900c83dc628063955d861f05acc98796
3
+ metadata.gz: 5b51ed52baf74b585a7ab7799f92a446aef5852431ba10e146658b419657ffbe
4
+ data.tar.gz: cc7c6786eee78ec13214582b8701448d312f59fb723d12676fb673447ab409a7
5
5
  SHA512:
6
- metadata.gz: 297aad8a8b4c7845fbea64808a2beaf4aa66b8431a23841c3d17952aaf85f41a3377c2dadc7651858e038adc69a35b2fe8e6ca484d45999f026efb41817e281b
7
- data.tar.gz: 1e3f123c7f723c752b2e8326c70b4208188ad09c275574bd0cee3dc7a119c7e3f07173f4ad4ed32035d2103a10b1a979400dfa35bdc1dd55272b53bcc8eaa2b9
6
+ metadata.gz: 5de98f8c9084b30694fff5f8154a6e42e7e67d76518c25136ab4fb0c0afb047ad3c923f4544dcf613ded4c3b01417729aa796c973100faaa7ee93051fa630c7d
7
+ data.tar.gz: e5dbcc90da7adfba7ef45cd9a2da5fd1781a2bd489002a5ffc0a764915c035c178db30ae9b8431a8fc810cfa6f03a1b38ec0a50cbf23c2e1ba5dfc36549c0609
data/.clang-format ADDED
@@ -0,0 +1,21 @@
1
+ ---
2
+ BasedOnStyle: mozilla
3
+ IndentWidth: 4
4
+ PointerAlignment: Right
5
+ AlignAfterOpenBracket: Align
6
+ AlignConsecutiveAssignments: true
7
+ AlignConsecutiveDeclarations: true
8
+ AlignConsecutiveBitFields: true
9
+ AlignConsecutiveMacros: true
10
+ AlignEscapedNewlines: Right
11
+ AlignOperands: true
12
+
13
+ AllowAllConstructorInitializersOnNextLine: false
14
+ AllowShortIfStatementsOnASingleLine: WithoutElse
15
+
16
+ IndentCaseLabels: false
17
+ IndentPPDirectives: AfterHash
18
+
19
+ ForEachMacros:
20
+ - WHILE_PEEK_LT_P
21
+ ...
@@ -23,4 +23,19 @@ jobs:
23
23
  run: |
24
24
  gem install bundler -v 2.2.3
25
25
  bundle install
26
- bundle exec rake
26
+ bundle exec rake ci
27
+
28
+ benchmarks:
29
+ runs-on: ubuntu-latest
30
+ steps:
31
+ - uses: actions/checkout@v2
32
+ - name: Set up Ruby
33
+ uses: ruby/setup-ruby@v1
34
+ with:
35
+ ruby-version: 2.7
36
+ bundler-cache: true
37
+ - name: Run the benchmarks
38
+ run: |
39
+ gem install bundler -v 2.2.3
40
+ bundle install
41
+ bundle exec rake ci:benchmarks
data/.rubocop.yml CHANGED
@@ -135,6 +135,7 @@ Style/ClassAndModuleChildren: { Enabled: false }
135
135
  Style/EachWithObject: { Enabled: false }
136
136
  Style/FormatStringToken: { Enabled: false }
137
137
  Style/FloatDivision: { Enabled: false }
138
+ Style/GuardClause: { Enabled: false } # usually nice to do, but...
138
139
  Style/IfUnlessModifier: { Enabled: false }
139
140
  Style/IfWithSemicolon: { Enabled: false }
140
141
  Style/Lambda: { Enabled: false }
data/CHANGELOG.md CHANGED
@@ -1,5 +1,22 @@
1
1
  ## Current/Unreleased
2
2
 
3
+ ## Release v0.7.0 (2021-01-24)
4
+
5
+ * 💥⚡️ **BREAKING**: Uses `double`) for _all_ scores.
6
+ * 💥 Integers larger than a double mantissa (53-bits) will lose some
7
+ precision.
8
+ * ⚡️ big speed up
9
+ * ⚡️ Much better memory usage
10
+ * ⚡️ Simplifies score conversion between ruby and C
11
+ * ✨ Added `DHeap::Map` for ensuring values can only be added once, by `#hash`.
12
+ * Adding again will update the score.
13
+ * Adds `DHeap::Map#[]` for quick lookup of existing scores
14
+ * Adds `DHeap::Map#[]=` for adjustments of existing scores
15
+ * TODO: `DHeap::Map#delete`
16
+ * 📝📈 SO MANY BENCHMARKS
17
+ * ⚡️ Set DEFAULT_D to 6, based on benchmarks.
18
+ * 🐛♻️ convert all `long` indexes to `size_t`
19
+
3
20
  ## Release v0.6.1 (2021-01-24)
4
21
 
5
22
  * 📝 Fix link to CHANGELOG.md in gemspec
data/{N → D} RENAMED
@@ -1,7 +1,7 @@
1
1
  #!/bin/sh
2
2
  set -eu
3
3
 
4
- export BENCH_N="$1"
4
+ export BENCH_D="$1"
5
5
  shift
6
6
 
7
7
  exec ruby "$@"
data/README.md CHANGED
@@ -7,6 +7,13 @@
7
7
  A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
8
8
  implemented as a C extension.
9
9
 
10
+ A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":
11
+ last in first out. A priority queue pushes each element with a score and pops
12
+ out in order by score. Priority queues are often used in algorithms for e.g.
13
+ [scheduling] of timers or bandwidth management, for [Huffman coding], and for
14
+ various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
15
+ [Prim's algorithm].
16
+
10
17
  From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
11
18
  > A heap is a specialized tree-based data structure which is essentially an
12
19
  > almost complete tree that satisfies the heap property: in a min heap, for any
@@ -16,26 +23,17 @@ From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
16
23
 
17
24
  ![tree representation of a min heap](images/wikipedia-min-heap.png)
18
25
 
19
- With a regular queue, you expect "FIFO" behavior: first in, first out. With a
20
- stack you expect "LIFO": last in first out. A priority queue has a score for
21
- each element and elements are popped in order by score. Priority queues are
22
- often used in algorithms for e.g. [scheduling] of timers or bandwidth
23
- management, for [Huffman coding], and various graph search algorithms such as
24
- [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
25
-
26
- The _d_-ary heap data structure is a generalization of the [binary heap], in
27
- which the nodes have _d_ children instead of 2. This allows for "insert" and
28
- "decrease priority" operations to be performed more quickly with the tradeoff of
29
- slower delete minimum or "increase priority". Additionally, _d_-ary heaps can
30
- have better memory cache behavior than binary heaps, allowing them to run more
31
- quickly in practice despite slower worst-case time complexity. In the worst
32
- case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
33
- the tradeoff that pop requires `O(d log n / log d)`.
34
-
35
- Although you should probably just use the default _d_ value of `4` (see the
36
- analysis below), it's always advisable to benchmark your specific use-case. In
37
- particular, if you push items more than you pop, higher values for _d_ can give
38
- a faster total runtime.
26
+ The _d_-ary heap data structure is a generalization of a [binary heap] in which
27
+ each node has _d_ children instead of 2. This speeds up "push" or "decrease
28
+ priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
29
+ "increase priority" (`O(d log n / log d)`). Additionally, _d_-ary heaps can
30
+ have better memory cache behavior than binary heaps, letting them run more
31
+ quickly in practice.
32
+
33
+ Although the default _d_ value will usually perform best (see the time
34
+ complexity analysis below), it's always advisable to benchmark your specific
35
+ use-case. In particular, if you push items more than you pop, higher values for
36
+ _d_ can give a faster total runtime.
39
37
 
40
38
  [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
41
39
  [priority queue]: https://en.wikipedia.org/wiki/Priority_queue
@@ -46,41 +44,39 @@ a faster total runtime.
46
44
  [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
47
45
  [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
48
46
 
47
+ ## Installation
48
+
49
+ Add this line to your application's Gemfile:
50
+
51
+ ```ruby
52
+ gem 'd_heap'
53
+ ```
54
+
55
+ And then execute:
56
+
57
+ $ bundle install
58
+
59
+ Or install it yourself as:
60
+
61
+ $ gem install d_heap
62
+
49
63
  ## Usage
50
64
 
51
- The basic API is `#push(object, score)` and `#pop`. Please read the
52
- [gem documentation] for more details and other methods.
65
+ The basic API is `#push(object, score)` and `#pop`. Please read the [full
66
+ documentation] for more details. The score must be convertable to a `Float` via
67
+ `Float(score)` (i.e. it should properly implement `#to_f`).
53
68
 
54
- Quick reference for some common methods:
69
+ Quick reference for the most common methods:
55
70
 
56
- * `heap << object` adds a value, with `Float(object)` as its score.
71
+ * `heap << object` adds a value, using `Float(object)` as its intrinsic score.
57
72
  * `heap.push(object, score)` adds a value with an extrinsic score.
58
- * `heap.pop` removes and returns the value with the minimum score.
59
- * `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
60
73
  * `heap.peek` to view the minimum value without popping it.
74
+ * `heap.pop` removes and returns the value with the minimum score.
75
+ * `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
61
76
  * `heap.clear` to remove all items from the heap.
62
77
  * `heap.empty?` returns true if the heap is empty.
63
78
  * `heap.size` returns the number of items in the heap.
64
79
 
65
- If the score changes while the object is still in the heap, it will not be
66
- re-evaluated again.
67
-
68
- The score must either be `Integer` or `Float` or convertable to a `Float` via
69
- `Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
70
- numeric values gives more than 50% speedup under some benchmarks! _n.b._
71
- `Integer` _scores must have an absolute value that fits into_ `unsigned long
72
- long`. This is compiler and architecture dependant but with gcc on an IA-64
73
- system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
74
- +18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
75
- in nanoseconds.
76
-
77
- _Comparing arbitary objects via_ `a <=> b` _was the original design and may be
78
- added back in a future version,_ if (and only if) _it can be done without
79
- impacting the speed of numeric comparisons. The speedup from this constraint is
80
- huge!_
81
-
82
- [gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
83
-
84
80
  ### Examples
85
81
 
86
82
  ```ruby
@@ -128,251 +124,272 @@ heap.size # => 0
128
124
  heap.pop # => nil
129
125
  ```
130
126
 
131
- Please see the [gem documentation] for more methods and more examples.
127
+ Please see the [full documentation] for more methods and more examples.
132
128
 
133
- ## Installation
129
+ [full documentation]: https://rubydoc.info/gems/d_heap/DHeap
134
130
 
135
- Add this line to your application's Gemfile:
131
+ ### DHeap::Map
136
132
 
137
- ```ruby
138
- gem 'd_heap'
139
- ```
133
+ `DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
134
+ index in the heap. For simple push/pop this a bit slower than a normal `DHeap`
135
+ heap, but it can enable huge speed-ups for algorithms that need to adjust scores
136
+ after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:
140
137
 
141
- And then execute:
142
-
143
- $ bundle install
144
-
145
- Or install it yourself as:
138
+ * a uniqueness constraint, by `#hash` value
139
+ * `#[obj] # => score` or `#score(obj)` in `O(1)`
140
+ * `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
141
+ * TODO:
142
+ * optionally unique by object identity
143
+ * `#delete(obj)` in `O(d log n / log d)` (TODO)
146
144
 
147
- $ gem install d_heap
145
+ ## Scores
148
146
 
149
- ## Motivation
147
+ If a score changes while the object is still in the heap, it will not be
148
+ re-evaluated again.
150
149
 
151
- One naive approach to a priority queue is to maintain an array in sorted order.
152
- This can be very simply implemented in ruby with `Array#bseach_index` +
153
- `Array#insert`. This can be very fast—`Array#pop` is `O(1)`—but the worst-case
154
- for insert is `O(n)` because it may need to `memcpy` a significant portion of
155
- the array.
150
+ Constraining scores to `Float` gives enormous performance benefits. n.b.
151
+ very large `Integer` values will lose precision when converted to `Float`. This
152
+ is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
153
+ is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
154
+ to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
155
+ time since the epoch in nanoseconds. This can be worked around by adding a
156
+ bias, but probably it's good enough for most usage.
156
157
 
157
- The standard way to implement a priority queue is with a binary heap. Although
158
- this increases the time complexity for `pop` alone, it reduces the combined time
159
- compexity for the combined `push` + `pop`. Using a d-ary heap with d > 2
160
- makes the tree shorter but broader, which reduces to `O(log n / log d)` while
161
- increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
158
+ _Comparing arbitary objects via_ `a <=> b` _was the original design and may be
159
+ added back in a future version,_ if (and only if) _it can be done without
160
+ impacting the speed of numeric comparisons._
162
161
 
163
- However, I was disappointed when my best ruby heap implementation ran much more
164
- slowly than the naive approach—even for heaps containing ten thousand items.
165
- Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
166
- has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons
167
- for each push + pop than `bsearch` + `insert`.
162
+ ## Thread safety
168
163
 
169
- Additionally, when researching how other systems handle their scheduling, I was
170
- inspired by reading go's "timer.go" implementation to experiment with a 4-ary
171
- heap instead of the traditional binary heap.
164
+ `DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
165
+ take precautions such as locking access behind a mutex.
172
166
 
173
167
  ## Benchmarks
174
168
 
175
- _See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
176
- `docs/profile.txt` for much more detail or updated results. These benchmarks
177
- were measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
178
-
179
- These benchmarks use very simple implementations for a pure-ruby heap and an
180
- array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
181
- comparison, I also compare to the [priority_queue_cxx gem] which uses the [C++
182
- STL priority_queue], and another naive implementation that uses `Array#min` and
183
- `Array#delete_at` with an unsorted array.
184
-
185
- In these benchmarks, `DHeap` runs faster than all other implementations for
186
- every scenario and every value of N, although the difference is usually more
187
- noticable at higher values of N. The pure ruby heap implementation is
188
- competitive for `push` alone at every value of N, but is significantly slower
189
- than bsearch + insert for push + pop, until N is _very_ large (somewhere between
190
- 10k and 100k)!
169
+ _See full benchmark output in subdirs of `benchmarks`. See also or updated
170
+ results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
171
+ with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
172
+
173
+ ### Implementations
174
+
175
+ * **findmin** -
176
+ A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
177
+ very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
178
+ `Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but
179
+ unusably slow for `n > 1000`.
180
+
181
+ * **bsearch** -
182
+ A simple implementation with a slow `O(n)` push using `Array#bsearch` +
183
+ `Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
184
+ `Array#pop`. It is still relatively fast for `n < 10000`, but its linear
185
+ time complexity really destroys it after that.
186
+
187
+ * **rb_heap** -
188
+ A pure ruby binary min-heap that has been tuned for performance by making
189
+ few method calls and allocating and assigning as few variables as possible.
190
+ It runs in `O(log n)` for both push and pop, although pop is slower than
191
+ push by a constant factor. Its much higher constant factors makes it lose
192
+ to `bsearch` push + pop for `n < 10000` but it holds steady with very little
193
+ slowdown even with `n > 10000000`.
194
+
195
+ * **c++ stl** -
196
+ A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
197
+ priority_queue]. The wrapper is simply to provide compatibility with the
198
+ other benchmarked implementations, but it should be possible to speed this
199
+ up a little bit by benchmarking the `priority_queue_cxx` API directly. It
200
+ has the same time complexity as rb_heap but its much lower constant
201
+ factors allow it to easily outperform `bsearch`.
202
+
203
+ * **c_dheap** -
204
+ A {DHeap} instance with the default `d` value of `4`. It has the same time
205
+ complexity as `rb_heap` and `c++ stl`, but is faster than both in every
206
+ benchmarked scenario.
191
207
 
192
208
  [priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
193
209
  [C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
194
210
 
195
- Three different scenarios are measured:
211
+ ### Scenarios
196
212
 
197
- ### push N items onto an empty heap
213
+ Each benchmark increases N exponentially, either by √1̅0̅ or approximating
214
+ (alternating between x3 and x3.333) in order to simplify keeping loop counts
215
+ evenly divisible by N.
198
216
 
199
- ...but never pop (clearing between each set of pushes).
217
+ #### push N items
200
218
 
201
- ![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
219
+ This measures the _average time per insert_ to create a queue of size N
220
+ (clearing the queue once it reaches that size). Use cases which push (or
221
+ decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's
222
+ algorithm] when the graph has more edges than verticies, may want to pay more
223
+ attention to this benchmark.
202
224
 
203
- ### push N items onto an empty heap then pop all N
225
+ ![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
204
226
 
205
- Although this could be used for heap sort, we're unlikely to choose heap sort
206
- over Ruby's quick sort implementation. I'm using this scenario to represent
207
- the amortized cost of creating a heap and (eventually) draining it.
227
+ == push N (N=100) ==========================================================
228
+ push N (c_dheap): 10522662.6 i/s
229
+ push N (findmin): 9980622.3 i/s - 1.05x slower
230
+ push N (c++ stl): 7991608.3 i/s - 1.32x slower
231
+ push N (rb_heap): 4607849.4 i/s - 2.28x slower
232
+ push N (bsearch): 2769106.2 i/s - 3.80x slower
233
+ == push N (N=10,000) =======================================================
234
+ push N (c_dheap): 10444588.3 i/s
235
+ push N (findmin): 10191797.4 i/s - 1.02x slower
236
+ push N (c++ stl): 8210895.4 i/s - 1.27x slower
237
+ push N (rb_heap): 4369252.9 i/s - 2.39x slower
238
+ push N (bsearch): 1213580.4 i/s - 8.61x slower
239
+ == push N (N=1,000,000) ====================================================
240
+ push N (c_dheap): 10342183.7 i/s
241
+ push N (findmin): 9963898.8 i/s - 1.04x slower
242
+ push N (c++ stl): 7891924.8 i/s - 1.31x slower
243
+ push N (rb_heap): 4350116.0 i/s - 2.38x slower
244
+
245
+ All three heap implementations have little to no perceptible slowdown for `N >
246
+ 100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
247
+
248
+ #### push then pop N items
249
+
250
+ This measures the _average_ for a push **or** a pop, filling up a queue with N
251
+ items and then draining that queue until empty. It represents the amortized
252
+ cost of balanced pushes and pops to fill a heap and drain it.
208
253
 
209
254
  ![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
210
255
 
211
- ### push and pop on a heap with N values
212
-
213
- Repeatedly push and pop while keeping a stable heap size. This is a _very
214
- simplistic_ approximation for how most scheduler/timer heaps might be used.
215
- Usually when a timer fires it will be quickly replaced by a new timer, and the
216
- overall count of timers will remain roughly stable.
256
+ == push N then pop N (N=100) ===============================================
257
+ push N + pop N (c_dheap): 10954469.2 i/s
258
+ push N + pop N (c++ stl): 9317140.2 i/s - 1.18x slower
259
+ push N + pop N (bsearch): 4808770.2 i/s - 2.28x slower
260
+ push N + pop N (findmin): 4321411.9 i/s - 2.53x slower
261
+ push N + pop N (rb_heap): 2467417.0 i/s - 4.44x slower
262
+ == push N then pop N (N=10,000) ============================================
263
+ push N + pop N (c_dheap): 8083962.7 i/s
264
+ push N + pop N (c++ stl): 7365661.8 i/s - 1.10x slower
265
+ push N + pop N (bsearch): 2257047.9 i/s - 3.58x slower
266
+ push N + pop N (rb_heap): 1439204.3 i/s - 5.62x slower
267
+ == push N then pop N (N=1,000,000) =========================================
268
+ push N + pop N (c++ stl): 5274657.5 i/s
269
+ push N + pop N (c_dheap): 4731117.9 i/s - 1.11x slower
270
+ push N + pop N (rb_heap): 976688.6 i/s - 5.40x slower
271
+
272
+ At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
273
+ much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above
274
+ 30k it slows down too much to be useful. `DHeap` consistently runs 4.5-5.5x
275
+ faster than the pure ruby heap.
276
+
277
+ #### push & pop on N-item heap
278
+
279
+ This measures the combined time to push once and pop once, which is done
280
+ repeatedly while keeping a stable heap size of N. Its an approximation for
281
+ scenarios which reach a stable size and then plateau with balanced pushes and
282
+ pops. E.g. timers and timeouts will often reschedule themselves or replace
283
+ themselves with new timers or timeouts, maintaining a roughly stable total count
284
+ of timers.
217
285
 
218
286
  ![bar graph for push_pop benchmarks](./images/push_pop.png)
219
287
 
220
- ### numbers
221
-
222
- Even for very small N values the benchmark implementations, `DHeap` runs faster
223
- than the other implementations for each scenario, although the difference is
224
- still relatively small. The pure ruby binary heap is 2x or more slower than
225
- bsearch + insert for common push/pop scenario.
226
-
227
- == push N (N=5) ==========================================================
228
- push N (c_dheap): 1969700.7 i/s
229
- push N (c++ stl): 1049738.1 i/s - 1.88x slower
230
- push N (rb_heap): 928435.2 i/s - 2.12x slower
231
- push N (bsearch): 921060.0 i/s - 2.14x slower
232
-
233
- == push N then pop N (N=5) ===============================================
234
- push N + pop N (c_dheap): 1375805.0 i/s
235
- push N + pop N (c++ stl): 1134997.5 i/s - 1.21x slower
236
- push N + pop N (findmin): 862913.1 i/s - 1.59x slower
237
- push N + pop N (bsearch): 762887.1 i/s - 1.80x slower
238
- push N + pop N (rb_heap): 506890.4 i/s - 2.71x slower
239
-
240
- == Push/pop with pre-filled queue of size=N (N=5) ========================
241
- push + pop (c_dheap): 9044435.5 i/s
242
- push + pop (c++ stl): 7534583.4 i/s - 1.20x slower
243
- push + pop (findmin): 5026155.1 i/s - 1.80x slower
244
- push + pop (bsearch): 4300260.0 i/s - 2.10x slower
245
- push + pop (rb_heap): 2299499.7 i/s - 3.93x slower
246
-
247
- By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
248
- scenarios, but the pure ruby heap is still slower than every other
249
- implementation—even resorting the array after every `#push`—in any scenario that
250
- uses `#pop`.
251
-
252
- == push N (N=21) =========================================================
253
- push N (c_dheap): 464231.4 i/s
254
- push N (c++ stl): 305546.7 i/s - 1.52x slower
255
- push N (rb_heap): 202803.7 i/s - 2.29x slower
256
- push N (bsearch): 168678.7 i/s - 2.75x slower
257
-
258
- == push N then pop N (N=21) ==============================================
259
- push N + pop N (c_dheap): 298350.3 i/s
260
- push N + pop N (c++ stl): 252227.1 i/s - 1.18x slower
261
- push N + pop N (findmin): 161998.7 i/s - 1.84x slower
262
- push N + pop N (bsearch): 143432.3 i/s - 2.08x slower
263
- push N + pop N (rb_heap): 79622.1 i/s - 3.75x slower
264
-
265
- == Push/pop with pre-filled queue of size=N (N=21) =======================
266
- push + pop (c_dheap): 8855093.4 i/s
267
- push + pop (c++ stl): 7223079.5 i/s - 1.23x slower
268
- push + pop (findmin): 4542913.7 i/s - 1.95x slower
269
- push + pop (bsearch): 3461802.4 i/s - 2.56x slower
270
- push + pop (rb_heap): 1845488.7 i/s - 4.80x slower
271
-
272
- At higher values of N, a heaps logarithmic growth leads to only a little
273
- slowdown of `#push`, while insert's linear growth causes it to run noticably
274
- slower and slower. But because `#pop` is `O(1)` for a sorted array and `O(d log
275
- n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
276
- relatively close, and bsearch + insert still runs faster than a pure ruby heap,
277
- even up to queues with 10k items. But as queue size increases beyond than that,
278
- the linear time compexity to keep a sorted array dominates.
279
-
280
- == push + pop (rb_heap)
281
- queue size = 10000: 736618.2 i/s
282
- queue size = 25000: 670186.8 i/s - 1.10x slower
283
- queue size = 50000: 618156.7 i/s - 1.19x slower
284
- queue size = 100000: 579250.7 i/s - 1.27x slower
285
- queue size = 250000: 572795.0 i/s - 1.29x slower
286
- queue size = 500000: 543648.3 i/s - 1.35x slower
287
- queue size = 1000000: 513523.4 i/s - 1.43x slower
288
- queue size = 2500000: 460848.9 i/s - 1.60x slower
289
- queue size = 5000000: 445234.5 i/s - 1.65x slower
290
- queue size = 10000000: 423119.0 i/s - 1.74x slower
291
-
292
- == push + pop (bsearch)
293
- queue size = 10000: 786334.2 i/s
294
- queue size = 25000: 364963.8 i/s - 2.15x slower
295
- queue size = 50000: 200520.6 i/s - 3.92x slower
296
- queue size = 100000: 88607.0 i/s - 8.87x slower
297
- queue size = 250000: 34530.5 i/s - 22.77x slower
298
- queue size = 500000: 17965.4 i/s - 43.77x slower
299
- queue size = 1000000: 5638.7 i/s - 139.45x slower
300
- queue size = 2500000: 1302.0 i/s - 603.93x slower
301
- queue size = 5000000: 592.0 i/s - 1328.25x slower
302
- queue size = 10000000: 288.8 i/s - 2722.66x slower
303
-
304
- == push + pop (c_dheap)
305
- queue size = 10000: 7311366.6 i/s
306
- queue size = 50000: 6737824.5 i/s - 1.09x slower
307
- queue size = 25000: 6407340.6 i/s - 1.14x slower
308
- queue size = 100000: 6254396.3 i/s - 1.17x slower
309
- queue size = 250000: 5917684.5 i/s - 1.24x slower
310
- queue size = 500000: 5126307.6 i/s - 1.43x slower
311
- queue size = 1000000: 4403494.1 i/s - 1.66x slower
312
- queue size = 2500000: 3304088.2 i/s - 2.21x slower
313
- queue size = 5000000: 2664897.7 i/s - 2.74x slower
314
- queue size = 10000000: 2137927.6 i/s - 3.42x slower
315
-
316
- ## Analysis
317
-
318
- ### Time complexity
319
-
320
- There are two fundamental heap operations: sift-up (used by push) and sift-down
321
- (used by pop).
322
-
323
- * A _d_-ary heap will have `log n / log d` layers, so both sift operations can
324
- perform as many as `log n / log d` writes, when a member sifts the entire
325
- length of the tree.
326
- * Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
327
- * Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
328
-
329
- So, in the simplest case of running balanced push/pop while maintaining the same
330
- heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,
331
- when every sift traverses every layer of the tree, `d=4` requires the fewest
332
- comparisons for combined insert and delete:
333
-
334
- * (1 + 2) lg n / lg d ≈ 4.328085 lg n
335
- * (1 + 3) lg n / lg d ≈ 3.640957 lg n
336
- * (1 + 4) lg n / lg d ≈ 3.606738 lg n
337
- * (1 + 5) lg n / lg d ≈ 3.728010 lg n
338
- * (1 + 6) lg n / lg d ≈ 3.906774 lg n
339
- * (1 + 7) lg n / lg d ≈ 4.111187 lg n
340
- * (1 + 8) lg n / lg d ≈ 4.328085 lg n
341
- * (1 + 9) lg n / lg d ≈ 4.551196 lg n
342
- * (1 + 10) lg n / lg d ≈ 4.777239 lg n
288
+ push + pop (findmin)
289
+ N 10: 5480288.0 i/s
290
+ N 100: 2595178.8 i/s - 2.11x slower
291
+ N 1000: 224813.9 i/s - 24.38x slower
292
+ N 10000: 12630.7 i/s - 433.89x slower
293
+ N 100000: 1097.3 i/s - 4994.31x slower
294
+ N 1000000: 135.9 i/s - 40313.05x slower
295
+ N 10000000: 12.9 i/s - 425838.01x slower
296
+
297
+ push + pop (bsearch)
298
+ N 10: 3931408.4 i/s
299
+ N 100: 2904181.8 i/s - 1.35x slower
300
+ N 1000: 2203157.1 i/s - 1.78x slower
301
+ N 10000: 1209584.9 i/s - 3.25x slower
302
+ N 100000: 81121.4 i/s - 48.46x slower
303
+ N 1000000: 5356.0 i/s - 734.02x slower
304
+ N 10000000: 281.9 i/s - 13946.33x slower
305
+
306
+ push + pop (rb_heap)
307
+ N 10: 2325816.5 i/s
308
+ N 100: 1603540.3 i/s - 1.45x slower
309
+ N 1000: 1262515.2 i/s - 1.84x slower
310
+ N 10000: 950389.3 i/s - 2.45x slower
311
+ N 100000: 732548.8 i/s - 3.17x slower
312
+ N 1000000: 673577.8 i/s - 3.45x slower
313
+ N 10000000: 467512.3 i/s - 4.97x slower
314
+
315
+ push + pop (c++ stl)
316
+ N 10: 7706818.6 i/s - 1.01x slower
317
+ N 100: 7393127.3 i/s - 1.05x slower
318
+ N 1000: 6898781.3 i/s - 1.13x slower
319
+ N 10000: 5731130.5 i/s - 1.36x slower
320
+ N 100000: 4842393.2 i/s - 1.60x slower
321
+ N 1000000: 4170936.4 i/s - 1.86x slower
322
+ N 10000000: 2737146.6 i/s - 2.84x slower
323
+
324
+ push + pop (c_dheap)
325
+ N 10: 10196454.1 i/s
326
+ N 100: 9668679.8 i/s - 1.05x slower
327
+ N 1000: 9339557.0 i/s - 1.09x slower
328
+ N 10000: 8045103.0 i/s - 1.27x slower
329
+ N 100000: 7150276.7 i/s - 1.43x slower
330
+ N 1000000: 6490261.6 i/s - 1.57x slower
331
+ N 10000000: 3734856.5 i/s - 2.73x slower
332
+
333
+ ## Time complexity analysis
334
+
335
+ There are two fundamental heap operations: sift-up (used by push or decrease
336
+ score) and sift-down (used by pop or delete or increase score). Each sift
337
+ bubbles an item to its correct location in the tree.
338
+
339
+ * A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
340
+ `log n / log d` writes, when a member sifts the entire length of the tree.
341
+ * Sift-up needs one comparison per layer: `O(log n / log d)`.
342
+ * Sift-down needs d comparions per layer: `O(d log n / log d)`.
343
+
344
+ So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
345
+ comparisons are made. Looking only at this worst case combo, `d=4` requires the
346
+ fewest comparisons for a combined push and pop:
347
+
348
+ * `(1 + 2) log n / log d ≈ 4.328085 log n`
349
+ * `(1 + 3) log n / log d ≈ 3.640957 log n`
350
+ * `(1 + 4) log n / log d ≈ 3.606738 log n`
351
+ * `(1 + 5) log n / log d ≈ 3.728010 log n`
352
+ * `(1 + 6) log n / log d ≈ 3.906774 log n`
353
+ * `(1 + 7) log n / log d ≈ 4.111187 log n`
354
+ * `(1 + 8) log n / log d ≈ 4.328085 log n`
355
+ * `(1 + 9) log n / log d ≈ 4.551196 log n`
356
+ * `(1 + 10) log n / log d ≈ 4.777239 log n`
343
357
  * etc...
344
358
 
345
359
  See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
346
360
 
347
- ### Space complexity
361
+ However, what this simple count of comparisons misses is the extent to which
362
+ modern compilers can optimize code (e.g. by unrolling the comparison loop to
363
+ execute on registers) and more importantly how well modern processors are at
364
+ pipelined speculative execution using branch prediction, etc. Benchmarks should
365
+ be run on the _exact same_ hardware platform that production code will use,
366
+ as the sift-down operation is especially sensitive to good pipelining.
348
367
 
349
- Space usage is linear, regardless of d. However higher d values may
350
- provide better cache locality. Because the heap is a complete binary tree, the
351
- elements can be stored in an array, without the need for tree or list pointers.
368
+ ## Comparison performance
352
369
 
353
- Ruby can compare Numeric values _much_ faster than other ruby objects, even if
354
- those objects simply delegate comparison to internal Numeric values. And it is
355
- often useful to use external scores for otherwise uncomparable values. So
356
- `DHeap` uses twice as many entries (one for score and one for value)
357
- as an array which only stores values.
370
+ It is often useful to use external scores for otherwise uncomparable values.
371
+ And casting an item or score (via `to_f`) can also be time consuming. So
372
+ `DHeap` evaluates and stores scores at the time of insertion, and they will be
373
+ compared directly without needing any further lookup.
358
374
 
359
- ## Thread safety
375
+ Numeric values can be compared _much_ faster than other ruby objects, even if
376
+ those objects simply delegate comparison to internal Numeric values.
377
+ Additionally, native C integers or floats can be compared _much_ faster than
378
+ ruby `Numeric` objects. So scores are converted to Float and stored as
379
+ `double`, which is 64 bits on an [LP64 64-bit system].
360
380
 
361
- `DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
362
- take precautions such as locking access behind a mutex.
381
+ [LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
363
382
 
364
383
  ## Alternative data structures
365
384
 
366
385
  As always, you should run benchmarks with your expected scenarios to determine
367
386
  which is best for your application.
368
387
 
369
- Depending on your use-case, maintaining a sorted `Array` using `#bsearch_index`
370
- and `#insert` might be just fine! Even `min` plus `delete` with an unsorted
371
- array can be very fast on small queues. Although insertions run with `O(n)`,
372
- `memcpy` is so fast on modern hardware that your dataset might not be large
373
- enough for it to matter.
388
+ Depending on your use-case, using a sorted `Array` using `#bsearch_index`
389
+ and `#insert` might be just fine! It only takes a couple of lines of code and
390
+ is probably "Fast Enough".
374
391
 
375
- More complex heap varients, e.g. [Fibonacci heap], allow heaps to be split and
392
+ More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
376
393
  merged which gives some graph algorithms a lower amortized time complexity. But
377
394
  in practice, _d_-ary heaps have much lower overhead and often run faster.
378
395
 
@@ -385,25 +402,60 @@ of values in it, then you may want to use a self-balancing binary search tree
385
402
  [red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
386
403
  [skip-list]: https://en.wikipedia.org/wiki/Skip_list
387
404
 
388
- [Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
389
- family of data structures) can be constructed to have effectively `O(1)` running
390
- time in most cases. Although the implementation for that data structure is more
405
+ [Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
406
+ timing wheel family of data structures) can have effectively `O(1)` running time
407
+ in most cases. Although the implementation for that data structure is more
391
408
  complex than a heap, it may be necessary for enormous values of N.
392
409
 
393
- [timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
410
+ [timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
411
+
412
+ ## Supported platforms
413
+
414
+ See the [CI workflow] for all supported platforms.
415
+
416
+ [CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
417
+
418
+ `d_heap` may contain bugs on 32-bit systems. Currently, `d_heap` is only tested
419
+ on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
420
+
421
+ ## Caveats and TODOs (PRs welcome!)
422
+
423
+ A `DHeap`'s internal array grows but never shrinks. At the very least, there
424
+ should be a `#compact` or `#shrink` method and during `#freeze`. It might make
425
+ sense to automatically shrink (to no more than 2x the current size) during GC's
426
+ compact phase.
427
+
428
+ Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F. This
429
+ might lead to a different default `d` value (maybe 16 or 24?).
430
+
431
+ Shrink scores to 64-bits: either store a type flag with each entry (this could
432
+ be used to support non-numeric scores) or require users to choose between
433
+ `Integer` or `Float` at construction time. Reducing memory usage should also
434
+ improve speed for very large heaps.
435
+
436
+ Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
437
+ welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
438
+ Other platforms could fallback on the (slower) pure ruby implementation used by
439
+ the benchmarks.
440
+
441
+ [Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
442
+
443
+ Allow a max-heap (or other configurations of the compare function). This can be
444
+ very easily implemented by just reversing the scores.
394
445
 
395
- ## TODOs...
446
+ _Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
447
+ numeric use case simplicity and speed can be preserved.
396
448
 
397
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Map`, which augments the
398
- basic heap with an internal `Hash`, which maps objects to their position in the
399
- heap. This enforces a uniqueness constraint on items on the heap, and also
400
- allows items to be more efficiently deleted or adjusted. However maintaining
401
- the hash does lead to a small drop in normal `#push` and `#pop` performance.
449
+ Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
450
+ and move all values below that time onto an Array.
402
451
 
403
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
404
- features that are loosely inspired by go's timers. e.g: It lazily sifts its
405
- heap after deletion and adjustments, to achieve faster average runtime for *add*
406
- and *cancel* operations.
452
+ Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
453
+ some features that are loosely inspired by go's timers. Go lazily sifts its
454
+ heap after deletion or adjustments, to achieve faster amortized runtime.
455
+ There's no need to actually remove a deleted item from the heap, if you re-add
456
+ it back before it's next evaluated. A similar trick can be to store "far away"
457
+ values in an internal `Hash`, assuming many will be deleted before they rise to
458
+ the top. This could naturally evolve into a [timing wheel] variant.
407
459
 
408
460
  ## Development
409
461