d_heap 0.6.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1ad095ff29343f83c8bbe6fd0bc7f4acd79fa9c298aa4f8d007acf02ebedba30
4
- data.tar.gz: b2806a066a173a83d12259342c3f7d90900c83dc628063955d861f05acc98796
3
+ metadata.gz: 5b51ed52baf74b585a7ab7799f92a446aef5852431ba10e146658b419657ffbe
4
+ data.tar.gz: cc7c6786eee78ec13214582b8701448d312f59fb723d12676fb673447ab409a7
5
5
  SHA512:
6
- metadata.gz: 297aad8a8b4c7845fbea64808a2beaf4aa66b8431a23841c3d17952aaf85f41a3377c2dadc7651858e038adc69a35b2fe8e6ca484d45999f026efb41817e281b
7
- data.tar.gz: 1e3f123c7f723c752b2e8326c70b4208188ad09c275574bd0cee3dc7a119c7e3f07173f4ad4ed32035d2103a10b1a979400dfa35bdc1dd55272b53bcc8eaa2b9
6
+ metadata.gz: 5de98f8c9084b30694fff5f8154a6e42e7e67d76518c25136ab4fb0c0afb047ad3c923f4544dcf613ded4c3b01417729aa796c973100faaa7ee93051fa630c7d
7
+ data.tar.gz: e5dbcc90da7adfba7ef45cd9a2da5fd1781a2bd489002a5ffc0a764915c035c178db30ae9b8431a8fc810cfa6f03a1b38ec0a50cbf23c2e1ba5dfc36549c0609
data/.clang-format ADDED
@@ -0,0 +1,21 @@
1
+ ---
2
+ BasedOnStyle: mozilla
3
+ IndentWidth: 4
4
+ PointerAlignment: Right
5
+ AlignAfterOpenBracket: Align
6
+ AlignConsecutiveAssignments: true
7
+ AlignConsecutiveDeclarations: true
8
+ AlignConsecutiveBitFields: true
9
+ AlignConsecutiveMacros: true
10
+ AlignEscapedNewlines: Right
11
+ AlignOperands: true
12
+
13
+ AllowAllConstructorInitializersOnNextLine: false
14
+ AllowShortIfStatementsOnASingleLine: WithoutElse
15
+
16
+ IndentCaseLabels: false
17
+ IndentPPDirectives: AfterHash
18
+
19
+ ForEachMacros:
20
+ - WHILE_PEEK_LT_P
21
+ ...
@@ -23,4 +23,19 @@ jobs:
23
23
  run: |
24
24
  gem install bundler -v 2.2.3
25
25
  bundle install
26
- bundle exec rake
26
+ bundle exec rake ci
27
+
28
+ benchmarks:
29
+ runs-on: ubuntu-latest
30
+ steps:
31
+ - uses: actions/checkout@v2
32
+ - name: Set up Ruby
33
+ uses: ruby/setup-ruby@v1
34
+ with:
35
+ ruby-version: 2.7
36
+ bundler-cache: true
37
+ - name: Run the benchmarks
38
+ run: |
39
+ gem install bundler -v 2.2.3
40
+ bundle install
41
+ bundle exec rake ci:benchmarks
data/.rubocop.yml CHANGED
@@ -135,6 +135,7 @@ Style/ClassAndModuleChildren: { Enabled: false }
135
135
  Style/EachWithObject: { Enabled: false }
136
136
  Style/FormatStringToken: { Enabled: false }
137
137
  Style/FloatDivision: { Enabled: false }
138
+ Style/GuardClause: { Enabled: false } # usually nice to do, but...
138
139
  Style/IfUnlessModifier: { Enabled: false }
139
140
  Style/IfWithSemicolon: { Enabled: false }
140
141
  Style/Lambda: { Enabled: false }
data/CHANGELOG.md CHANGED
@@ -1,5 +1,22 @@
1
1
  ## Current/Unreleased
2
2
 
3
+ ## Release v0.7.0 (2021-01-24)
4
+
5
+ * 💥⚡️ **BREAKING**: Uses `double`) for _all_ scores.
6
+ * 💥 Integers larger than a double mantissa (53-bits) will lose some
7
+ precision.
8
+ * ⚡️ big speed up
9
+ * ⚡️ Much better memory usage
10
+ * ⚡️ Simplifies score conversion between ruby and C
11
+ * ✨ Added `DHeap::Map` for ensuring values can only be added once, by `#hash`.
12
+ * Adding again will update the score.
13
+ * Adds `DHeap::Map#[]` for quick lookup of existing scores
14
+ * Adds `DHeap::Map#[]=` for adjustments of existing scores
15
+ * TODO: `DHeap::Map#delete`
16
+ * 📝📈 SO MANY BENCHMARKS
17
+ * ⚡️ Set DEFAULT_D to 6, based on benchmarks.
18
+ * 🐛♻️ convert all `long` indexes to `size_t`
19
+
3
20
  ## Release v0.6.1 (2021-01-24)
4
21
 
5
22
  * 📝 Fix link to CHANGELOG.md in gemspec
data/{N → D} RENAMED
@@ -1,7 +1,7 @@
1
1
  #!/bin/sh
2
2
  set -eu
3
3
 
4
- export BENCH_N="$1"
4
+ export BENCH_D="$1"
5
5
  shift
6
6
 
7
7
  exec ruby "$@"
data/README.md CHANGED
@@ -7,6 +7,13 @@
7
7
  A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
8
8
  implemented as a C extension.
9
9
 
10
+ A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":
11
+ last in first out. A priority queue pushes each element with a score and pops
12
+ out in order by score. Priority queues are often used in algorithms for e.g.
13
+ [scheduling] of timers or bandwidth management, for [Huffman coding], and for
14
+ various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
15
+ [Prim's algorithm].
16
+
10
17
  From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
11
18
  > A heap is a specialized tree-based data structure which is essentially an
12
19
  > almost complete tree that satisfies the heap property: in a min heap, for any
@@ -16,26 +23,17 @@ From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
16
23
 
17
24
  ![tree representation of a min heap](images/wikipedia-min-heap.png)
18
25
 
19
- With a regular queue, you expect "FIFO" behavior: first in, first out. With a
20
- stack you expect "LIFO": last in first out. A priority queue has a score for
21
- each element and elements are popped in order by score. Priority queues are
22
- often used in algorithms for e.g. [scheduling] of timers or bandwidth
23
- management, for [Huffman coding], and various graph search algorithms such as
24
- [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
25
-
26
- The _d_-ary heap data structure is a generalization of the [binary heap], in
27
- which the nodes have _d_ children instead of 2. This allows for "insert" and
28
- "decrease priority" operations to be performed more quickly with the tradeoff of
29
- slower delete minimum or "increase priority". Additionally, _d_-ary heaps can
30
- have better memory cache behavior than binary heaps, allowing them to run more
31
- quickly in practice despite slower worst-case time complexity. In the worst
32
- case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
33
- the tradeoff that pop requires `O(d log n / log d)`.
34
-
35
- Although you should probably just use the default _d_ value of `4` (see the
36
- analysis below), it's always advisable to benchmark your specific use-case. In
37
- particular, if you push items more than you pop, higher values for _d_ can give
38
- a faster total runtime.
26
+ The _d_-ary heap data structure is a generalization of a [binary heap] in which
27
+ each node has _d_ children instead of 2. This speeds up "push" or "decrease
28
+ priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
29
+ "increase priority" (`O(d log n / log d)`). Additionally, _d_-ary heaps can
30
+ have better memory cache behavior than binary heaps, letting them run more
31
+ quickly in practice.
32
+
33
+ Although the default _d_ value will usually perform best (see the time
34
+ complexity analysis below), it's always advisable to benchmark your specific
35
+ use-case. In particular, if you push items more than you pop, higher values for
36
+ _d_ can give a faster total runtime.
39
37
 
40
38
  [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
41
39
  [priority queue]: https://en.wikipedia.org/wiki/Priority_queue
@@ -46,41 +44,39 @@ a faster total runtime.
46
44
  [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
47
45
  [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
48
46
 
47
+ ## Installation
48
+
49
+ Add this line to your application's Gemfile:
50
+
51
+ ```ruby
52
+ gem 'd_heap'
53
+ ```
54
+
55
+ And then execute:
56
+
57
+ $ bundle install
58
+
59
+ Or install it yourself as:
60
+
61
+ $ gem install d_heap
62
+
49
63
  ## Usage
50
64
 
51
- The basic API is `#push(object, score)` and `#pop`. Please read the
52
- [gem documentation] for more details and other methods.
65
+ The basic API is `#push(object, score)` and `#pop`. Please read the [full
66
+ documentation] for more details. The score must be convertable to a `Float` via
67
+ `Float(score)` (i.e. it should properly implement `#to_f`).
53
68
 
54
- Quick reference for some common methods:
69
+ Quick reference for the most common methods:
55
70
 
56
- * `heap << object` adds a value, with `Float(object)` as its score.
71
+ * `heap << object` adds a value, using `Float(object)` as its intrinsic score.
57
72
  * `heap.push(object, score)` adds a value with an extrinsic score.
58
- * `heap.pop` removes and returns the value with the minimum score.
59
- * `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
60
73
  * `heap.peek` to view the minimum value without popping it.
74
+ * `heap.pop` removes and returns the value with the minimum score.
75
+ * `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
61
76
  * `heap.clear` to remove all items from the heap.
62
77
  * `heap.empty?` returns true if the heap is empty.
63
78
  * `heap.size` returns the number of items in the heap.
64
79
 
65
- If the score changes while the object is still in the heap, it will not be
66
- re-evaluated again.
67
-
68
- The score must either be `Integer` or `Float` or convertable to a `Float` via
69
- `Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
70
- numeric values gives more than 50% speedup under some benchmarks! _n.b._
71
- `Integer` _scores must have an absolute value that fits into_ `unsigned long
72
- long`. This is compiler and architecture dependant but with gcc on an IA-64
73
- system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
74
- +18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
75
- in nanoseconds.
76
-
77
- _Comparing arbitary objects via_ `a <=> b` _was the original design and may be
78
- added back in a future version,_ if (and only if) _it can be done without
79
- impacting the speed of numeric comparisons. The speedup from this constraint is
80
- huge!_
81
-
82
- [gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
83
-
84
80
  ### Examples
85
81
 
86
82
  ```ruby
@@ -128,251 +124,272 @@ heap.size # => 0
128
124
  heap.pop # => nil
129
125
  ```
130
126
 
131
- Please see the [gem documentation] for more methods and more examples.
127
+ Please see the [full documentation] for more methods and more examples.
132
128
 
133
- ## Installation
129
+ [full documentation]: https://rubydoc.info/gems/d_heap/DHeap
134
130
 
135
- Add this line to your application's Gemfile:
131
+ ### DHeap::Map
136
132
 
137
- ```ruby
138
- gem 'd_heap'
139
- ```
133
+ `DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
134
+ index in the heap. For simple push/pop this a bit slower than a normal `DHeap`
135
+ heap, but it can enable huge speed-ups for algorithms that need to adjust scores
136
+ after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:
140
137
 
141
- And then execute:
142
-
143
- $ bundle install
144
-
145
- Or install it yourself as:
138
+ * a uniqueness constraint, by `#hash` value
139
+ * `#[obj] # => score` or `#score(obj)` in `O(1)`
140
+ * `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
141
+ * TODO:
142
+ * optionally unique by object identity
143
+ * `#delete(obj)` in `O(d log n / log d)` (TODO)
146
144
 
147
- $ gem install d_heap
145
+ ## Scores
148
146
 
149
- ## Motivation
147
+ If a score changes while the object is still in the heap, it will not be
148
+ re-evaluated again.
150
149
 
151
- One naive approach to a priority queue is to maintain an array in sorted order.
152
- This can be very simply implemented in ruby with `Array#bseach_index` +
153
- `Array#insert`. This can be very fast—`Array#pop` is `O(1)`—but the worst-case
154
- for insert is `O(n)` because it may need to `memcpy` a significant portion of
155
- the array.
150
+ Constraining scores to `Float` gives enormous performance benefits. n.b.
151
+ very large `Integer` values will lose precision when converted to `Float`. This
152
+ is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
153
+ is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
154
+ to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
155
+ time since the epoch in nanoseconds. This can be worked around by adding a
156
+ bias, but probably it's good enough for most usage.
156
157
 
157
- The standard way to implement a priority queue is with a binary heap. Although
158
- this increases the time complexity for `pop` alone, it reduces the combined time
159
- compexity for the combined `push` + `pop`. Using a d-ary heap with d > 2
160
- makes the tree shorter but broader, which reduces to `O(log n / log d)` while
161
- increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
158
+ _Comparing arbitary objects via_ `a <=> b` _was the original design and may be
159
+ added back in a future version,_ if (and only if) _it can be done without
160
+ impacting the speed of numeric comparisons._
162
161
 
163
- However, I was disappointed when my best ruby heap implementation ran much more
164
- slowly than the naive approach—even for heaps containing ten thousand items.
165
- Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
166
- has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons
167
- for each push + pop than `bsearch` + `insert`.
162
+ ## Thread safety
168
163
 
169
- Additionally, when researching how other systems handle their scheduling, I was
170
- inspired by reading go's "timer.go" implementation to experiment with a 4-ary
171
- heap instead of the traditional binary heap.
164
+ `DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
165
+ take precautions such as locking access behind a mutex.
172
166
 
173
167
  ## Benchmarks
174
168
 
175
- _See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
176
- `docs/profile.txt` for much more detail or updated results. These benchmarks
177
- were measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
178
-
179
- These benchmarks use very simple implementations for a pure-ruby heap and an
180
- array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
181
- comparison, I also compare to the [priority_queue_cxx gem] which uses the [C++
182
- STL priority_queue], and another naive implementation that uses `Array#min` and
183
- `Array#delete_at` with an unsorted array.
184
-
185
- In these benchmarks, `DHeap` runs faster than all other implementations for
186
- every scenario and every value of N, although the difference is usually more
187
- noticable at higher values of N. The pure ruby heap implementation is
188
- competitive for `push` alone at every value of N, but is significantly slower
189
- than bsearch + insert for push + pop, until N is _very_ large (somewhere between
190
- 10k and 100k)!
169
+ _See full benchmark output in subdirs of `benchmarks`. See also or updated
170
+ results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
171
+ with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
172
+
173
+ ### Implementations
174
+
175
+ * **findmin** -
176
+ A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
177
+ very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
178
+ `Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but
179
+ unusably slow for `n > 1000`.
180
+
181
+ * **bsearch** -
182
+ A simple implementation with a slow `O(n)` push using `Array#bsearch` +
183
+ `Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
184
+ `Array#pop`. It is still relatively fast for `n < 10000`, but its linear
185
+ time complexity really destroys it after that.
186
+
187
+ * **rb_heap** -
188
+ A pure ruby binary min-heap that has been tuned for performance by making
189
+ few method calls and allocating and assigning as few variables as possible.
190
+ It runs in `O(log n)` for both push and pop, although pop is slower than
191
+ push by a constant factor. Its much higher constant factors makes it lose
192
+ to `bsearch` push + pop for `n < 10000` but it holds steady with very little
193
+ slowdown even with `n > 10000000`.
194
+
195
+ * **c++ stl** -
196
+ A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
197
+ priority_queue]. The wrapper is simply to provide compatibility with the
198
+ other benchmarked implementations, but it should be possible to speed this
199
+ up a little bit by benchmarking the `priority_queue_cxx` API directly. It
200
+ has the same time complexity as rb_heap but its much lower constant
201
+ factors allow it to easily outperform `bsearch`.
202
+
203
+ * **c_dheap** -
204
+ A {DHeap} instance with the default `d` value of `4`. It has the same time
205
+ complexity as `rb_heap` and `c++ stl`, but is faster than both in every
206
+ benchmarked scenario.
191
207
 
192
208
  [priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
193
209
  [C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
194
210
 
195
- Three different scenarios are measured:
211
+ ### Scenarios
196
212
 
197
- ### push N items onto an empty heap
213
+ Each benchmark increases N exponentially, either by √1̅0̅ or approximating
214
+ (alternating between x3 and x3.333) in order to simplify keeping loop counts
215
+ evenly divisible by N.
198
216
 
199
- ...but never pop (clearing between each set of pushes).
217
+ #### push N items
200
218
 
201
- ![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
219
+ This measures the _average time per insert_ to create a queue of size N
220
+ (clearing the queue once it reaches that size). Use cases which push (or
221
+ decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's
222
+ algorithm] when the graph has more edges than verticies, may want to pay more
223
+ attention to this benchmark.
202
224
 
203
- ### push N items onto an empty heap then pop all N
225
+ ![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
204
226
 
205
- Although this could be used for heap sort, we're unlikely to choose heap sort
206
- over Ruby's quick sort implementation. I'm using this scenario to represent
207
- the amortized cost of creating a heap and (eventually) draining it.
227
+ == push N (N=100) ==========================================================
228
+ push N (c_dheap): 10522662.6 i/s
229
+ push N (findmin): 9980622.3 i/s - 1.05x slower
230
+ push N (c++ stl): 7991608.3 i/s - 1.32x slower
231
+ push N (rb_heap): 4607849.4 i/s - 2.28x slower
232
+ push N (bsearch): 2769106.2 i/s - 3.80x slower
233
+ == push N (N=10,000) =======================================================
234
+ push N (c_dheap): 10444588.3 i/s
235
+ push N (findmin): 10191797.4 i/s - 1.02x slower
236
+ push N (c++ stl): 8210895.4 i/s - 1.27x slower
237
+ push N (rb_heap): 4369252.9 i/s - 2.39x slower
238
+ push N (bsearch): 1213580.4 i/s - 8.61x slower
239
+ == push N (N=1,000,000) ====================================================
240
+ push N (c_dheap): 10342183.7 i/s
241
+ push N (findmin): 9963898.8 i/s - 1.04x slower
242
+ push N (c++ stl): 7891924.8 i/s - 1.31x slower
243
+ push N (rb_heap): 4350116.0 i/s - 2.38x slower
244
+
245
+ All three heap implementations have little to no perceptible slowdown for `N >
246
+ 100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
247
+
248
+ #### push then pop N items
249
+
250
+ This measures the _average_ for a push **or** a pop, filling up a queue with N
251
+ items and then draining that queue until empty. It represents the amortized
252
+ cost of balanced pushes and pops to fill a heap and drain it.
208
253
 
209
254
  ![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
210
255
 
211
- ### push and pop on a heap with N values
212
-
213
- Repeatedly push and pop while keeping a stable heap size. This is a _very
214
- simplistic_ approximation for how most scheduler/timer heaps might be used.
215
- Usually when a timer fires it will be quickly replaced by a new timer, and the
216
- overall count of timers will remain roughly stable.
256
+ == push N then pop N (N=100) ===============================================
257
+ push N + pop N (c_dheap): 10954469.2 i/s
258
+ push N + pop N (c++ stl): 9317140.2 i/s - 1.18x slower
259
+ push N + pop N (bsearch): 4808770.2 i/s - 2.28x slower
260
+ push N + pop N (findmin): 4321411.9 i/s - 2.53x slower
261
+ push N + pop N (rb_heap): 2467417.0 i/s - 4.44x slower
262
+ == push N then pop N (N=10,000) ============================================
263
+ push N + pop N (c_dheap): 8083962.7 i/s
264
+ push N + pop N (c++ stl): 7365661.8 i/s - 1.10x slower
265
+ push N + pop N (bsearch): 2257047.9 i/s - 3.58x slower
266
+ push N + pop N (rb_heap): 1439204.3 i/s - 5.62x slower
267
+ == push N then pop N (N=1,000,000) =========================================
268
+ push N + pop N (c++ stl): 5274657.5 i/s
269
+ push N + pop N (c_dheap): 4731117.9 i/s - 1.11x slower
270
+ push N + pop N (rb_heap): 976688.6 i/s - 5.40x slower
271
+
272
+ At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
273
+ much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above
274
+ 30k it slows down too much to be useful. `DHeap` consistently runs 4.5-5.5x
275
+ faster than the pure ruby heap.
276
+
277
+ #### push & pop on N-item heap
278
+
279
+ This measures the combined time to push once and pop once, which is done
280
+ repeatedly while keeping a stable heap size of N. Its an approximation for
281
+ scenarios which reach a stable size and then plateau with balanced pushes and
282
+ pops. E.g. timers and timeouts will often reschedule themselves or replace
283
+ themselves with new timers or timeouts, maintaining a roughly stable total count
284
+ of timers.
217
285
 
218
286
  ![bar graph for push_pop benchmarks](./images/push_pop.png)
219
287
 
220
- ### numbers
221
-
222
- Even for very small N values the benchmark implementations, `DHeap` runs faster
223
- than the other implementations for each scenario, although the difference is
224
- still relatively small. The pure ruby binary heap is 2x or more slower than
225
- bsearch + insert for common push/pop scenario.
226
-
227
- == push N (N=5) ==========================================================
228
- push N (c_dheap): 1969700.7 i/s
229
- push N (c++ stl): 1049738.1 i/s - 1.88x slower
230
- push N (rb_heap): 928435.2 i/s - 2.12x slower
231
- push N (bsearch): 921060.0 i/s - 2.14x slower
232
-
233
- == push N then pop N (N=5) ===============================================
234
- push N + pop N (c_dheap): 1375805.0 i/s
235
- push N + pop N (c++ stl): 1134997.5 i/s - 1.21x slower
236
- push N + pop N (findmin): 862913.1 i/s - 1.59x slower
237
- push N + pop N (bsearch): 762887.1 i/s - 1.80x slower
238
- push N + pop N (rb_heap): 506890.4 i/s - 2.71x slower
239
-
240
- == Push/pop with pre-filled queue of size=N (N=5) ========================
241
- push + pop (c_dheap): 9044435.5 i/s
242
- push + pop (c++ stl): 7534583.4 i/s - 1.20x slower
243
- push + pop (findmin): 5026155.1 i/s - 1.80x slower
244
- push + pop (bsearch): 4300260.0 i/s - 2.10x slower
245
- push + pop (rb_heap): 2299499.7 i/s - 3.93x slower
246
-
247
- By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
248
- scenarios, but the pure ruby heap is still slower than every other
249
- implementation—even resorting the array after every `#push`—in any scenario that
250
- uses `#pop`.
251
-
252
- == push N (N=21) =========================================================
253
- push N (c_dheap): 464231.4 i/s
254
- push N (c++ stl): 305546.7 i/s - 1.52x slower
255
- push N (rb_heap): 202803.7 i/s - 2.29x slower
256
- push N (bsearch): 168678.7 i/s - 2.75x slower
257
-
258
- == push N then pop N (N=21) ==============================================
259
- push N + pop N (c_dheap): 298350.3 i/s
260
- push N + pop N (c++ stl): 252227.1 i/s - 1.18x slower
261
- push N + pop N (findmin): 161998.7 i/s - 1.84x slower
262
- push N + pop N (bsearch): 143432.3 i/s - 2.08x slower
263
- push N + pop N (rb_heap): 79622.1 i/s - 3.75x slower
264
-
265
- == Push/pop with pre-filled queue of size=N (N=21) =======================
266
- push + pop (c_dheap): 8855093.4 i/s
267
- push + pop (c++ stl): 7223079.5 i/s - 1.23x slower
268
- push + pop (findmin): 4542913.7 i/s - 1.95x slower
269
- push + pop (bsearch): 3461802.4 i/s - 2.56x slower
270
- push + pop (rb_heap): 1845488.7 i/s - 4.80x slower
271
-
272
- At higher values of N, a heaps logarithmic growth leads to only a little
273
- slowdown of `#push`, while insert's linear growth causes it to run noticably
274
- slower and slower. But because `#pop` is `O(1)` for a sorted array and `O(d log
275
- n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
276
- relatively close, and bsearch + insert still runs faster than a pure ruby heap,
277
- even up to queues with 10k items. But as queue size increases beyond than that,
278
- the linear time compexity to keep a sorted array dominates.
279
-
280
- == push + pop (rb_heap)
281
- queue size = 10000: 736618.2 i/s
282
- queue size = 25000: 670186.8 i/s - 1.10x slower
283
- queue size = 50000: 618156.7 i/s - 1.19x slower
284
- queue size = 100000: 579250.7 i/s - 1.27x slower
285
- queue size = 250000: 572795.0 i/s - 1.29x slower
286
- queue size = 500000: 543648.3 i/s - 1.35x slower
287
- queue size = 1000000: 513523.4 i/s - 1.43x slower
288
- queue size = 2500000: 460848.9 i/s - 1.60x slower
289
- queue size = 5000000: 445234.5 i/s - 1.65x slower
290
- queue size = 10000000: 423119.0 i/s - 1.74x slower
291
-
292
- == push + pop (bsearch)
293
- queue size = 10000: 786334.2 i/s
294
- queue size = 25000: 364963.8 i/s - 2.15x slower
295
- queue size = 50000: 200520.6 i/s - 3.92x slower
296
- queue size = 100000: 88607.0 i/s - 8.87x slower
297
- queue size = 250000: 34530.5 i/s - 22.77x slower
298
- queue size = 500000: 17965.4 i/s - 43.77x slower
299
- queue size = 1000000: 5638.7 i/s - 139.45x slower
300
- queue size = 2500000: 1302.0 i/s - 603.93x slower
301
- queue size = 5000000: 592.0 i/s - 1328.25x slower
302
- queue size = 10000000: 288.8 i/s - 2722.66x slower
303
-
304
- == push + pop (c_dheap)
305
- queue size = 10000: 7311366.6 i/s
306
- queue size = 50000: 6737824.5 i/s - 1.09x slower
307
- queue size = 25000: 6407340.6 i/s - 1.14x slower
308
- queue size = 100000: 6254396.3 i/s - 1.17x slower
309
- queue size = 250000: 5917684.5 i/s - 1.24x slower
310
- queue size = 500000: 5126307.6 i/s - 1.43x slower
311
- queue size = 1000000: 4403494.1 i/s - 1.66x slower
312
- queue size = 2500000: 3304088.2 i/s - 2.21x slower
313
- queue size = 5000000: 2664897.7 i/s - 2.74x slower
314
- queue size = 10000000: 2137927.6 i/s - 3.42x slower
315
-
316
- ## Analysis
317
-
318
- ### Time complexity
319
-
320
- There are two fundamental heap operations: sift-up (used by push) and sift-down
321
- (used by pop).
322
-
323
- * A _d_-ary heap will have `log n / log d` layers, so both sift operations can
324
- perform as many as `log n / log d` writes, when a member sifts the entire
325
- length of the tree.
326
- * Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
327
- * Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
328
-
329
- So, in the simplest case of running balanced push/pop while maintaining the same
330
- heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,
331
- when every sift traverses every layer of the tree, `d=4` requires the fewest
332
- comparisons for combined insert and delete:
333
-
334
- * (1 + 2) lg n / lg d ≈ 4.328085 lg n
335
- * (1 + 3) lg n / lg d ≈ 3.640957 lg n
336
- * (1 + 4) lg n / lg d ≈ 3.606738 lg n
337
- * (1 + 5) lg n / lg d ≈ 3.728010 lg n
338
- * (1 + 6) lg n / lg d ≈ 3.906774 lg n
339
- * (1 + 7) lg n / lg d ≈ 4.111187 lg n
340
- * (1 + 8) lg n / lg d ≈ 4.328085 lg n
341
- * (1 + 9) lg n / lg d ≈ 4.551196 lg n
342
- * (1 + 10) lg n / lg d ≈ 4.777239 lg n
288
+ push + pop (findmin)
289
+ N 10: 5480288.0 i/s
290
+ N 100: 2595178.8 i/s - 2.11x slower
291
+ N 1000: 224813.9 i/s - 24.38x slower
292
+ N 10000: 12630.7 i/s - 433.89x slower
293
+ N 100000: 1097.3 i/s - 4994.31x slower
294
+ N 1000000: 135.9 i/s - 40313.05x slower
295
+ N 10000000: 12.9 i/s - 425838.01x slower
296
+
297
+ push + pop (bsearch)
298
+ N 10: 3931408.4 i/s
299
+ N 100: 2904181.8 i/s - 1.35x slower
300
+ N 1000: 2203157.1 i/s - 1.78x slower
301
+ N 10000: 1209584.9 i/s - 3.25x slower
302
+ N 100000: 81121.4 i/s - 48.46x slower
303
+ N 1000000: 5356.0 i/s - 734.02x slower
304
+ N 10000000: 281.9 i/s - 13946.33x slower
305
+
306
+ push + pop (rb_heap)
307
+ N 10: 2325816.5 i/s
308
+ N 100: 1603540.3 i/s - 1.45x slower
309
+ N 1000: 1262515.2 i/s - 1.84x slower
310
+ N 10000: 950389.3 i/s - 2.45x slower
311
+ N 100000: 732548.8 i/s - 3.17x slower
312
+ N 1000000: 673577.8 i/s - 3.45x slower
313
+ N 10000000: 467512.3 i/s - 4.97x slower
314
+
315
+ push + pop (c++ stl)
316
+ N 10: 7706818.6 i/s - 1.01x slower
317
+ N 100: 7393127.3 i/s - 1.05x slower
318
+ N 1000: 6898781.3 i/s - 1.13x slower
319
+ N 10000: 5731130.5 i/s - 1.36x slower
320
+ N 100000: 4842393.2 i/s - 1.60x slower
321
+ N 1000000: 4170936.4 i/s - 1.86x slower
322
+ N 10000000: 2737146.6 i/s - 2.84x slower
323
+
324
+ push + pop (c_dheap)
325
+ N 10: 10196454.1 i/s
326
+ N 100: 9668679.8 i/s - 1.05x slower
327
+ N 1000: 9339557.0 i/s - 1.09x slower
328
+ N 10000: 8045103.0 i/s - 1.27x slower
329
+ N 100000: 7150276.7 i/s - 1.43x slower
330
+ N 1000000: 6490261.6 i/s - 1.57x slower
331
+ N 10000000: 3734856.5 i/s - 2.73x slower
332
+
333
+ ## Time complexity analysis
334
+
335
+ There are two fundamental heap operations: sift-up (used by push or decrease
336
+ score) and sift-down (used by pop or delete or increase score). Each sift
337
+ bubbles an item to its correct location in the tree.
338
+
339
+ * A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
340
+ `log n / log d` writes, when a member sifts the entire length of the tree.
341
+ * Sift-up needs one comparison per layer: `O(log n / log d)`.
342
+ * Sift-down needs d comparions per layer: `O(d log n / log d)`.
343
+
344
+ So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
345
+ comparisons are made. Looking only at this worst case combo, `d=4` requires the
346
+ fewest comparisons for a combined push and pop:
347
+
348
+ * `(1 + 2) log n / log d ≈ 4.328085 log n`
349
+ * `(1 + 3) log n / log d ≈ 3.640957 log n`
350
+ * `(1 + 4) log n / log d ≈ 3.606738 log n`
351
+ * `(1 + 5) log n / log d ≈ 3.728010 log n`
352
+ * `(1 + 6) log n / log d ≈ 3.906774 log n`
353
+ * `(1 + 7) log n / log d ≈ 4.111187 log n`
354
+ * `(1 + 8) log n / log d ≈ 4.328085 log n`
355
+ * `(1 + 9) log n / log d ≈ 4.551196 log n`
356
+ * `(1 + 10) log n / log d ≈ 4.777239 log n`
343
357
  * etc...
344
358
 
345
359
  See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
346
360
 
347
- ### Space complexity
361
+ However, what this simple count of comparisons misses is the extent to which
362
+ modern compilers can optimize code (e.g. by unrolling the comparison loop to
363
+ execute on registers) and more importantly how well modern processors are at
364
+ pipelined speculative execution using branch prediction, etc. Benchmarks should
365
+ be run on the _exact same_ hardware platform that production code will use,
366
+ as the sift-down operation is especially sensitive to good pipelining.
348
367
 
349
- Space usage is linear, regardless of d. However higher d values may
350
- provide better cache locality. Because the heap is a complete binary tree, the
351
- elements can be stored in an array, without the need for tree or list pointers.
368
+ ## Comparison performance
352
369
 
353
- Ruby can compare Numeric values _much_ faster than other ruby objects, even if
354
- those objects simply delegate comparison to internal Numeric values. And it is
355
- often useful to use external scores for otherwise uncomparable values. So
356
- `DHeap` uses twice as many entries (one for score and one for value)
357
- as an array which only stores values.
370
+ It is often useful to use external scores for otherwise uncomparable values.
371
+ And casting an item or score (via `to_f`) can also be time consuming. So
372
+ `DHeap` evaluates and stores scores at the time of insertion, and they will be
373
+ compared directly without needing any further lookup.
358
374
 
359
- ## Thread safety
375
+ Numeric values can be compared _much_ faster than other ruby objects, even if
376
+ those objects simply delegate comparison to internal Numeric values.
377
+ Additionally, native C integers or floats can be compared _much_ faster than
378
+ ruby `Numeric` objects. So scores are converted to Float and stored as
379
+ `double`, which is 64 bits on an [LP64 64-bit system].
360
380
 
361
- `DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
362
- take precautions such as locking access behind a mutex.
381
+ [LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
363
382
 
364
383
  ## Alternative data structures
365
384
 
366
385
  As always, you should run benchmarks with your expected scenarios to determine
367
386
  which is best for your application.
368
387
 
369
- Depending on your use-case, maintaining a sorted `Array` using `#bsearch_index`
370
- and `#insert` might be just fine! Even `min` plus `delete` with an unsorted
371
- array can be very fast on small queues. Although insertions run with `O(n)`,
372
- `memcpy` is so fast on modern hardware that your dataset might not be large
373
- enough for it to matter.
388
+ Depending on your use-case, using a sorted `Array` using `#bsearch_index`
389
+ and `#insert` might be just fine! It only takes a couple of lines of code and
390
+ is probably "Fast Enough".
374
391
 
375
- More complex heap varients, e.g. [Fibonacci heap], allow heaps to be split and
392
+ More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
376
393
  merged which gives some graph algorithms a lower amortized time complexity. But
377
394
  in practice, _d_-ary heaps have much lower overhead and often run faster.
378
395
 
@@ -385,25 +402,60 @@ of values in it, then you may want to use a self-balancing binary search tree
385
402
  [red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
386
403
  [skip-list]: https://en.wikipedia.org/wiki/Skip_list
387
404
 
388
- [Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
389
- family of data structures) can be constructed to have effectively `O(1)` running
390
- time in most cases. Although the implementation for that data structure is more
405
+ [Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
406
+ timing wheel family of data structures) can have effectively `O(1)` running time
407
+ in most cases. Although the implementation for that data structure is more
391
408
  complex than a heap, it may be necessary for enormous values of N.
392
409
 
393
- [timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
410
+ [timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
411
+
412
+ ## Supported platforms
413
+
414
+ See the [CI workflow] for all supported platforms.
415
+
416
+ [CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
417
+
418
+ `d_heap` may contain bugs on 32-bit systems. Currently, `d_heap` is only tested
419
+ on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
420
+
421
+ ## Caveats and TODOs (PRs welcome!)
422
+
423
+ A `DHeap`'s internal array grows but never shrinks. At the very least, there
424
+ should be a `#compact` or `#shrink` method and during `#freeze`. It might make
425
+ sense to automatically shrink (to no more than 2x the current size) during GC's
426
+ compact phase.
427
+
428
+ Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F. This
429
+ might lead to a different default `d` value (maybe 16 or 24?).
430
+
431
+ Shrink scores to 64-bits: either store a type flag with each entry (this could
432
+ be used to support non-numeric scores) or require users to choose between
433
+ `Integer` or `Float` at construction time. Reducing memory usage should also
434
+ improve speed for very large heaps.
435
+
436
+ Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
437
+ welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
438
+ Other platforms could fallback on the (slower) pure ruby implementation used by
439
+ the benchmarks.
440
+
441
+ [Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
442
+
443
+ Allow a max-heap (or other configurations of the compare function). This can be
444
+ very easily implemented by just reversing the scores.
394
445
 
395
- ## TODOs...
446
+ _Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
447
+ numeric use case simplicity and speed can be preserved.
396
448
 
397
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Map`, which augments the
398
- basic heap with an internal `Hash`, which maps objects to their position in the
399
- heap. This enforces a uniqueness constraint on items on the heap, and also
400
- allows items to be more efficiently deleted or adjusted. However maintaining
401
- the hash does lead to a small drop in normal `#push` and `#pop` performance.
449
+ Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
450
+ and move all values below that time onto an Array.
402
451
 
403
- _TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
404
- features that are loosely inspired by go's timers. e.g: It lazily sifts its
405
- heap after deletion and adjustments, to achieve faster average runtime for *add*
406
- and *cancel* operations.
452
+ Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
453
+ some features that are loosely inspired by go's timers. Go lazily sifts its
454
+ heap after deletion or adjustments, to achieve faster amortized runtime.
455
+ There's no need to actually remove a deleted item from the heap, if you re-add
456
+ it back before it's next evaluated. A similar trick can be to store "far away"
457
+ values in an internal `Hash`, assuming many will be deleted before they rise to
458
+ the top. This could naturally evolve into a [timing wheel] variant.
407
459
 
408
460
  ## Development
409
461