RubyGems - d_heap - Versions diffs - 0.3.0 → 0.4.0 - Mend

d_heap 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +4 -4
data/.rubocop.yml +30 -1
data/CHANGELOG.md +42 -0
data/Gemfile +1 -0
data/Gemfile.lock +11 -10
data/README.md +353 -121
data/benchmarks/push_n.yml +28 -0
data/benchmarks/push_n_pop_n.yml +31 -0
data/benchmarks/push_pop.yml +24 -0
data/bin/bench_n +7 -0
data/bin/benchmark-driver +29 -0
data/bin/benchmarks +10 -0
data/bin/profile +10 -0
data/d_heap.gemspec +2 -1
data/docs/benchmarks-2.txt +52 -0
data/docs/benchmarks.txt +443 -0
data/docs/profile.txt +392 -0
data/ext/d_heap/d_heap.c +428 -150
data/ext/d_heap/d_heap.h +6 -3
data/ext/d_heap/extconf.rb +8 -3
data/lib/benchmark_driver/runner/ips_zero_fail.rb +120 -0
data/lib/d_heap.rb +5 -3
data/lib/d_heap/benchmarks.rb +111 -0
data/lib/d_heap/benchmarks/benchmarker.rb +113 -0
data/lib/d_heap/benchmarks/implementations.rb +168 -0
data/lib/d_heap/benchmarks/profiler.rb +71 -0
data/lib/d_heap/benchmarks/rspec_matchers.rb +374 -0
data/lib/d_heap/version.rb +1 -1
metadata +34 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5a20fe814944fa8945bc2cc0ce87f810c8bf8d21102b8c454ae5d639f0548576
-  data.tar.gz: 65a1af345ae84c7f6e5da8af89a00730ffc4cbdb6bb2cac59461c3728eb0ad28
+  metadata.gz: 413c0a93e2c3cbdbb86ee433df47a310034d453e441a150d8317dc055b4a9a90
+  data.tar.gz: 4bf67447021da03b07da7f44bcf97a66f13fa42f6f67bcfe9a49d0866c8b8167
 SHA512:
-  metadata.gz: 27c240634013397033925ee258de08135587c31c7a046a902c028842680dadec20e2ccc0f76570e6f7db34d2292e5dae2ae7d17449436f890498697de4352c91
-  data.tar.gz: 8eb2cba120747cc7788c42f264d2109ef1afee8a28caa7effb8bcb1ff36ed7c99a0556b48b7d9e9479f6e8be8945eea3bf1ddb7f608c13c63252684643154179
+  metadata.gz: 5e55bf53c1062686e0863fb9c3b09f3c2b8b936b0cf83985092e1e906b0b24f40e02a42eada048dcea732a60ec4e3695bb861943b424a8cd3152b227abad8a4e
+  data.tar.gz: e021616d6dcdcec943fec11783f2147a2d175aa3a0caf668c2339e05795e32475cdf7be90201c38d7108c74087e22140a9da9d3f211d7f0335e3c173ae83893b

data/.rubocop.yml CHANGED

@@ -6,6 +6,7 @@ AllCops:
   TargetRubyVersion: 2.5
   NewCops: disable
   Exclude:
+    - bin/benchmark-driver
     - bin/rake
     - bin/rspec
     - bin/rubocop
@@ -106,26 +107,49 @@ Naming/RescuedExceptionsVariableName: { Enabled: false }
 ###########################################################################
 # Matrics:
+Metrics/CyclomaticComplexity:
+  Max: 10
 # Although it may be better to split specs into multiple files...?
 Metrics/BlockLength:
   Exclude:
     - "spec/**/*_spec.rb"
+  CountAsOne:
+    - array
+    - hash
+    - heredoc
+Metrics/ClassLength:
+  Max: 200
+  CountAsOne:
+    - array
+    - hash
+    - heredoc
 ###########################################################################
 # Style...
 Style/AccessorGrouping:        { Enabled: false }
 Style/AsciiComments:           { Enabled: false } # 👮 can't stop our 🎉🥳🎊🥳!
+Style/ClassAndModuleChildren:  { Enabled: false }
 Style/EachWithObject:          { Enabled: false }
 Style/FormatStringToken:       { Enabled: false }
 Style/FloatDivision:           { Enabled: false }
+Style/IfUnlessModifier:        { Enabled: false }
+Style/IfWithSemicolon:         { Enabled: false }
 Style/Lambda:                  { Enabled: false }
 Style/LineEndConcatenation:    { Enabled: false }
 Style/MixinGrouping:           { Enabled: false }
+Style/MultilineBlockChain:     { Enabled: false }
 Style/PerlBackrefs:            { Enabled: false } # use occasionally/sparingly
 Style/RescueStandardError:     { Enabled: false }
+Style/Semicolon:               { Enabled: false }
 Style/SingleLineMethods:       { Enabled: false }
 Style/StabbyLambdaParentheses: { Enabled: false }
+Style/WhenThen               : { Enabled: false }
+# I require trailing commas elsewhere, but these are optional
+Style/TrailingCommaInArguments: { Enabled: false }
 # If rubocop had an option to only enforce this on constants and literals (e.g.
 # strings, regexp, range), I'd agree.
@@ -149,7 +173,9 @@ Style/BlockDelimiters:
   EnforcedStyle: semantic
   AllowBracesOnProceduralOneLiners: true
   IgnoredMethods:
-    - expect
+    - expect  # rspec
+    - profile # ruby-prof
+    - ips     # benchmark-ips
 Style/FormatString:
@@ -168,3 +194,6 @@ Style/TrailingCommaInHashLiteral:
 Style/TrailingCommaInArrayLiteral:
   EnforcedStyleForMultiline: consistent_comma
+Style/YodaCondition:
+  EnforcedStyle: forbid_for_equality_operators_only

data/CHANGELOG.md ADDED

@@ -0,0 +1,42 @@
+## Current/Unreleased
+## Release v0.4.0 (2021-01-12)
+* ⚡️ Big performance improvements, by using C `long double *cscores` array
+* ⚡️ Scores must be `Integer` in `-uint64..+uint64`, or convertable to `Float`
+* ⚡️ many many (so many) updates to benchmarks
+* ✨ Added `DHeap#clear`
+* 🐛 Fixed `DHeap#initialize_copy` and `#freeze`
+* ♻️  significant refactoring
+* 📝 Updated docs (mostly adding benchmarks)
+## Release v0.3.0 (2020-12-29)
+* ⚡️ Big performance improvements, by converting to a `T_DATA` struct.
+* ♻️  Major refactoring/rewriting of dheap.c
+* ✅ Added benchmark specs
+* 🔥 Removed class methods that operated directly on an array.  They weren't
+    compatible with the performance improvements.
+## Release v0.2.2 (2020-12-27)
+* 🐛 fix `optimized_cmp`, avoiding internal symbols
+* 📝 Update documentation
+* 💚 fix macos CI
+* ➕ Add rubocop 👮🎨
+## Release v0.2.1 (2020-12-26)
+* ⬆️  Upgraded rake (and bundler) to support ruby 3.0
+## Release v0.2.0 (2020-12-24)
+* ✨ Add ability to push separate score and value
+* ⚡️ Big performance gain, by storing scores separately and using ruby's
+  internal `OPTIMIZED_CMP` instead of always directly calling `<=>`
+## Release v0.1.0 (2020-12-22)
+🎉 initial release 🎉
+* ✨ Add basic d-ary Heap implementation

data/Gemfile CHANGED

@@ -5,6 +5,7 @@ source "https://rubygems.org"
 # Specify your gem's dependencies in d_heap.gemspec
 gemspec
+gem "pry"
 gem "rake", "~> 13.0"
 gem "rake-compiler"
 gem "rspec", "~> 3.10"

data/Gemfile.lock CHANGED

@@ -1,19 +1,22 @@
 PATH
   remote: .
   specs:
-    d_heap (0.3.0)
+    d_heap (0.4.0)
 GEM
   remote: https://rubygems.org/
   specs:
     ast (2.4.1)
-    benchmark-malloc (0.2.0)
-    benchmark-perf (0.6.0)
-    benchmark-trend (0.4.0)
+    benchmark_driver (0.15.16)
+    coderay (1.1.3)
     diff-lcs (1.4.4)
+    method_source (1.0.0)
     parallel (1.19.2)
     parser (2.7.2.0)
       ast (~> 2.4.1)
+    pry (0.13.1)
+      coderay (~> 1.1)
+      method_source (~> 1.0)
     rainbow (3.0.0)
     rake (13.0.3)
     rake-compiler (1.1.1)
@@ -24,11 +27,6 @@ GEM
       rspec-core (~> 3.10.0)
       rspec-expectations (~> 3.10.0)
       rspec-mocks (~> 3.10.0)
-    rspec-benchmark (0.6.0)
-      benchmark-malloc (~> 0.2)
-      benchmark-perf (~> 0.6)
-      benchmark-trend (~> 0.4)
-      rspec (>= 3.0)
     rspec-core (3.10.0)
       rspec-support (~> 3.10.0)
     rspec-expectations (3.10.0)
@@ -49,6 +47,7 @@ GEM
       unicode-display_width (>= 1.4.0, < 2.0)
     rubocop-ast (1.1.1)
       parser (>= 2.7.1.5)
+    ruby-prof (1.4.2)
     ruby-progressbar (1.10.1)
     unicode-display_width (1.7.0)
@@ -56,12 +55,14 @@ PLATFORMS
   ruby
 DEPENDENCIES
+  benchmark_driver
   d_heap!
+  pry
   rake (~> 13.0)
   rake-compiler
   rspec (~> 3.10)
-  rspec-benchmark
   rubocop (~> 1.0)
+  ruby-prof
 BUNDLED WITH
    2.2.3

data/README.md CHANGED

@@ -1,28 +1,64 @@
 # DHeap
-A fast _d_-ary heap implementation for ruby, useful in priority queues and graph
-algorithms.
+A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
+implemented as a C extension.
+With a regular queue, you expect "FIFO" behavior: first in, first out.  With a
+stack you expect "LIFO": last in first out.  With a priority queue, you push
+elements along with a score and the lowest scored element is the first to be
+popped.  Priority queues are often used in algorithms for e.g. [scheduling] of
+timers or bandwidth management, [Huffman coding], and various graph search
+algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
+The _d_-ary heap data structure is a generalization of the [binary heap], in
+which the nodes have _d_ children instead of 2.  This allows for "decrease
+priority" operations to be performed more quickly with the tradeoff of slower
+delete minimum.  Additionally, _d_-ary heaps can have better memory cache
+behavior than binary heaps, allowing them to run more quickly in practice
+despite slower worst-case time complexity. In the worst case, a _d_-ary heap
+requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
+n / log d)`.
+Although you should probably just use the default _d_ value  of `4` (see the
+analysis below), it's always advisable to benchmark your specific use-case.
+[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
+[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
+[binary heap]: https://en.wikipedia.org/wiki/Binary_heap
+[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
+[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
+[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
+[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
+[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
-The _d_-ary heap data structure is a generalization of the binary heap, in which
-the nodes have _d_ children instead of 2.  This allows for "decrease priority"
-operations to be performed more quickly with the tradeoff of slower delete
-minimum.  Additionally, _d_-ary heaps can have better memory cache behavior than
-binary heaps, allowing them to run more quickly in practice despite slower
-worst-case time complexity. In the worst case, a _d_-ary heap requires only
-`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.
+## Usage
-Although you should probably just stick with the default _d_ value  of `4`, it
-may be worthwhile to benchmark your specific scenario.
+The basic API is:
+* `heap << object` adds a value as its own score.
+* `heap.push(score, value)` adds a value with an extrinsic score.
+* `heap.pop` removes and returns the value with the minimum score.
+* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
+* `heap.peek` to view the minimum value without popping it.
-## Usage
+The score must be `Integer` or `Float` or convertable to a `Float` via
+`Float(score)` (i.e. it should implement `#to_f`).  Constraining scores to
+numeric values gives a 40+% speedup under some benchmarks!
+_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
+long long`. _This is architecture dependant but on an IA-64 system this is 64
+bits, which gives a range of -18,446,744,073,709,551,615 to
++18446744073709551615._
-The simplest way to use it is simply with `#push` and `#pop`.  Push takes a
-score and a value, and pop returns the value with the current minimum score.
+_Comparing arbitary objects via_ `a <=> b` _was the original design and may
+be added back in a future version,_ if (and only if) _it can be done without
+impacting the speed of numeric comparisons._
 ```ruby
 require "d_heap"
-heap = DHeap.new # defaults to a 4-ary heap
+Task = Struct.new(:id) # for demonstration
+heap = DHeap.new # defaults to a 4-heap
 # storing [score, value] tuples
 heap.push Time.now + 5*60, Task.new(1)
@@ -31,14 +67,61 @@ heap.push Time.now +   60, Task.new(3)
 heap.push Time.now +    5, Task.new(4)
 # peeking and popping (using last to get the task and ignore the time)
-heap.pop.last  # => Task[4]
-heap.pop.last  # => Task[2]
-heap.peak.last # => Task[3], but don't pop it
-heap.pop.last  # => Task[3]
-heap.pop.last  # => Task[1]
+heap.pop    # => Task[4]
+heap.pop    # => Task[2]
+heap.peek   # => Task[3], but don't pop it from the heap
+heap.pop    # => Task[3]
+heap.pop    # => Task[1]
+heap.empty? # => true
+heap.pop    # => nil
 ```
-Read the `rdoc` for more detailed documentation and examples.
+If your values behave as their own score, by being convertible via
+`Float(value)`, then you can use `#<<` for implicit scoring.  The score should
+not change for as long as the value remains in the heap, since it will not be
+re-evaluated after being pushed.
+```ruby
+heap.clear
+# The score can be derived from the value by using to_f.
+# "a <=> b" is *much* slower than comparing numbers, so it isn't used.
+class Event
+  include Comparable
+  attr_reader :time, :payload
+  alias_method :to_time, :time
+  def initialize(time, payload)
+    @time = time.to_time
+    @payload = payload
+    freeze
+  end
+  def to_f
+    time.to_f
+  end
+  def <=>(other)
+    to_f <=> other.to_f
+  end
+end
+heap << comparable_max # sorts last, using <=>
+heap << comparable_min # sorts first, using <=>
+heap << comparable_mid # sorts in the middle, using <=>
+heap.pop    # => comparable_min
+heap.pop    # => comparable_mid
+heap.pop    # => comparable_max
+heap.empty? # => true
+heap.pop    # => nil
+```
+You can also pass a value into `#pop(max)` which will only pop if the minimum
+score is less than or equal to `max`.
+Read the [API documentation] for more detailed documentation and examples.
+[API documentation]: https://rubydoc.info/gems/d_heap/DHeap
 ## Installation
@@ -58,109 +141,226 @@ Or install it yourself as:
 ## Motivation
-Sometimes you just need a priority queue, right?  With a regular queue, you
-expect "FIFO" behavior: first in, first out.  With a priority queue, you push
-with a score (or your elements are comparable), and you want to be able to
-efficiently pop off the minimum (or maximum) element.
-One obvious approach is to simply maintain an array in sorted order.  And
-ruby's Array class makes it simple to maintain a sorted array by combining
-`#bsearch_index` with `#insert`.  With certain insert/remove workloads that can
-perform very well, but in the worst-case an insert or delete can result in O(n),
-since `#insert` may need to `memcpy` or `memmove` a significant portion of the
-array.
-But the standard way to efficiently and simply solve this problem is using a
-binary heap.  Although it increases the time for `pop`, it converts the
-amortized time per push + pop from `O(n)` to `O(d log n / log d)`.
-I was surprised to find that, at least under certain benchmarks, my pure ruby
-heap implementation was usually slower than inserting into a fully sorted
-array.  While this is a testament to ruby's fine-tuned Array implementationw, a
-heap implementated in C should easily peform faster than `Array#insert`.
-The biggest issue is that it just takes far too much time to call `<=>` from
-ruby code: A sorted array only requires `log n / log 2` comparisons to insert
-and no comparisons to pop.  However a _d_-ary heap requires `log n / log d` to
-insert plus an additional `d log n / log d` to pop.  If your queue contains only
-a few hundred items at once, the overhead of those extra calls to `<=>` is far
-more than occasionally calling `memcpy`.
-It's likely that MJIT will eventually make the C-extension completely
-unnecessary.  This is definitely hotspot code, and the basic ruby implementation
-would work fine, if not for that `<=>` overhead.  Until then... this gem gets
-the job done.
-## TODOs...
-_TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library
-~~includes~~ _will include_ extensions to `Array`, allowing an Array to be
-directly handled as a priority queue.  These extension methods are meant to be
-used similarly to how `#bsearch` and `#bsearch_index` might be used.
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
-basic heap with an internal `Hash`, which maps a set of values to scores.
-loosely inspired by go's timers.  e.g: It lazily sifts its heap after deletion
-and adjustments, to achieve faster average runtime for *add* and *cancel*
-operations.
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some
-features that are loosely inspired by go's timers.  e.g: It lazily sifts its
-heap after deletion and adjustments, to achieve faster average runtime for *add*
-and *cancel* operations.
-Additionally, I was inspired by reading go's "timer.go" implementation to
-experiment with a 4-ary heap instead of the traditional binary heap.  In the
-case of timers, new timers are usually scheduled to run after most of the
-existing timers.  And timers are usually canceled before they have a chance to
-run. While a binary heap holds 50% of its elements in its last layer, 75% of a
-4-ary heap will have no children.  That diminishes the extra comparison overhead
-during sift-down.
-## Benchmarks
-_TODO: put benchmarks here._
+One naive approach to a priority queue is to maintain an array in sorted order.
+This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
+This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
+`O(n)` because it may need to `memcpy` a significant portion of the array.
+The standard way to implement a priority queue is with a binary heap.  Although
+this increases the time for `pop`, it converts the amortized time per push + pop
+from `O(n)` to `O(d log n / log d)`.
+However, I was surprised to find that—at least for some benchmarks—my pure ruby
+heap implementation was much slower than inserting into and popping from a fully
+sorted array.  The reason for this surprising result: Although it is `O(n)`,
+`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
+has relatively _much_ larger constant factors.  If your queue contains only a
+few thousand items, the overhead of those extra calls to `<=>` is _far_ more
+than occasionally calling `memcpy`.  In the worst case, a _d_-heap will require
+`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
+sorted array.
+Moving the sift-up and sift-down code into C helps some.  But much more helpful
+is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
+called.  I'm hopeful that MJIT will eventually obsolete this C-extension.  JRuby
+or TruffleRuby may already run the pure ruby version at high speed.  This can be
+hotspot code, and the basic ruby implementation should perform well if not for
+the high overhead of `<=>`.
 ## Analysis
 ### Time complexity
-Both sift operations can perform (log[d] n = log n / log d) swaps.
-Swap up performs only a single comparison per swap: O(1).
-Swap down performs as many as d comparions per swap: O(d).
+There are two fundamental heap operations: sift-up (used by push) and sift-down
+(used by pop).
+* Both sift operations can perform as many as `log n / log d` swaps, as the
+  element may sift from the bottom of the tree to the top, or vice versa.
+* Sift-up performs a single comparison per swap: `O(1)`.
+  So pushing a new element is `O(log n / log d)`.
+* Swap down performs as many as d comparions per swap: `O(d)`.
+  So popping the min element is `O(d log n / log d)`.
-Inserting an item is O(log n / log d).
-Deleting the root is O(d log n / log d).
+Assuming every inserted element is eventually deleted from the root, d=4
+requires the fewest comparisons for combined insert and delete:
-Assuming every inserted item is eventually deleted from the root, d=4 requires
-the fewest comparisons for combined insert and delete:
- * (1 + 2) lg 2 = 4.328085
- * (1 + 3) lg 3 = 3.640957
- * (1 + 4) lg 4 = 3.606738
- * (1 + 5) lg 5 = 3.728010
- * (1 + 6) lg 6 = 3.906774
- * etc...
+* (1 + 2) lg 2 = 4.328085
+* (1 + 3) lg 3 = 3.640957
+* (1 + 4) lg 4 = 3.606738
+* (1 + 5) lg 5 = 3.728010
+* (1 + 6) lg 6 = 3.906774
+* etc...
 Leaf nodes require no comparisons to shift down, and higher values for d have
 higher percentage of leaf nodes:
- * d=2 has ~50% leaves,
- * d=3 has ~67% leaves,
- * d=4 has ~75% leaves,
- * and so on...
+* d=2 has ~50% leaves,
+* d=3 has ~67% leaves,
+* d=4 has ~75% leaves,
+* and so on...
 See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
 ### Space complexity
-Because the heap is a complete binary tree, space usage is linear, regardless
-of d.  However higher d values may provide better cache locality.
+Space usage is linear, regardless of d.  However higher d values may
+provide better cache locality.  Because the heap is a complete binary tree, the
+elements can be stored in an array, without the need for tree or list pointers.
+Ruby can compare Numeric values _much_ faster than other ruby objects, even if
+those objects simply delegate comparison to internal Numeric values.  And it is
+often useful to use external scores for otherwise uncomparable values.  So
+`DHeap` uses twice as many entries (one for score and one for value)
+as an array which only stores values.
-We can run comparisons much much faster for Numeric or String objects than for
-ruby objects which delegate comparison to internal Numeric or String objects.
-And it is often advantageous to use extrinsic scores for uncomparable items.
-For this, our internal array uses twice as many entries (one for score and one
-for value) as it would if it only supported intrinsic comparison or used an
-un-memoized "sort_by" proc.
+## Benchmarks
+_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
+`docs/profile.txt` for more details or updated results. These benchmarks were
+measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
+These benchmarks use very simple implementations for a pure-ruby heap and an
+array that is kept sorted using `Array#bsearch_index` and `Array#insert`.  For
+comparison, an alternate implementation `Array#min` and `Array#delete_at` is
+also shown.
+Three different scenarios are measured:
+ * push N values but never pop (clearing between each set of pushes).
+ * push N values and then pop N values.
+   Although this could be used for heap sort, we're unlikely to choose heap sort
+   over Ruby's quick sort implementation. I'm using this scenario to represent
+   the amortized cost of creating a heap and (eventually) draining it.
+ * For a heap of size N, repeatedly push and pop while keeping a stable size.
+   This is a _very simple_ approximation for how most scheduler/timer heaps
+   would be used. Usually when a timer fires it will be quickly replaced by a
+   new timer, and the overall count of timers will remain roughly stable.
+In these benchmarks, `DHeap` runs faster than all other implementations for
+every scenario and every value of N, although the difference is much more
+noticable at higher values of N.  The pure ruby heap implementation is
+competitive for `push` alone at every value of N, but is significantly slower
+than bsearch + insert for push + pop until N is _very_ large (somewhere between
+10k and 100k)!
+For very small N values the benchmark implementations,  `DHeap` runs faster than
+the other implementations for each scenario, although the difference is still
+relatively small.  The pure ruby binary heap is 2x or more slower than bsearch +
+insert for common common push/pop scenario.
+    == push N (N=5) ==========================================================
+    push N (c_dheap):   1701338.1 i/s
+    push N (rb_heap):    971614.1 i/s - 1.75x  slower
+    push N (bsearch):    946363.7 i/s - 1.80x  slower
+    == push N then pop N (N=5) ===============================================
+    push N + pop N (c_dheap):   1087944.8 i/s
+    push N + pop N (findmin):    841708.1 i/s - 1.29x  slower
+    push N + pop N (bsearch):    773252.7 i/s - 1.41x  slower
+    push N + pop N (rb_heap):    471852.9 i/s - 2.31x  slower
+    == Push/pop with pre-filled queue of size=N (N=5) ========================
+    push + pop (c_dheap):   5525418.8 i/s
+    push + pop (findmin):   5003904.8 i/s - 1.10x  slower
+    push + pop (bsearch):   4320581.8 i/s - 1.28x  slower
+    push + pop (rb_heap):   2207042.0 i/s - 2.50x  slower
+By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
+scenarios, but the pure ruby heap is still slower than every other
+implementation—even resorting the array after every `#push`—in any scenario that
+uses `#pop`.
+    == push N (N=21) =========================================================
+    push N (c_dheap):    408307.0 i/s
+    push N (rb_heap):    212275.2 i/s - 1.92x  slower
+    push N (bsearch):    169583.2 i/s - 2.41x  slower
+    == push N then pop N (N=21) ==============================================
+    push N + pop N (c_dheap):    199435.5 i/s
+    push N + pop N (findmin):    162024.5 i/s - 1.23x  slower
+    push N + pop N (bsearch):    146284.3 i/s - 1.36x  slower
+    push N + pop N (rb_heap):     72289.0 i/s - 2.76x  slower
+    == Push/pop with pre-filled queue of size=N (N=21) =======================
+    push + pop (c_dheap):   4836860.0 i/s
+    push + pop (findmin):   4467453.9 i/s - 1.08x  slower
+    push + pop (bsearch):   3345458.4 i/s - 1.45x  slower
+    push + pop (rb_heap):   1560476.3 i/s - 3.10x  slower
+At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
+of `DHeap#push`, while insert's linear growth causes it to run slower and
+slower.  But because `#pop` is O(1) for a sorted array and O(d log n / log d)
+for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
+increases to 5k:
+    == Push/pop with pre-filled queue of size=N (N=5461) ==============
+    push + pop (c_dheap):   2718225.1 i/s
+    push + pop (bsearch):   1793546.4 i/s - 1.52x  slower
+    push + pop (rb_heap):    707139.9 i/s - 3.84x  slower
+    push + pop (findmin):    122316.0 i/s - 22.22x  slower
+Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
+for the repeated push/pop scenario, all the way up to N as high as 87k:
+    == push N (N=87381) ======================================================
+    push N (c_dheap):        92.8 i/s
+    push N (rb_heap):        43.5 i/s - 2.13x  slower
+    push N (bsearch):         2.9 i/s - 31.70x  slower
+    == push N then pop N (N=87381) ===========================================
+    push N + pop N (c_dheap):        22.6 i/s
+    push N + pop N (rb_heap):         5.5 i/s - 4.08x  slower
+    push N + pop N (bsearch):         2.9 i/s - 7.90x  slower
+    == Push/pop with pre-filled queue of size=N (N=87381) ====================
+    push + pop (c_dheap):   1815277.3 i/s
+    push + pop (bsearch):    762343.2 i/s - 2.38x  slower
+    push + pop (rb_heap):    535913.6 i/s - 3.39x  slower
+    push + pop (findmin):      2262.8 i/s - 802.24x  slower
+## Profiling
+_n.b. `Array#fetch` is reading the input data, external to heap operations.
+These benchmarks use integers for all scores, which enables significantly faster
+comparisons.  If `a <=> b` were used instead, then the difference between push
+and pop would be much larger.  And ruby's `Tracepoint` impacts these different
+implementations differently.  So we can't use these profiler results for
+comparisons between implementations.  A sampling profiler would be needed for
+more accurate relative measurements._
+It's informative to look at the `ruby-prof` results for a simple binary search +
+insert implementation, repeatedly pushing and popping to a large heap. In
+particular, even with 1000 members, the linear `Array#insert` is _still_ faster
+than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are
+still (relatively) slow and `memcpy` is (relatively) quite fast!
+    %self      total      self      wait     child     calls  name                           location
+    34.79      2.222     2.222     0.000     0.000  1000000   Array#insert
+    32.59      2.081     2.081     0.000     0.000  1000000   Array#bsearch_index
+    12.84      6.386     0.820     0.000     5.566        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+    10.38      4.966     0.663     0.000     4.303  1000000   DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
+     5.38      0.468     0.343     0.000     0.125  1000000   DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
+     2.06      0.132     0.132     0.000     0.000  1000000   Array#fetch
+     1.95      0.125     0.125     0.000     0.000  1000000   Array#pop
+Contrast this with a simplistic pure-ruby implementation of a binary heap:
+    %self      total      self      wait     child     calls  name                           location
+    48.52      8.487     8.118     0.000     0.369  1000000   DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
+    42.94      7.310     7.184     0.000     0.126  1000000   DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
+     4.80     16.732     0.803     0.000    15.929        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+You can see that it spends almost more time in pop than it does in push.  That
+is expected behavior for a heap: although both are O(log n), pop is
+significantly more complex, and has _d_ comparisons per layer.
+And `DHeap` shows a similar comparison between push and pop, although it spends
+half of its time in the benchmark code (which is written in ruby):
+    %self      total      self      wait     child     calls  name                           location
+    43.09      1.685     0.726     0.000     0.959        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+    26.05      0.439     0.439     0.000     0.000  1000000   DHeap#<<
+    23.57      0.397     0.397     0.000     0.000  1000000   DHeap#pop
+     7.29      0.123     0.123     0.000     0.000  1000000   Array#fetch
 ### Timers
@@ -178,22 +378,54 @@ faster than a delete and re-insert.
 ## Alternative data structures
+As always, you should run benchmarks with your expected scenarios to determine
+which is right.
 Depending on what you're doing, maintaining a sorted `Array` using
-`#bsearch_index` and `#insert` might be faster!  Although it is technically
-O(n) for insertions, the implementations for `memcpy` or `memmove` can be *very*
-fast on modern architectures.  Also, it can be faster O(n) on average, if
-insertions are usually near the end of the array.  You should run benchmarks
-with your expected scenarios to determine which is right.
+`#bsearch_index` and `#insert` might be just fine!  As discussed above, although
+it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this
+may not matter.  Also, if you can arrange for insertions to occur near the end
+of the array, that could significantly reduce the `memcpy` overhead even more.
+More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged
+as well as lower amortized time.
+[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
 If it is important to be able to quickly enumerate the set or find the ranking
-of values in it, then you probably want to use a self-balancing binary search
-tree (e.g. a red-black tree) or a skip-list.
-A Hashed Timing Wheel or Heirarchical Timing Wheels (or some variant in that
-family of data structures) can be constructed to have effectively O(1) running
-time in most cases.  However, the implementation for that data structure is more
-complex than a heap.  If a 4-ary heap is good enough for go's timers, it should
-be suitable for many use cases.
+of values in it, then you may want to use a self-balancing binary search tree
+(e.g. a [red-black tree]) or a [skip-list].
+[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
+[skip-list]: https://en.wikipedia.org/wiki/Skip_list
+[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
+family of data structures) can be constructed to have effectively `O(1)` running
+time in most cases.  Although the implementation for that data structure is more
+complex than a heap, it may be necessary for enormous values of N.
+[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
+## TODOs...
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
+basic heap with an internal `Hash`, which maps a set of values to scores.
+loosely inspired by go's timers.  e.g: It lazily sifts its heap after deletion
+and adjustments, to achieve faster average runtime for *add* and *cancel*
+operations.
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
+features that are loosely inspired by go's timers.  e.g: It lazily sifts its
+heap after deletion and adjustments, to achieve faster average runtime for *add*
+and *cancel* operations.
+Additionally, I was inspired by reading go's "timer.go" implementation to
+experiment with a 4-ary heap instead of the traditional binary heap.  In the
+case of timers, new timers are usually scheduled to run after most of the
+existing timers.  And timers are usually canceled before they have a chance to
+run. While a binary heap holds 50% of its elements in its last layer, 75% of a
+4-ary heap will have no children.  That diminishes the extra comparison overhead
+during sift-down.
 ## Development