d_heap 0.3.0 → 0.4.0
Sign up to get free protection for your applications and to get access to all the features.
 checksums.yaml +4 4
 data/.rubocop.yml +30 1
 data/CHANGELOG.md +42 0
 data/Gemfile +1 0
 data/Gemfile.lock +11 10
 data/README.md +353 121
 data/benchmarks/push_n.yml +28 0
 data/benchmarks/push_n_pop_n.yml +31 0
 data/benchmarks/push_pop.yml +24 0
 data/bin/bench_n +7 0
 data/bin/benchmarkdriver +29 0
 data/bin/benchmarks +10 0
 data/bin/profile +10 0
 data/d_heap.gemspec +2 1
 data/docs/benchmarks2.txt +52 0
 data/docs/benchmarks.txt +443 0
 data/docs/profile.txt +392 0
 data/ext/d_heap/d_heap.c +428 150
 data/ext/d_heap/d_heap.h +6 3
 data/ext/d_heap/extconf.rb +8 3
 data/lib/benchmark_driver/runner/ips_zero_fail.rb +120 0
 data/lib/d_heap.rb +5 3
 data/lib/d_heap/benchmarks.rb +111 0
 data/lib/d_heap/benchmarks/benchmarker.rb +113 0
 data/lib/d_heap/benchmarks/implementations.rb +168 0
 data/lib/d_heap/benchmarks/profiler.rb +71 0
 data/lib/d_heap/benchmarks/rspec_matchers.rb +374 0
 data/lib/d_heap/version.rb +1 1
 metadata +34 3
checksums.yaml
CHANGED
@@ 1,7 +1,7 @@


1
1



2
2

SHA256:

3


metadata.gz:

4


data.tar.gz:

3

+
metadata.gz: 413c0a93e2c3cbdbb86ee433df47a310034d453e441a150d8317dc055b4a9a90

4

+
data.tar.gz: 4bf67447021da03b07da7f44bcf97a66f13fa42f6f67bcfe9a49d0866c8b8167

5
5

SHA512:

6


metadata.gz:

7


data.tar.gz:

6

+
metadata.gz: 5e55bf53c1062686e0863fb9c3b09f3c2b8b936b0cf83985092e1e906b0b24f40e02a42eada048dcea732a60ec4e3695bb861943b424a8cd3152b227abad8a4e

7

+
data.tar.gz: e021616d6dcdcec943fec11783f2147a2d175aa3a0caf668c2339e05795e32475cdf7be90201c38d7108c74087e22140a9da9d3f211d7f0335e3c173ae83893b

data/.rubocop.yml
CHANGED
@@ 6,6 +6,7 @@ AllCops:


6
6

TargetRubyVersion: 2.5

7
7

NewCops: disable

8
8

Exclude:

9

+
 bin/benchmarkdriver

9
10

 bin/rake

10
11

 bin/rspec

11
12

 bin/rubocop

@@ 106,26 +107,49 @@ Naming/RescuedExceptionsVariableName: { Enabled: false }


106
107

###########################################################################

107
108

# Matrics:

108
109


110

+
Metrics/CyclomaticComplexity:

111

+
Max: 10

112

+

109
113

# Although it may be better to split specs into multiple files...?

110
114

Metrics/BlockLength:

111
115

Exclude:

112
116

 "spec/**/*_spec.rb"

117

+
CountAsOne:

118

+
 array

119

+
 hash

120

+
 heredoc

121

+

122

+
Metrics/ClassLength:

123

+
Max: 200

124

+
CountAsOne:

125

+
 array

126

+
 hash

127

+
 heredoc

113
128


114
129

###########################################################################

115
130

# Style...

116
131


117
132

Style/AccessorGrouping: { Enabled: false }

118
133

Style/AsciiComments: { Enabled: false } # 👮 can't stop our 🎉🥳🎊🥳!

134

+
Style/ClassAndModuleChildren: { Enabled: false }

119
135

Style/EachWithObject: { Enabled: false }

120
136

Style/FormatStringToken: { Enabled: false }

121
137

Style/FloatDivision: { Enabled: false }

138

+
Style/IfUnlessModifier: { Enabled: false }

139

+
Style/IfWithSemicolon: { Enabled: false }

122
140

Style/Lambda: { Enabled: false }

123
141

Style/LineEndConcatenation: { Enabled: false }

124
142

Style/MixinGrouping: { Enabled: false }

143

+
Style/MultilineBlockChain: { Enabled: false }

125
144

Style/PerlBackrefs: { Enabled: false } # use occasionally/sparingly

126
145

Style/RescueStandardError: { Enabled: false }

146

+
Style/Semicolon: { Enabled: false }

127
147

Style/SingleLineMethods: { Enabled: false }

128
148

Style/StabbyLambdaParentheses: { Enabled: false }

149

+
Style/WhenThen : { Enabled: false }

150

+

151

+
# I require trailing commas elsewhere, but these are optional

152

+
Style/TrailingCommaInArguments: { Enabled: false }

129
153


130
154

# If rubocop had an option to only enforce this on constants and literals (e.g.

131
155

# strings, regexp, range), I'd agree.

@@ 149,7 +173,9 @@ Style/BlockDelimiters:


149
173

EnforcedStyle: semantic

150
174

AllowBracesOnProceduralOneLiners: true

151
175

IgnoredMethods:

152


 expect

176

+
 expect # rspec

177

+
 profile # rubyprof

178

+
 ips # benchmarkips

153
179


154
180


155
181

Style/FormatString:

@@ 168,3 +194,6 @@ Style/TrailingCommaInHashLiteral:


168
194


169
195

Style/TrailingCommaInArrayLiteral:

170
196

EnforcedStyleForMultiline: consistent_comma

197

+

198

+
Style/YodaCondition:

199

+
EnforcedStyle: forbid_for_equality_operators_only

data/CHANGELOG.md
ADDED
@@ 0,0 +1,42 @@


1

+
## Current/Unreleased

2

+

3

+
## Release v0.4.0 (20210112)

4

+

5

+
* ⚡️ Big performance improvements, by using C `long double *cscores` array

6

+
* ⚡️ Scores must be `Integer` in `uint64..+uint64`, or convertable to `Float`

7

+
* ⚡️ many many (so many) updates to benchmarks

8

+
* ✨ Added `DHeap#clear`

9

+
* 🐛 Fixed `DHeap#initialize_copy` and `#freeze`

10

+
* ♻️ significant refactoring

11

+
* 📝 Updated docs (mostly adding benchmarks)

12

+

13

+
## Release v0.3.0 (20201229)

14

+

15

+
* ⚡️ Big performance improvements, by converting to a `T_DATA` struct.

16

+
* ♻️ Major refactoring/rewriting of dheap.c

17

+
* ✅ Added benchmark specs

18

+
* 🔥 Removed class methods that operated directly on an array. They weren't

19

+
compatible with the performance improvements.

20

+

21

+
## Release v0.2.2 (20201227)

22

+

23

+
* 🐛 fix `optimized_cmp`, avoiding internal symbols

24

+
* 📝 Update documentation

25

+
* 💚 fix macos CI

26

+
* ➕ Add rubocop 👮🎨

27

+

28

+
## Release v0.2.1 (20201226)

29

+

30

+
* ⬆️ Upgraded rake (and bundler) to support ruby 3.0

31

+

32

+
## Release v0.2.0 (20201224)

33

+

34

+
* ✨ Add ability to push separate score and value

35

+
* ⚡️ Big performance gain, by storing scores separately and using ruby's

36

+
internal `OPTIMIZED_CMP` instead of always directly calling `<=>`

37

+

38

+
## Release v0.1.0 (20201222)

39

+

40

+
🎉 initial release 🎉

41

+

42

+
* ✨ Add basic dary Heap implementation

data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
@@ 1,19 +1,22 @@


1
1

PATH

2
2

remote: .

3
3

specs:

4


d_heap (0.

4

+
d_heap (0.4.0)

5
5


6
6

GEM

7
7

remote: https://rubygems.org/

8
8

specs:

9
9

ast (2.4.1)

10



11



12


benchmarktrend (0.4.0)

10

+
benchmark_driver (0.15.16)

11

+
coderay (1.1.3)

13
12

difflcs (1.4.4)

13

+
method_source (1.0.0)

14
14

parallel (1.19.2)

15
15

parser (2.7.2.0)

16
16

ast (~> 2.4.1)

17

+
pry (0.13.1)

18

+
coderay (~> 1.1)

19

+
method_source (~> 1.0)

17
20

rainbow (3.0.0)

18
21

rake (13.0.3)

19
22

rakecompiler (1.1.1)

@@ 24,11 +27,6 @@ GEM


24
27

rspeccore (~> 3.10.0)

25
28

rspecexpectations (~> 3.10.0)

26
29

rspecmocks (~> 3.10.0)

27


rspecbenchmark (0.6.0)

28


benchmarkmalloc (~> 0.2)

29


benchmarkperf (~> 0.6)

30


benchmarktrend (~> 0.4)

31


rspec (>= 3.0)

32
30

rspeccore (3.10.0)

33
31

rspecsupport (~> 3.10.0)

34
32

rspecexpectations (3.10.0)

@@ 49,6 +47,7 @@ GEM


49
47

unicodedisplay_width (>= 1.4.0, < 2.0)

50
48

rubocopast (1.1.1)

51
49

parser (>= 2.7.1.5)

50

+
rubyprof (1.4.2)

52
51

rubyprogressbar (1.10.1)

53
52

unicodedisplay_width (1.7.0)

54
53


@@ 56,12 +55,14 @@ PLATFORMS


56
55

ruby

57
56


58
57

DEPENDENCIES

58

+
benchmark_driver

59
59

d_heap!

60

+
pry

60
61

rake (~> 13.0)

61
62

rakecompiler

62
63

rspec (~> 3.10)

63


rspecbenchmark

64
64

rubocop (~> 1.0)

65

+
rubyprof

65
66


66
67

BUNDLED WITH

67
68

2.2.3

data/README.md
CHANGED
@@ 1,28 +1,64 @@


1
1

# DHeap

2
2


3


A fast _d_ary heap implementation for ruby,

4



3

+
A fast [_d_ary heap][dary heap] [priority queue] implementation for ruby,

4

+
implemented as a C extension.

5

+

6

+
With a regular queue, you expect "FIFO" behavior: first in, first out. With a

7

+
stack you expect "LIFO": last in first out. With a priority queue, you push

8

+
elements along with a score and the lowest scored element is the first to be

9

+
popped. Priority queues are often used in algorithms for e.g. [scheduling] of

10

+
timers or bandwidth management, [Huffman coding], and various graph search

11

+
algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].

12

+

13

+
The _d_ary heap data structure is a generalization of the [binary heap], in

14

+
which the nodes have _d_ children instead of 2. This allows for "decrease

15

+
priority" operations to be performed more quickly with the tradeoff of slower

16

+
delete minimum. Additionally, _d_ary heaps can have better memory cache

17

+
behavior than binary heaps, allowing them to run more quickly in practice

18

+
despite slower worstcase time complexity. In the worst case, a _d_ary heap

19

+
requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log

20

+
n / log d)`.

21

+

22

+
Although you should probably just use the default _d_ value of `4` (see the

23

+
analysis below), it's always advisable to benchmark your specific usecase.

24

+

25

+
[dary heap]: https://en.wikipedia.org/wiki/Dary_heap

26

+
[priority queue]: https://en.wikipedia.org/wiki/Priority_queue

27

+
[binary heap]: https://en.wikipedia.org/wiki/Binary_heap

28

+
[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)

29

+
[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression

30

+
[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue

31

+
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description

32

+
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm

5
33


6



7


the nodes have _d_ children instead of 2. This allows for "decrease priority"

8


operations to be performed more quickly with the tradeoff of slower delete

9


minimum. Additionally, _d_ary heaps can have better memory cache behavior than

10


binary heaps, allowing them to run more quickly in practice despite slower

11


worstcase time complexity. In the worst case, a _d_ary heap requires only

12


`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.

34

+
## Usage

13
35


14



15



36

+
The basic API is:

37

+
* `heap << object` adds a value as its own score.

38

+
* `heap.push(score, value)` adds a value with an extrinsic score.

39

+
* `heap.pop` removes and returns the value with the minimum score.

40

+
* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.

41

+
* `heap.peek` to view the minimum value without popping it.

16
42


17



43

+
The score must be `Integer` or `Float` or convertable to a `Float` via

44

+
`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to

45

+
numeric values gives a 40+% speedup under some benchmarks!

46

+

47

+
_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned

48

+
long long`. _This is architecture dependant but on an IA64 system this is 64

49

+
bits, which gives a range of 18,446,744,073,709,551,615 to

50

+
+18446744073709551615._

18
51


19



20



52

+
_Comparing arbitary objects via_ `a <=> b` _was the original design and may

53

+
be added back in a future version,_ if (and only if) _it can be done without

54

+
impacting the speed of numeric comparisons._

21
55


22
56

```ruby

23
57

require "d_heap"

24
58


25



59

+
Task = Struct.new(:id) # for demonstration

60

+

61

+
heap = DHeap.new # defaults to a 4heap

26
62


27
63

# storing [score, value] tuples

28
64

heap.push Time.now + 5*60, Task.new(1)

@@ 31,14 +67,61 @@ heap.push Time.now + 60, Task.new(3)


31
67

heap.push Time.now + 5, Task.new(4)

32
68


33
69

# peeking and popping (using last to get the task and ignore the time)

34


heap.pop

35


heap.pop

36


heap.

37


heap.pop

38


heap.pop

70

+
heap.pop # => Task[4]

71

+
heap.pop # => Task[2]

72

+
heap.peek # => Task[3], but don't pop it from the heap

73

+
heap.pop # => Task[3]

74

+
heap.pop # => Task[1]

75

+
heap.empty? # => true

76

+
heap.pop # => nil

39
77

```

40
78


41



79

+
If your values behave as their own score, by being convertible via

80

+
`Float(value)`, then you can use `#<<` for implicit scoring. The score should

81

+
not change for as long as the value remains in the heap, since it will not be

82

+
reevaluated after being pushed.

83

+

84

+
```ruby

85

+
heap.clear

86

+

87

+
# The score can be derived from the value by using to_f.

88

+
# "a <=> b" is *much* slower than comparing numbers, so it isn't used.

89

+
class Event

90

+
include Comparable

91

+
attr_reader :time, :payload

92

+
alias_method :to_time, :time

93

+

94

+
def initialize(time, payload)

95

+
@time = time.to_time

96

+
@payload = payload

97

+
freeze

98

+
end

99

+

100

+
def to_f

101

+
time.to_f

102

+
end

103

+

104

+
def <=>(other)

105

+
to_f <=> other.to_f

106

+
end

107

+
end

108

+

109

+
heap << comparable_max # sorts last, using <=>

110

+
heap << comparable_min # sorts first, using <=>

111

+
heap << comparable_mid # sorts in the middle, using <=>

112

+
heap.pop # => comparable_min

113

+
heap.pop # => comparable_mid

114

+
heap.pop # => comparable_max

115

+
heap.empty? # => true

116

+
heap.pop # => nil

117

+
```

118

+

119

+
You can also pass a value into `#pop(max)` which will only pop if the minimum

120

+
score is less than or equal to `max`.

121

+

122

+
Read the [API documentation] for more detailed documentation and examples.

123

+

124

+
[API documentation]: https://rubydoc.info/gems/d_heap/DHeap

42
125


43
126

## Installation

44
127


@@ 58,109 +141,226 @@ Or install it yourself as:


58
141


59
142

## Motivation

60
143


61



62



63



64



65



66



67



68



69



70



71



72



73



74



75



76



77



78



79



80



81



82



83



84



85



86


a few hundred items at once, the overhead of those extra calls to `<=>` is far

87


more than occasionally calling `memcpy`.

88



89


It's likely that MJIT will eventually make the Cextension completely

90


unnecessary. This is definitely hotspot code, and the basic ruby implementation

91


would work fine, if not for that `<=>` overhead. Until then... this gem gets

92


the job done.

93



94


## TODOs...

95



96


_TODO:_ In addition to a basic _d_ary heap class (`DHeap`), this library

97


~~includes~~ _will include_ extensions to `Array`, allowing an Array to be

98


directly handled as a priority queue. These extension methods are meant to be

99


used similarly to how `#bsearch` and `#bsearch_index` might be used.

100



101


_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the

102


basic heap with an internal `Hash`, which maps a set of values to scores.

103


loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion

104


and adjustments, to achieve faster average runtime for *add* and *cancel*

105


operations.

106



107


_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some

108


features that are loosely inspired by go's timers. e.g: It lazily sifts its

109


heap after deletion and adjustments, to achieve faster average runtime for *add*

110


and *cancel* operations.

111



112


Additionally, I was inspired by reading go's "timer.go" implementation to

113


experiment with a 4ary heap instead of the traditional binary heap. In the

114


case of timers, new timers are usually scheduled to run after most of the

115


existing timers. And timers are usually canceled before they have a chance to

116


run. While a binary heap holds 50% of its elements in its last layer, 75% of a

117


4ary heap will have no children. That diminishes the extra comparison overhead

118


during siftdown.

119



120


## Benchmarks

121



122


_TODO: put benchmarks here._

144

+
One naive approach to a priority queue is to maintain an array in sorted order.

145

+
This can be very simply implemented using `Array#bseach_index` + `Array#insert`.

146

+
This can be very fast—`Array#pop` is `O(1)`—but the worstcase for insert is

147

+
`O(n)` because it may need to `memcpy` a significant portion of the array.

148

+

149

+
The standard way to implement a priority queue is with a binary heap. Although

150

+
this increases the time for `pop`, it converts the amortized time per push + pop

151

+
from `O(n)` to `O(d log n / log d)`.

152

+

153

+
However, I was surprised to find that—at least for some benchmarks—my pure ruby

154

+
heap implementation was much slower than inserting into and popping from a fully

155

+
sorted array. The reason for this surprising result: Although it is `O(n)`,

156

+
`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code

157

+
has relatively _much_ larger constant factors. If your queue contains only a

158

+
few thousand items, the overhead of those extra calls to `<=>` is _far_ more

159

+
than occasionally calling `memcpy`. In the worst case, a _d_heap will require

160

+
`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`

161

+
sorted array.

162

+

163

+
Moving the siftup and siftdown code into C helps some. But much more helpful

164

+
is optimizing the comparison of numeric scores, so `a <=> b` never needs to be

165

+
called. I'm hopeful that MJIT will eventually obsolete this Cextension. JRuby

166

+
or TruffleRuby may already run the pure ruby version at high speed. This can be

167

+
hotspot code, and the basic ruby implementation should perform well if not for

168

+
the high overhead of `<=>`.

123
169


124
170

## Analysis

125
171


126
172

### Time complexity

127
173


128



129



130



174

+
There are two fundamental heap operations: siftup (used by push) and siftdown

175

+
(used by pop).

176

+

177

+
* Both sift operations can perform as many as `log n / log d` swaps, as the

178

+
element may sift from the bottom of the tree to the top, or vice versa.

179

+
* Siftup performs a single comparison per swap: `O(1)`.

180

+
So pushing a new element is `O(log n / log d)`.

181

+
* Swap down performs as many as d comparions per swap: `O(d)`.

182

+
So popping the min element is `O(d log n / log d)`.

131
183


132



133



184

+
Assuming every inserted element is eventually deleted from the root, d=4

185

+
requires the fewest comparisons for combined insert and delete:

134
186


135



136



137



138



139



140



141


* (1 + 6) lg 6 = 3.906774

142


* etc...

187

+
* (1 + 2) lg 2 = 4.328085

188

+
* (1 + 3) lg 3 = 3.640957

189

+
* (1 + 4) lg 4 = 3.606738

190

+
* (1 + 5) lg 5 = 3.728010

191

+
* (1 + 6) lg 6 = 3.906774

192

+
* etc...

143
193


144
194

Leaf nodes require no comparisons to shift down, and higher values for d have

145
195

higher percentage of leaf nodes:

146



147



148



149



196

+

197

+
* d=2 has ~50% leaves,

198

+
* d=3 has ~67% leaves,

199

+
* d=4 has ~75% leaves,

200

+
* and so on...

150
201


151
202

See https://en.wikipedia.org/wiki/Dary_heap#Analysis for deeper analysis.

152
203


153
204

### Space complexity

154
205


155



156



206

+
Space usage is linear, regardless of d. However higher d values may

207

+
provide better cache locality. Because the heap is a complete binary tree, the

208

+
elements can be stored in an array, without the need for tree or list pointers.

209

+

210

+
Ruby can compare Numeric values _much_ faster than other ruby objects, even if

211

+
those objects simply delegate comparison to internal Numeric values. And it is

212

+
often useful to use external scores for otherwise uncomparable values. So

213

+
`DHeap` uses twice as many entries (one for score and one for value)

214

+
as an array which only stores values.

157
215


158



159



160



161



162



163



216

+
## Benchmarks

217

+

218

+
_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and

219

+
`docs/profile.txt` for more details or updated results. These benchmarks were

220

+
measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._

221

+

222

+
These benchmarks use very simple implementations for a pureruby heap and an

223

+
array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For

224

+
comparison, an alternate implementation `Array#min` and `Array#delete_at` is

225

+
also shown.

226

+

227

+
Three different scenarios are measured:

228

+
* push N values but never pop (clearing between each set of pushes).

229

+
* push N values and then pop N values.

230

+
Although this could be used for heap sort, we're unlikely to choose heap sort

231

+
over Ruby's quick sort implementation. I'm using this scenario to represent

232

+
the amortized cost of creating a heap and (eventually) draining it.

233

+
* For a heap of size N, repeatedly push and pop while keeping a stable size.

234

+
This is a _very simple_ approximation for how most scheduler/timer heaps

235

+
would be used. Usually when a timer fires it will be quickly replaced by a

236

+
new timer, and the overall count of timers will remain roughly stable.

237

+

238

+
In these benchmarks, `DHeap` runs faster than all other implementations for

239

+
every scenario and every value of N, although the difference is much more

240

+
noticable at higher values of N. The pure ruby heap implementation is

241

+
competitive for `push` alone at every value of N, but is significantly slower

242

+
than bsearch + insert for push + pop until N is _very_ large (somewhere between

243

+
10k and 100k)!

244

+

245

+
For very small N values the benchmark implementations, `DHeap` runs faster than

246

+
the other implementations for each scenario, although the difference is still

247

+
relatively small. The pure ruby binary heap is 2x or more slower than bsearch +

248

+
insert for common common push/pop scenario.

249

+

250

+
== push N (N=5) ==========================================================

251

+
push N (c_dheap): 1701338.1 i/s

252

+
push N (rb_heap): 971614.1 i/s  1.75x slower

253

+
push N (bsearch): 946363.7 i/s  1.80x slower

254

+

255

+
== push N then pop N (N=5) ===============================================

256

+
push N + pop N (c_dheap): 1087944.8 i/s

257

+
push N + pop N (findmin): 841708.1 i/s  1.29x slower

258

+
push N + pop N (bsearch): 773252.7 i/s  1.41x slower

259

+
push N + pop N (rb_heap): 471852.9 i/s  2.31x slower

260

+

261

+
== Push/pop with prefilled queue of size=N (N=5) ========================

262

+
push + pop (c_dheap): 5525418.8 i/s

263

+
push + pop (findmin): 5003904.8 i/s  1.10x slower

264

+
push + pop (bsearch): 4320581.8 i/s  1.28x slower

265

+
push + pop (rb_heap): 2207042.0 i/s  2.50x slower

266

+

267

+
By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all

268

+
scenarios, but the pure ruby heap is still slower than every other

269

+
implementation—even resorting the array after every `#push`—in any scenario that

270

+
uses `#pop`.

271

+

272

+
== push N (N=21) =========================================================

273

+
push N (c_dheap): 408307.0 i/s

274

+
push N (rb_heap): 212275.2 i/s  1.92x slower

275

+
push N (bsearch): 169583.2 i/s  2.41x slower

276

+

277

+
== push N then pop N (N=21) ==============================================

278

+
push N + pop N (c_dheap): 199435.5 i/s

279

+
push N + pop N (findmin): 162024.5 i/s  1.23x slower

280

+
push N + pop N (bsearch): 146284.3 i/s  1.36x slower

281

+
push N + pop N (rb_heap): 72289.0 i/s  2.76x slower

282

+

283

+
== Push/pop with prefilled queue of size=N (N=21) =======================

284

+
push + pop (c_dheap): 4836860.0 i/s

285

+
push + pop (findmin): 4467453.9 i/s  1.08x slower

286

+
push + pop (bsearch): 3345458.4 i/s  1.45x slower

287

+
push + pop (rb_heap): 1560476.3 i/s  3.10x slower

288

+

289

+
At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown

290

+
of `DHeap#push`, while insert's linear growth causes it to run slower and

291

+
slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d)

292

+
for a _d_heap, scenarios involving `#pop` remain relatively close even as N

293

+
increases to 5k:

294

+

295

+
== Push/pop with prefilled queue of size=N (N=5461) ==============

296

+
push + pop (c_dheap): 2718225.1 i/s

297

+
push + pop (bsearch): 1793546.4 i/s  1.52x slower

298

+
push + pop (rb_heap): 707139.9 i/s  3.84x slower

299

+
push + pop (findmin): 122316.0 i/s  22.22x slower

300

+

301

+
Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap

302

+
for the repeated push/pop scenario, all the way up to N as high as 87k:

303

+

304

+
== push N (N=87381) ======================================================

305

+
push N (c_dheap): 92.8 i/s

306

+
push N (rb_heap): 43.5 i/s  2.13x slower

307

+
push N (bsearch): 2.9 i/s  31.70x slower

308

+

309

+
== push N then pop N (N=87381) ===========================================

310

+
push N + pop N (c_dheap): 22.6 i/s

311

+
push N + pop N (rb_heap): 5.5 i/s  4.08x slower

312

+
push N + pop N (bsearch): 2.9 i/s  7.90x slower

313

+

314

+
== Push/pop with prefilled queue of size=N (N=87381) ====================

315

+
push + pop (c_dheap): 1815277.3 i/s

316

+
push + pop (bsearch): 762343.2 i/s  2.38x slower

317

+
push + pop (rb_heap): 535913.6 i/s  3.39x slower

318

+
push + pop (findmin): 2262.8 i/s  802.24x slower

319

+

320

+
## Profiling

321

+

322

+
_n.b. `Array#fetch` is reading the input data, external to heap operations.

323

+
These benchmarks use integers for all scores, which enables significantly faster

324

+
comparisons. If `a <=> b` were used instead, then the difference between push

325

+
and pop would be much larger. And ruby's `Tracepoint` impacts these different

326

+
implementations differently. So we can't use these profiler results for

327

+
comparisons between implementations. A sampling profiler would be needed for

328

+
more accurate relative measurements._

329

+

330

+
It's informative to look at the `rubyprof` results for a simple binary search +

331

+
insert implementation, repeatedly pushing and popping to a large heap. In

332

+
particular, even with 1000 members, the linear `Array#insert` is _still_ faster

333

+
than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are

334

+
still (relatively) slow and `memcpy` is (relatively) quite fast!

335

+

336

+
%self total self wait child calls name location

337

+
34.79 2.222 2.222 0.000 0.000 1000000 Array#insert

338

+
32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index

339

+
12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77

340

+
10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61

341

+
5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70

342

+
2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch

343

+
1.95 0.125 0.125 0.000 0.000 1000000 Array#pop

344

+

345

+
Contrast this with a simplistic pureruby implementation of a binary heap:

346

+

347

+
%self total self wait child calls name location

348

+
48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96

349

+
42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80

350

+
4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77

351

+

352

+
You can see that it spends almost more time in pop than it does in push. That

353

+
is expected behavior for a heap: although both are O(log n), pop is

354

+
significantly more complex, and has _d_ comparisons per layer.

355

+

356

+
And `DHeap` shows a similar comparison between push and pop, although it spends

357

+
half of its time in the benchmark code (which is written in ruby):

358

+

359

+
%self total self wait child calls name location

360

+
43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77

361

+
26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<<

362

+
23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop

363

+
7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch

164
364


165
365

### Timers

166
366


@@ 178,22 +378,54 @@ faster than a delete and reinsert.


178
378


179
379

## Alternative data structures

180
380


381

+
As always, you should run benchmarks with your expected scenarios to determine

382

+
which is right.

383

+

181
384

Depending on what you're doing, maintaining a sorted `Array` using

182


`#bsearch_index` and `#insert` might be

183


O(n) for insertions,

184



185



186



385

+
`#bsearch_index` and `#insert` might be just fine! As discussed above, although

386

+
it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this

387

+
may not matter. Also, if you can arrange for insertions to occur near the end

388

+
of the array, that could significantly reduce the `memcpy` overhead even more.

389

+

390

+
More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged

391

+
as well as lower amortized time.

392

+

393

+
[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap

187
394


188
395

If it is important to be able to quickly enumerate the set or find the ranking

189


of values in it, then you

190



191



192



193



194



195



196


be

396

+
of values in it, then you may want to use a selfbalancing binary search tree

397

+
(e.g. a [redblack tree]) or a [skiplist].

398

+

399

+
[redblack tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree

400

+
[skiplist]: https://en.wikipedia.org/wiki/Skip_list

401

+

402

+
[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that

403

+
family of data structures) can be constructed to have effectively `O(1)` running

404

+
time in most cases. Although the implementation for that data structure is more

405

+
complex than a heap, it may be necessary for enormous values of N.

406

+

407

+
[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97timingwheels.pdf

408

+

409

+
## TODOs...

410

+

411

+
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the

412

+
basic heap with an internal `Hash`, which maps a set of values to scores.

413

+
loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion

414

+
and adjustments, to achieve faster average runtime for *add* and *cancel*

415

+
operations.

416

+

417

+
_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some

418

+
features that are loosely inspired by go's timers. e.g: It lazily sifts its

419

+
heap after deletion and adjustments, to achieve faster average runtime for *add*

420

+
and *cancel* operations.

421

+

422

+
Additionally, I was inspired by reading go's "timer.go" implementation to

423

+
experiment with a 4ary heap instead of the traditional binary heap. In the

424

+
case of timers, new timers are usually scheduled to run after most of the

425

+
existing timers. And timers are usually canceled before they have a chance to

426

+
run. While a binary heap holds 50% of its elements in its last layer, 75% of a

427

+
4ary heap will have no children. That diminishes the extra comparison overhead

428

+
during siftdown.

197
429


198
430

## Development

199
431

