d_heap 0.6.1 → 0.7.0
Sign up to get free protection for your applications and to get access to all the features.
 checksums.yaml +4 4
 data/.clangformat +21 0
 data/.github/workflows/main.yml +16 1
 data/.rubocop.yml +1 0
 data/CHANGELOG.md +17 0
 data/{N → D} +1 1
 data/README.md +313 261
 data/d_heap.gemspec +16 5
 data/docs/benchmarks2.txt +79 61
 data/docs/benchmarks.txt +587 416
 data/docs/profile.txt +99 133
 data/ext/d_heap/.rubocop.yml +7 0
 data/ext/d_heap/d_heap.c +575 424
 data/ext/d_heap/extconf.rb +34 3
 data/images/push_n.png +0 0
 data/images/push_n_pop_n.png +0 0
 data/images/push_pop.png +0 0
 data/lib/d_heap.rb +25 1
 data/lib/d_heap/version.rb +1 1
 metadata +6 30
 data/.rspec +0 3
 data/.travis.yml +0 6
 data/Gemfile +0 20
 data/Gemfile.lock +0 83
 data/Rakefile +0 20
 data/benchmarks/perf.rb +0 29
 data/benchmarks/push_n.yml +0 35
 data/benchmarks/push_n_pop_n.yml +0 52
 data/benchmarks/push_pop.yml +0 32
 data/benchmarks/stackprof.rb +0 31
 data/bin/bench_charts +0 13
 data/bin/bench_n +0 7
 data/bin/benchmarkdriver +0 29
 data/bin/benchmarks +0 10
 data/bin/console +0 15
 data/bin/profile +0 10
 data/bin/rake +0 29
 data/bin/rspec +0 29
 data/bin/rubocop +0 29
 data/bin/setup +0 8
 data/lib/benchmark_driver/runner/ips_zero_fail.rb +0 158
 data/lib/d_heap/benchmarks.rb +0 112
 data/lib/d_heap/benchmarks/benchmarker.rb +0 116
 data/lib/d_heap/benchmarks/implementations.rb +0 224
 data/lib/d_heap/benchmarks/profiler.rb +0 71
 data/lib/d_heap/benchmarks/rspec_matchers.rb +0 352
checksums.yaml
CHANGED
@@ 1,7 +1,7 @@


1
1



2
2

SHA256:

3


metadata.gz:

4


data.tar.gz:

3

+
metadata.gz: 5b51ed52baf74b585a7ab7799f92a446aef5852431ba10e146658b419657ffbe

4

+
data.tar.gz: cc7c6786eee78ec13214582b8701448d312f59fb723d12676fb673447ab409a7

5
5

SHA512:

6


metadata.gz:

7


data.tar.gz:

6

+
metadata.gz: 5de98f8c9084b30694fff5f8154a6e42e7e67d76518c25136ab4fb0c0afb047ad3c923f4544dcf613ded4c3b01417729aa796c973100faaa7ee93051fa630c7d

7

+
data.tar.gz: e5dbcc90da7adfba7ef45cd9a2da5fd1781a2bd489002a5ffc0a764915c035c178db30ae9b8431a8fc810cfa6f03a1b38ec0a50cbf23c2e1ba5dfc36549c0609

data/.clangformat
ADDED
@@ 0,0 +1,21 @@


1

+


2

+
BasedOnStyle: mozilla

3

+
IndentWidth: 4

4

+
PointerAlignment: Right

5

+
AlignAfterOpenBracket: Align

6

+
AlignConsecutiveAssignments: true

7

+
AlignConsecutiveDeclarations: true

8

+
AlignConsecutiveBitFields: true

9

+
AlignConsecutiveMacros: true

10

+
AlignEscapedNewlines: Right

11

+
AlignOperands: true

12

+

13

+
AllowAllConstructorInitializersOnNextLine: false

14

+
AllowShortIfStatementsOnASingleLine: WithoutElse

15

+

16

+
IndentCaseLabels: false

17

+
IndentPPDirectives: AfterHash

18

+

19

+
ForEachMacros:

20

+
 WHILE_PEEK_LT_P

21

+
...

data/.github/workflows/main.yml
CHANGED
@@ 23,4 +23,19 @@ jobs:


23
23

run: 

24
24

gem install bundler v 2.2.3

25
25

bundle install

26


bundle exec rake

26

+
bundle exec rake ci

27

+

28

+
benchmarks:

29

+
runson: ubuntulatest

30

+
steps:

31

+
 uses: actions/checkout@v2

32

+
 name: Set up Ruby

33

+
uses: ruby/setupruby@v1

34

+
with:

35

+
rubyversion: 2.7

36

+
bundlercache: true

37

+
 name: Run the benchmarks

38

+
run: 

39

+
gem install bundler v 2.2.3

40

+
bundle install

41

+
bundle exec rake ci:benchmarks

data/.rubocop.yml
CHANGED
@@ 135,6 +135,7 @@ Style/ClassAndModuleChildren: { Enabled: false }


135
135

Style/EachWithObject: { Enabled: false }

136
136

Style/FormatStringToken: { Enabled: false }

137
137

Style/FloatDivision: { Enabled: false }

138

+
Style/GuardClause: { Enabled: false } # usually nice to do, but...

138
139

Style/IfUnlessModifier: { Enabled: false }

139
140

Style/IfWithSemicolon: { Enabled: false }

140
141

Style/Lambda: { Enabled: false }

data/CHANGELOG.md
CHANGED
@@ 1,5 +1,22 @@


1
1

## Current/Unreleased

2
2


3

+
## Release v0.7.0 (20210124)

4

+

5

+
* 💥⚡️ **BREAKING**: Uses `double`) for _all_ scores.

6

+
* 💥 Integers larger than a double mantissa (53bits) will lose some

7

+
precision.

8

+
* ⚡️ big speed up

9

+
* ⚡️ Much better memory usage

10

+
* ⚡️ Simplifies score conversion between ruby and C

11

+
* ✨ Added `DHeap::Map` for ensuring values can only be added once, by `#hash`.

12

+
* Adding again will update the score.

13

+
* Adds `DHeap::Map#[]` for quick lookup of existing scores

14

+
* Adds `DHeap::Map#[]=` for adjustments of existing scores

15

+
* TODO: `DHeap::Map#delete`

16

+
* 📝📈 SO MANY BENCHMARKS

17

+
* ⚡️ Set DEFAULT_D to 6, based on benchmarks.

18

+
* 🐛♻️ convert all `long` indexes to `size_t`

19

+

3
20

## Release v0.6.1 (20210124)

4
21


5
22

* 📝 Fix link to CHANGELOG.md in gemspec

data/{N → D}
RENAMED
data/README.md
CHANGED
@@ 7,6 +7,13 @@


7
7

A fast [_d_ary heap][dary heap] [priority queue] implementation for ruby,

8
8

implemented as a C extension.

9
9


10

+
A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":

11

+
last in first out. A priority queue pushes each element with a score and pops

12

+
out in order by score. Priority queues are often used in algorithms for e.g.

13

+
[scheduling] of timers or bandwidth management, for [Huffman coding], and for

14

+
various graph search algorithms such as [Dijkstra's algorithm], [A* search], or

15

+
[Prim's algorithm].

16

+

10
17

From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):

11
18

> A heap is a specialized treebased data structure which is essentially an

12
19

> almost complete tree that satisfies the heap property: in a min heap, for any

@@ 16,26 +23,17 @@ From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):


16
23


17
24

![tree representation of a min heap](images/wikipediaminheap.png)

18
25


19



20



21



22



23



24



25



26



27



28



29



30


have better memory cache behavior than binary heaps, allowing them to run more

31


quickly in practice despite slower worstcase time complexity. In the worst

32


case, a _d_ary heap requires only `O(log n / log d)` operations to push, with

33


the tradeoff that pop requires `O(d log n / log d)`.

34



35


Although you should probably just use the default _d_ value of `4` (see the

36


analysis below), it's always advisable to benchmark your specific usecase. In

37


particular, if you push items more than you pop, higher values for _d_ can give

38


a faster total runtime.

26

+
The _d_ary heap data structure is a generalization of a [binary heap] in which

27

+
each node has _d_ children instead of 2. This speeds up "push" or "decrease

28

+
priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or

29

+
"increase priority" (`O(d log n / log d)`). Additionally, _d_ary heaps can

30

+
have better memory cache behavior than binary heaps, letting them run more

31

+
quickly in practice.

32

+

33

+
Although the default _d_ value will usually perform best (see the time

34

+
complexity analysis below), it's always advisable to benchmark your specific

35

+
usecase. In particular, if you push items more than you pop, higher values for

36

+
_d_ can give a faster total runtime.

39
37


40
38

[dary heap]: https://en.wikipedia.org/wiki/Dary_heap

41
39

[priority queue]: https://en.wikipedia.org/wiki/Priority_queue

@@ 46,41 +44,39 @@ a faster total runtime.


46
44

[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description

47
45

[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm

48
46


47

+
## Installation

48

+

49

+
Add this line to your application's Gemfile:

50

+

51

+
```ruby

52

+
gem 'd_heap'

53

+
```

54

+

55

+
And then execute:

56

+

57

+
$ bundle install

58

+

59

+
Or install it yourself as:

60

+

61

+
$ gem install d_heap

62

+

49
63

## Usage

50
64


51


The basic API is `#push(object, score)` and `#pop`. Please read the

52



65

+
The basic API is `#push(object, score)` and `#pop`. Please read the [full

66

+
documentation] for more details. The score must be convertable to a `Float` via

67

+
`Float(score)` (i.e. it should properly implement `#to_f`).

53
68


54


Quick reference for

69

+
Quick reference for the most common methods:

55
70


56


* `heap << object` adds a value,

71

+
* `heap << object` adds a value, using `Float(object)` as its intrinsic score.

57
72

* `heap.push(object, score)` adds a value with an extrinsic score.

58


* `heap.pop` removes and returns the value with the minimum score.

59


* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.

60
73

* `heap.peek` to view the minimum value without popping it.

74

+
* `heap.pop` removes and returns the value with the minimum score.

75

+
* `heap.pop_below(max_score)` pops only if the next score is `<` the argument.

61
76

* `heap.clear` to remove all items from the heap.

62
77

* `heap.empty?` returns true if the heap is empty.

63
78

* `heap.size` returns the number of items in the heap.

64
79


65


If the score changes while the object is still in the heap, it will not be

66


reevaluated again.

67



68


The score must either be `Integer` or `Float` or convertable to a `Float` via

69


`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to

70


numeric values gives more than 50% speedup under some benchmarks! _n.b._

71


`Integer` _scores must have an absolute value that fits into_ `unsigned long

72


long`. This is compiler and architecture dependant but with gcc on an IA64

73


system it's 64 bits, which gives a range of 18,446,744,073,709,551,615 to

74


+18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time

75


in nanoseconds.

76



77


_Comparing arbitary objects via_ `a <=> b` _was the original design and may be

78


added back in a future version,_ if (and only if) _it can be done without

79


impacting the speed of numeric comparisons. The speedup from this constraint is

80


huge!_

81



82


[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap

83



84
80

### Examples

85
81


86
82

```ruby

@@ 128,251 +124,272 @@ heap.size # => 0


128
124

heap.pop # => nil

129
125

```

130
126


131


Please see the [

127

+
Please see the [full documentation] for more methods and more examples.

132
128


133



129

+
[full documentation]: https://rubydoc.info/gems/d_heap/DHeap

134
130


135



131

+
### DHeap::Map

136
132


137



138



139



133

+
`DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their

134

+
index in the heap. For simple push/pop this a bit slower than a normal `DHeap`

135

+
heap, but it can enable huge speedups for algorithms that need to adjust scores

136

+
after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:

140
137


141



142



143



144



145



138

+
* a uniqueness constraint, by `#hash` value

139

+
* `#[obj] # => score` or `#score(obj)` in `O(1)`

140

+
* `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`

141

+
* TODO:

142

+
* optionally unique by object identity

143

+
* `#delete(obj)` in `O(d log n / log d)` (TODO)

146
144


147



145

+
## Scores

148
146


149



147

+
If a score changes while the object is still in the heap, it will not be

148

+
reevaluated again.

150
149


151



152



153



154



155


the

150

+
Constraining scores to `Float` gives enormous performance benefits. n.b.

151

+
very large `Integer` values will lose precision when converted to `Float`. This

152

+
is compiler and architecture dependant but with gcc on an IA64 system, `Float`

153

+
is 64 bits with a 53bit mantissa, which gives a range of 9,007,199,254,740,991

154

+
to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX

155

+
time since the epoch in nanoseconds. This can be worked around by adding a

156

+
bias, but probably it's good enough for most usage.

156
157


157



158



159



160


makes the tree shorter but broader, which reduces to `O(log n / log d)` while

161


increasing the comparisons needed by siftdown to `O(d log n/ log d)`.

158

+
_Comparing arbitary objects via_ `a <=> b` _was the original design and may be

159

+
added back in a future version,_ if (and only if) _it can be done without

160

+
impacting the speed of numeric comparisons._

162
161


163



164


slowly than the naive approach—even for heaps containing ten thousand items.

165


Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby

166


has _much_ higher overhead. And a _d_heap needs `d + 1` times more comparisons

167


for each push + pop than `bsearch` + `insert`.

162

+
## Thread safety

168
163


169



170



171


heap instead of the traditional binary heap.

164

+
`DHeap` is _not_ threadsafe, so concurrent access from multiple threads need to

165

+
take precautions such as locking access behind a mutex.

172
166


173
167

## Benchmarks

174
168


175


_See

176



177



178



179



180



181



182



183


`

184



185



186



187



188



189



190



169

+
_See full benchmark output in subdirs of `benchmarks`. See also or updated

170

+
results. These benchmarks were measured with an Intel Core i71065G7 8x3.9GHz

171

+
with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._

172

+

173

+
### Implementations

174

+

175

+
* **findmin** 

176

+
A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a

177

+
very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and

178

+
`Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but

179

+
unusably slow for `n > 1000`.

180

+

181

+
* **bsearch** 

182

+
A simple implementation with a slow `O(n)` push using `Array#bsearch` +

183

+
`Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with

184

+
`Array#pop`. It is still relatively fast for `n < 10000`, but its linear

185

+
time complexity really destroys it after that.

186

+

187

+
* **rb_heap** 

188

+
A pure ruby binary minheap that has been tuned for performance by making

189

+
few method calls and allocating and assigning as few variables as possible.

190

+
It runs in `O(log n)` for both push and pop, although pop is slower than

191

+
push by a constant factor. Its much higher constant factors makes it lose

192

+
to `bsearch` push + pop for `n < 10000` but it holds steady with very little

193

+
slowdown even with `n > 10000000`.

194

+

195

+
* **c++ stl** 

196

+
A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL

197

+
priority_queue]. The wrapper is simply to provide compatibility with the

198

+
other benchmarked implementations, but it should be possible to speed this

199

+
up a little bit by benchmarking the `priority_queue_cxx` API directly. It

200

+
has the same time complexity as rb_heap but its much lower constant

201

+
factors allow it to easily outperform `bsearch`.

202

+

203

+
* **c_dheap** 

204

+
A {DHeap} instance with the default `d` value of `4`. It has the same time

205

+
complexity as `rb_heap` and `c++ stl`, but is faster than both in every

206

+
benchmarked scenario.

191
207


192
208

[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx

193
209

[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/

194
210


195



211

+
### Scenarios

196
212


197



213

+
Each benchmark increases N exponentially, either by √1̅0̅ or approximating

214

+
(alternating between x3 and x3.333) in order to simplify keeping loop counts

215

+
evenly divisible by N.

198
216


199



217

+
#### push N items

200
218


201



219

+
This measures the _average time per insert_ to create a queue of size N

220

+
(clearing the queue once it reaches that size). Use cases which push (or

221

+
decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's

222

+
algorithm] when the graph has more edges than verticies, may want to pay more

223

+
attention to this benchmark.

202
224


203



225

+
![bar graph for push_n_pop_n benchmarks](./images/push_n.png)

204
226


205



206



207



227

+
== push N (N=100) ==========================================================

228

+
push N (c_dheap): 10522662.6 i/s

229

+
push N (findmin): 9980622.3 i/s  1.05x slower

230

+
push N (c++ stl): 7991608.3 i/s  1.32x slower

231

+
push N (rb_heap): 4607849.4 i/s  2.28x slower

232

+
push N (bsearch): 2769106.2 i/s  3.80x slower

233

+
== push N (N=10,000) =======================================================

234

+
push N (c_dheap): 10444588.3 i/s

235

+
push N (findmin): 10191797.4 i/s  1.02x slower

236

+
push N (c++ stl): 8210895.4 i/s  1.27x slower

237

+
push N (rb_heap): 4369252.9 i/s  2.39x slower

238

+
push N (bsearch): 1213580.4 i/s  8.61x slower

239

+
== push N (N=1,000,000) ====================================================

240

+
push N (c_dheap): 10342183.7 i/s

241

+
push N (findmin): 9963898.8 i/s  1.04x slower

242

+
push N (c++ stl): 7891924.8 i/s  1.31x slower

243

+
push N (rb_heap): 4350116.0 i/s  2.38x slower

244

+

245

+
All three heap implementations have little to no perceptible slowdown for `N >

246

+
100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!

247

+

248

+
#### push then pop N items

249

+

250

+
This measures the _average_ for a push **or** a pop, filling up a queue with N

251

+
items and then draining that queue until empty. It represents the amortized

252

+
cost of balanced pushes and pops to fill a heap and drain it.

208
253


209
254

![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)

210
255


211



212



213



214



215



216



256

+
== push N then pop N (N=100) ===============================================

257

+
push N + pop N (c_dheap): 10954469.2 i/s

258

+
push N + pop N (c++ stl): 9317140.2 i/s  1.18x slower

259

+
push N + pop N (bsearch): 4808770.2 i/s  2.28x slower

260

+
push N + pop N (findmin): 4321411.9 i/s  2.53x slower

261

+
push N + pop N (rb_heap): 2467417.0 i/s  4.44x slower

262

+
== push N then pop N (N=10,000) ============================================

263

+
push N + pop N (c_dheap): 8083962.7 i/s

264

+
push N + pop N (c++ stl): 7365661.8 i/s  1.10x slower

265

+
push N + pop N (bsearch): 2257047.9 i/s  3.58x slower

266

+
push N + pop N (rb_heap): 1439204.3 i/s  5.62x slower

267

+
== push N then pop N (N=1,000,000) =========================================

268

+
push N + pop N (c++ stl): 5274657.5 i/s

269

+
push N + pop N (c_dheap): 4731117.9 i/s  1.11x slower

270

+
push N + pop N (rb_heap): 976688.6 i/s  5.40x slower

271

+

272

+
At N=100 findmin still beats a pureruby heap. But above that it slows down too

273

+
much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above

274

+
30k it slows down too much to be useful. `DHeap` consistently runs 4.55.5x

275

+
faster than the pure ruby heap.

276

+

277

+
#### push & pop on Nitem heap

278

+

279

+
This measures the combined time to push once and pop once, which is done

280

+
repeatedly while keeping a stable heap size of N. Its an approximation for

281

+
scenarios which reach a stable size and then plateau with balanced pushes and

282

+
pops. E.g. timers and timeouts will often reschedule themselves or replace

283

+
themselves with new timers or timeouts, maintaining a roughly stable total count

284

+
of timers.

217
285


218
286

![bar graph for push_pop benchmarks](./images/push_pop.png)

219
287


220



221



222



223



224



225



226



227



228



229



230



231



232



233



234



235



236



237



238



239



240



241



242



243



244



245



246



247



248



249



250



251



252



253



254



255



256



257



258



259



260



261



262



263



264



265



266



267



268



269



270



271



272



273



274



275



276



277



278



279



280



281



282



283



284



285



286



287



288



289


queue size = 5000000: 445234.5 i/s  1.65x slower

290


queue size = 10000000: 423119.0 i/s  1.74x slower

291



292


== push + pop (bsearch)

293


queue size = 10000: 786334.2 i/s

294


queue size = 25000: 364963.8 i/s  2.15x slower

295


queue size = 50000: 200520.6 i/s  3.92x slower

296


queue size = 100000: 88607.0 i/s  8.87x slower

297


queue size = 250000: 34530.5 i/s  22.77x slower

298


queue size = 500000: 17965.4 i/s  43.77x slower

299


queue size = 1000000: 5638.7 i/s  139.45x slower

300


queue size = 2500000: 1302.0 i/s  603.93x slower

301


queue size = 5000000: 592.0 i/s  1328.25x slower

302


queue size = 10000000: 288.8 i/s  2722.66x slower

303



304


== push + pop (c_dheap)

305


queue size = 10000: 7311366.6 i/s

306


queue size = 50000: 6737824.5 i/s  1.09x slower

307


queue size = 25000: 6407340.6 i/s  1.14x slower

308


queue size = 100000: 6254396.3 i/s  1.17x slower

309


queue size = 250000: 5917684.5 i/s  1.24x slower

310


queue size = 500000: 5126307.6 i/s  1.43x slower

311


queue size = 1000000: 4403494.1 i/s  1.66x slower

312


queue size = 2500000: 3304088.2 i/s  2.21x slower

313


queue size = 5000000: 2664897.7 i/s  2.74x slower

314


queue size = 10000000: 2137927.6 i/s  3.42x slower

315



316


## Analysis

317



318


### Time complexity

319



320


There are two fundamental heap operations: siftup (used by push) and siftdown

321


(used by pop).

322



323


* A _d_ary heap will have `log n / log d` layers, so both sift operations can

324


perform as many as `log n / log d` writes, when a member sifts the entire

325


length of the tree.

326


* Siftup makes one comparison per layer, so push runs in `O(log n / log d)`.

327


* Siftdown makes d comparions per layer, so pop runs in `O(d log n / log d)`.

328



329


So, in the simplest case of running balanced push/pop while maintaining the same

330


heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,

331


when every sift traverses every layer of the tree, `d=4` requires the fewest

332


comparisons for combined insert and delete:

333



334


* (1 + 2) lg n / lg d ≈ 4.328085 lg n

335


* (1 + 3) lg n / lg d ≈ 3.640957 lg n

336


* (1 + 4) lg n / lg d ≈ 3.606738 lg n

337


* (1 + 5) lg n / lg d ≈ 3.728010 lg n

338


* (1 + 6) lg n / lg d ≈ 3.906774 lg n

339


* (1 + 7) lg n / lg d ≈ 4.111187 lg n

340


* (1 + 8) lg n / lg d ≈ 4.328085 lg n

341


* (1 + 9) lg n / lg d ≈ 4.551196 lg n

342


* (1 + 10) lg n / lg d ≈ 4.777239 lg n

288

+
push + pop (findmin)

289

+
N 10: 5480288.0 i/s

290

+
N 100: 2595178.8 i/s  2.11x slower

291

+
N 1000: 224813.9 i/s  24.38x slower

292

+
N 10000: 12630.7 i/s  433.89x slower

293

+
N 100000: 1097.3 i/s  4994.31x slower

294

+
N 1000000: 135.9 i/s  40313.05x slower

295

+
N 10000000: 12.9 i/s  425838.01x slower

296

+

297

+
push + pop (bsearch)

298

+
N 10: 3931408.4 i/s

299

+
N 100: 2904181.8 i/s  1.35x slower

300

+
N 1000: 2203157.1 i/s  1.78x slower

301

+
N 10000: 1209584.9 i/s  3.25x slower

302

+
N 100000: 81121.4 i/s  48.46x slower

303

+
N 1000000: 5356.0 i/s  734.02x slower

304

+
N 10000000: 281.9 i/s  13946.33x slower

305

+

306

+
push + pop (rb_heap)

307

+
N 10: 2325816.5 i/s

308

+
N 100: 1603540.3 i/s  1.45x slower

309

+
N 1000: 1262515.2 i/s  1.84x slower

310

+
N 10000: 950389.3 i/s  2.45x slower

311

+
N 100000: 732548.8 i/s  3.17x slower

312

+
N 1000000: 673577.8 i/s  3.45x slower

313

+
N 10000000: 467512.3 i/s  4.97x slower

314

+

315

+
push + pop (c++ stl)

316

+
N 10: 7706818.6 i/s  1.01x slower

317

+
N 100: 7393127.3 i/s  1.05x slower

318

+
N 1000: 6898781.3 i/s  1.13x slower

319

+
N 10000: 5731130.5 i/s  1.36x slower

320

+
N 100000: 4842393.2 i/s  1.60x slower

321

+
N 1000000: 4170936.4 i/s  1.86x slower

322

+
N 10000000: 2737146.6 i/s  2.84x slower

323

+

324

+
push + pop (c_dheap)

325

+
N 10: 10196454.1 i/s

326

+
N 100: 9668679.8 i/s  1.05x slower

327

+
N 1000: 9339557.0 i/s  1.09x slower

328

+
N 10000: 8045103.0 i/s  1.27x slower

329

+
N 100000: 7150276.7 i/s  1.43x slower

330

+
N 1000000: 6490261.6 i/s  1.57x slower

331

+
N 10000000: 3734856.5 i/s  2.73x slower

332

+

333

+
## Time complexity analysis

334

+

335

+
There are two fundamental heap operations: siftup (used by push or decrease

336

+
score) and siftdown (used by pop or delete or increase score). Each sift

337

+
bubbles an item to its correct location in the tree.

338

+

339

+
* A _d_ary heap has `log n / log d` layers, so either sift performs as many as

340

+
`log n / log d` writes, when a member sifts the entire length of the tree.

341

+
* Siftup needs one comparison per layer: `O(log n / log d)`.

342

+
* Siftdown needs d comparions per layer: `O(d log n / log d)`.

343

+

344

+
So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`

345

+
comparisons are made. Looking only at this worst case combo, `d=4` requires the

346

+
fewest comparisons for a combined push and pop:

347

+

348

+
* `(1 + 2) log n / log d ≈ 4.328085 log n`

349

+
* `(1 + 3) log n / log d ≈ 3.640957 log n`

350

+
* `(1 + 4) log n / log d ≈ 3.606738 log n`

351

+
* `(1 + 5) log n / log d ≈ 3.728010 log n`

352

+
* `(1 + 6) log n / log d ≈ 3.906774 log n`

353

+
* `(1 + 7) log n / log d ≈ 4.111187 log n`

354

+
* `(1 + 8) log n / log d ≈ 4.328085 log n`

355

+
* `(1 + 9) log n / log d ≈ 4.551196 log n`

356

+
* `(1 + 10) log n / log d ≈ 4.777239 log n`

343
357

* etc...

344
358


345
359

See https://en.wikipedia.org/wiki/Dary_heap#Analysis for deeper analysis.

346
360


347



361

+
However, what this simple count of comparisons misses is the extent to which

362

+
modern compilers can optimize code (e.g. by unrolling the comparison loop to

363

+
execute on registers) and more importantly how well modern processors are at

364

+
pipelined speculative execution using branch prediction, etc. Benchmarks should

365

+
be run on the _exact same_ hardware platform that production code will use,

366

+
as the siftdown operation is especially sensitive to good pipelining.

348
367


349



350


provide better cache locality. Because the heap is a complete binary tree, the

351


elements can be stored in an array, without the need for tree or list pointers.

368

+
## Comparison performance

352
369


353



354



355



356



357


as an array which only stores values.

370

+
It is often useful to use external scores for otherwise uncomparable values.

371

+
And casting an item or score (via `to_f`) can also be time consuming. So

372

+
`DHeap` evaluates and stores scores at the time of insertion, and they will be

373

+
compared directly without needing any further lookup.

358
374


359



375

+
Numeric values can be compared _much_ faster than other ruby objects, even if

376

+
those objects simply delegate comparison to internal Numeric values.

377

+
Additionally, native C integers or floats can be compared _much_ faster than

378

+
ruby `Numeric` objects. So scores are converted to Float and stored as

379

+
`double`, which is 64 bits on an [LP64 64bit system].

360
380


361



362


take precautions such as locking access behind a mutex.

381

+
[LP64 64bit system]: https://en.wikipedia.org/wiki/64bit_computing#64bit_data_models

363
382


364
383

## Alternative data structures

365
384


366
385

As always, you should run benchmarks with your expected scenarios to determine

367
386

which is best for your application.

368
387


369


Depending on your usecase,

370


and `#insert` might be just fine!

371



372


`memcpy` is so fast on modern hardware that your dataset might not be large

373


enough for it to matter.

388

+
Depending on your usecase, using a sorted `Array` using `#bsearch_index`

389

+
and `#insert` might be just fine! It only takes a couple of lines of code and

390

+
is probably "Fast Enough".

374
391


375


More complex heap

392

+
More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and

376
393

merged which gives some graph algorithms a lower amortized time complexity. But

377
394

in practice, _d_ary heaps have much lower overhead and often run faster.

378
395


@@ 385,25 +402,60 @@ of values in it, then you may want to use a selfbalancing binary search tree


385
402

[redblack tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree

386
403

[skiplist]: https://en.wikipedia.org/wiki/Skip_list

387
404


388


[Hashed and Heirarchical Timing Wheels][timing

389


family of data structures) can

390



405

+
[Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the

406

+
timing wheel family of data structures) can have effectively `O(1)` running time

407

+
in most cases. Although the implementation for that data structure is more

391
408

complex than a heap, it may be necessary for enormous values of N.

392
409


393


[timing

410

+
[timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97timingwheels.pdf

411

+

412

+
## Supported platforms

413

+

414

+
See the [CI workflow] for all supported platforms.

415

+

416

+
[CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI

417

+

418

+
`d_heap` may contain bugs on 32bit systems. Currently, `d_heap` is only tested

419

+
on 64bit x86 CRuby 2.43.0 under Linux and Mac OS.

420

+

421

+
## Caveats and TODOs (PRs welcome!)

422

+

423

+
A `DHeap`'s internal array grows but never shrinks. At the very least, there

424

+
should be a `#compact` or `#shrink` method and during `#freeze`. It might make

425

+
sense to automatically shrink (to no more than 2x the current size) during GC's

426

+
compact phase.

427

+

428

+
Benchmark siftdown minchild comparisons using SSE, AVX2, and AVX512F. This

429

+
might lead to a different default `d` value (maybe 16 or 24?).

430

+

431

+
Shrink scores to 64bits: either store a type flag with each entry (this could

432

+
be used to support nonnumeric scores) or require users to choose between

433

+
`Integer` or `Float` at construction time. Reducing memory usage should also

434

+
improve speed for very large heaps.

435

+

436

+
Patches to support JRuby, rubinius, 32bit systems, or any other platforms are

437

+
welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?

438

+
Other platforms could fallback on the (slower) pure ruby implementation used by

439

+
the benchmarks.

440

+

441

+
[Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html

442

+

443

+
Allow a maxheap (or other configurations of the compare function). This can be

444

+
very easily implemented by just reversing the scores.

394
445


395



446

+
_Maybe_ allow nonnumeric scores to be compared with `<=>`, _only_ if the basic

447

+
numeric use case simplicity and speed can be preserved.

396
448


397



398



399


heap. This enforces a uniqueness constraint on items on the heap, and also

400


allows items to be more efficiently deleted or adjusted. However maintaining

401


the hash does lead to a small drop in normal `#push` and `#pop` performance.

449

+
Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"

450

+
and move all values below that time onto an Array.

402
451


403



404


features that are loosely inspired by go's timers.

405


heap after deletion

406



452

+
Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain

453

+
some features that are loosely inspired by go's timers. Go lazily sifts its

454

+
heap after deletion or adjustments, to achieve faster amortized runtime.

455

+
There's no need to actually remove a deleted item from the heap, if you readd

456

+
it back before it's next evaluated. A similar trick can be to store "far away"

457

+
values in an internal `Hash`, assuming many will be deleted before they rise to

458

+
the top. This could naturally evolve into a [timing wheel] variant.

407
459


408
460

## Development

409
461

