aggregate_afurmanov 0.2.2
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +1 -0
- data/LICENSE +22 -0
- data/README.textile +215 -0
- data/Rakefile +15 -0
- data/VERSION +1 -0
- data/aggregate_afurmanov.gemspec +48 -0
- data/lib/aggregate.rb +298 -0
- data/test/ts_aggregate.rb +162 -0
- metadata +75 -0
data/.gitignore
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
pkg/
|
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2009 Joseph Ruscio
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person
|
4
|
+
obtaining a copy of this software and associated documentation
|
5
|
+
files (the "Software"), to deal in the Software without
|
6
|
+
restriction, including without limitation the rights to use,
|
7
|
+
copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
copies of the Software, and to permit persons to whom the
|
9
|
+
Software is furnished to do so, subject to the following
|
10
|
+
conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be
|
13
|
+
included in all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
16
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
|
17
|
+
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
18
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
|
19
|
+
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
20
|
+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
21
|
+
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
|
22
|
+
OTHER DEALINGS IN THE SOFTWARE.
|
data/README.textile
ADDED
@@ -0,0 +1,215 @@
|
|
1
|
+
h1. Aggregate
|
2
|
+
|
3
|
+
By Joseph Ruscio
|
4
|
+
|
5
|
+
Aggregate is an intuitive ruby implementation of a statistics aggregator
|
6
|
+
including both default and configurable histogram support. It does this
|
7
|
+
without recording/storing any of the actual sample values, making it
|
8
|
+
suitable for tracking statistics across millions/billions of sample
|
9
|
+
without any impact on performance or memory footprint. Originally
|
10
|
+
inspired by the Aggregate support in "SystemTap.":http://sourceware.org/systemtap
|
11
|
+
|
12
|
+
h2. Getting Started
|
13
|
+
|
14
|
+
Aggregates are easy to instantiate, populate with sample data, and then
|
15
|
+
inspect for common aggregate statistics:
|
16
|
+
|
17
|
+
<pre><code>
|
18
|
+
#After instantiation use the << operator to add a sample to the aggregate:
|
19
|
+
stats = Aggregate.new
|
20
|
+
|
21
|
+
loop do
|
22
|
+
# Take some action that generates a sample measurement
|
23
|
+
stats << sample
|
24
|
+
end
|
25
|
+
|
26
|
+
# The number of samples
|
27
|
+
stats.count
|
28
|
+
|
29
|
+
# The average
|
30
|
+
stats.mean
|
31
|
+
|
32
|
+
# Max sample value
|
33
|
+
stats.max
|
34
|
+
|
35
|
+
# Min sample value
|
36
|
+
stats.min
|
37
|
+
|
38
|
+
# The standard deviation
|
39
|
+
stats.std_dev
|
40
|
+
</code></pre>
|
41
|
+
|
42
|
+
h2. Histograms
|
43
|
+
|
44
|
+
Perhaps more importantly than the basic aggregate statistics detailed above
|
45
|
+
Aggregate also maintains a histogram of samples. For anything other than
|
46
|
+
normally distributed data are insufficient at best and often downright misleading
|
47
|
+
37Signals recently posted a terse but effective "explanation":http://37signals.com/svn/posts/1836-the-problem-with-averages of the importance of histograms.
|
48
|
+
Aggregates maintains its histogram internally as a set of "buckets".
|
49
|
+
Each bucket represents a range of possible sample values. The set of all buckets
|
50
|
+
represents the range of "normal" sample values.
|
51
|
+
|
52
|
+
h3. Binary Histograms
|
53
|
+
|
54
|
+
Without any configuration Aggregate instance maintains a binary histogram, where
|
55
|
+
each bucket represents a range twice as large as the preceding bucket i.e.
|
56
|
+
[1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
|
57
|
+
provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
|
58
|
+
(See NOTES below for a discussion on the effects in practice of insufficient
|
59
|
+
precision.)
|
60
|
+
|
61
|
+
Binary histograms are useful when we have little idea about what the
|
62
|
+
sample distribution may look like as almost any positive value will
|
63
|
+
fall into some bucket. After using binary histograms to determine
|
64
|
+
the coarse-grained characteristics of your sample space you can
|
65
|
+
configure a linear histogram to examine it in closer detail.
|
66
|
+
|
67
|
+
h3. Linear Histograms
|
68
|
+
|
69
|
+
Linear histograms are specified with the three values low, high, and width.
|
70
|
+
Low and high specify a range [low, high) of values included in the
|
71
|
+
histogram (all others are outliers). Width specifies the number of
|
72
|
+
values represented by each bucket and therefore the number of
|
73
|
+
buckets i.e. granularity of the histogram. The histogram range
|
74
|
+
(high - low) must be a multiple of width:
|
75
|
+
|
76
|
+
<pre><code>
|
77
|
+
#Want to track aggregate stats on response times in ms
|
78
|
+
response_stats = Aggregate.new(0, 2000, 50)
|
79
|
+
</code></pre>
|
80
|
+
|
81
|
+
The example above creates a linear histogram that tracks the
|
82
|
+
response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
|
83
|
+
most of your samples fall in the first couple buckets!
|
84
|
+
|
85
|
+
h3. Histogram Outliers
|
86
|
+
|
87
|
+
An Aggregate records any samples that fall outside the histogram range as
|
88
|
+
outliers:
|
89
|
+
|
90
|
+
<pre><code>
|
91
|
+
# Number of samples that fall below the normal range
|
92
|
+
stats.outliers_low
|
93
|
+
|
94
|
+
# Number of samples that fall above the normal range
|
95
|
+
stats.outliers_high
|
96
|
+
</code></pre>
|
97
|
+
|
98
|
+
h3. Histogram Iterators
|
99
|
+
|
100
|
+
Once a histogram is populated Aggregate provides iterator support for
|
101
|
+
examining the contents of buckets. The iterators provide both the
|
102
|
+
number of samples in the bucket, as well as its range:
|
103
|
+
|
104
|
+
<pre><code>
|
105
|
+
#Examine every bucket
|
106
|
+
@stats.each do |bucket, count|
|
107
|
+
end
|
108
|
+
|
109
|
+
#Examine only buckets containing samples
|
110
|
+
@stats.each_nonzero do |bucket, count|
|
111
|
+
end
|
112
|
+
</code></pre>
|
113
|
+
|
114
|
+
h3. Histogram Bar Chart
|
115
|
+
|
116
|
+
Finally Aggregate contains sophisticated pretty-printing support to generate
|
117
|
+
ASCII bar charts. For any given number of columns >= 80 (defaults to 80) and
|
118
|
+
sample distribution the <code>to_s</code> method properly sets a marker weight based on the
|
119
|
+
samples per bucket and aligns all output. Empty buckets are skipped to conserve
|
120
|
+
screen space.
|
121
|
+
|
122
|
+
<pre><code>
|
123
|
+
# Generate and display an 80 column histogram
|
124
|
+
puts stats.to_s
|
125
|
+
|
126
|
+
# Generate and display a 120 column histogram
|
127
|
+
puts stats.to_s(120)
|
128
|
+
</code></pre>
|
129
|
+
|
130
|
+
This code example populates both a binary and linear histogram with the same
|
131
|
+
set of 65536 values generated by <code>rand</code> to produce the
|
132
|
+
two histograms that follow it:
|
133
|
+
|
134
|
+
<pre><code>
|
135
|
+
require 'rubygems'
|
136
|
+
require 'aggregate'
|
137
|
+
|
138
|
+
# Create an Aggregate instance
|
139
|
+
binary_aggregate = Aggregate.new
|
140
|
+
linear_aggregate = Aggregate.new(0, 65536, 8192)
|
141
|
+
|
142
|
+
65536.times do
|
143
|
+
x = rand(65536)
|
144
|
+
binary_aggregate << x
|
145
|
+
linear_aggregate << x
|
146
|
+
end
|
147
|
+
|
148
|
+
puts binary_aggregate.to_s
|
149
|
+
puts linear_aggregate.to_s
|
150
|
+
</code></pre>
|
151
|
+
|
152
|
+
h4. Binary Histogram
|
153
|
+
|
154
|
+
<pre><code>
|
155
|
+
value |------------------------------------------------------------------| count
|
156
|
+
1 | | 3
|
157
|
+
2 | | 1
|
158
|
+
4 | | 5
|
159
|
+
8 | | 9
|
160
|
+
16 | | 15
|
161
|
+
32 | | 29
|
162
|
+
64 | | 62
|
163
|
+
128 | | 115
|
164
|
+
256 | | 267
|
165
|
+
512 |@ | 523
|
166
|
+
1024 |@ | 970
|
167
|
+
2048 |@@@ | 1987
|
168
|
+
4096 |@@@@@@@@ | 4075
|
169
|
+
8192 |@@@@@@@@@@@@@@@@ | 8108
|
170
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 16405
|
171
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
|
172
|
+
~
|
173
|
+
Total |------------------------------------------------------------------| 65535
|
174
|
+
</code></pre>
|
175
|
+
|
176
|
+
h4. Linear (0, 65536, 4096) Histogram
|
177
|
+
|
178
|
+
<pre><code>
|
179
|
+
value |------------------------------------------------------------------| count
|
180
|
+
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4094
|
181
|
+
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 4202
|
182
|
+
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4118
|
183
|
+
12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4059
|
184
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 3999
|
185
|
+
20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4083
|
186
|
+
24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4134
|
187
|
+
28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4143
|
188
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4152
|
189
|
+
36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4033
|
190
|
+
40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4064
|
191
|
+
45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4012
|
192
|
+
49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4070
|
193
|
+
53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4090
|
194
|
+
57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4135
|
195
|
+
61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4144
|
196
|
+
Total |------------------------------------------------------------------| 65532
|
197
|
+
</code></pre>
|
198
|
+
We can see from these histograms that Ruby's rand function does a relatively good
|
199
|
+
job of distributing returned values in the requested range.
|
200
|
+
|
201
|
+
h2. Examples
|
202
|
+
|
203
|
+
Here's an example of a "handy timing benchmark":http://gist.github.com/187669
|
204
|
+
implemented with aggregate.
|
205
|
+
|
206
|
+
h2. NOTES
|
207
|
+
|
208
|
+
Ruby doesn't have a log2 function built into Math, so we approximate with
|
209
|
+
log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
|
210
|
+
to precision limitations, once n reaches a certain size (somewhere > 32)
|
211
|
+
this starts to return n. The larger the value of n, the more numbers i.e.
|
212
|
+
(2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
|
213
|
+
using something like BigDecimal, but for the current purposes of the binary
|
214
|
+
histogram i.e. a simple coarse-grained view the current implementation is
|
215
|
+
sufficient.
|
data/Rakefile
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
require 'rake'
|
2
|
+
|
3
|
+
begin
|
4
|
+
require 'jeweler'
|
5
|
+
Jeweler::Tasks.new do |gemspec|
|
6
|
+
gemspec.name = "aggregate_afurmanov"
|
7
|
+
gemspec.summary = "Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support"
|
8
|
+
gemspec.description = "Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support. For a detailed README see: http://github.com/josephruscio/aggregate"
|
9
|
+
gemspec.email = "aleksandr.furmanov@gmail.com"
|
10
|
+
gemspec.homepage = "http://github.com/afurmanov/aggregate"
|
11
|
+
gemspec.authors = ["Joseph Ruscio, Aleksandr Furmanov"]
|
12
|
+
end
|
13
|
+
rescue LoadError
|
14
|
+
puts "Jeweler not available. Install it with: sudo gem install technicalpickles-jeweler -s http://gems.github.com"
|
15
|
+
end
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.2.2
|
@@ -0,0 +1,48 @@
|
|
1
|
+
# Generated by jeweler
|
2
|
+
# DO NOT EDIT THIS FILE DIRECTLY
|
3
|
+
# Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
|
4
|
+
# -*- encoding: utf-8 -*-
|
5
|
+
|
6
|
+
Gem::Specification.new do |s|
|
7
|
+
s.name = %q{aggregate_afurmanov}
|
8
|
+
s.version = "0.2.2"
|
9
|
+
|
10
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
|
+
s.authors = ["Joseph Ruscio, Aleksandr Furmanov"]
|
12
|
+
s.date = %q{2010-12-08}
|
13
|
+
s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support. For a detailed README see: http://github.com/josephruscio/aggregate}
|
14
|
+
s.email = %q{aleksandr.furmanov@gmail.com}
|
15
|
+
s.extra_rdoc_files = [
|
16
|
+
"LICENSE",
|
17
|
+
"README.textile"
|
18
|
+
]
|
19
|
+
s.files = [
|
20
|
+
".gitignore",
|
21
|
+
"LICENSE",
|
22
|
+
"README.textile",
|
23
|
+
"Rakefile",
|
24
|
+
"VERSION",
|
25
|
+
"aggregate_afurmanov.gemspec",
|
26
|
+
"lib/aggregate.rb",
|
27
|
+
"test/ts_aggregate.rb"
|
28
|
+
]
|
29
|
+
s.homepage = %q{http://github.com/afurmanov/aggregate}
|
30
|
+
s.rdoc_options = ["--charset=UTF-8"]
|
31
|
+
s.require_paths = ["lib"]
|
32
|
+
s.rubygems_version = %q{1.3.7}
|
33
|
+
s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
|
34
|
+
s.test_files = [
|
35
|
+
"test/ts_aggregate.rb"
|
36
|
+
]
|
37
|
+
|
38
|
+
if s.respond_to? :specification_version then
|
39
|
+
current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
|
40
|
+
s.specification_version = 3
|
41
|
+
|
42
|
+
if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
|
43
|
+
else
|
44
|
+
end
|
45
|
+
else
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
data/lib/aggregate.rb
ADDED
@@ -0,0 +1,298 @@
|
|
1
|
+
# Implements aggregate statistics and maintains
|
2
|
+
# configurable histogram for a set of given samples. Convenient for tracking
|
3
|
+
# high throughput data.
|
4
|
+
class Aggregate
|
5
|
+
#The current average of all samples
|
6
|
+
attr_reader :mean
|
7
|
+
|
8
|
+
#The current number of samples
|
9
|
+
attr_reader :count
|
10
|
+
|
11
|
+
#The maximum sample value
|
12
|
+
attr_reader :max
|
13
|
+
|
14
|
+
#The minimum samples value
|
15
|
+
attr_reader :min
|
16
|
+
|
17
|
+
#The sum of all samples
|
18
|
+
attr_reader :sum
|
19
|
+
|
20
|
+
#The number of samples falling below the lowest valued histogram bucket
|
21
|
+
attr_reader :outliers_low
|
22
|
+
|
23
|
+
#The number of samples falling above the highest valued histogram bucket
|
24
|
+
attr_reader :outliers_high
|
25
|
+
|
26
|
+
DEFAULT_LOG_BUCKETS = 8
|
27
|
+
|
28
|
+
# The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)
|
29
|
+
def log_buckets
|
30
|
+
@log_buckets
|
31
|
+
end
|
32
|
+
|
33
|
+
# Create a new Aggregate that maintains a binary logarithmic histogram
|
34
|
+
# by default. Specifying values for low, high, and width configures
|
35
|
+
# the aggregate to maintain a linear histogram with (high - low)/width buckets
|
36
|
+
def initialize (options={})
|
37
|
+
low = options[:low]
|
38
|
+
high = options[:high]
|
39
|
+
width = options[:width]
|
40
|
+
@log_buckets = options[:log_buckets] || DEFAULT_LOG_BUCKETS
|
41
|
+
@count = 0
|
42
|
+
@sum = 0.0
|
43
|
+
@sum2 = 0.0
|
44
|
+
@outliers_low = 0
|
45
|
+
@outliers_high = 0
|
46
|
+
|
47
|
+
# If the user asks we maintain a linear histogram where
|
48
|
+
# values in the range [low, high) are bucketed in multiples
|
49
|
+
# of width
|
50
|
+
if (nil != low && nil != high && nil != width)
|
51
|
+
|
52
|
+
#Validate linear specification
|
53
|
+
if high <= low
|
54
|
+
raise ArgumentError, "High bucket must be > Low bucket"
|
55
|
+
end
|
56
|
+
|
57
|
+
if high - low < width
|
58
|
+
raise ArgumentError, "Histogram width must be <= histogram range"
|
59
|
+
end
|
60
|
+
|
61
|
+
if 0 != (high - low).modulo(width)
|
62
|
+
raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
|
63
|
+
end
|
64
|
+
|
65
|
+
@low = low
|
66
|
+
@high = high
|
67
|
+
@width = width
|
68
|
+
else
|
69
|
+
low ||= 1
|
70
|
+
@low = 1
|
71
|
+
@low = to_bucket(to_index(low))
|
72
|
+
@high = to_bucket(to_index(@low) + log_buckets - 1)
|
73
|
+
end
|
74
|
+
|
75
|
+
#Initialize all buckets to 0
|
76
|
+
@buckets = Array.new(bucket_count, 0)
|
77
|
+
end
|
78
|
+
|
79
|
+
# Include a sample in the aggregate
|
80
|
+
def << data
|
81
|
+
|
82
|
+
# Update min/max
|
83
|
+
if 0 == @count
|
84
|
+
@min = data
|
85
|
+
@max = data
|
86
|
+
else
|
87
|
+
@max = [data, @max].max
|
88
|
+
@min = [data, @min].min
|
89
|
+
end
|
90
|
+
|
91
|
+
# Update the running info
|
92
|
+
@count += 1
|
93
|
+
@sum += data
|
94
|
+
@sum2 += (data * data)
|
95
|
+
|
96
|
+
# Update the bucket
|
97
|
+
@buckets[to_index(data)] += 1 unless outlier?(data)
|
98
|
+
end
|
99
|
+
|
100
|
+
def mean
|
101
|
+
@sum / @count
|
102
|
+
end
|
103
|
+
|
104
|
+
#Calculate the standard deviation
|
105
|
+
def std_dev
|
106
|
+
Math.sqrt((@sum2.to_f - ((@sum.to_f * @sum.to_f)/@count.to_f)) / (@count.to_f - 1))
|
107
|
+
end
|
108
|
+
|
109
|
+
# Combine two aggregates
|
110
|
+
#def +(b)
|
111
|
+
# a = self
|
112
|
+
# c = Aggregate.new
|
113
|
+
|
114
|
+
# c.count = a.count + b.count
|
115
|
+
#end
|
116
|
+
|
117
|
+
#Generate a pretty-printed ASCII representation of the histogram
|
118
|
+
def to_s(columns=nil)
|
119
|
+
|
120
|
+
#default to an 80 column terminal, don't support < 80 for now
|
121
|
+
if nil == columns
|
122
|
+
columns = 80
|
123
|
+
else
|
124
|
+
raise ArgumentError if columns < 80
|
125
|
+
end
|
126
|
+
|
127
|
+
#Find the largest bucket and create an array of the rows we intend to print
|
128
|
+
disp_buckets = Array.new
|
129
|
+
max_count = 0
|
130
|
+
total = 0
|
131
|
+
@buckets.each_with_index do |count, idx|
|
132
|
+
next if 0 == count
|
133
|
+
max_count = [max_count, count].max
|
134
|
+
disp_buckets << [idx, to_bucket(idx), count]
|
135
|
+
total += count
|
136
|
+
end
|
137
|
+
|
138
|
+
#XXX: Better to print just header --> footer
|
139
|
+
return "Empty histogram" if 0 == disp_buckets.length
|
140
|
+
|
141
|
+
#Figure out how wide the value and count columns need to be based on their
|
142
|
+
#largest respective numbers
|
143
|
+
value_str = "value"
|
144
|
+
count_str = "count"
|
145
|
+
total_str = "Total"
|
146
|
+
value_width = [disp_buckets.last[1].to_s.length, value_str.length].max
|
147
|
+
value_width = [value_width, total_str.length].max
|
148
|
+
count_width = [total.to_s.length, count_str.length].max
|
149
|
+
max_bar_width = columns - (value_width + " |".length + "| ".length + count_width)
|
150
|
+
|
151
|
+
#Determine the value of a '@'
|
152
|
+
weight = [max_count.to_f/max_bar_width.to_f, 1.0].max
|
153
|
+
|
154
|
+
#format the header
|
155
|
+
histogram = sprintf("%#{value_width}s |", value_str)
|
156
|
+
max_bar_width.times { histogram << "-"}
|
157
|
+
histogram << sprintf("| %#{count_width}s\n", count_str)
|
158
|
+
|
159
|
+
# We denote empty buckets with a '~'
|
160
|
+
def skip_row(value_width)
|
161
|
+
sprintf("%#{value_width}s ~\n", " ")
|
162
|
+
end
|
163
|
+
|
164
|
+
#Loop through each bucket to be displayed and output the correct number
|
165
|
+
prev_index = disp_buckets[0][0] - 1
|
166
|
+
|
167
|
+
disp_buckets.each do |x|
|
168
|
+
#Denote skipped empty buckets with a ~
|
169
|
+
histogram << skip_row(value_width) unless prev_index == x[0] - 1
|
170
|
+
prev_index = x[0]
|
171
|
+
|
172
|
+
#Add the value
|
173
|
+
row = sprintf("%#{value_width}d |", x[1])
|
174
|
+
|
175
|
+
#Add the bar
|
176
|
+
bar_size = (x[2]/weight).to_i
|
177
|
+
bar_size.times { row += "@"}
|
178
|
+
(max_bar_width - bar_size).times { row += " " }
|
179
|
+
|
180
|
+
#Add the count
|
181
|
+
row << sprintf("| %#{count_width}d\n", x[2])
|
182
|
+
|
183
|
+
#Append the finished row onto the histogram
|
184
|
+
histogram << row
|
185
|
+
end
|
186
|
+
|
187
|
+
#End the table
|
188
|
+
histogram << skip_row(value_width) if disp_buckets.last[0] != bucket_count-1
|
189
|
+
histogram << sprintf("%#{value_width}s", "Total")
|
190
|
+
histogram << " |"
|
191
|
+
max_bar_width.times {histogram << "-"}
|
192
|
+
histogram << "| "
|
193
|
+
histogram << sprintf("%#{count_width}d\n", total)
|
194
|
+
end
|
195
|
+
|
196
|
+
#Iterate through each bucket in the histogram regardless of
|
197
|
+
#its contents
|
198
|
+
def each
|
199
|
+
@buckets.each_with_index do |count, index|
|
200
|
+
yield(to_bucket(index), count)
|
201
|
+
end
|
202
|
+
end
|
203
|
+
|
204
|
+
#Iterate through only the buckets in the histogram that contain
|
205
|
+
#samples
|
206
|
+
def each_nonzero
|
207
|
+
@buckets.each_with_index do |count, index|
|
208
|
+
yield(to_bucket(index), count) if count != 0
|
209
|
+
end
|
210
|
+
end
|
211
|
+
|
212
|
+
private
|
213
|
+
|
214
|
+
def linear?
|
215
|
+
nil != @width
|
216
|
+
end
|
217
|
+
|
218
|
+
def outlier? (data)
|
219
|
+
|
220
|
+
if data < @low
|
221
|
+
@outliers_low += 1
|
222
|
+
elsif data >= @high
|
223
|
+
@outliers_high += 1
|
224
|
+
else
|
225
|
+
return false
|
226
|
+
end
|
227
|
+
end
|
228
|
+
|
229
|
+
def bucket_count
|
230
|
+
if linear?
|
231
|
+
return (@high-@low)/@width
|
232
|
+
else
|
233
|
+
return log_buckets
|
234
|
+
end
|
235
|
+
end
|
236
|
+
|
237
|
+
def to_bucket(index)
|
238
|
+
if linear?
|
239
|
+
return @low + (index * @width)
|
240
|
+
else
|
241
|
+
return 2**(log2(@low) + index)
|
242
|
+
end
|
243
|
+
end
|
244
|
+
|
245
|
+
def right_bucket? index, data
|
246
|
+
|
247
|
+
# check invariant
|
248
|
+
raise unless linear?
|
249
|
+
|
250
|
+
bucket = to_bucket(index)
|
251
|
+
|
252
|
+
#It's the right bucket if data falls between bucket and next bucket
|
253
|
+
bucket <= data && data < bucket + @width
|
254
|
+
end
|
255
|
+
|
256
|
+
=begin
|
257
|
+
def find_bucket(lower, upper, target)
|
258
|
+
#Classic binary search
|
259
|
+
return upper if right_bucket?(upper, target)
|
260
|
+
|
261
|
+
# Cut the search range in half
|
262
|
+
middle = (upper/2).to_i
|
263
|
+
|
264
|
+
# Determine which half contains our value and recurse
|
265
|
+
if (to_bucket(middle) >= target)
|
266
|
+
return find_bucket(lower, middle, target)
|
267
|
+
else
|
268
|
+
return find_bucket(middle, upper, target)
|
269
|
+
end
|
270
|
+
end
|
271
|
+
=end
|
272
|
+
|
273
|
+
# A data point is added to the bucket[n] where the data point
|
274
|
+
# is less than the value represented by bucket[n], but greater
|
275
|
+
# than the value represented by bucket[n+1]
|
276
|
+
public
|
277
|
+
def to_index (data)
|
278
|
+
|
279
|
+
# basic case is simple
|
280
|
+
return log2([1,data/@low].max).to_i if !linear?
|
281
|
+
|
282
|
+
# Search for the right bucket in the linear case
|
283
|
+
@buckets.each_with_index do |count, idx|
|
284
|
+
return idx if right_bucket?(idx, data)
|
285
|
+
end
|
286
|
+
#find_bucket(0, bucket_count-1, data)
|
287
|
+
|
288
|
+
#Should not get here
|
289
|
+
raise "#{data}"
|
290
|
+
end
|
291
|
+
|
292
|
+
# log2(x) returns j, | i = j-1 and 2**i <= data < 2**j
|
293
|
+
@@LOG2_DIVEDEND = Math.log(2)
|
294
|
+
def log2( x )
|
295
|
+
Math.log(x) / @@LOG2_DIVEDEND
|
296
|
+
end
|
297
|
+
|
298
|
+
end
|
@@ -0,0 +1,162 @@
|
|
1
|
+
require 'test/unit'
|
2
|
+
require 'lib/aggregate'
|
3
|
+
|
4
|
+
class SimpleStatsTest < Test::Unit::TestCase
|
5
|
+
|
6
|
+
def setup
|
7
|
+
@stats = Aggregate.new(:log_buckets => 128)
|
8
|
+
|
9
|
+
@@DATA.each do |x|
|
10
|
+
@stats << x
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
def test_stats_count
|
15
|
+
assert_equal @@DATA.length, @stats.count
|
16
|
+
end
|
17
|
+
|
18
|
+
def test_stats_min_max
|
19
|
+
sorted_data = @@DATA.sort
|
20
|
+
|
21
|
+
assert_equal sorted_data[0], @stats.min
|
22
|
+
assert_equal sorted_data.last, @stats.max
|
23
|
+
end
|
24
|
+
|
25
|
+
def test_stats_mean
|
26
|
+
sum = 0
|
27
|
+
@@DATA.each do |x|
|
28
|
+
sum += x
|
29
|
+
end
|
30
|
+
|
31
|
+
assert_equal sum.to_f/@@DATA.length.to_f, @stats.mean
|
32
|
+
end
|
33
|
+
|
34
|
+
def test_bucket_counts
|
35
|
+
|
36
|
+
#Test each iterator
|
37
|
+
total_bucket_sum = 0
|
38
|
+
i = 0
|
39
|
+
@stats.each do |bucket, count|
|
40
|
+
assert_equal 2**i, bucket
|
41
|
+
|
42
|
+
total_bucket_sum += count
|
43
|
+
i += 1
|
44
|
+
end
|
45
|
+
|
46
|
+
assert_equal @@DATA.length, total_bucket_sum
|
47
|
+
|
48
|
+
#Test each_nonzero iterator
|
49
|
+
prev_bucket = 0
|
50
|
+
total_bucket_sum = 0
|
51
|
+
@stats.each_nonzero do |bucket, count|
|
52
|
+
assert bucket > prev_bucket
|
53
|
+
assert_not_equal count, 0
|
54
|
+
|
55
|
+
total_bucket_sum += count
|
56
|
+
end
|
57
|
+
|
58
|
+
assert_equal total_bucket_sum, @@DATA.length
|
59
|
+
end
|
60
|
+
|
61
|
+
=begin
|
62
|
+
def test_addition
|
63
|
+
stats1 = Aggregate.new
|
64
|
+
stats2 = Aggregate.new
|
65
|
+
|
66
|
+
stats1 << 1
|
67
|
+
stats2 << 3
|
68
|
+
|
69
|
+
stats_sum = stats1 + stats2
|
70
|
+
|
71
|
+
assert_equal stats_sum.count, stats1.count + stats2.count
|
72
|
+
end
|
73
|
+
=end
|
74
|
+
|
75
|
+
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
76
|
+
@@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383]
|
77
|
+
def test_bucket_contents
|
78
|
+
#XXX: This is the only test so far that cares about the actual contents
|
79
|
+
# of @@DATA, so if you update that array ... update this method too
|
80
|
+
expected_buckets = [1, 4, 1024, 8192, 16384]
|
81
|
+
expected_counts = [1, 3, 2, 1, 2]
|
82
|
+
|
83
|
+
i = 0
|
84
|
+
@stats.each_nonzero do |bucket, count|
|
85
|
+
assert_equal expected_buckets[i], bucket
|
86
|
+
assert_equal expected_counts[i], count
|
87
|
+
# Increment for the next test
|
88
|
+
i += 1
|
89
|
+
end
|
90
|
+
end
|
91
|
+
|
92
|
+
def test_histogram
|
93
|
+
puts @stats.to_s
|
94
|
+
end
|
95
|
+
|
96
|
+
def test_outlier
|
97
|
+
assert_equal 0, @stats.outliers_low
|
98
|
+
assert_equal 0, @stats.outliers_high
|
99
|
+
|
100
|
+
@stats << -1
|
101
|
+
@stats << -2
|
102
|
+
@stats << 0
|
103
|
+
|
104
|
+
@stats << 2**128
|
105
|
+
|
106
|
+
# This should be the last value in the last bucket, but Ruby's native
|
107
|
+
# floats are not precise enough. Somewhere past 2^32 the log(x)/log(2)
|
108
|
+
# breaks down. So it shows up as 128 (outlier) instead of 127
|
109
|
+
#@stats << (2**128) - 1
|
110
|
+
|
111
|
+
assert_equal 3, @stats.outliers_low
|
112
|
+
assert_equal 1, @stats.outliers_high
|
113
|
+
end
|
114
|
+
|
115
|
+
def test_std_dev
|
116
|
+
@stats.std_dev
|
117
|
+
end
|
118
|
+
end
|
119
|
+
|
120
|
+
class LinearHistogramTest < Test::Unit::TestCase
|
121
|
+
def setup
|
122
|
+
@stats = Aggregate.new(:low => 0, :high => 32768, :width => 1024)
|
123
|
+
|
124
|
+
@@DATA.each do |x|
|
125
|
+
@stats << x
|
126
|
+
end
|
127
|
+
end
|
128
|
+
|
129
|
+
def test_validation
|
130
|
+
|
131
|
+
# Range cannot be 0
|
132
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(:low => 32,:high => 32, :width => 4)}
|
133
|
+
|
134
|
+
# Range cannot be negative
|
135
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(:low => 32, :high => 16, :width => 4)}
|
136
|
+
|
137
|
+
# Range cannot be < single bucket
|
138
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(:low => 16, :high => 32, :width => 17)}
|
139
|
+
|
140
|
+
# Range % width must equal 0 (for now)
|
141
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(:low => 1, :high => 16384, :width => 1024)}
|
142
|
+
end
|
143
|
+
|
144
|
+
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
145
|
+
# 32768 is an outlier
|
146
|
+
@@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768]
|
147
|
+
def test_bucket_contents
|
148
|
+
#XXX: This is the only test so far that cares about the actual contents
|
149
|
+
# of @@DATA, so if you update that array ... update this method too
|
150
|
+
expected_buckets = [0, 1024, 15360, 16384]
|
151
|
+
expected_counts = [5, 2, 1, 2]
|
152
|
+
|
153
|
+
i = 0
|
154
|
+
@stats.each_nonzero do |bucket, count|
|
155
|
+
assert_equal expected_buckets[i], bucket
|
156
|
+
assert_equal expected_counts[i], count
|
157
|
+
# Increment for the next test
|
158
|
+
i += 1
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
end
|
metadata
ADDED
@@ -0,0 +1,75 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: aggregate_afurmanov
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
hash: 19
|
5
|
+
prerelease: false
|
6
|
+
segments:
|
7
|
+
- 0
|
8
|
+
- 2
|
9
|
+
- 2
|
10
|
+
version: 0.2.2
|
11
|
+
platform: ruby
|
12
|
+
authors:
|
13
|
+
- Joseph Ruscio, Aleksandr Furmanov
|
14
|
+
autorequire:
|
15
|
+
bindir: bin
|
16
|
+
cert_chain: []
|
17
|
+
|
18
|
+
date: 2010-12-08 00:00:00 -08:00
|
19
|
+
default_executable:
|
20
|
+
dependencies: []
|
21
|
+
|
22
|
+
description: "Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support. For a detailed README see: http://github.com/josephruscio/aggregate"
|
23
|
+
email: aleksandr.furmanov@gmail.com
|
24
|
+
executables: []
|
25
|
+
|
26
|
+
extensions: []
|
27
|
+
|
28
|
+
extra_rdoc_files:
|
29
|
+
- LICENSE
|
30
|
+
- README.textile
|
31
|
+
files:
|
32
|
+
- .gitignore
|
33
|
+
- LICENSE
|
34
|
+
- README.textile
|
35
|
+
- Rakefile
|
36
|
+
- VERSION
|
37
|
+
- aggregate_afurmanov.gemspec
|
38
|
+
- lib/aggregate.rb
|
39
|
+
- test/ts_aggregate.rb
|
40
|
+
has_rdoc: true
|
41
|
+
homepage: http://github.com/afurmanov/aggregate
|
42
|
+
licenses: []
|
43
|
+
|
44
|
+
post_install_message:
|
45
|
+
rdoc_options:
|
46
|
+
- --charset=UTF-8
|
47
|
+
require_paths:
|
48
|
+
- lib
|
49
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
50
|
+
none: false
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
hash: 3
|
55
|
+
segments:
|
56
|
+
- 0
|
57
|
+
version: "0"
|
58
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
59
|
+
none: false
|
60
|
+
requirements:
|
61
|
+
- - ">="
|
62
|
+
- !ruby/object:Gem::Version
|
63
|
+
hash: 3
|
64
|
+
segments:
|
65
|
+
- 0
|
66
|
+
version: "0"
|
67
|
+
requirements: []
|
68
|
+
|
69
|
+
rubyforge_project:
|
70
|
+
rubygems_version: 1.3.7
|
71
|
+
signing_key:
|
72
|
+
specification_version: 3
|
73
|
+
summary: Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support
|
74
|
+
test_files:
|
75
|
+
- test/ts_aggregate.rb
|