aggregate 0.1.2 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +168 -1
- data/VERSION +1 -1
- data/aggregate.gemspec +3 -3
- data/lib/aggregate.rb +25 -15
- data/test/ts_aggregate.rb +25 -8
- metadata +3 -3
data/README
CHANGED
@@ -1,2 +1,169 @@
|
|
1
|
-
Aggregate is
|
1
|
+
Aggregate is an intuitive ruby implementation of a statistics aggregator
|
2
|
+
including both default and configurable histogram support. It does this
|
3
|
+
without recording/storing any of the actual sample values, making it
|
4
|
+
suitable for tracking statistics across millions/billions of sample
|
5
|
+
without any impact on performance or memory footprint. Originally
|
6
|
+
inspired by the Aggregate support in SystemTap (http://sourceware.org/systemtap/)
|
2
7
|
|
8
|
+
Aggregates are easy to instantiate, populate with sample data, and examine
|
9
|
+
statistics:
|
10
|
+
|
11
|
+
#After instantiation use the << operator to add a sample to the aggregate:
|
12
|
+
stats = Aggregate.new
|
13
|
+
|
14
|
+
loop do
|
15
|
+
# Take some action that generates a sample measurement
|
16
|
+
stats << sample
|
17
|
+
end
|
18
|
+
|
19
|
+
# The number of samples
|
20
|
+
stats.count
|
21
|
+
|
22
|
+
# The average
|
23
|
+
stats.mean
|
24
|
+
|
25
|
+
# Max sample value
|
26
|
+
stats.max
|
27
|
+
|
28
|
+
# Min sample value
|
29
|
+
stats.min
|
30
|
+
|
31
|
+
# The standard deviation
|
32
|
+
stats.std_dev
|
33
|
+
|
34
|
+
Perhaps more importantly than the basic aggregate statistics detailed above
|
35
|
+
Aggregate also maintains a histogram of samples. Good explanation of why
|
36
|
+
its important: http://37signals.com/svn/posts/1836-the-problem-with-averages
|
37
|
+
|
38
|
+
The histogram is maintained as a set of "buckets". Each bucket represents a
|
39
|
+
range of possible sample values. The set of all buckets represents the range
|
40
|
+
of "normal" sample values. By default this is a binary histogram, where
|
41
|
+
each bucket represents a range twice as large as the preceding bucket i.e.
|
42
|
+
[1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
|
43
|
+
provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
|
44
|
+
(See NOTES below for a discussion on the effects in practice of insufficient
|
45
|
+
precision.)
|
46
|
+
|
47
|
+
Binary histograms are useful when we have little idea about what the
|
48
|
+
sample distribution may look like as almost any positive value will
|
49
|
+
fall into some bucket. After using binary histograms to determine
|
50
|
+
the coarse-grained characteristics of your sample space you can
|
51
|
+
configure a linear histogram to examine it in closer detail.
|
52
|
+
|
53
|
+
Linear histograms are specified with the three values low, high, and width.
|
54
|
+
Low and high specifiy a range [low, high) of values included in the
|
55
|
+
histogram (all others are outliers). Width specifies the number of
|
56
|
+
values represented by each bucket and therefore the number of
|
57
|
+
buckets i.e. granularity of the histogram. The histogram range
|
58
|
+
(high - low) must be a multiple of width:
|
59
|
+
|
60
|
+
#Want to track aggregate stats on response times in ms
|
61
|
+
response_stats = Aggregate.new(0, 2000, 50)
|
62
|
+
|
63
|
+
The example above creates a linear histogram that tracks the
|
64
|
+
response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
|
65
|
+
most of your samples fall in the first couple buckets! Any values added to the
|
66
|
+
aggregate that fall outside of the histogram range are recorded as outliers:
|
67
|
+
|
68
|
+
# Number of samples that fall below the normal range
|
69
|
+
stats.outliers_low
|
70
|
+
|
71
|
+
# Number of samples that fall above the normal range
|
72
|
+
stats.outliers_high
|
73
|
+
|
74
|
+
Once a histogram is populated Aggregate provides iterator support for
|
75
|
+
examining the contents of buckets. The iterators provide both the
|
76
|
+
number of samples in the bucket, as well as its range:
|
77
|
+
|
78
|
+
#Examine every bucket
|
79
|
+
@stats.each do |bucket, count|
|
80
|
+
end
|
81
|
+
|
82
|
+
#Examine only buckets containing samples
|
83
|
+
@stats.each_nonzero do |bucket, count|
|
84
|
+
end
|
85
|
+
|
86
|
+
Finally Aggregate contains sophisticated pretty-printing support that for
|
87
|
+
any given number of columns >= 80 (defaults to 80) and sample distribution
|
88
|
+
properly sets a marker weight based on the samples per bucket and aligns all
|
89
|
+
output. Empty buckets are skipped to conserve screen space.
|
90
|
+
|
91
|
+
# Generate and display an 80 column histogram
|
92
|
+
puts stats.to_s
|
93
|
+
|
94
|
+
# Generate and display a 120 column histogram
|
95
|
+
puts stats.to_s(120)
|
96
|
+
|
97
|
+
The following code populates both a binary and linear histogram with the same
|
98
|
+
set of 65536 values generated by rand to produce two histograms:
|
99
|
+
|
100
|
+
require 'rubygems'
|
101
|
+
require 'aggregate'
|
102
|
+
|
103
|
+
# Create an Aggregate instance
|
104
|
+
binary_aggregate = Aggregate.new
|
105
|
+
linear_aggregate = Aggregate.new(0, 65536, 8192)
|
106
|
+
|
107
|
+
65536.times do
|
108
|
+
x = rand(65536)
|
109
|
+
binary_aggregate << x
|
110
|
+
linear_aggregate << x
|
111
|
+
end
|
112
|
+
|
113
|
+
puts binary_aggregate.to_s
|
114
|
+
puts linear_aggregate.to_s
|
115
|
+
|
116
|
+
** OUTPUT **
|
117
|
+
** Binary Histogram**
|
118
|
+
value |------------------------------------------------------------------| count
|
119
|
+
1 | | 3
|
120
|
+
2 | | 1
|
121
|
+
4 | | 5
|
122
|
+
8 | | 9
|
123
|
+
16 | | 15
|
124
|
+
32 | | 29
|
125
|
+
64 | | 62
|
126
|
+
128 | | 115
|
127
|
+
256 | | 267
|
128
|
+
512 |@ | 523
|
129
|
+
1024 |@ | 970
|
130
|
+
2048 |@@@ | 1987
|
131
|
+
4096 |@@@@@@@@ | 4075
|
132
|
+
8192 |@@@@@@@@@@@@@@@@ | 8108
|
133
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 16405
|
134
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
|
135
|
+
~
|
136
|
+
Total |------------------------------------------------------------------| 65535
|
137
|
+
|
138
|
+
** Linear (0, 65536, 4096) Histogram **
|
139
|
+
value |------------------------------------------------------------------| count
|
140
|
+
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4094
|
141
|
+
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 4202
|
142
|
+
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4118
|
143
|
+
12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4059
|
144
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 3999
|
145
|
+
20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4083
|
146
|
+
24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4134
|
147
|
+
28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4143
|
148
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4152
|
149
|
+
36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4033
|
150
|
+
40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4064
|
151
|
+
45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4012
|
152
|
+
49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4070
|
153
|
+
53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4090
|
154
|
+
57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4135
|
155
|
+
61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4144
|
156
|
+
Total |------------------------------------------------------------------| 65532
|
157
|
+
|
158
|
+
We can see from these histograms that Ruby's rand function does a relatively good
|
159
|
+
job of distributing returned values in the requested range.
|
160
|
+
|
161
|
+
** NOTES **
|
162
|
+
Ruby doesn't have a log2 function built into Math, so we approximate with
|
163
|
+
log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
|
164
|
+
to precision limitations, once n reaches a certain size (somewhere > 32)
|
165
|
+
this starts to return n. The larger the value of n, the more numbers i.e.
|
166
|
+
(2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
|
167
|
+
using something like BigDecimal, but for the current purposes of the binary
|
168
|
+
histogram i.e. a simple coarse-grained view the current implementation is
|
169
|
+
sufficient.
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.2.0
|
data/aggregate.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{aggregate}
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.2.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Joseph Ruscio"]
|
12
|
-
s.date = %q{2009-
|
12
|
+
s.date = %q{2009-09-12}
|
13
13
|
s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
|
14
14
|
s.email = %q{jruscio@gmail.com}
|
15
15
|
s.extra_rdoc_files = [
|
@@ -28,7 +28,7 @@ Gem::Specification.new do |s|
|
|
28
28
|
s.homepage = %q{http://github.com/josephruscio/aggregate}
|
29
29
|
s.rdoc_options = ["--charset=UTF-8"]
|
30
30
|
s.require_paths = ["lib"]
|
31
|
-
s.rubygems_version = %q{1.3.
|
31
|
+
s.rubygems_version = %q{1.3.5}
|
32
32
|
s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
|
33
33
|
s.test_files = [
|
34
34
|
"test/ts_aggregate.rb"
|
data/lib/aggregate.rb
CHANGED
@@ -5,15 +5,15 @@ class Aggregate
|
|
5
5
|
#The current average of all samples
|
6
6
|
attr_reader :mean
|
7
7
|
|
8
|
-
#The current number of samples
|
8
|
+
#The current number of samples
|
9
9
|
attr_reader :count
|
10
|
-
|
10
|
+
|
11
11
|
#The maximum sample value
|
12
12
|
attr_reader :max
|
13
|
-
|
13
|
+
|
14
14
|
#The minimum samples value
|
15
15
|
attr_reader :min
|
16
|
-
|
16
|
+
|
17
17
|
#The sum of all samples
|
18
18
|
attr_reader :sum
|
19
19
|
|
@@ -22,7 +22,7 @@ class Aggregate
|
|
22
22
|
|
23
23
|
#The number of samples falling above the highest valued histogram bucket
|
24
24
|
attr_reader :outliers_high
|
25
|
-
|
25
|
+
|
26
26
|
# The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)
|
27
27
|
@@LOG_BUCKETS = 128
|
28
28
|
|
@@ -36,7 +36,9 @@ class Aggregate
|
|
36
36
|
@outliers_low = 0
|
37
37
|
@outliers_high = 0
|
38
38
|
|
39
|
-
# If the user asks we maintain a linear histogram
|
39
|
+
# If the user asks we maintain a linear histogram where
|
40
|
+
# values in the range [low, high) are bucketed in multiples
|
41
|
+
# of width
|
40
42
|
if (nil != low && nil != high && nil != width)
|
41
43
|
|
42
44
|
#Validate linear specification
|
@@ -48,6 +50,10 @@ class Aggregate
|
|
48
50
|
raise ArgumentError, "Histogram width must be <= histogram range"
|
49
51
|
end
|
50
52
|
|
53
|
+
if 0 != (high - low).modulo(width)
|
54
|
+
raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
|
55
|
+
end
|
56
|
+
|
51
57
|
@low = low
|
52
58
|
@high = high
|
53
59
|
@width = width
|
@@ -73,7 +79,7 @@ class Aggregate
|
|
73
79
|
end
|
74
80
|
|
75
81
|
# Update the running info
|
76
|
-
@count += 1
|
82
|
+
@count += 1
|
77
83
|
@sum += data
|
78
84
|
@sum2 += (data * data)
|
79
85
|
|
@@ -119,6 +125,9 @@ class Aggregate
|
|
119
125
|
total += count
|
120
126
|
end
|
121
127
|
|
128
|
+
#XXX: Better to print just header --> footer
|
129
|
+
return "Empty histogram" if 0 == disp_buckets.length
|
130
|
+
|
122
131
|
#Figure out how wide the value and count columns need to be based on their
|
123
132
|
#largest respective numbers
|
124
133
|
value_str = "value"
|
@@ -144,7 +153,7 @@ class Aggregate
|
|
144
153
|
|
145
154
|
#Loop through each bucket to be displayed and output the correct number
|
146
155
|
prev_index = disp_buckets[0][0] - 1
|
147
|
-
|
156
|
+
|
148
157
|
disp_buckets.each do |x|
|
149
158
|
#Denote skipped empty buckets with a ~
|
150
159
|
histogram << skip_row(value_width) unless prev_index == x[0] - 1
|
@@ -173,9 +182,9 @@ class Aggregate
|
|
173
182
|
histogram << "| "
|
174
183
|
histogram << sprintf("%#{count_width}d\n", total)
|
175
184
|
end
|
176
|
-
|
177
|
-
#Iterate through each bucket in the histogram regardless of
|
178
|
-
#its contents
|
185
|
+
|
186
|
+
#Iterate through each bucket in the histogram regardless of
|
187
|
+
#its contents
|
179
188
|
def each
|
180
189
|
@buckets.each_with_index do |count, index|
|
181
190
|
yield(to_bucket(index), count)
|
@@ -200,7 +209,7 @@ class Aggregate
|
|
200
209
|
|
201
210
|
if data < @low
|
202
211
|
@outliers_low += 1
|
203
|
-
elsif data
|
212
|
+
elsif data >= @high
|
204
213
|
@outliers_high += 1
|
205
214
|
else
|
206
215
|
return false
|
@@ -222,7 +231,7 @@ class Aggregate
|
|
222
231
|
return 2**(index)
|
223
232
|
end
|
224
233
|
end
|
225
|
-
|
234
|
+
|
226
235
|
def right_bucket? index, data
|
227
236
|
|
228
237
|
# check invariant
|
@@ -270,8 +279,9 @@ class Aggregate
|
|
270
279
|
end
|
271
280
|
|
272
281
|
# log2(x) returns j, | i = j-1 and 2**i <= data < 2**j
|
282
|
+
@@LOG2_DIVEDEND = Math.log(2)
|
273
283
|
def log2( x )
|
274
|
-
Math.log(x) /
|
284
|
+
Math.log(x) / @@LOG2_DIVEDEND
|
275
285
|
end
|
276
|
-
|
286
|
+
|
277
287
|
end
|
data/test/ts_aggregate.rb
CHANGED
@@ -73,7 +73,7 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
73
73
|
=end
|
74
74
|
|
75
75
|
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
76
|
-
@@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383
|
76
|
+
@@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383]
|
77
77
|
def test_bucket_contents
|
78
78
|
#XXX: This is the only test so far that cares about the actual contents
|
79
79
|
# of @@DATA, so if you update that array ... update this method too
|
@@ -83,7 +83,7 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
83
83
|
i = 0
|
84
84
|
@stats.each_nonzero do |bucket, count|
|
85
85
|
assert_equal expected_buckets[i], bucket
|
86
|
-
assert_equal expected_counts[i], count
|
86
|
+
assert_equal expected_counts[i], count
|
87
87
|
# Increment for the next test
|
88
88
|
i += 1
|
89
89
|
end
|
@@ -96,12 +96,19 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
96
96
|
def test_outlier
|
97
97
|
assert_equal 0, @stats.outliers_low
|
98
98
|
assert_equal 0, @stats.outliers_high
|
99
|
-
|
99
|
+
|
100
100
|
@stats << -1
|
101
101
|
@stats << -2
|
102
|
-
@stats <<
|
102
|
+
@stats << 0
|
103
|
+
|
104
|
+
@stats << 2**128
|
103
105
|
|
104
|
-
|
106
|
+
# This should be the last value in the last bucket, but Ruby's native
|
107
|
+
# floats are not precise enough. Somewhere past 2^32 the log(x)/log(2)
|
108
|
+
# breaks down. So it shows up as 128 (outlier) instead of 127
|
109
|
+
#@stats << (2**128) - 1
|
110
|
+
|
111
|
+
assert_equal 3, @stats.outliers_low
|
105
112
|
assert_equal 1, @stats.outliers_high
|
106
113
|
end
|
107
114
|
|
@@ -120,23 +127,33 @@ class LinearHistogramTest < Test::Unit::TestCase
|
|
120
127
|
end
|
121
128
|
|
122
129
|
def test_validation
|
130
|
+
|
131
|
+
# Range cannot be 0
|
123
132
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,32,4)}
|
133
|
+
|
134
|
+
# Range cannot be negative
|
124
135
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,16,4)}
|
136
|
+
|
137
|
+
# Range cannot be < single bucket
|
125
138
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(16,32,17)}
|
139
|
+
|
140
|
+
# Range % width must equal 0 (for now)
|
141
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(1,16384,1024)}
|
126
142
|
end
|
127
143
|
|
128
144
|
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
129
|
-
|
145
|
+
# 32768 is an outlier
|
146
|
+
@@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768]
|
130
147
|
def test_bucket_contents
|
131
148
|
#XXX: This is the only test so far that cares about the actual contents
|
132
149
|
# of @@DATA, so if you update that array ... update this method too
|
133
150
|
expected_buckets = [0, 1024, 15360, 16384]
|
134
|
-
expected_counts = [
|
151
|
+
expected_counts = [5, 2, 1, 2]
|
135
152
|
|
136
153
|
i = 0
|
137
154
|
@stats.each_nonzero do |bucket, count|
|
138
155
|
assert_equal expected_buckets[i], bucket
|
139
|
-
assert_equal expected_counts[i], count
|
156
|
+
assert_equal expected_counts[i], count
|
140
157
|
# Increment for the next test
|
141
158
|
i += 1
|
142
159
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: aggregate
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Joseph Ruscio
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-09-12 00:00:00 -07:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
@@ -54,7 +54,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
54
54
|
requirements: []
|
55
55
|
|
56
56
|
rubyforge_project:
|
57
|
-
rubygems_version: 1.3.
|
57
|
+
rubygems_version: 1.3.5
|
58
58
|
signing_key:
|
59
59
|
specification_version: 3
|
60
60
|
summary: Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support
|