aggregate 0.1.2 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README +168 -1
- data/VERSION +1 -1
- data/aggregate.gemspec +3 -3
- data/lib/aggregate.rb +25 -15
- data/test/ts_aggregate.rb +25 -8
- metadata +3 -3
data/README
CHANGED
@@ -1,2 +1,169 @@
|
|
1
|
-
Aggregate is
|
1
|
+
Aggregate is an intuitive ruby implementation of a statistics aggregator
|
2
|
+
including both default and configurable histogram support. It does this
|
3
|
+
without recording/storing any of the actual sample values, making it
|
4
|
+
suitable for tracking statistics across millions/billions of sample
|
5
|
+
without any impact on performance or memory footprint. Originally
|
6
|
+
inspired by the Aggregate support in SystemTap (http://sourceware.org/systemtap/)
|
2
7
|
|
8
|
+
Aggregates are easy to instantiate, populate with sample data, and examine
|
9
|
+
statistics:
|
10
|
+
|
11
|
+
#After instantiation use the << operator to add a sample to the aggregate:
|
12
|
+
stats = Aggregate.new
|
13
|
+
|
14
|
+
loop do
|
15
|
+
# Take some action that generates a sample measurement
|
16
|
+
stats << sample
|
17
|
+
end
|
18
|
+
|
19
|
+
# The number of samples
|
20
|
+
stats.count
|
21
|
+
|
22
|
+
# The average
|
23
|
+
stats.mean
|
24
|
+
|
25
|
+
# Max sample value
|
26
|
+
stats.max
|
27
|
+
|
28
|
+
# Min sample value
|
29
|
+
stats.min
|
30
|
+
|
31
|
+
# The standard deviation
|
32
|
+
stats.std_dev
|
33
|
+
|
34
|
+
Perhaps more importantly than the basic aggregate statistics detailed above
|
35
|
+
Aggregate also maintains a histogram of samples. Good explanation of why
|
36
|
+
its important: http://37signals.com/svn/posts/1836-the-problem-with-averages
|
37
|
+
|
38
|
+
The histogram is maintained as a set of "buckets". Each bucket represents a
|
39
|
+
range of possible sample values. The set of all buckets represents the range
|
40
|
+
of "normal" sample values. By default this is a binary histogram, where
|
41
|
+
each bucket represents a range twice as large as the preceding bucket i.e.
|
42
|
+
[1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
|
43
|
+
provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
|
44
|
+
(See NOTES below for a discussion on the effects in practice of insufficient
|
45
|
+
precision.)
|
46
|
+
|
47
|
+
Binary histograms are useful when we have little idea about what the
|
48
|
+
sample distribution may look like as almost any positive value will
|
49
|
+
fall into some bucket. After using binary histograms to determine
|
50
|
+
the coarse-grained characteristics of your sample space you can
|
51
|
+
configure a linear histogram to examine it in closer detail.
|
52
|
+
|
53
|
+
Linear histograms are specified with the three values low, high, and width.
|
54
|
+
Low and high specifiy a range [low, high) of values included in the
|
55
|
+
histogram (all others are outliers). Width specifies the number of
|
56
|
+
values represented by each bucket and therefore the number of
|
57
|
+
buckets i.e. granularity of the histogram. The histogram range
|
58
|
+
(high - low) must be a multiple of width:
|
59
|
+
|
60
|
+
#Want to track aggregate stats on response times in ms
|
61
|
+
response_stats = Aggregate.new(0, 2000, 50)
|
62
|
+
|
63
|
+
The example above creates a linear histogram that tracks the
|
64
|
+
response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
|
65
|
+
most of your samples fall in the first couple buckets! Any values added to the
|
66
|
+
aggregate that fall outside of the histogram range are recorded as outliers:
|
67
|
+
|
68
|
+
# Number of samples that fall below the normal range
|
69
|
+
stats.outliers_low
|
70
|
+
|
71
|
+
# Number of samples that fall above the normal range
|
72
|
+
stats.outliers_high
|
73
|
+
|
74
|
+
Once a histogram is populated Aggregate provides iterator support for
|
75
|
+
examining the contents of buckets. The iterators provide both the
|
76
|
+
number of samples in the bucket, as well as its range:
|
77
|
+
|
78
|
+
#Examine every bucket
|
79
|
+
@stats.each do |bucket, count|
|
80
|
+
end
|
81
|
+
|
82
|
+
#Examine only buckets containing samples
|
83
|
+
@stats.each_nonzero do |bucket, count|
|
84
|
+
end
|
85
|
+
|
86
|
+
Finally Aggregate contains sophisticated pretty-printing support that for
|
87
|
+
any given number of columns >= 80 (defaults to 80) and sample distribution
|
88
|
+
properly sets a marker weight based on the samples per bucket and aligns all
|
89
|
+
output. Empty buckets are skipped to conserve screen space.
|
90
|
+
|
91
|
+
# Generate and display an 80 column histogram
|
92
|
+
puts stats.to_s
|
93
|
+
|
94
|
+
# Generate and display a 120 column histogram
|
95
|
+
puts stats.to_s(120)
|
96
|
+
|
97
|
+
The following code populates both a binary and linear histogram with the same
|
98
|
+
set of 65536 values generated by rand to produce two histograms:
|
99
|
+
|
100
|
+
require 'rubygems'
|
101
|
+
require 'aggregate'
|
102
|
+
|
103
|
+
# Create an Aggregate instance
|
104
|
+
binary_aggregate = Aggregate.new
|
105
|
+
linear_aggregate = Aggregate.new(0, 65536, 8192)
|
106
|
+
|
107
|
+
65536.times do
|
108
|
+
x = rand(65536)
|
109
|
+
binary_aggregate << x
|
110
|
+
linear_aggregate << x
|
111
|
+
end
|
112
|
+
|
113
|
+
puts binary_aggregate.to_s
|
114
|
+
puts linear_aggregate.to_s
|
115
|
+
|
116
|
+
** OUTPUT **
|
117
|
+
** Binary Histogram**
|
118
|
+
value |------------------------------------------------------------------| count
|
119
|
+
1 | | 3
|
120
|
+
2 | | 1
|
121
|
+
4 | | 5
|
122
|
+
8 | | 9
|
123
|
+
16 | | 15
|
124
|
+
32 | | 29
|
125
|
+
64 | | 62
|
126
|
+
128 | | 115
|
127
|
+
256 | | 267
|
128
|
+
512 |@ | 523
|
129
|
+
1024 |@ | 970
|
130
|
+
2048 |@@@ | 1987
|
131
|
+
4096 |@@@@@@@@ | 4075
|
132
|
+
8192 |@@@@@@@@@@@@@@@@ | 8108
|
133
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 16405
|
134
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
|
135
|
+
~
|
136
|
+
Total |------------------------------------------------------------------| 65535
|
137
|
+
|
138
|
+
** Linear (0, 65536, 4096) Histogram **
|
139
|
+
value |------------------------------------------------------------------| count
|
140
|
+
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4094
|
141
|
+
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 4202
|
142
|
+
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4118
|
143
|
+
12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4059
|
144
|
+
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 3999
|
145
|
+
20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4083
|
146
|
+
24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4134
|
147
|
+
28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4143
|
148
|
+
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4152
|
149
|
+
36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4033
|
150
|
+
40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4064
|
151
|
+
45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4012
|
152
|
+
49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4070
|
153
|
+
53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4090
|
154
|
+
57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4135
|
155
|
+
61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4144
|
156
|
+
Total |------------------------------------------------------------------| 65532
|
157
|
+
|
158
|
+
We can see from these histograms that Ruby's rand function does a relatively good
|
159
|
+
job of distributing returned values in the requested range.
|
160
|
+
|
161
|
+
** NOTES **
|
162
|
+
Ruby doesn't have a log2 function built into Math, so we approximate with
|
163
|
+
log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
|
164
|
+
to precision limitations, once n reaches a certain size (somewhere > 32)
|
165
|
+
this starts to return n. The larger the value of n, the more numbers i.e.
|
166
|
+
(2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
|
167
|
+
using something like BigDecimal, but for the current purposes of the binary
|
168
|
+
histogram i.e. a simple coarse-grained view the current implementation is
|
169
|
+
sufficient.
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.2.0
|
data/aggregate.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{aggregate}
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.2.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Joseph Ruscio"]
|
12
|
-
s.date = %q{2009-
|
12
|
+
s.date = %q{2009-09-12}
|
13
13
|
s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
|
14
14
|
s.email = %q{jruscio@gmail.com}
|
15
15
|
s.extra_rdoc_files = [
|
@@ -28,7 +28,7 @@ Gem::Specification.new do |s|
|
|
28
28
|
s.homepage = %q{http://github.com/josephruscio/aggregate}
|
29
29
|
s.rdoc_options = ["--charset=UTF-8"]
|
30
30
|
s.require_paths = ["lib"]
|
31
|
-
s.rubygems_version = %q{1.3.
|
31
|
+
s.rubygems_version = %q{1.3.5}
|
32
32
|
s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
|
33
33
|
s.test_files = [
|
34
34
|
"test/ts_aggregate.rb"
|
data/lib/aggregate.rb
CHANGED
@@ -5,15 +5,15 @@ class Aggregate
|
|
5
5
|
#The current average of all samples
|
6
6
|
attr_reader :mean
|
7
7
|
|
8
|
-
#The current number of samples
|
8
|
+
#The current number of samples
|
9
9
|
attr_reader :count
|
10
|
-
|
10
|
+
|
11
11
|
#The maximum sample value
|
12
12
|
attr_reader :max
|
13
|
-
|
13
|
+
|
14
14
|
#The minimum samples value
|
15
15
|
attr_reader :min
|
16
|
-
|
16
|
+
|
17
17
|
#The sum of all samples
|
18
18
|
attr_reader :sum
|
19
19
|
|
@@ -22,7 +22,7 @@ class Aggregate
|
|
22
22
|
|
23
23
|
#The number of samples falling above the highest valued histogram bucket
|
24
24
|
attr_reader :outliers_high
|
25
|
-
|
25
|
+
|
26
26
|
# The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)
|
27
27
|
@@LOG_BUCKETS = 128
|
28
28
|
|
@@ -36,7 +36,9 @@ class Aggregate
|
|
36
36
|
@outliers_low = 0
|
37
37
|
@outliers_high = 0
|
38
38
|
|
39
|
-
# If the user asks we maintain a linear histogram
|
39
|
+
# If the user asks we maintain a linear histogram where
|
40
|
+
# values in the range [low, high) are bucketed in multiples
|
41
|
+
# of width
|
40
42
|
if (nil != low && nil != high && nil != width)
|
41
43
|
|
42
44
|
#Validate linear specification
|
@@ -48,6 +50,10 @@ class Aggregate
|
|
48
50
|
raise ArgumentError, "Histogram width must be <= histogram range"
|
49
51
|
end
|
50
52
|
|
53
|
+
if 0 != (high - low).modulo(width)
|
54
|
+
raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
|
55
|
+
end
|
56
|
+
|
51
57
|
@low = low
|
52
58
|
@high = high
|
53
59
|
@width = width
|
@@ -73,7 +79,7 @@ class Aggregate
|
|
73
79
|
end
|
74
80
|
|
75
81
|
# Update the running info
|
76
|
-
@count += 1
|
82
|
+
@count += 1
|
77
83
|
@sum += data
|
78
84
|
@sum2 += (data * data)
|
79
85
|
|
@@ -119,6 +125,9 @@ class Aggregate
|
|
119
125
|
total += count
|
120
126
|
end
|
121
127
|
|
128
|
+
#XXX: Better to print just header --> footer
|
129
|
+
return "Empty histogram" if 0 == disp_buckets.length
|
130
|
+
|
122
131
|
#Figure out how wide the value and count columns need to be based on their
|
123
132
|
#largest respective numbers
|
124
133
|
value_str = "value"
|
@@ -144,7 +153,7 @@ class Aggregate
|
|
144
153
|
|
145
154
|
#Loop through each bucket to be displayed and output the correct number
|
146
155
|
prev_index = disp_buckets[0][0] - 1
|
147
|
-
|
156
|
+
|
148
157
|
disp_buckets.each do |x|
|
149
158
|
#Denote skipped empty buckets with a ~
|
150
159
|
histogram << skip_row(value_width) unless prev_index == x[0] - 1
|
@@ -173,9 +182,9 @@ class Aggregate
|
|
173
182
|
histogram << "| "
|
174
183
|
histogram << sprintf("%#{count_width}d\n", total)
|
175
184
|
end
|
176
|
-
|
177
|
-
#Iterate through each bucket in the histogram regardless of
|
178
|
-
#its contents
|
185
|
+
|
186
|
+
#Iterate through each bucket in the histogram regardless of
|
187
|
+
#its contents
|
179
188
|
def each
|
180
189
|
@buckets.each_with_index do |count, index|
|
181
190
|
yield(to_bucket(index), count)
|
@@ -200,7 +209,7 @@ class Aggregate
|
|
200
209
|
|
201
210
|
if data < @low
|
202
211
|
@outliers_low += 1
|
203
|
-
elsif data
|
212
|
+
elsif data >= @high
|
204
213
|
@outliers_high += 1
|
205
214
|
else
|
206
215
|
return false
|
@@ -222,7 +231,7 @@ class Aggregate
|
|
222
231
|
return 2**(index)
|
223
232
|
end
|
224
233
|
end
|
225
|
-
|
234
|
+
|
226
235
|
def right_bucket? index, data
|
227
236
|
|
228
237
|
# check invariant
|
@@ -270,8 +279,9 @@ class Aggregate
|
|
270
279
|
end
|
271
280
|
|
272
281
|
# log2(x) returns j, | i = j-1 and 2**i <= data < 2**j
|
282
|
+
@@LOG2_DIVEDEND = Math.log(2)
|
273
283
|
def log2( x )
|
274
|
-
Math.log(x) /
|
284
|
+
Math.log(x) / @@LOG2_DIVEDEND
|
275
285
|
end
|
276
|
-
|
286
|
+
|
277
287
|
end
|
data/test/ts_aggregate.rb
CHANGED
@@ -73,7 +73,7 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
73
73
|
=end
|
74
74
|
|
75
75
|
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
76
|
-
@@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383
|
76
|
+
@@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383]
|
77
77
|
def test_bucket_contents
|
78
78
|
#XXX: This is the only test so far that cares about the actual contents
|
79
79
|
# of @@DATA, so if you update that array ... update this method too
|
@@ -83,7 +83,7 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
83
83
|
i = 0
|
84
84
|
@stats.each_nonzero do |bucket, count|
|
85
85
|
assert_equal expected_buckets[i], bucket
|
86
|
-
assert_equal expected_counts[i], count
|
86
|
+
assert_equal expected_counts[i], count
|
87
87
|
# Increment for the next test
|
88
88
|
i += 1
|
89
89
|
end
|
@@ -96,12 +96,19 @@ class SimpleStatsTest < Test::Unit::TestCase
|
|
96
96
|
def test_outlier
|
97
97
|
assert_equal 0, @stats.outliers_low
|
98
98
|
assert_equal 0, @stats.outliers_high
|
99
|
-
|
99
|
+
|
100
100
|
@stats << -1
|
101
101
|
@stats << -2
|
102
|
-
@stats <<
|
102
|
+
@stats << 0
|
103
|
+
|
104
|
+
@stats << 2**128
|
103
105
|
|
104
|
-
|
106
|
+
# This should be the last value in the last bucket, but Ruby's native
|
107
|
+
# floats are not precise enough. Somewhere past 2^32 the log(x)/log(2)
|
108
|
+
# breaks down. So it shows up as 128 (outlier) instead of 127
|
109
|
+
#@stats << (2**128) - 1
|
110
|
+
|
111
|
+
assert_equal 3, @stats.outliers_low
|
105
112
|
assert_equal 1, @stats.outliers_high
|
106
113
|
end
|
107
114
|
|
@@ -120,23 +127,33 @@ class LinearHistogramTest < Test::Unit::TestCase
|
|
120
127
|
end
|
121
128
|
|
122
129
|
def test_validation
|
130
|
+
|
131
|
+
# Range cannot be 0
|
123
132
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,32,4)}
|
133
|
+
|
134
|
+
# Range cannot be negative
|
124
135
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,16,4)}
|
136
|
+
|
137
|
+
# Range cannot be < single bucket
|
125
138
|
assert_raise(ArgumentError) {bad_stats = Aggregate.new(16,32,17)}
|
139
|
+
|
140
|
+
# Range % width must equal 0 (for now)
|
141
|
+
assert_raise(ArgumentError) {bad_stats = Aggregate.new(1,16384,1024)}
|
126
142
|
end
|
127
143
|
|
128
144
|
#XXX: Update test_bucket_contents() if you muck with @@DATA
|
129
|
-
|
145
|
+
# 32768 is an outlier
|
146
|
+
@@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768]
|
130
147
|
def test_bucket_contents
|
131
148
|
#XXX: This is the only test so far that cares about the actual contents
|
132
149
|
# of @@DATA, so if you update that array ... update this method too
|
133
150
|
expected_buckets = [0, 1024, 15360, 16384]
|
134
|
-
expected_counts = [
|
151
|
+
expected_counts = [5, 2, 1, 2]
|
135
152
|
|
136
153
|
i = 0
|
137
154
|
@stats.each_nonzero do |bucket, count|
|
138
155
|
assert_equal expected_buckets[i], bucket
|
139
|
-
assert_equal expected_counts[i], count
|
156
|
+
assert_equal expected_counts[i], count
|
140
157
|
# Increment for the next test
|
141
158
|
i += 1
|
142
159
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: aggregate
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Joseph Ruscio
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-09-12 00:00:00 -07:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
@@ -54,7 +54,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
54
54
|
requirements: []
|
55
55
|
|
56
56
|
rubyforge_project:
|
57
|
-
rubygems_version: 1.3.
|
57
|
+
rubygems_version: 1.3.5
|
58
58
|
signing_key:
|
59
59
|
specification_version: 3
|
60
60
|
summary: Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support
|