aggregate 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/README CHANGED
@@ -1,2 +1,169 @@
1
- Aggregate is a ruby implementation of a statistics aggregator including histogram support
1
+ Aggregate is an intuitive ruby implementation of a statistics aggregator
2
+ including both default and configurable histogram support. It does this
3
+ without recording/storing any of the actual sample values, making it
4
+ suitable for tracking statistics across millions/billions of sample
5
+ without any impact on performance or memory footprint. Originally
6
+ inspired by the Aggregate support in SystemTap (http://sourceware.org/systemtap/)
2
7
 
8
+ Aggregates are easy to instantiate, populate with sample data, and examine
9
+ statistics:
10
+
11
+ #After instantiation use the << operator to add a sample to the aggregate:
12
+ stats = Aggregate.new
13
+
14
+ loop do
15
+ # Take some action that generates a sample measurement
16
+ stats << sample
17
+ end
18
+
19
+ # The number of samples
20
+ stats.count
21
+
22
+ # The average
23
+ stats.mean
24
+
25
+ # Max sample value
26
+ stats.max
27
+
28
+ # Min sample value
29
+ stats.min
30
+
31
+ # The standard deviation
32
+ stats.std_dev
33
+
34
+ Perhaps more importantly than the basic aggregate statistics detailed above
35
+ Aggregate also maintains a histogram of samples. Good explanation of why
36
+ its important: http://37signals.com/svn/posts/1836-the-problem-with-averages
37
+
38
+ The histogram is maintained as a set of "buckets". Each bucket represents a
39
+ range of possible sample values. The set of all buckets represents the range
40
+ of "normal" sample values. By default this is a binary histogram, where
41
+ each bucket represents a range twice as large as the preceding bucket i.e.
42
+ [1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
43
+ provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
44
+ (See NOTES below for a discussion on the effects in practice of insufficient
45
+ precision.)
46
+
47
+ Binary histograms are useful when we have little idea about what the
48
+ sample distribution may look like as almost any positive value will
49
+ fall into some bucket. After using binary histograms to determine
50
+ the coarse-grained characteristics of your sample space you can
51
+ configure a linear histogram to examine it in closer detail.
52
+
53
+ Linear histograms are specified with the three values low, high, and width.
54
+ Low and high specifiy a range [low, high) of values included in the
55
+ histogram (all others are outliers). Width specifies the number of
56
+ values represented by each bucket and therefore the number of
57
+ buckets i.e. granularity of the histogram. The histogram range
58
+ (high - low) must be a multiple of width:
59
+
60
+ #Want to track aggregate stats on response times in ms
61
+ response_stats = Aggregate.new(0, 2000, 50)
62
+
63
+ The example above creates a linear histogram that tracks the
64
+ response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
65
+ most of your samples fall in the first couple buckets! Any values added to the
66
+ aggregate that fall outside of the histogram range are recorded as outliers:
67
+
68
+ # Number of samples that fall below the normal range
69
+ stats.outliers_low
70
+
71
+ # Number of samples that fall above the normal range
72
+ stats.outliers_high
73
+
74
+ Once a histogram is populated Aggregate provides iterator support for
75
+ examining the contents of buckets. The iterators provide both the
76
+ number of samples in the bucket, as well as its range:
77
+
78
+ #Examine every bucket
79
+ @stats.each do |bucket, count|
80
+ end
81
+
82
+ #Examine only buckets containing samples
83
+ @stats.each_nonzero do |bucket, count|
84
+ end
85
+
86
+ Finally Aggregate contains sophisticated pretty-printing support that for
87
+ any given number of columns >= 80 (defaults to 80) and sample distribution
88
+ properly sets a marker weight based on the samples per bucket and aligns all
89
+ output. Empty buckets are skipped to conserve screen space.
90
+
91
+ # Generate and display an 80 column histogram
92
+ puts stats.to_s
93
+
94
+ # Generate and display a 120 column histogram
95
+ puts stats.to_s(120)
96
+
97
+ The following code populates both a binary and linear histogram with the same
98
+ set of 65536 values generated by rand to produce two histograms:
99
+
100
+ require 'rubygems'
101
+ require 'aggregate'
102
+
103
+ # Create an Aggregate instance
104
+ binary_aggregate = Aggregate.new
105
+ linear_aggregate = Aggregate.new(0, 65536, 8192)
106
+
107
+ 65536.times do
108
+ x = rand(65536)
109
+ binary_aggregate << x
110
+ linear_aggregate << x
111
+ end
112
+
113
+ puts binary_aggregate.to_s
114
+ puts linear_aggregate.to_s
115
+
116
+ ** OUTPUT **
117
+ ** Binary Histogram**
118
+ value |------------------------------------------------------------------| count
119
+ 1 | | 3
120
+ 2 | | 1
121
+ 4 | | 5
122
+ 8 | | 9
123
+ 16 | | 15
124
+ 32 | | 29
125
+ 64 | | 62
126
+ 128 | | 115
127
+ 256 | | 267
128
+ 512 |@ | 523
129
+ 1024 |@ | 970
130
+ 2048 |@@@ | 1987
131
+ 4096 |@@@@@@@@ | 4075
132
+ 8192 |@@@@@@@@@@@@@@@@ | 8108
133
+ 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 16405
134
+ 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
135
+ ~
136
+ Total |------------------------------------------------------------------| 65535
137
+
138
+ ** Linear (0, 65536, 4096) Histogram **
139
+ value |------------------------------------------------------------------| count
140
+ 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4094
141
+ 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 4202
142
+ 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4118
143
+ 12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4059
144
+ 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 3999
145
+ 20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4083
146
+ 24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4134
147
+ 28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4143
148
+ 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4152
149
+ 36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4033
150
+ 40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4064
151
+ 45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4012
152
+ 49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4070
153
+ 53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4090
154
+ 57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4135
155
+ 61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4144
156
+ Total |------------------------------------------------------------------| 65532
157
+
158
+ We can see from these histograms that Ruby's rand function does a relatively good
159
+ job of distributing returned values in the requested range.
160
+
161
+ ** NOTES **
162
+ Ruby doesn't have a log2 function built into Math, so we approximate with
163
+ log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
164
+ to precision limitations, once n reaches a certain size (somewhere > 32)
165
+ this starts to return n. The larger the value of n, the more numbers i.e.
166
+ (2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
167
+ using something like BigDecimal, but for the current purposes of the binary
168
+ histogram i.e. a simple coarse-grained view the current implementation is
169
+ sufficient.
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.2
1
+ 0.2.0
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{aggregate}
8
- s.version = "0.1.2"
8
+ s.version = "0.2.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Joseph Ruscio"]
12
- s.date = %q{2009-08-16}
12
+ s.date = %q{2009-09-12}
13
13
  s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
14
14
  s.email = %q{jruscio@gmail.com}
15
15
  s.extra_rdoc_files = [
@@ -28,7 +28,7 @@ Gem::Specification.new do |s|
28
28
  s.homepage = %q{http://github.com/josephruscio/aggregate}
29
29
  s.rdoc_options = ["--charset=UTF-8"]
30
30
  s.require_paths = ["lib"]
31
- s.rubygems_version = %q{1.3.3}
31
+ s.rubygems_version = %q{1.3.5}
32
32
  s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
33
33
  s.test_files = [
34
34
  "test/ts_aggregate.rb"
@@ -5,15 +5,15 @@ class Aggregate
5
5
  #The current average of all samples
6
6
  attr_reader :mean
7
7
 
8
- #The current number of samples
8
+ #The current number of samples
9
9
  attr_reader :count
10
-
10
+
11
11
  #The maximum sample value
12
12
  attr_reader :max
13
-
13
+
14
14
  #The minimum samples value
15
15
  attr_reader :min
16
-
16
+
17
17
  #The sum of all samples
18
18
  attr_reader :sum
19
19
 
@@ -22,7 +22,7 @@ class Aggregate
22
22
 
23
23
  #The number of samples falling above the highest valued histogram bucket
24
24
  attr_reader :outliers_high
25
-
25
+
26
26
  # The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)
27
27
  @@LOG_BUCKETS = 128
28
28
 
@@ -36,7 +36,9 @@ class Aggregate
36
36
  @outliers_low = 0
37
37
  @outliers_high = 0
38
38
 
39
- # If the user asks we maintain a linear histogram
39
+ # If the user asks we maintain a linear histogram where
40
+ # values in the range [low, high) are bucketed in multiples
41
+ # of width
40
42
  if (nil != low && nil != high && nil != width)
41
43
 
42
44
  #Validate linear specification
@@ -48,6 +50,10 @@ class Aggregate
48
50
  raise ArgumentError, "Histogram width must be <= histogram range"
49
51
  end
50
52
 
53
+ if 0 != (high - low).modulo(width)
54
+ raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
55
+ end
56
+
51
57
  @low = low
52
58
  @high = high
53
59
  @width = width
@@ -73,7 +79,7 @@ class Aggregate
73
79
  end
74
80
 
75
81
  # Update the running info
76
- @count += 1
82
+ @count += 1
77
83
  @sum += data
78
84
  @sum2 += (data * data)
79
85
 
@@ -119,6 +125,9 @@ class Aggregate
119
125
  total += count
120
126
  end
121
127
 
128
+ #XXX: Better to print just header --> footer
129
+ return "Empty histogram" if 0 == disp_buckets.length
130
+
122
131
  #Figure out how wide the value and count columns need to be based on their
123
132
  #largest respective numbers
124
133
  value_str = "value"
@@ -144,7 +153,7 @@ class Aggregate
144
153
 
145
154
  #Loop through each bucket to be displayed and output the correct number
146
155
  prev_index = disp_buckets[0][0] - 1
147
-
156
+
148
157
  disp_buckets.each do |x|
149
158
  #Denote skipped empty buckets with a ~
150
159
  histogram << skip_row(value_width) unless prev_index == x[0] - 1
@@ -173,9 +182,9 @@ class Aggregate
173
182
  histogram << "| "
174
183
  histogram << sprintf("%#{count_width}d\n", total)
175
184
  end
176
-
177
- #Iterate through each bucket in the histogram regardless of
178
- #its contents
185
+
186
+ #Iterate through each bucket in the histogram regardless of
187
+ #its contents
179
188
  def each
180
189
  @buckets.each_with_index do |count, index|
181
190
  yield(to_bucket(index), count)
@@ -200,7 +209,7 @@ class Aggregate
200
209
 
201
210
  if data < @low
202
211
  @outliers_low += 1
203
- elsif data > @high
212
+ elsif data >= @high
204
213
  @outliers_high += 1
205
214
  else
206
215
  return false
@@ -222,7 +231,7 @@ class Aggregate
222
231
  return 2**(index)
223
232
  end
224
233
  end
225
-
234
+
226
235
  def right_bucket? index, data
227
236
 
228
237
  # check invariant
@@ -270,8 +279,9 @@ class Aggregate
270
279
  end
271
280
 
272
281
  # log2(x) returns j, | i = j-1 and 2**i <= data < 2**j
282
+ @@LOG2_DIVEDEND = Math.log(2)
273
283
  def log2( x )
274
- Math.log(x) / Math.log(2)
284
+ Math.log(x) / @@LOG2_DIVEDEND
275
285
  end
276
-
286
+
277
287
  end
@@ -73,7 +73,7 @@ class SimpleStatsTest < Test::Unit::TestCase
73
73
  =end
74
74
 
75
75
  #XXX: Update test_bucket_contents() if you muck with @@DATA
76
- @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383 ]
76
+ @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383]
77
77
  def test_bucket_contents
78
78
  #XXX: This is the only test so far that cares about the actual contents
79
79
  # of @@DATA, so if you update that array ... update this method too
@@ -83,7 +83,7 @@ class SimpleStatsTest < Test::Unit::TestCase
83
83
  i = 0
84
84
  @stats.each_nonzero do |bucket, count|
85
85
  assert_equal expected_buckets[i], bucket
86
- assert_equal expected_counts[i], count
86
+ assert_equal expected_counts[i], count
87
87
  # Increment for the next test
88
88
  i += 1
89
89
  end
@@ -96,12 +96,19 @@ class SimpleStatsTest < Test::Unit::TestCase
96
96
  def test_outlier
97
97
  assert_equal 0, @stats.outliers_low
98
98
  assert_equal 0, @stats.outliers_high
99
-
99
+
100
100
  @stats << -1
101
101
  @stats << -2
102
- @stats << 2**129
102
+ @stats << 0
103
+
104
+ @stats << 2**128
103
105
 
104
- assert_equal 2, @stats.outliers_low
106
+ # This should be the last value in the last bucket, but Ruby's native
107
+ # floats are not precise enough. Somewhere past 2^32 the log(x)/log(2)
108
+ # breaks down. So it shows up as 128 (outlier) instead of 127
109
+ #@stats << (2**128) - 1
110
+
111
+ assert_equal 3, @stats.outliers_low
105
112
  assert_equal 1, @stats.outliers_high
106
113
  end
107
114
 
@@ -120,23 +127,33 @@ class LinearHistogramTest < Test::Unit::TestCase
120
127
  end
121
128
 
122
129
  def test_validation
130
+
131
+ # Range cannot be 0
123
132
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,32,4)}
133
+
134
+ # Range cannot be negative
124
135
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,16,4)}
136
+
137
+ # Range cannot be < single bucket
125
138
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(16,32,17)}
139
+
140
+ # Range % width must equal 0 (for now)
141
+ assert_raise(ArgumentError) {bad_stats = Aggregate.new(1,16384,1024)}
126
142
  end
127
143
 
128
144
  #XXX: Update test_bucket_contents() if you muck with @@DATA
129
- @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383 ]
145
+ # 32768 is an outlier
146
+ @@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768]
130
147
  def test_bucket_contents
131
148
  #XXX: This is the only test so far that cares about the actual contents
132
149
  # of @@DATA, so if you update that array ... update this method too
133
150
  expected_buckets = [0, 1024, 15360, 16384]
134
- expected_counts = [4, 2, 1, 2]
151
+ expected_counts = [5, 2, 1, 2]
135
152
 
136
153
  i = 0
137
154
  @stats.each_nonzero do |bucket, count|
138
155
  assert_equal expected_buckets[i], bucket
139
- assert_equal expected_counts[i], count
156
+ assert_equal expected_counts[i], count
140
157
  # Increment for the next test
141
158
  i += 1
142
159
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: aggregate
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Joseph Ruscio
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-08-16 00:00:00 -07:00
12
+ date: 2009-09-12 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies: []
15
15
 
@@ -54,7 +54,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
54
54
  requirements: []
55
55
 
56
56
  rubyforge_project:
57
- rubygems_version: 1.3.3
57
+ rubygems_version: 1.3.5
58
58
  signing_key:
59
59
  specification_version: 3
60
60
  summary: Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support