aggregate 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README CHANGED
@@ -1,2 +1,169 @@
1
- Aggregate is a ruby implementation of a statistics aggregator including histogram support
1
+ Aggregate is an intuitive ruby implementation of a statistics aggregator
2
+ including both default and configurable histogram support. It does this
3
+ without recording/storing any of the actual sample values, making it
4
+ suitable for tracking statistics across millions/billions of sample
5
+ without any impact on performance or memory footprint. Originally
6
+ inspired by the Aggregate support in SystemTap (http://sourceware.org/systemtap/)
2
7
 
8
+ Aggregates are easy to instantiate, populate with sample data, and examine
9
+ statistics:
10
+
11
+ #After instantiation use the << operator to add a sample to the aggregate:
12
+ stats = Aggregate.new
13
+
14
+ loop do
15
+ # Take some action that generates a sample measurement
16
+ stats << sample
17
+ end
18
+
19
+ # The number of samples
20
+ stats.count
21
+
22
+ # The average
23
+ stats.mean
24
+
25
+ # Max sample value
26
+ stats.max
27
+
28
+ # Min sample value
29
+ stats.min
30
+
31
+ # The standard deviation
32
+ stats.std_dev
33
+
34
+ Perhaps more importantly than the basic aggregate statistics detailed above
35
+ Aggregate also maintains a histogram of samples. Good explanation of why
36
+ its important: http://37signals.com/svn/posts/1836-the-problem-with-averages
37
+
38
+ The histogram is maintained as a set of "buckets". Each bucket represents a
39
+ range of possible sample values. The set of all buckets represents the range
40
+ of "normal" sample values. By default this is a binary histogram, where
41
+ each bucket represents a range twice as large as the preceding bucket i.e.
42
+ [1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
43
+ provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
44
+ (See NOTES below for a discussion on the effects in practice of insufficient
45
+ precision.)
46
+
47
+ Binary histograms are useful when we have little idea about what the
48
+ sample distribution may look like as almost any positive value will
49
+ fall into some bucket. After using binary histograms to determine
50
+ the coarse-grained characteristics of your sample space you can
51
+ configure a linear histogram to examine it in closer detail.
52
+
53
+ Linear histograms are specified with the three values low, high, and width.
54
+ Low and high specifiy a range [low, high) of values included in the
55
+ histogram (all others are outliers). Width specifies the number of
56
+ values represented by each bucket and therefore the number of
57
+ buckets i.e. granularity of the histogram. The histogram range
58
+ (high - low) must be a multiple of width:
59
+
60
+ #Want to track aggregate stats on response times in ms
61
+ response_stats = Aggregate.new(0, 2000, 50)
62
+
63
+ The example above creates a linear histogram that tracks the
64
+ response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
65
+ most of your samples fall in the first couple buckets! Any values added to the
66
+ aggregate that fall outside of the histogram range are recorded as outliers:
67
+
68
+ # Number of samples that fall below the normal range
69
+ stats.outliers_low
70
+
71
+ # Number of samples that fall above the normal range
72
+ stats.outliers_high
73
+
74
+ Once a histogram is populated Aggregate provides iterator support for
75
+ examining the contents of buckets. The iterators provide both the
76
+ number of samples in the bucket, as well as its range:
77
+
78
+ #Examine every bucket
79
+ @stats.each do |bucket, count|
80
+ end
81
+
82
+ #Examine only buckets containing samples
83
+ @stats.each_nonzero do |bucket, count|
84
+ end
85
+
86
+ Finally Aggregate contains sophisticated pretty-printing support that for
87
+ any given number of columns >= 80 (defaults to 80) and sample distribution
88
+ properly sets a marker weight based on the samples per bucket and aligns all
89
+ output. Empty buckets are skipped to conserve screen space.
90
+
91
+ # Generate and display an 80 column histogram
92
+ puts stats.to_s
93
+
94
+ # Generate and display a 120 column histogram
95
+ puts stats.to_s(120)
96
+
97
+ The following code populates both a binary and linear histogram with the same
98
+ set of 65536 values generated by rand to produce two histograms:
99
+
100
+ require 'rubygems'
101
+ require 'aggregate'
102
+
103
+ # Create an Aggregate instance
104
+ binary_aggregate = Aggregate.new
105
+ linear_aggregate = Aggregate.new(0, 65536, 8192)
106
+
107
+ 65536.times do
108
+ x = rand(65536)
109
+ binary_aggregate << x
110
+ linear_aggregate << x
111
+ end
112
+
113
+ puts binary_aggregate.to_s
114
+ puts linear_aggregate.to_s
115
+
116
+ ** OUTPUT **
117
+ ** Binary Histogram**
118
+ value |------------------------------------------------------------------| count
119
+ 1 | | 3
120
+ 2 | | 1
121
+ 4 | | 5
122
+ 8 | | 9
123
+ 16 | | 15
124
+ 32 | | 29
125
+ 64 | | 62
126
+ 128 | | 115
127
+ 256 | | 267
128
+ 512 |@ | 523
129
+ 1024 |@ | 970
130
+ 2048 |@@@ | 1987
131
+ 4096 |@@@@@@@@ | 4075
132
+ 8192 |@@@@@@@@@@@@@@@@ | 8108
133
+ 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 16405
134
+ 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
135
+ ~
136
+ Total |------------------------------------------------------------------| 65535
137
+
138
+ ** Linear (0, 65536, 4096) Histogram **
139
+ value |------------------------------------------------------------------| count
140
+ 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4094
141
+ 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 4202
142
+ 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4118
143
+ 12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4059
144
+ 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 3999
145
+ 20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4083
146
+ 24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4134
147
+ 28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4143
148
+ 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4152
149
+ 36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4033
150
+ 40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4064
151
+ 45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4012
152
+ 49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4070
153
+ 53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4090
154
+ 57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4135
155
+ 61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 4144
156
+ Total |------------------------------------------------------------------| 65532
157
+
158
+ We can see from these histograms that Ruby's rand function does a relatively good
159
+ job of distributing returned values in the requested range.
160
+
161
+ ** NOTES **
162
+ Ruby doesn't have a log2 function built into Math, so we approximate with
163
+ log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
164
+ to precision limitations, once n reaches a certain size (somewhere > 32)
165
+ this starts to return n. The larger the value of n, the more numbers i.e.
166
+ (2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
167
+ using something like BigDecimal, but for the current purposes of the binary
168
+ histogram i.e. a simple coarse-grained view the current implementation is
169
+ sufficient.
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.2
1
+ 0.2.0
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{aggregate}
8
- s.version = "0.1.2"
8
+ s.version = "0.2.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Joseph Ruscio"]
12
- s.date = %q{2009-08-16}
12
+ s.date = %q{2009-09-12}
13
13
  s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
14
14
  s.email = %q{jruscio@gmail.com}
15
15
  s.extra_rdoc_files = [
@@ -28,7 +28,7 @@ Gem::Specification.new do |s|
28
28
  s.homepage = %q{http://github.com/josephruscio/aggregate}
29
29
  s.rdoc_options = ["--charset=UTF-8"]
30
30
  s.require_paths = ["lib"]
31
- s.rubygems_version = %q{1.3.3}
31
+ s.rubygems_version = %q{1.3.5}
32
32
  s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support}
33
33
  s.test_files = [
34
34
  "test/ts_aggregate.rb"
@@ -5,15 +5,15 @@ class Aggregate
5
5
  #The current average of all samples
6
6
  attr_reader :mean
7
7
 
8
- #The current number of samples
8
+ #The current number of samples
9
9
  attr_reader :count
10
-
10
+
11
11
  #The maximum sample value
12
12
  attr_reader :max
13
-
13
+
14
14
  #The minimum samples value
15
15
  attr_reader :min
16
-
16
+
17
17
  #The sum of all samples
18
18
  attr_reader :sum
19
19
 
@@ -22,7 +22,7 @@ class Aggregate
22
22
 
23
23
  #The number of samples falling above the highest valued histogram bucket
24
24
  attr_reader :outliers_high
25
-
25
+
26
26
  # The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)
27
27
  @@LOG_BUCKETS = 128
28
28
 
@@ -36,7 +36,9 @@ class Aggregate
36
36
  @outliers_low = 0
37
37
  @outliers_high = 0
38
38
 
39
- # If the user asks we maintain a linear histogram
39
+ # If the user asks we maintain a linear histogram where
40
+ # values in the range [low, high) are bucketed in multiples
41
+ # of width
40
42
  if (nil != low && nil != high && nil != width)
41
43
 
42
44
  #Validate linear specification
@@ -48,6 +50,10 @@ class Aggregate
48
50
  raise ArgumentError, "Histogram width must be <= histogram range"
49
51
  end
50
52
 
53
+ if 0 != (high - low).modulo(width)
54
+ raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
55
+ end
56
+
51
57
  @low = low
52
58
  @high = high
53
59
  @width = width
@@ -73,7 +79,7 @@ class Aggregate
73
79
  end
74
80
 
75
81
  # Update the running info
76
- @count += 1
82
+ @count += 1
77
83
  @sum += data
78
84
  @sum2 += (data * data)
79
85
 
@@ -119,6 +125,9 @@ class Aggregate
119
125
  total += count
120
126
  end
121
127
 
128
+ #XXX: Better to print just header --> footer
129
+ return "Empty histogram" if 0 == disp_buckets.length
130
+
122
131
  #Figure out how wide the value and count columns need to be based on their
123
132
  #largest respective numbers
124
133
  value_str = "value"
@@ -144,7 +153,7 @@ class Aggregate
144
153
 
145
154
  #Loop through each bucket to be displayed and output the correct number
146
155
  prev_index = disp_buckets[0][0] - 1
147
-
156
+
148
157
  disp_buckets.each do |x|
149
158
  #Denote skipped empty buckets with a ~
150
159
  histogram << skip_row(value_width) unless prev_index == x[0] - 1
@@ -173,9 +182,9 @@ class Aggregate
173
182
  histogram << "| "
174
183
  histogram << sprintf("%#{count_width}d\n", total)
175
184
  end
176
-
177
- #Iterate through each bucket in the histogram regardless of
178
- #its contents
185
+
186
+ #Iterate through each bucket in the histogram regardless of
187
+ #its contents
179
188
  def each
180
189
  @buckets.each_with_index do |count, index|
181
190
  yield(to_bucket(index), count)
@@ -200,7 +209,7 @@ class Aggregate
200
209
 
201
210
  if data < @low
202
211
  @outliers_low += 1
203
- elsif data > @high
212
+ elsif data >= @high
204
213
  @outliers_high += 1
205
214
  else
206
215
  return false
@@ -222,7 +231,7 @@ class Aggregate
222
231
  return 2**(index)
223
232
  end
224
233
  end
225
-
234
+
226
235
  def right_bucket? index, data
227
236
 
228
237
  # check invariant
@@ -270,8 +279,9 @@ class Aggregate
270
279
  end
271
280
 
272
281
  # log2(x) returns j, | i = j-1 and 2**i <= data < 2**j
282
+ @@LOG2_DIVEDEND = Math.log(2)
273
283
  def log2( x )
274
- Math.log(x) / Math.log(2)
284
+ Math.log(x) / @@LOG2_DIVEDEND
275
285
  end
276
-
286
+
277
287
  end
@@ -73,7 +73,7 @@ class SimpleStatsTest < Test::Unit::TestCase
73
73
  =end
74
74
 
75
75
  #XXX: Update test_bucket_contents() if you muck with @@DATA
76
- @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383 ]
76
+ @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383]
77
77
  def test_bucket_contents
78
78
  #XXX: This is the only test so far that cares about the actual contents
79
79
  # of @@DATA, so if you update that array ... update this method too
@@ -83,7 +83,7 @@ class SimpleStatsTest < Test::Unit::TestCase
83
83
  i = 0
84
84
  @stats.each_nonzero do |bucket, count|
85
85
  assert_equal expected_buckets[i], bucket
86
- assert_equal expected_counts[i], count
86
+ assert_equal expected_counts[i], count
87
87
  # Increment for the next test
88
88
  i += 1
89
89
  end
@@ -96,12 +96,19 @@ class SimpleStatsTest < Test::Unit::TestCase
96
96
  def test_outlier
97
97
  assert_equal 0, @stats.outliers_low
98
98
  assert_equal 0, @stats.outliers_high
99
-
99
+
100
100
  @stats << -1
101
101
  @stats << -2
102
- @stats << 2**129
102
+ @stats << 0
103
+
104
+ @stats << 2**128
103
105
 
104
- assert_equal 2, @stats.outliers_low
106
+ # This should be the last value in the last bucket, but Ruby's native
107
+ # floats are not precise enough. Somewhere past 2^32 the log(x)/log(2)
108
+ # breaks down. So it shows up as 128 (outlier) instead of 127
109
+ #@stats << (2**128) - 1
110
+
111
+ assert_equal 3, @stats.outliers_low
105
112
  assert_equal 1, @stats.outliers_high
106
113
  end
107
114
 
@@ -120,23 +127,33 @@ class LinearHistogramTest < Test::Unit::TestCase
120
127
  end
121
128
 
122
129
  def test_validation
130
+
131
+ # Range cannot be 0
123
132
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,32,4)}
133
+
134
+ # Range cannot be negative
124
135
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(32,16,4)}
136
+
137
+ # Range cannot be < single bucket
125
138
  assert_raise(ArgumentError) {bad_stats = Aggregate.new(16,32,17)}
139
+
140
+ # Range % width must equal 0 (for now)
141
+ assert_raise(ArgumentError) {bad_stats = Aggregate.new(1,16384,1024)}
126
142
  end
127
143
 
128
144
  #XXX: Update test_bucket_contents() if you muck with @@DATA
129
- @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383 ]
145
+ # 32768 is an outlier
146
+ @@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768]
130
147
  def test_bucket_contents
131
148
  #XXX: This is the only test so far that cares about the actual contents
132
149
  # of @@DATA, so if you update that array ... update this method too
133
150
  expected_buckets = [0, 1024, 15360, 16384]
134
- expected_counts = [4, 2, 1, 2]
151
+ expected_counts = [5, 2, 1, 2]
135
152
 
136
153
  i = 0
137
154
  @stats.each_nonzero do |bucket, count|
138
155
  assert_equal expected_buckets[i], bucket
139
- assert_equal expected_counts[i], count
156
+ assert_equal expected_counts[i], count
140
157
  # Increment for the next test
141
158
  i += 1
142
159
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: aggregate
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Joseph Ruscio
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-08-16 00:00:00 -07:00
12
+ date: 2009-09-12 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies: []
15
15
 
@@ -54,7 +54,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
54
54
  requirements: []
55
55
 
56
56
  rubyforge_project:
57
- rubygems_version: 1.3.3
57
+ rubygems_version: 1.3.5
58
58
  signing_key:
59
59
  specification_version: 3
60
60
  summary: Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support