hetchy 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 051d2f9830fb1d6e3bfa363610249fc760ab0ee2
4
+ data.tar.gz: f2971f802dbc95eea3a1641d5060db28c78cadc7
5
+ SHA512:
6
+ metadata.gz: 0b9571297d6389a176aae76e58d0b66d55413646c7686cd3f2a85155f8991e6d29c6965649eb26b9b3ecc9eeb447ac15c43f65925184e6fcbd8b415e86922972
7
+ data.tar.gz: 1f33d20c689b47f2ae6f0db8597f0ac95efe60c83f6feabd1bb6ef3c3c77c4d274e1a20c9bf80ef379ab20920dc13a88f78db5d21dafd0bf9c34f8798cd4f267
@@ -0,0 +1,15 @@
1
+ /.ruby-version
2
+ /.bundle/
3
+ /.yardoc
4
+ /Gemfile.lock
5
+ /_yardoc/
6
+ /coverage/
7
+ /doc/
8
+ /pkg/
9
+ /spec/reports/
10
+ /tmp/
11
+ *.bundle
12
+ *.so
13
+ *.o
14
+ *.a
15
+ mkmf.log
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in hetchy.gemspec
4
+ gemspec
5
+
6
+ gem 'minitest', '>= 5'
7
+ gem 'rake'
8
+ gem 'pry'
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Matt Sanders
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,113 @@
1
+ # Hetchy
2
+
3
+ A high performance, thread-safe reservoir sampler for ruby with snapshot and percentile support.
4
+
5
+ ## Benefits/Goals
6
+
7
+ * Generate statistics for large datasets with a very small in-memory footprint
8
+ * Generate percentiles/quantiles for a known set or numbers or for real-time streaming set
9
+ * Ability to capture sample state at a moment in time for further analysis
10
+ * Speed
11
+ * Thorough test suite
12
+ * No dependencies
13
+
14
+ ## Installation
15
+
16
+ Add this line to your application's Gemfile:
17
+
18
+ ```ruby
19
+ gem 'hetchy'
20
+ ```
21
+
22
+ And then `$ bundle`
23
+
24
+ Or install it yourself with:
25
+
26
+ $ gem install hetchy
27
+
28
+ ## Usage
29
+
30
+
31
+ Create a reservoir and designate how big you want it to be:
32
+
33
+ ```ruby
34
+ reservoir = Hetchy::Reservoir.new(size: 1000)
35
+ ```
36
+
37
+ Add samples as they arrive or are generated:
38
+
39
+ ```ruby
40
+ reservoir << 123
41
+ ```
42
+
43
+ You can add sets of samples as well:
44
+
45
+ ```ruby
46
+ reservoir << [45,89,124,96]
47
+ ```
48
+
49
+ You can calculate a percentile at any time:
50
+
51
+ ```ruby
52
+ reservoir.percentile(95)
53
+ ```
54
+
55
+ Hetchy supports high resolution percentiles as well:
56
+
57
+ ```ruby
58
+ reservoir.percentile(99.99)
59
+ ```
60
+
61
+ NOTE: you may need to increase reservoir size for _very_ high resolution percentiles. Experiment to see what works for your data set.
62
+
63
+ For threaded applications where the reservoir is accepting samples rapidly you can increase performance by snapshotting before running a series of calculations on the reservoir:
64
+
65
+ ```ruby
66
+ dataset = reservoir.snapshot
67
+
68
+ perc_95 = dataset.percentile(95)
69
+ perc_99 = dataset.percentile(99)
70
+ perc_999 = dataset.percentile(99.9)
71
+ ```
72
+
73
+ Clear the reservoir to reset it:
74
+
75
+ ```ruby
76
+ reservoir.clear
77
+ ```
78
+
79
+ ## Datasets
80
+
81
+ If you have an existing series you can use Dataset to generate stats for it:
82
+
83
+ ```ruby
84
+ my_series = Array(1..1000)
85
+ dataset = Hetchy::Dataset.new(my_series)
86
+
87
+ perc_95 = dataset.percentile(95) #=> 950.95
88
+ median = dataset.median #=> 500.5
89
+ ```
90
+
91
+ ## Stats Details
92
+
93
+ For those interested:
94
+
95
+ * Reservoir sampling is based on Vitter's algorithm R, ensures a uniform sampling probability for every entry in the series
96
+ * Percentile calculations use weighted averages, not nearest neighbor
97
+
98
+ ## Contributing
99
+
100
+ * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
101
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
102
+ * Fork the project and submit a pull request from a feature or bugfix branch.
103
+ * Please include tests. This is important so we don't break your changes unintentionally in a future version.
104
+ * Please don't modify the gemspec, Rakefile, version, or changelog. If you do change these files, please isolate a separate commit so we can cherry-pick around it.
105
+
106
+ ## Credits
107
+
108
+ Parts of Hetchy are inspired by Eric Lindvall's [metriks]() gem.
109
+
110
+ ## Copyright
111
+
112
+ Copyright (c) 2015 Matt Sanders. See LICENSE for details.
113
+
@@ -0,0 +1,18 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ # Console
4
+ desc "Open an console session preloaded with this library"
5
+ task :console do
6
+ sh "pry -r ./lib/hetchy.rb"
7
+ end
8
+
9
+ # Testing
10
+ require 'rake/testtask'
11
+ Rake::TestTask.new(:test) do |t|
12
+ t.libs << 'lib'
13
+ t.libs << 'test'
14
+ t.pattern = 'test/**/*_test.rb'
15
+ t.verbose = false
16
+ end
17
+
18
+ task :default => :test
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'hetchy/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "hetchy"
8
+ spec.version = Hetchy::VERSION
9
+ spec.authors = ["Matt Sanders"]
10
+ spec.email = ["matt@modal.org"]
11
+ spec.summary = %q{High performance, thread-safe reservoir sampler with percentile support.}
12
+ spec.description = %q{A high performance, thread-safe reservoir sampler with snapshot and percentile support.}
13
+ spec.homepage = "https://github.com/nextmat/hetchy"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ # spec.add_development_dependency "bundler", "~> 1.7"
22
+ # spec.add_development_dependency "rake", "~> 10.0"
23
+ end
@@ -0,0 +1,5 @@
1
+ module Hetchy; end
2
+
3
+ require_relative 'hetchy/dataset'
4
+ require_relative 'hetchy/reservoir'
5
+ require_relative 'hetchy/version'
@@ -0,0 +1,54 @@
1
+ module Hetchy
2
+
3
+ # Takes a fixed immutable dataset (array of values) and provides
4
+ # various statistical measurements.
5
+ #
6
+ class Dataset
7
+
8
+ attr_reader :data,
9
+ :size
10
+
11
+ def initialize(data)
12
+ @data = data.sort
13
+ @size = @data.length
14
+ end
15
+
16
+ # @return [Float/Integer] median of the data set.
17
+ def median
18
+ percentile(50.0)
19
+ end
20
+
21
+ # Generate a percentile for the data set.
22
+ # @example
23
+ # snapshot.percentile(95)
24
+ # snapshot.percentile(99.9)
25
+ #
26
+ def percentile(perc)
27
+ if perc > 100.0 || perc < 0.0
28
+ raise InvalidPercentile, "percentile must be between 0.0 and 100.0"
29
+ end
30
+ return 0.0 if data.empty?
31
+
32
+ rank = (perc / 100.0) * (size + 1)
33
+
34
+ return data[0] if rank < 1
35
+ return data[-1] if rank > size
36
+ return data[rank - 1] if rank == Integer(rank)
37
+ weighted_average_for(rank)
38
+ end
39
+
40
+ private
41
+
42
+ # when rank lands between values, generated a weighted average
43
+ # of adjacent values
44
+ def weighted_average_for(rank)
45
+ above = data[rank.to_i]
46
+ below = data[rank.to_i - 1]
47
+ fractional = rank - rank.floor
48
+ below + ((above - below) * fractional)
49
+ end
50
+
51
+ end
52
+
53
+ class InvalidPercentile < StandardError; end
54
+ end
@@ -0,0 +1,72 @@
1
+ module Hetchy
2
+ class Reservoir
3
+
4
+ attr_reader :count, # number of samples processed
5
+ :pool, # raw pool data, do not modify this
6
+ :size # size of allocated pool
7
+
8
+ # Create a reservoir.
9
+ # @option opts [Integer] :size Size of reservoir
10
+ #
11
+ def initialize(opts={})
12
+ @size = opts.fetch(:size, 1000)
13
+ @lock = Mutex.new
14
+ initialize_pool
15
+ end
16
+
17
+ # Add one or more values to the reservoir.
18
+ # @example
19
+ # reservoir << 1234
20
+ # reservoir << [2345,7891,2131]
21
+ #
22
+ def << (values)
23
+ Array(values).each do |value|
24
+ @lock.synchronize do
25
+ # sampling strategy is Vitter's algo R
26
+ if count < size
27
+ @pool[count] = value
28
+ else
29
+ index = rand(count+1)
30
+ if index < @size
31
+ @pool[index] = value
32
+ end
33
+ end
34
+ @count += 1
35
+ end
36
+ end
37
+ end
38
+
39
+ # Empty/reset the reservoir
40
+ def clear
41
+ initialize_pool
42
+ end
43
+
44
+ # Calculate a percentile based on the current state of the
45
+ # reservoir.
46
+ #
47
+ # If you are going to calculate multiple percentiles
48
+ # it will be faster to #snapshot and then calculate them off of
49
+ # the generated Dataset.
50
+ def percentile(perc)
51
+ snapshot.percentile(perc)
52
+ end
53
+
54
+ # Capture a moment in time for the reservoir for analysis.
55
+ # Since sampling may be ongoing this ensures we are working
56
+ # with data from our intended period.
57
+ #
58
+ def snapshot
59
+ data = nil
60
+ @lock.synchronize { data = @pool.dup }
61
+ Dataset.new(data.compact)
62
+ end
63
+
64
+ private
65
+
66
+ def initialize_pool
67
+ @lock.synchronize { @pool = Array.new(@size) }
68
+ @count = 0
69
+ end
70
+
71
+ end
72
+ end
@@ -0,0 +1,3 @@
1
+ module Hetchy
2
+ VERSION = "1.0.0"
3
+ end
@@ -0,0 +1,15 @@
1
+ class SampleSets
2
+
3
+ # 100 random samples from 50-2550
4
+ def self.set_100
5
+ [1136, 1551, 929, 932, 2383, 970, 2437, 1909, 1083, 2188, 1250, 1530, 1559,
6
+ 782, 2163, 2108, 1173, 2323, 553, 1503, 436, 2418, 2542, 1938, 1276, 328,
7
+ 222, 2384, 2186, 861, 1596, 590, 743, 1837, 1732, 1408, 2177, 2334, 310,
8
+ 1726, 274, 2283, 1908, 138, 587, 325, 975, 2047, 87, 1671, 535, 1007,
9
+ 2340, 1816, 1458, 1565, 1996, 1740, 1568, 853, 1816, 2078, 2094, 707,
10
+ 1444, 939, 2365, 1704, 999, 752, 899, 1220, 2190, 99, 1917, 1738, 2355,
11
+ 1164, 504, 832, 2087, 2311, 2513, 1322, 254, 2486, 1865, 1294, 585, 528,
12
+ 1244, 1395, 1185, 2327, 2461, 2145, 1795, 1336, 1935, 766]
13
+ end
14
+
15
+ end
@@ -0,0 +1,9 @@
1
+ require 'bundler'
2
+ Bundler.setup
3
+
4
+ require 'pry'
5
+ require 'minitest/autorun'
6
+
7
+ require 'hetchy'
8
+
9
+ Dir[File.dirname(__FILE__) + '/support/*.rb'].each {|file| require file }
@@ -0,0 +1,54 @@
1
+ require 'test_helper'
2
+
3
+ module Hetchy
4
+ class DatasetTest < Minitest::Test
5
+
6
+ def test_data_sort
7
+ dataset = Dataset.new([5,2,3,1])
8
+ assert_equal [1,2,3,5], dataset.data, 'data is sorted'
9
+ end
10
+
11
+ def test_median
12
+ dataset = Dataset.new(Array(1..10))
13
+ assert_equal 5.5, dataset.median, 'median is accurate'
14
+ end
15
+
16
+ def test_percentile_empty
17
+ empty_dataset = Dataset.new([])
18
+ assert_equal 0.0, empty_dataset.percentile(99)
19
+ end
20
+
21
+ def test_percentile_extreme_bounds
22
+ dataset = Dataset.new(Array(1..9))
23
+ assert_equal 9, dataset.percentile(99.9)
24
+ assert_equal 1, dataset.percentile(0.9)
25
+ end
26
+
27
+ def test_percentile_fractional
28
+ dataset = Dataset.new(SampleSets.set_100)
29
+ assert_equal 2447.92, dataset.percentile(95.5)
30
+ end
31
+
32
+ def test_percentile_limits
33
+ dataset = Dataset.new([1,2,3])
34
+ assert_raises(InvalidPercentile) { dataset.percentile(-0.2) }
35
+ assert_raises(InvalidPercentile) { dataset.percentile(1000) }
36
+ end
37
+
38
+ def test_percentile_simple
39
+ dataset = Dataset.new(Array(1..9))
40
+ assert_equal 5, dataset.percentile(50)
41
+
42
+ set = [4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9, 9, 9, 10, 10, 10]
43
+ dataset = Dataset.new(set)
44
+ assert_equal 5, dataset.percentile(25)
45
+ assert_in_delta 9.85, dataset.percentile(85), 0.001
46
+ end
47
+
48
+ def test_size
49
+ dataset = Dataset.new([1,2,3])
50
+ assert_equal 3, dataset.size
51
+ end
52
+
53
+ end
54
+ end
@@ -0,0 +1,97 @@
1
+ require 'test_helper'
2
+
3
+ module Hetchy
4
+ class ReservoirTest < Minitest::Test
5
+
6
+ def test_add_basics
7
+ # single add
8
+ reservoir = Reservoir.new(size: 10)
9
+ reservoir << 1
10
+ assert_equal 1, reservoir.pool[0]
11
+
12
+ # multi add
13
+ reservoir << [2,3,4]
14
+ assert_equal 2, reservoir.pool[1]
15
+ assert_equal 3, reservoir.pool[2]
16
+ assert_equal 4, reservoir.pool[3]
17
+ end
18
+
19
+ def test_add_pool_fully_accessible
20
+ series = Array(1..1000)
21
+ reservoir = Reservoir.new(size: 10)
22
+ reservoir << series
23
+
24
+ refute_equal 1, reservoir.pool[0], 'first entry in pool can be changed'
25
+ refute_equal 10, reservoir.pool[-1], 'last entry in pool can be changed'
26
+ end
27
+
28
+ def test_add_saturation_starts_sampling
29
+ series = Array(1..15)
30
+ overflow = series[10..-1]
31
+
32
+ reservoir = Reservoir.new(size: 10)
33
+ reservoir << series
34
+
35
+ assert_equal 15, reservoir.count, 'all samples recorded'
36
+ assert_equal 10, reservoir.pool.length, 'pool is stable size'
37
+ refute_equal (series - reservoir.pool), overflow,
38
+ 'overflow measures are sampled into pool'
39
+ end
40
+
41
+ def test_add_thread_safety
42
+ reservoir = Reservoir.new(size: 1024)
43
+ threads = []
44
+ 5.times do
45
+ threads << Thread.new do
46
+ 1000.times { reservoir << rand(10000) }
47
+ end
48
+ end
49
+
50
+ threads.each(&:join)
51
+ assert_equal 5000, reservoir.count, 'adding is threadsafe'
52
+ end
53
+
54
+ def test_clear
55
+ reservoir = Reservoir.new(size: 10)
56
+ reservoir << [1,2,3]
57
+
58
+ reservoir.clear
59
+ assert_nil reservoir.pool[0], 'should clear reservoir'
60
+ assert_equal 0, reservoir.count
61
+ end
62
+
63
+ def test_count
64
+ reservoir = Reservoir.new(size: 10)
65
+ assert_equal 0, reservoir.count
66
+
67
+ 10.times do
68
+ reservoir << rand(1000)
69
+ end
70
+ assert_equal 10, reservoir.count
71
+ end
72
+
73
+ def test_percentile
74
+ reservoir = Reservoir.new(size: 10)
75
+ reservoir << [1,2,3,4,5]
76
+
77
+ assert_equal 3, reservoir.percentile(50)
78
+ end
79
+
80
+ def test_size
81
+ reservoir = Reservoir.new(size: 5)
82
+ assert_equal 5, reservoir.size
83
+ end
84
+
85
+ def test_snapshot
86
+ reservoir = Reservoir.new(size: 5)
87
+ reservoir << [1,2]
88
+ snapshot = reservoir.snapshot
89
+
90
+ reservoir << 3
91
+ assert_includes snapshot.data, 1, 'has previous samples'
92
+ assert_includes snapshot.data, 2, 'has previous samples'
93
+ refute_includes snapshot.data, 3, 'does not receive changes'
94
+ end
95
+
96
+ end
97
+ end
metadata ADDED
@@ -0,0 +1,63 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: hetchy
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Matt Sanders
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-08-27 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A high performance, thread-safe reservoir sampler with snapshot and percentile
14
+ support.
15
+ email:
16
+ - matt@modal.org
17
+ executables: []
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - ".gitignore"
22
+ - Gemfile
23
+ - LICENSE.txt
24
+ - README.md
25
+ - Rakefile
26
+ - hetchy.gemspec
27
+ - lib/hetchy.rb
28
+ - lib/hetchy/dataset.rb
29
+ - lib/hetchy/reservoir.rb
30
+ - lib/hetchy/version.rb
31
+ - test/support/sample_sets.rb
32
+ - test/test_helper.rb
33
+ - test/unit/dataset_test.rb
34
+ - test/unit/reservoir_test.rb
35
+ homepage: https://github.com/nextmat/hetchy
36
+ licenses:
37
+ - MIT
38
+ metadata: {}
39
+ post_install_message:
40
+ rdoc_options: []
41
+ require_paths:
42
+ - lib
43
+ required_ruby_version: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ required_rubygems_version: !ruby/object:Gem::Requirement
49
+ requirements:
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: '0'
53
+ requirements: []
54
+ rubyforge_project:
55
+ rubygems_version: 2.2.2
56
+ signing_key:
57
+ specification_version: 4
58
+ summary: High performance, thread-safe reservoir sampler with percentile support.
59
+ test_files:
60
+ - test/support/sample_sets.rb
61
+ - test/test_helper.rb
62
+ - test/unit/dataset_test.rb
63
+ - test/unit/reservoir_test.rb