hetchy 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 051d2f9830fb1d6e3bfa363610249fc760ab0ee2
4
+ data.tar.gz: f2971f802dbc95eea3a1641d5060db28c78cadc7
5
+ SHA512:
6
+ metadata.gz: 0b9571297d6389a176aae76e58d0b66d55413646c7686cd3f2a85155f8991e6d29c6965649eb26b9b3ecc9eeb447ac15c43f65925184e6fcbd8b415e86922972
7
+ data.tar.gz: 1f33d20c689b47f2ae6f0db8597f0ac95efe60c83f6feabd1bb6ef3c3c77c4d274e1a20c9bf80ef379ab20920dc13a88f78db5d21dafd0bf9c34f8798cd4f267
@@ -0,0 +1,15 @@
1
+ /.ruby-version
2
+ /.bundle/
3
+ /.yardoc
4
+ /Gemfile.lock
5
+ /_yardoc/
6
+ /coverage/
7
+ /doc/
8
+ /pkg/
9
+ /spec/reports/
10
+ /tmp/
11
+ *.bundle
12
+ *.so
13
+ *.o
14
+ *.a
15
+ mkmf.log
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in hetchy.gemspec
4
+ gemspec
5
+
6
+ gem 'minitest', '>= 5'
7
+ gem 'rake'
8
+ gem 'pry'
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Matt Sanders
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,113 @@
1
+ # Hetchy
2
+
3
+ A high performance, thread-safe reservoir sampler for ruby with snapshot and percentile support.
4
+
5
+ ## Benefits/Goals
6
+
7
+ * Generate statistics for large datasets with a very small in-memory footprint
8
+ * Generate percentiles/quantiles for a known set or numbers or for real-time streaming set
9
+ * Ability to capture sample state at a moment in time for further analysis
10
+ * Speed
11
+ * Thorough test suite
12
+ * No dependencies
13
+
14
+ ## Installation
15
+
16
+ Add this line to your application's Gemfile:
17
+
18
+ ```ruby
19
+ gem 'hetchy'
20
+ ```
21
+
22
+ And then `$ bundle`
23
+
24
+ Or install it yourself with:
25
+
26
+ $ gem install hetchy
27
+
28
+ ## Usage
29
+
30
+
31
+ Create a reservoir and designate how big you want it to be:
32
+
33
+ ```ruby
34
+ reservoir = Hetchy::Reservoir.new(size: 1000)
35
+ ```
36
+
37
+ Add samples as they arrive or are generated:
38
+
39
+ ```ruby
40
+ reservoir << 123
41
+ ```
42
+
43
+ You can add sets of samples as well:
44
+
45
+ ```ruby
46
+ reservoir << [45,89,124,96]
47
+ ```
48
+
49
+ You can calculate a percentile at any time:
50
+
51
+ ```ruby
52
+ reservoir.percentile(95)
53
+ ```
54
+
55
+ Hetchy supports high resolution percentiles as well:
56
+
57
+ ```ruby
58
+ reservoir.percentile(99.99)
59
+ ```
60
+
61
+ NOTE: you may need to increase reservoir size for _very_ high resolution percentiles. Experiment to see what works for your data set.
62
+
63
+ For threaded applications where the reservoir is accepting samples rapidly you can increase performance by snapshotting before running a series of calculations on the reservoir:
64
+
65
+ ```ruby
66
+ dataset = reservoir.snapshot
67
+
68
+ perc_95 = dataset.percentile(95)
69
+ perc_99 = dataset.percentile(99)
70
+ perc_999 = dataset.percentile(99.9)
71
+ ```
72
+
73
+ Clear the reservoir to reset it:
74
+
75
+ ```ruby
76
+ reservoir.clear
77
+ ```
78
+
79
+ ## Datasets
80
+
81
+ If you have an existing series you can use Dataset to generate stats for it:
82
+
83
+ ```ruby
84
+ my_series = Array(1..1000)
85
+ dataset = Hetchy::Dataset.new(my_series)
86
+
87
+ perc_95 = dataset.percentile(95) #=> 950.95
88
+ median = dataset.median #=> 500.5
89
+ ```
90
+
91
+ ## Stats Details
92
+
93
+ For those interested:
94
+
95
+ * Reservoir sampling is based on Vitter's algorithm R, ensures a uniform sampling probability for every entry in the series
96
+ * Percentile calculations use weighted averages, not nearest neighbor
97
+
98
+ ## Contributing
99
+
100
+ * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
101
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
102
+ * Fork the project and submit a pull request from a feature or bugfix branch.
103
+ * Please include tests. This is important so we don't break your changes unintentionally in a future version.
104
+ * Please don't modify the gemspec, Rakefile, version, or changelog. If you do change these files, please isolate a separate commit so we can cherry-pick around it.
105
+
106
+ ## Credits
107
+
108
+ Parts of Hetchy are inspired by Eric Lindvall's [metriks]() gem.
109
+
110
+ ## Copyright
111
+
112
+ Copyright (c) 2015 Matt Sanders. See LICENSE for details.
113
+
@@ -0,0 +1,18 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ # Console
4
+ desc "Open an console session preloaded with this library"
5
+ task :console do
6
+ sh "pry -r ./lib/hetchy.rb"
7
+ end
8
+
9
+ # Testing
10
+ require 'rake/testtask'
11
+ Rake::TestTask.new(:test) do |t|
12
+ t.libs << 'lib'
13
+ t.libs << 'test'
14
+ t.pattern = 'test/**/*_test.rb'
15
+ t.verbose = false
16
+ end
17
+
18
+ task :default => :test
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'hetchy/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "hetchy"
8
+ spec.version = Hetchy::VERSION
9
+ spec.authors = ["Matt Sanders"]
10
+ spec.email = ["matt@modal.org"]
11
+ spec.summary = %q{High performance, thread-safe reservoir sampler with percentile support.}
12
+ spec.description = %q{A high performance, thread-safe reservoir sampler with snapshot and percentile support.}
13
+ spec.homepage = "https://github.com/nextmat/hetchy"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ # spec.add_development_dependency "bundler", "~> 1.7"
22
+ # spec.add_development_dependency "rake", "~> 10.0"
23
+ end
@@ -0,0 +1,5 @@
1
+ module Hetchy; end
2
+
3
+ require_relative 'hetchy/dataset'
4
+ require_relative 'hetchy/reservoir'
5
+ require_relative 'hetchy/version'
@@ -0,0 +1,54 @@
1
+ module Hetchy
2
+
3
+ # Takes a fixed immutable dataset (array of values) and provides
4
+ # various statistical measurements.
5
+ #
6
+ class Dataset
7
+
8
+ attr_reader :data,
9
+ :size
10
+
11
+ def initialize(data)
12
+ @data = data.sort
13
+ @size = @data.length
14
+ end
15
+
16
+ # @return [Float/Integer] median of the data set.
17
+ def median
18
+ percentile(50.0)
19
+ end
20
+
21
+ # Generate a percentile for the data set.
22
+ # @example
23
+ # snapshot.percentile(95)
24
+ # snapshot.percentile(99.9)
25
+ #
26
+ def percentile(perc)
27
+ if perc > 100.0 || perc < 0.0
28
+ raise InvalidPercentile, "percentile must be between 0.0 and 100.0"
29
+ end
30
+ return 0.0 if data.empty?
31
+
32
+ rank = (perc / 100.0) * (size + 1)
33
+
34
+ return data[0] if rank < 1
35
+ return data[-1] if rank > size
36
+ return data[rank - 1] if rank == Integer(rank)
37
+ weighted_average_for(rank)
38
+ end
39
+
40
+ private
41
+
42
+ # when rank lands between values, generated a weighted average
43
+ # of adjacent values
44
+ def weighted_average_for(rank)
45
+ above = data[rank.to_i]
46
+ below = data[rank.to_i - 1]
47
+ fractional = rank - rank.floor
48
+ below + ((above - below) * fractional)
49
+ end
50
+
51
+ end
52
+
53
+ class InvalidPercentile < StandardError; end
54
+ end
@@ -0,0 +1,72 @@
1
+ module Hetchy
2
+ class Reservoir
3
+
4
+ attr_reader :count, # number of samples processed
5
+ :pool, # raw pool data, do not modify this
6
+ :size # size of allocated pool
7
+
8
+ # Create a reservoir.
9
+ # @option opts [Integer] :size Size of reservoir
10
+ #
11
+ def initialize(opts={})
12
+ @size = opts.fetch(:size, 1000)
13
+ @lock = Mutex.new
14
+ initialize_pool
15
+ end
16
+
17
+ # Add one or more values to the reservoir.
18
+ # @example
19
+ # reservoir << 1234
20
+ # reservoir << [2345,7891,2131]
21
+ #
22
+ def << (values)
23
+ Array(values).each do |value|
24
+ @lock.synchronize do
25
+ # sampling strategy is Vitter's algo R
26
+ if count < size
27
+ @pool[count] = value
28
+ else
29
+ index = rand(count+1)
30
+ if index < @size
31
+ @pool[index] = value
32
+ end
33
+ end
34
+ @count += 1
35
+ end
36
+ end
37
+ end
38
+
39
+ # Empty/reset the reservoir
40
+ def clear
41
+ initialize_pool
42
+ end
43
+
44
+ # Calculate a percentile based on the current state of the
45
+ # reservoir.
46
+ #
47
+ # If you are going to calculate multiple percentiles
48
+ # it will be faster to #snapshot and then calculate them off of
49
+ # the generated Dataset.
50
+ def percentile(perc)
51
+ snapshot.percentile(perc)
52
+ end
53
+
54
+ # Capture a moment in time for the reservoir for analysis.
55
+ # Since sampling may be ongoing this ensures we are working
56
+ # with data from our intended period.
57
+ #
58
+ def snapshot
59
+ data = nil
60
+ @lock.synchronize { data = @pool.dup }
61
+ Dataset.new(data.compact)
62
+ end
63
+
64
+ private
65
+
66
+ def initialize_pool
67
+ @lock.synchronize { @pool = Array.new(@size) }
68
+ @count = 0
69
+ end
70
+
71
+ end
72
+ end
@@ -0,0 +1,3 @@
1
+ module Hetchy
2
+ VERSION = "1.0.0"
3
+ end
@@ -0,0 +1,15 @@
1
+ class SampleSets
2
+
3
+ # 100 random samples from 50-2550
4
+ def self.set_100
5
+ [1136, 1551, 929, 932, 2383, 970, 2437, 1909, 1083, 2188, 1250, 1530, 1559,
6
+ 782, 2163, 2108, 1173, 2323, 553, 1503, 436, 2418, 2542, 1938, 1276, 328,
7
+ 222, 2384, 2186, 861, 1596, 590, 743, 1837, 1732, 1408, 2177, 2334, 310,
8
+ 1726, 274, 2283, 1908, 138, 587, 325, 975, 2047, 87, 1671, 535, 1007,
9
+ 2340, 1816, 1458, 1565, 1996, 1740, 1568, 853, 1816, 2078, 2094, 707,
10
+ 1444, 939, 2365, 1704, 999, 752, 899, 1220, 2190, 99, 1917, 1738, 2355,
11
+ 1164, 504, 832, 2087, 2311, 2513, 1322, 254, 2486, 1865, 1294, 585, 528,
12
+ 1244, 1395, 1185, 2327, 2461, 2145, 1795, 1336, 1935, 766]
13
+ end
14
+
15
+ end
@@ -0,0 +1,9 @@
1
+ require 'bundler'
2
+ Bundler.setup
3
+
4
+ require 'pry'
5
+ require 'minitest/autorun'
6
+
7
+ require 'hetchy'
8
+
9
+ Dir[File.dirname(__FILE__) + '/support/*.rb'].each {|file| require file }
@@ -0,0 +1,54 @@
1
+ require 'test_helper'
2
+
3
+ module Hetchy
4
+ class DatasetTest < Minitest::Test
5
+
6
+ def test_data_sort
7
+ dataset = Dataset.new([5,2,3,1])
8
+ assert_equal [1,2,3,5], dataset.data, 'data is sorted'
9
+ end
10
+
11
+ def test_median
12
+ dataset = Dataset.new(Array(1..10))
13
+ assert_equal 5.5, dataset.median, 'median is accurate'
14
+ end
15
+
16
+ def test_percentile_empty
17
+ empty_dataset = Dataset.new([])
18
+ assert_equal 0.0, empty_dataset.percentile(99)
19
+ end
20
+
21
+ def test_percentile_extreme_bounds
22
+ dataset = Dataset.new(Array(1..9))
23
+ assert_equal 9, dataset.percentile(99.9)
24
+ assert_equal 1, dataset.percentile(0.9)
25
+ end
26
+
27
+ def test_percentile_fractional
28
+ dataset = Dataset.new(SampleSets.set_100)
29
+ assert_equal 2447.92, dataset.percentile(95.5)
30
+ end
31
+
32
+ def test_percentile_limits
33
+ dataset = Dataset.new([1,2,3])
34
+ assert_raises(InvalidPercentile) { dataset.percentile(-0.2) }
35
+ assert_raises(InvalidPercentile) { dataset.percentile(1000) }
36
+ end
37
+
38
+ def test_percentile_simple
39
+ dataset = Dataset.new(Array(1..9))
40
+ assert_equal 5, dataset.percentile(50)
41
+
42
+ set = [4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9, 9, 9, 10, 10, 10]
43
+ dataset = Dataset.new(set)
44
+ assert_equal 5, dataset.percentile(25)
45
+ assert_in_delta 9.85, dataset.percentile(85), 0.001
46
+ end
47
+
48
+ def test_size
49
+ dataset = Dataset.new([1,2,3])
50
+ assert_equal 3, dataset.size
51
+ end
52
+
53
+ end
54
+ end
@@ -0,0 +1,97 @@
1
+ require 'test_helper'
2
+
3
+ module Hetchy
4
+ class ReservoirTest < Minitest::Test
5
+
6
+ def test_add_basics
7
+ # single add
8
+ reservoir = Reservoir.new(size: 10)
9
+ reservoir << 1
10
+ assert_equal 1, reservoir.pool[0]
11
+
12
+ # multi add
13
+ reservoir << [2,3,4]
14
+ assert_equal 2, reservoir.pool[1]
15
+ assert_equal 3, reservoir.pool[2]
16
+ assert_equal 4, reservoir.pool[3]
17
+ end
18
+
19
+ def test_add_pool_fully_accessible
20
+ series = Array(1..1000)
21
+ reservoir = Reservoir.new(size: 10)
22
+ reservoir << series
23
+
24
+ refute_equal 1, reservoir.pool[0], 'first entry in pool can be changed'
25
+ refute_equal 10, reservoir.pool[-1], 'last entry in pool can be changed'
26
+ end
27
+
28
+ def test_add_saturation_starts_sampling
29
+ series = Array(1..15)
30
+ overflow = series[10..-1]
31
+
32
+ reservoir = Reservoir.new(size: 10)
33
+ reservoir << series
34
+
35
+ assert_equal 15, reservoir.count, 'all samples recorded'
36
+ assert_equal 10, reservoir.pool.length, 'pool is stable size'
37
+ refute_equal (series - reservoir.pool), overflow,
38
+ 'overflow measures are sampled into pool'
39
+ end
40
+
41
+ def test_add_thread_safety
42
+ reservoir = Reservoir.new(size: 1024)
43
+ threads = []
44
+ 5.times do
45
+ threads << Thread.new do
46
+ 1000.times { reservoir << rand(10000) }
47
+ end
48
+ end
49
+
50
+ threads.each(&:join)
51
+ assert_equal 5000, reservoir.count, 'adding is threadsafe'
52
+ end
53
+
54
+ def test_clear
55
+ reservoir = Reservoir.new(size: 10)
56
+ reservoir << [1,2,3]
57
+
58
+ reservoir.clear
59
+ assert_nil reservoir.pool[0], 'should clear reservoir'
60
+ assert_equal 0, reservoir.count
61
+ end
62
+
63
+ def test_count
64
+ reservoir = Reservoir.new(size: 10)
65
+ assert_equal 0, reservoir.count
66
+
67
+ 10.times do
68
+ reservoir << rand(1000)
69
+ end
70
+ assert_equal 10, reservoir.count
71
+ end
72
+
73
+ def test_percentile
74
+ reservoir = Reservoir.new(size: 10)
75
+ reservoir << [1,2,3,4,5]
76
+
77
+ assert_equal 3, reservoir.percentile(50)
78
+ end
79
+
80
+ def test_size
81
+ reservoir = Reservoir.new(size: 5)
82
+ assert_equal 5, reservoir.size
83
+ end
84
+
85
+ def test_snapshot
86
+ reservoir = Reservoir.new(size: 5)
87
+ reservoir << [1,2]
88
+ snapshot = reservoir.snapshot
89
+
90
+ reservoir << 3
91
+ assert_includes snapshot.data, 1, 'has previous samples'
92
+ assert_includes snapshot.data, 2, 'has previous samples'
93
+ refute_includes snapshot.data, 3, 'does not receive changes'
94
+ end
95
+
96
+ end
97
+ end
metadata ADDED
@@ -0,0 +1,63 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: hetchy
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Matt Sanders
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-08-27 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A high performance, thread-safe reservoir sampler with snapshot and percentile
14
+ support.
15
+ email:
16
+ - matt@modal.org
17
+ executables: []
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - ".gitignore"
22
+ - Gemfile
23
+ - LICENSE.txt
24
+ - README.md
25
+ - Rakefile
26
+ - hetchy.gemspec
27
+ - lib/hetchy.rb
28
+ - lib/hetchy/dataset.rb
29
+ - lib/hetchy/reservoir.rb
30
+ - lib/hetchy/version.rb
31
+ - test/support/sample_sets.rb
32
+ - test/test_helper.rb
33
+ - test/unit/dataset_test.rb
34
+ - test/unit/reservoir_test.rb
35
+ homepage: https://github.com/nextmat/hetchy
36
+ licenses:
37
+ - MIT
38
+ metadata: {}
39
+ post_install_message:
40
+ rdoc_options: []
41
+ require_paths:
42
+ - lib
43
+ required_ruby_version: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ required_rubygems_version: !ruby/object:Gem::Requirement
49
+ requirements:
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: '0'
53
+ requirements: []
54
+ rubyforge_project:
55
+ rubygems_version: 2.2.2
56
+ signing_key:
57
+ specification_version: 4
58
+ summary: High performance, thread-safe reservoir sampler with percentile support.
59
+ test_files:
60
+ - test/support/sample_sets.rb
61
+ - test/test_helper.rb
62
+ - test/unit/dataset_test.rb
63
+ - test/unit/reservoir_test.rb