pest 0.0.0 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -2,18 +2,10 @@
2
2
 
3
3
  [![Build Status](https://secure.travis-ci.org/kerinin/pest.png)](http://travis-ci.org/kerinin/pest)
4
4
 
5
+ **A concise API focused on painless investigation of data sets**
5
6
 
6
- Pest provides a unified framework for interacting with different probability
7
- estimation models.
8
-
9
- * Pest tries to be agnostic about the underlying data data structures,
10
- so changing libraries (GSL -> Hadoop) is as simple as using a different data source.
11
- * Pest is designed to create estimators using subsets of larger data sources, and
12
- transparently constructs estimators to facilitate dynamic querying
13
- * Implementing custom estimation models is easy, and Pest implements some model
14
- common ones for you.
15
-
16
- Pest abstracts common statstical operations including:
7
+ Pest provides a framework for interacting with different probability
8
+ estimation models. Pest abstracts common statstical operations including:
17
9
 
18
10
  * Marginal, Joint and Conditional point probability
19
11
  * Interval and Cumulative probability
@@ -21,24 +13,34 @@ Pest abstracts common statstical operations including:
21
13
  * Mean, Median, Mode, etc
22
14
 
23
15
 
24
- ## Ruby Install
16
+ **Scalability if you need it**
25
17
 
26
- ``` sh
27
- brew install gnuplot # This may take awhile...
28
- cd /usr/local
29
- git checkout 83ed494 /usr/local/Library/Formula/gsl.rb
30
- brew install gsl # Forcing gsl v1.4
18
+ Pest tries to be agnostic about the underlying data data structures,
19
+ so changing libraries (NArray -> Hadoop) is as simple as using a different data source.
20
+ Pest is designed to create estimators using subsets of larger data sources, and
21
+ transparently constructs estimators to facilitate dynamic querying
31
22
 
32
- bundle install
33
- ```
23
+
24
+ **Code structure designed to be extended**
25
+
26
+ Implementing custom estimation models is easy, and Pest implements some model
27
+ common ones for you.
28
+
29
+
30
+ ## Install
31
+
32
+ Add it to your Gemfile and bundle
33
+
34
+ gem "pest"
35
+
36
+ bundle install
34
37
 
35
38
  ## API
36
39
 
37
40
  ``` ruby
38
41
  # Creating Datasets
39
- test = Pest::DataSet::Hash.new hash # Creates a Hash dataset of observations from a hash
40
- test = Pest::DataSet::Hash.new file # Creates a Hash dataset of observations from an IO (Marshalled)
41
- train = Pest::DataSet::GSL.new file # Creates a GSL dataset from and IO instance
42
+ test = Pest::DataSet::Hash.from_hash hash # Creates a Hash dataset of observations from a hash
43
+ train = Pest::DataSet::NArray.from_hash hash # Creates a NArray dataset
42
44
 
43
45
  # DataSet Variables
44
46
  test.variables # hash of Variable instances detected in observation set
@@ -47,13 +49,14 @@ test.v[:foo] # a specific variable
47
49
  test.v[:foo] = another_variable # explicit declaration
48
50
 
49
51
  # Creating Estimators
50
- e = Pest::Estimator::Set::Multinomial.new(test) # Creates a multinomial estimator for set o
51
- e = Pest::Estimator::Discrete::Gaussian.new(file) # Creating an estimator with the DataSet API
52
+ e = Pest::Estimator::Frequency.new(data) # Frequentist estimator - values treated as unordered set
53
+ e = Pest::Estimator::Multinomial.new(data) # Multinomial estimator
54
+ e = Pest::Estimator::Gaussian.new(data) # Gaussian mean/varaince ML estimator
52
55
 
53
56
  # Descriptive Statistical Properties
54
- e.mode(:foo) # Mode
55
- e.mean(:foo) # Mean (discrete & continuous only)
56
- e.median(:foo) # Median (discrete & continuous only)
57
+ #e.mode(:foo) # Mode
58
+ #e.mean(:foo) # Mean (discrete & continuous only)
59
+ #e.median(:foo) # Median (discrete & continuous only)
57
60
  # quantile?
58
61
  # variance?
59
62
  # deviation?
@@ -65,26 +68,25 @@ e.h(:foo).given(:bar) # Cross entropy of 'foo' :
65
68
  e.mutual_information(:foo, :bar) # Mutual information of 'foo' and 'bar'
66
69
  e.i(:foo, :bar) # Alias
67
70
 
68
- # Estimating Point Probability (Set & Discrete only)
69
- e.probability(o.variables[:foo]) # (Set/Discrete only) Estimate the probability of all values of 'foo'
70
- e.p(:foo) # Same as above, tries to find a variable named 'foo'
71
- e.p(:foo).in(test) # Estimate the probability of values in dataset 'test'
72
- e.p(:foo).given(:bar).in(test) # Estimate the conditional foo | bar for the values in 'test'
73
- e.p(:foo, :bar).in(test) # Estimate the joint probablity foo AND bar
74
- e.p(:foo, :bar).given(:baz, :qux).in(test) # More complex joint & conditional probabilities
75
- e.p(:foo => 4, :bar => 2).given(:baz => 0) # Single prediction (implicitly creates dataset)
76
- e.p(:foo).given(:bar).cache # Builds and persists the model for 'foo|bar'
77
- e.p(:foo).given(:bar).cache('path.csv') # Persist to a specific path (defaults to tmp)
78
-
79
- # Estimating Cumulative & Interval Probability (Discrete & Continuous only)
80
- e.probability(:foo).greater_than(:bar).in(test)
81
- e.p(:foo).greater_than(:bar).less_than(:baz).in(test)
82
- e.p(:foo).gt(:bar).lt(:baz).given(:qux).in(test)
71
+ # Estimating Point Probability
72
+ e.probability(e.variables[:foo] => 1) # Estimate the probability that foo=1
73
+ e.p(:foo => 1) # Same as above, tries to find a variable named 'foo'
74
+ e.p(:foo => 1, :bar => 2) # Estimate the probability that foo=1 AND bar=2
75
+ e.p(:foo => 1).given(:bar => 2) # Estimate the probability that foo=1 given bar=2
76
+ e.p(:foo => 1, :bar => 2).given(:baz => 3, :qux => 4) # Moar
77
+
78
+ # Batch Point Probability Estimation
79
+ e.batch_probability(:foo).in(test) # Estimate the probability of each value in test
80
+ e.batch_p(:foo, :bar).in(test) # Joint probability
81
+ e.batch_p(:foo).given(:bar).in(test) # Conditional probability
82
+ e.batch_p(:foo, :bar).given(:baz, :qux).in(test) # Moar
83
+
84
+ # Estimating Cumulative & Interval Probability
85
+ #e.probability(:foo).greater_than(:bar).in(test)
86
+ #e.p(:foo).greater_than(:bar).less_than(:baz).in(test)
87
+ #e.p(:foo).gt(:bar).lt(:baz).given(:qux).in(test)
83
88
  ```
84
89
 
85
- ## Working Notes
90
+ ## TODO
86
91
 
87
- Do we want variable equality to be name-based? It may make more sense to allow
88
- variables named differently in different data sets to be equivalent. And how the
89
- fuck do we handle variable type? I'm almost thinking we don't, and let the actual
90
- estimators take care of type casting
92
+ the builders should validate the variables they're given and throw errors if they're not part of the estimators data
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.0
1
+ 0.1.0
@@ -5,7 +5,6 @@ require 'uuidtools'
5
5
  require 'csv'
6
6
 
7
7
  require "pest/version"
8
- require "pest/variable"
9
8
 
10
9
  require "pest/function"
11
10
  require "pest/function/probability"
@@ -3,13 +3,15 @@ module Pest::DataSet
3
3
  base.extend(ClassMethods)
4
4
  end
5
5
 
6
- def variables
7
- @variables ||= {}
8
- end
6
+ include Enumerable
7
+
8
+ attr_accessor :variables, :data
9
9
 
10
- def variable_array
11
- @variables.values.sort
10
+ def initialize(variables = Set.new, data = nil)
11
+ @variables = variables
12
+ @data = data
12
13
  end
14
+ alias :v :variables
13
15
 
14
16
  def to_hash(*args)
15
17
  raise NotImplementedError
@@ -27,6 +29,44 @@ module Pest::DataSet
27
29
  raise NotImplementedError
28
30
  end
29
31
 
32
+ def ==(other)
33
+ variables == other.variables and data == other.data
34
+ end
35
+ alias :eql? :==
36
+
37
+ def [](*args)
38
+ raise NotImplementedError
39
+ end
40
+
41
+ def except(start, finish)
42
+ left = start > 0 ? self[0..start] : nil
43
+ right = finish < length - 1 ? self[finish..-1] : nil
44
+ case [left.nil?, right.nil?]
45
+ when [true, false]
46
+ right
47
+ when [false, true]
48
+ left
49
+ when [false, false]
50
+ right + left
51
+ end
52
+ end
53
+
54
+ def +(other)
55
+ raise NotImplementedError
56
+ end
57
+
58
+ def pick(*args)
59
+ raise NotImplementedError
60
+ end
61
+
62
+ def each(&block)
63
+ raise NotImplementedError
64
+ end
65
+
66
+ def merge(other)
67
+ raise NotImplementedError
68
+ end
69
+
30
70
  module ClassMethods
31
71
  def from(data_source)
32
72
  # Try to translate the data source directly
@@ -3,68 +3,102 @@ class Pest::DataSet::Hash
3
3
 
4
4
  def self.translators
5
5
  {
6
- File => :from_file,
7
6
  String => :from_file,
8
7
  Symbol => :from_file
9
8
  }
10
9
  end
11
10
 
12
- def self.from_file(file)
13
- file = File.open(file.to_s, 'r') if file.kind_of?(String)
14
-
15
- object = Marshal.restore(file)
16
-
17
- if object.kind_of?(::Hash)
18
- self.new(object)
19
- else
20
- raise "File does not seem to contain valid data"
21
- end
11
+ def self.from_hash(hash)
12
+ data_set = new
13
+ data_set.variables += hash.keys
14
+ data_set.instance_variable_set(:@hash, hash)
15
+ data_set
22
16
  end
23
17
 
24
18
  attr_reader :variables, :hash
25
19
 
26
- def initialize(hash)
27
- @hash = hash
28
- @variables = {}
29
- hash.keys().each do |name|
30
- @variables[name] = Pest::Variable.new(:name => name)
31
- end
20
+ def initialize(*args)
21
+ super *args
22
+ @hash = {}
32
23
  end
33
24
 
34
- def to_hash
35
- @hash
25
+ def data
26
+ hash.values
36
27
  end
37
28
 
38
- def data_vectors(variables=nil)
39
- VectorEnumerable.new(self,variables)
29
+ def to_hash
30
+ hash
40
31
  end
41
32
 
42
33
  def length
43
34
  @hash.values.first.length
44
35
  end
45
36
 
46
- def save(file=nil)
47
- file ||= Tempfile.new('pest_hash_dataset')
48
- file = File.open(file, 'w') if file.kind_of?(String)
49
- Marshal.dump(@hash, file)
37
+ def [](*args)
38
+ unless args.any?
39
+ raise ArgumentError, "Indices not specified"
40
+ end
41
+
42
+ args.map do |arg|
43
+ subset = self.class.new
44
+ subset.variables = self.variables
45
+ variables.each do |var|
46
+ subset.hash[var.name] = hash[var.name][arg]
47
+ end
48
+ subset
49
+ end.inject(:+)
50
+
50
51
  end
51
52
 
52
- class VectorEnumerable
53
- include Enumerable
53
+ def +(other)
54
+ unless other.variables == variables
55
+ raise ArgumentError, "DataSets have different variables"
56
+ end
54
57
 
55
- def initialize(data_set,variables=nil)
56
- @data_set = data_set
57
- @variables = variables || @data_set.variables
58
+ union = self.class.new
59
+ union.variables = variables
60
+ variables.each do |var|
61
+ union.hash[var.name] = hash[var.name] + other.hash[var.name]
58
62
  end
63
+ union
64
+ end
59
65
 
60
- def [](i)
61
- @variables.map {|var| @data_set.hash[var][i]}
66
+ def pick(*args)
67
+ unless args.any?
68
+ raise ArgumentError, "You didn't specify any variables to pick"
62
69
  end
63
70
 
64
- def each
65
- @data_set.hash.values.first.each_index do |i|
66
- yield @variables.keys.map {|var| @data_set.hash[var][i]}
67
- end
71
+ subset = self.class.new
72
+ subset.variables += args
73
+ args.each do |var|
74
+ raise ArgumentError, "Dataset doesn't include '#{var}'" unless hash.has_key?(var)
75
+ subset.hash[var] = hash[var]
68
76
  end
77
+ subset
78
+ end
79
+
80
+ def each(&block)
81
+ (0..length-1).to_a.each do |i| yield variables.map {|var| hash[var][i]} end
82
+ end
83
+
84
+ def dup
85
+ instance = self.class.new
86
+ instance.variables = variables.dup
87
+ instance.instance_variable_set(:@hash, hash.dup)
88
+ instance
89
+ end
90
+
91
+ def merge(other)
92
+ dup.merge!(other)
93
+ end
94
+
95
+ def merge!(other)
96
+ other = self.class.from_hash(other) if other.kind_of?(::Hash)
97
+ raise ArgumentError, "Lengths must be the same" if other.length != length
98
+
99
+ @variables += other.variables
100
+ hash.merge! other.hash
101
+
102
+ self
69
103
  end
70
104
  end
@@ -1,6 +1,6 @@
1
1
  require 'narray'
2
2
 
3
- class Pest::DataSet::NArray < NMatrix
3
+ class Pest::DataSet::NArray
4
4
  include Pest::DataSet
5
5
 
6
6
  def self.translators
@@ -13,86 +13,137 @@ class Pest::DataSet::NArray < NMatrix
13
13
  end
14
14
 
15
15
  def self.from_hash(hash)
16
- data_set = to_na(hash.keys.sort.map {|key| hash[key]}) # Ensure the matrix is sorted the same as the variables
17
- data_set.variables = {}
18
- hash.keys.each do |key|
19
- variable = key.kind_of?(Pest::Variable) ? key : Pest::Variable.new(:name => key)
20
- data_set.variables[variable.name] = variable
21
- end
16
+ data_set = new(
17
+ hash.keys.to_set,
18
+ NMatrix.to_na(hash.values)
19
+ )
20
+ data_set.instance_variable_set(:@variable_array, hash.keys)
22
21
  data_set
23
22
  end
24
23
 
25
24
  def self.from_file(file)
26
- file = File.open(file.to_s, 'r') if file.kind_of?(String)
27
-
28
- begin
29
- variables, matrix = Marshal.restore(file)
30
- data_set = to_na(matrix)
31
- data_set.variables = variables
32
- data_set
33
- rescue
34
- raise "File does not seem to contain valid data"
35
- end
25
+ from_csv(file)
36
26
  end
37
27
 
38
28
  def self.from_csv(file, args={})
39
- args = args.merge({:converters => :all})
40
- data = CSV.read(file, args)
41
- data_set = to_na(data[1..-1]).transpose
42
- data_set.variables = {}
43
- data[0].each do |key|
44
- variable = key.kind_of?(Pest::Variable) ? key : Pest::Variable.new(:name => key)
45
- data_set.variables[variable.name] = variable
46
- end
29
+ args = {:col_sep => "\t", :headers => true, :converters => :all}.merge args
30
+ csv_data = CSV.read(file, args).map(&:to_hash)
31
+
32
+ data_set = new(
33
+ csv_data.first.keys.to_set,
34
+ NMatrix.to_na(csv_data.map(&:values)).transpose
35
+ )
36
+ # Ensure the ordering matches what's in the CSV
37
+ data_set.instance_variable_set(:@variable_array, csv_data.first.keys)
47
38
  data_set
48
39
  end
49
40
 
50
- attr_accessor :variables
41
+ attr_reader :variable_array
42
+
43
+ def initialize(*args)
44
+ super *args
45
+ @variable_array = variables.to_a.sort
46
+ end
51
47
 
52
48
  def to_hash
53
49
  hash = {}
54
- variables.values.each_index do |i|
55
- hash[variables.values[i]] = self[true,i].to_a[0]
50
+ variable_array.each_index do |i|
51
+ hash[variable_array[i]] = data[true,i].to_a[0]
56
52
  end
57
53
  hash
58
54
  end
59
55
 
60
- # variables: an array of variables for which each vector should contain values
61
- # Order is retained in the returned value
62
- def data_vectors(variables=nil)
63
- VectorEnumerable.new(self, variables)
56
+ def length
57
+ data.shape[0]
64
58
  end
65
59
 
66
- def length
67
- shape[0]
60
+ # Return a subset of the data with the same variables,
61
+ # but only the vectors specified by i
62
+ #
63
+ def [](*args)
64
+ unless args.any?
65
+ raise ArgumentError, "Indices not specified"
66
+ end
67
+
68
+ args.map do |arg|
69
+ subset = self.class.new
70
+ subset.variables = self.variables
71
+ subset.data = self.data[arg,true]
72
+ subset
73
+ end.inject(:+)
68
74
  end
69
75
 
70
- def save(file=nil)
71
- file ||= Tempfile.new('pest_hash_dataset')
72
- file = File.open(file, 'w') if file.kind_of?(String)
73
- Marshal.dump([variables,to_a], file)
74
- file.close
76
+
77
+ # Return the union of self and other
78
+ #
79
+ def +(other)
80
+ unless other.variables == variables
81
+ raise ArgumentError, "DataSets have different variables"
82
+ end
83
+
84
+ union = self.class.new
85
+ union.variables = variables
86
+ union.data = NMatrix[*(data.transpose.to_a + other.data.transpose.to_a)].transpose
87
+ union
75
88
  end
76
89
 
77
- class VectorEnumerable
78
- include Enumerable
90
+ # Return a subset of the data with the same vectors, but only
91
+ # the variables specified in args
92
+ #
93
+ def pick(*args)
94
+ unless args.any?
95
+ raise ArgumentError, "You didn't specify any variables to pick"
96
+ end
79
97
 
80
- def initialize(data_set, variables = true)
81
- @data_set = data_set
82
- @variables = variables
83
- if @variables.kind_of?(Enumerable)
84
- @variables = variables.map {|v| @data_set.variable_array.index(v)}
85
- end
98
+ picked_indices = args.map do |variable|
99
+ raise ArgumentError, "Dataset doesn't include #{variable}" unless variables.include?(variable)
100
+ self.variable_array.index(variable)
86
101
  end
87
102
 
88
- def [](i)
89
- @data_set[i,@variables].transpose
103
+ self.class.new(args, self.data[true, picked_indices] )
104
+ end
105
+
106
+ def each(&block)
107
+ (0..length-1).to_a.each do |i|
108
+ yield data[i,true].transpose.to_a.first
90
109
  end
110
+ end
91
111
 
92
- def each
93
- (0..@data_set.shape[0]-1).each do |i|
94
- yield Array(self[i]).first
95
- end
112
+ def dup
113
+ instance = self.class.new( variables.dup, data.dup)
114
+ instance.instance_variable_set(:@variable_array, variable_array)
115
+ instance
116
+ end
117
+
118
+ def merge(other)
119
+ dup.merge!(other)
120
+ end
121
+
122
+ def merge!(other)
123
+ other = self.class.from_hash(other) if other.kind_of?(::Hash)
124
+ raise ArgumentError, "Lengths must be the same" if other.length != length
125
+
126
+ # Merge the variables. Existing variables should be updated,
127
+ # new variables should be appended to the hash in the same order
128
+ # as they appear in other
129
+ @variable_array += (other.variables - variables).to_a
130
+ @variables += other.variables
131
+
132
+ # Create the new data array, should be the size of the merged variables
133
+ # by the number of vectors
134
+ new_data = ::NArray.object(length, variables.length)
135
+
136
+ # Copy over the data from self (as if we had extended self.data to the
137
+ # right to allow for the new data)
138
+ new_data[true, 0..data.shape[1]-1] = data
139
+
140
+ # Merge in other's data, using the indices of other's variables as the
141
+ # slice keys
142
+ other.variables.each do |variable|
143
+ new_data[true, variable_array.index(variable)] = other.pick(variable).to_a.flatten
96
144
  end
145
+
146
+ self.data = new_data
147
+ self
97
148
  end
98
149
  end