statsample 0.6.2 → 0.6.3

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,3 +1,6 @@
1
+ === 0.6.3 / 2010-02-15
2
+ * Statsample::Bivariate::Polychoric have joint estimation.
3
+ * Some extra documentation and bug fixs
1
4
  === 0.6.2 / 2010-02-11
2
5
  * New Statsample::Bivariate::Polychoric. For implement: X2 and G2
3
6
  * New matrix.rb, for faster development of Contingence Tables and Correlation Matrix
data/Manifest.txt CHANGED
@@ -9,6 +9,7 @@ data/repeated_fields.csv
9
9
  data/test_binomial.csv
10
10
  data/tetmat_matrix.txt
11
11
  data/tetmat_test.txt
12
+ demo/correlation_matrix.rb
12
13
  demo/dominance_analysis_bootstrap.rb
13
14
  demo/dominanceanalysis.rb
14
15
  demo/multiple_regression.rb
@@ -47,7 +48,6 @@ lib/statsample/graph/svggraph.rb
47
48
  lib/statsample/graph/svghistogram.rb
48
49
  lib/statsample/graph/svgscatterplot.rb
49
50
  lib/statsample/histogram.rb
50
- lib/statsample/htmlreport.rb
51
51
  lib/statsample/matrix.rb
52
52
  lib/statsample/mle.rb
53
53
  lib/statsample/mle/logit.rb
data/README.txt CHANGED
@@ -3,66 +3,70 @@
3
3
  http://ruby-statsample.rubyforge.org/
4
4
 
5
5
 
6
- == DESCRIPTION:
6
+ == FEATURES:
7
7
 
8
- A suite for your basic and advanced statistics needs. Descriptive statistics, multiple regression, factorial analysis, dominance analysis, scale's reliability analysis, bivariate statistics and others procedures.
8
+ A suite for basic and advanced statistics. Includes:
9
+ * Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others).
10
+ * Imports and exports datasets from and to Excel, CSV and plain text files.
11
+ * Correlations: Pearson (r), Rho, Tetrachoric, Polychoric
12
+ * Regression: Simple, Multiple, Probit and Logit
13
+ * Factorial Analysis: Extraction (PCA and Principal Axis) and Rotation (Varimax and relatives)
14
+ * Dominance Analysis (Azen & Budescu)
15
+ * Sample calculation related formulas
9
16
 
10
- == FEATURES:
17
+ == DETAILED FEATURES:
11
18
 
12
19
  * Factorial Analysis. Principal Component Analysis and Principal Axis extraction, with orthogonal rotations (Varimax, Equimax, Quartimax)
13
20
  * Multiple Regression. Listwise analysis optimized with use of Alglib library. Pairwise analysis is executed on pure ruby and reports same values as SPSS
21
+ * Module Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma, tetrachoric and polychoric correlation correlations. Include methods to create correlation (pearson and tetrachoric) and covariance matrices
22
+ * Regression module provides linear regression methods
14
23
  * Dominance Analysis. Based on Budescu and Azen papers, <strong>DominanceAnalysis</strong> class can report dominance analysis for a sample and <strong>DominanceAnalysisBootstrap</strong> can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/].
15
24
  * Classes for Vector, Datasets (set of Vectors) and Multisets (multiple datasets with same fields and type of vectors), and multiple methods to manipulate them
16
25
  * Module Codification, to help to codify open questions
17
26
  * Converters to and from database and csv files, and to output Mx and GGobi files
18
- * Module Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma and tetrachoric correlations. Include methods to create correlation (pearson and tetrachoric) and covariance matrices
19
27
  * Module Crosstab provides function to create crosstab for categorical data
20
- * Module HtmlReport provides methods to create a report for scale analysis and matrix correlation
21
- * Regression module provides linear regression methods
22
28
  * Reliability analysis provides functions to analyze scales. Class ItemAnalysis provides statistics like mean, standard deviation for a scale, Cronbach's alpha and standarized Cronbach's alpha, and for each item: mean, correlation with total scale, mean if deleted, Cronbach's alpha is deleted. With HtmlReport, graph the histogram of the scale and the Item Characteristic Curve for each item
23
29
  * Module SRS (Simple Random Sampling) provides a lot of functions to estimate standard error for several type of samples
24
30
  * Interfaces to gdchart, gnuplot and SVG::Graph
25
31
 
26
32
 
27
- == Example of use:
33
+ == Examples of use:
28
34
 
29
- # Read a CSV file, using '' and 'error' as missing values and ommiting 1 lines
30
- ds=Statsample::CSV.read('resultados_c1.csv',['','error'],1)
31
-
32
- # Create a new vector (column), calculating the mean of 13 vectors. Accept 1 missing values on one of the vectors
33
-
34
- indice_constructivismo_becker=ds.vector_mean(%w{fd_2_1 fd_2_2 fd_3_1 fd_3_2 fd_3_3},1)
35
-
36
- # Add the vector to the dataset
37
-
38
- ds.add_vector("ind_cons_becker",indice_constructivismo_becker)
39
-
40
- # Verify data. Vecto 'de_3_sex' must have values 'a' or 'b'. Dataset#verify returns and array with all errors
41
-
42
- t_sex=create_test("Sex must be a o b",'de_3_sex') {|v| v['de_3_sex']=="a" or v['de_3_sex']=="b")}
43
-
44
- p ds.verify(t_sexo)
45
-
46
-
47
- # Creates a new dataset, based on the names of vectors
48
-
49
- ds_software=ds.dup(%w{pe1n1 pe1n2 pe1n3 pe1n4 pe1n5 })
50
-
51
- # Creates an html report, add a correlation matrix with all the scale vectors and save the report into a file
52
- hr=Statsample::HtmlReport.new(ds_software,"correlations")
53
- hr.add_correlation_matrix()
54
- hr.save("correlation_matrix.html")
35
+ === Correlation matrix
36
+
37
+ require 'statsample'
38
+ a=1000.times.collect {rand}.to_scale
39
+ b=1000.times.collect {rand}.to_scale
40
+ c=1000.times.collect {rand}.to_scale
41
+ d=1000.times.collect {rand}.to_scale
42
+ ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
43
+ cm=Statsample::Bivariate.correlation_matrix(ds)
44
+ puts cm.summary
45
+
46
+ === Tetrachoric correlation
47
+
48
+ require 'statsample'
49
+ a=40
50
+ b=10
51
+ c=20
52
+ d=30
53
+ tetra=Statsample::Bivariate::Tetrachoric.new(a,b,c,d)
54
+ puts tetra.summary
55
55
 
56
+ === Polychoric correlation
57
+
58
+ require 'statsample'
59
+ ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
56
60
 
57
- # Saves the new dataset
58
- Statsample::CSV.write(ds_software,"ds_software.csv",true)
61
+ poly=Statsample::Bivariate::Polychoric.new(ct)
62
+ puts poly.summary
59
63
 
60
64
  == REQUIREMENTS:
61
65
 
62
66
  Optional:
63
67
 
64
68
  * Plotting: gnuplot and rbgnuplot, SVG::Graph
65
- * Advanced Statistical: gsl and rb-gsl (http://rb-gsl.rubyforge.org/)
69
+ * Factorial analysis and polychorical correlation: gsl and rb-gsl (http://rb-gsl.rubyforge.org/)
66
70
 
67
71
  == DOWNLOAD
68
72
  * Gems and bugs report: http://rubyforge.org/projects/ruby-statsample/
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/ruby
2
+ $:.unshift(File.dirname(__FILE__)+'/../lib/')
3
+
4
+ require 'statsample'
5
+ a=1000.times.collect {rand}.to_scale
6
+ b=1000.times.collect {rand}.to_scale
7
+ c=1000.times.collect {rand}.to_scale
8
+ d=1000.times.collect {rand}.to_scale
9
+ ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
10
+ cm=Statsample::Bivariate.correlation_matrix(ds)
11
+ puts cm.summary
@@ -12,9 +12,5 @@ ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
12
12
 
13
13
  ds['y']=ds.collect{|row| row['a']*5+row['b']*2+row['c']*2+row['d']*2+10*rand()}
14
14
  dab=Statsample::DominanceAnalysis::Bootstrap.new(ds, 'y')
15
- if HAS_GSL
16
- # Use Gsl if available (faster calculation)
17
- dab.regression_class=Statsample::Regression::Multiple::GslEngine
18
- end
19
15
  dab.bootstrap(100,nil,true)
20
16
  puts dab.summary
data/demo/polychoric.rb CHANGED
@@ -2,12 +2,20 @@
2
2
  $:.unshift(File.dirname(__FILE__)+'/../lib/')
3
3
 
4
4
  require 'statsample'
5
- #ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
5
+ ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
6
+
7
+ # Estimation of polychoric correlation using two-step (default)
8
+ poly=Statsample::Bivariate::Polychoric.new(ct, :name=>"Polychoric with two-step")
9
+ puts poly.summary
6
10
 
7
- ct=Matrix[[30,1,0,0,0,0],[0,10,2,0,0,0], [0,4,8,3,1,0], [0,3,3,37,9,0], [0,0,1, 25, 71, 49], [ 0,0,0,2, 20, 181]]
8
- poly=Statsample::Bivariate::Polychoric.new(ct)
9
11
 
12
+ # Estimation of polychoric correlation using joint method (slow)
13
+ poly=Statsample::Bivariate::Polychoric.new(ct, :method=>:joint, :name=>"Polychoric with joint")
10
14
  puts poly.summary
11
- puts poly.chi_square_independence
12
- puts poly.chi_square_model
13
- puts poly.chi_square_independence
15
+
16
+
17
+ # Uses polychoric series (not recomended)
18
+
19
+ poly=Statsample::Bivariate::Polychoric.new(ct, :method=>:polychoric_series, :name=>"Polychoric with polychoric series")
20
+ puts poly.summary
21
+
data/lib/distribution.rb CHANGED
@@ -13,4 +13,5 @@ module Distribution
13
13
  autoload(:F, 'distribution/f')
14
14
  autoload(:Normal, 'distribution/normal')
15
15
  autoload(:NormalBivariate, 'distribution/normalbivariate')
16
+ autoload(:NormalMultivariate, 'distribution/normalmultivariate')
16
17
  end
@@ -2,24 +2,24 @@ module Distribution
2
2
  # Calculate cdf and inverse cdf for Normal Distribution.
3
3
  # Uses Statistics2 module
4
4
  module Normal
5
- class << self
6
- # Return the P-value of the corresponding integral
7
- def p_value(pr)
8
- Statistics2.pnormaldist(pr)
9
- end
10
- # Normal cumulative distribution function (cdf).
11
- #
12
- # Returns the integral of normal distribution
13
- # over (-Infty, x].
14
- #
15
- def cdf(x)
16
- Statistics2.normaldist(x)
17
- end
18
- # Normal probability density function (pdf)
19
- # With x=0 and sigma=1
20
- def pdf(x)
21
- (1.0/Math::sqrt(2*Math::PI))*Math::exp(-(x**2/2.0))
22
- end
5
+ class << self
6
+ # Return the P-value of the corresponding integral
7
+ def p_value(pr)
8
+ Statistics2.pnormaldist(pr)
23
9
  end
10
+ # Normal cumulative distribution function (cdf).
11
+ #
12
+ # Returns the integral of normal distribution
13
+ # over (-Infty, x].
14
+ #
15
+ def cdf(x)
16
+ Statistics2.normaldist(x)
17
+ end
18
+ # Normal probability density function (pdf)
19
+ # With x=0 and sigma=1
20
+ def pdf(x)
21
+ (1.0/Math::sqrt(2*Math::PI))*Math::exp(-(x**2/2.0))
22
+ end
23
+ end
24
24
  end
25
25
  end
@@ -1,24 +1,42 @@
1
1
  module Distribution
2
- # Calculate pdf and cdf for bivariate normal distribution
2
+ # Calculate pdf and cdf for bivariate normal distribution.
3
+ #
4
+ # Pdf if easy to calculate, but CDF is not trivial. Several papers
5
+ # describe methods to calculate the integral.
6
+ #
7
+ # Three methods are implemented on this module:
8
+ # * Genz:: Used by default, with improvement to calculate p on rho > 0.95
9
+ # * Hull:: Port from a C++ code
10
+ # * Jantaravareerat:: Iterative (and slow)
11
+ #
12
+
3
13
  module NormalBivariate
4
14
 
5
15
  class << self
6
- SIDE=0.1
7
- LIMIT=5
8
- # Probability density function
16
+ SIDE=0.1 # :nodoc:
17
+ LIMIT=5 # :nodoc:
18
+
19
+ # Probability density function for a given x, y and rho value.
20
+ #
9
21
  # Source: http://en.wikipedia.org/wiki/Multivariate_normal_distribution
10
22
  def pdf(x,y, rho, sigma1=1.0, sigma2=1.0)
11
23
  (1.quo(2 * Math::PI * sigma1*sigma2 * Math::sqrt( 1 - rho**2 ))) *
12
24
  Math::exp(-(1.quo(2*(1-rho**2))) *
13
25
  ((x**2/sigma1) + (y**2/sigma2) - (2*rho*x*y).quo(sigma1*sigma2) ))
14
26
  end
15
- def f(x,y,aprime,bprime,rho)
27
+
28
+ def f(x,y,aprime,bprime,rho)
16
29
  r=aprime*(2*x-aprime)+bprime*(2*y-bprime)+2*rho*(x-aprime)*(y-bprime)
17
30
  Math::exp(r)
18
31
  end
32
+
33
+ # CDF for a given x, y and rho value.
34
+ # Uses Genz algorithm (cdf_genz method).
35
+ #
19
36
  def cdf(a,b,rho)
20
- cdf_math(a,b,rho)
37
+ cdf_genz(a,b,rho)
21
38
  end
39
+
22
40
  def sgn(x)
23
41
  if(x>=0)
24
42
  1
@@ -26,8 +44,13 @@ module Distribution
26
44
  -1
27
45
  end
28
46
  end
29
- # As http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html
30
- def cdf_math(a,b,rho)
47
+
48
+ # Normal cumulative distribution function (cdf) for a given x, y and rho.
49
+ # Based on Hull (1993, cited by Arne, 2003)
50
+ #
51
+ # References:
52
+ # * Arne, B.(2003). Financial Numerical Recipes in C ++. Available on http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html
53
+ def cdf_hull(a,b,rho)
31
54
  #puts "a:#{a} - b:#{b} - rho:#{rho}"
32
55
  if (a<=0 and b<=0 and rho<=0)
33
56
  # puts "ruta 1"
@@ -64,17 +87,19 @@ module Distribution
64
87
  end
65
88
  raise "Should'nt be here! #{a} - #{b} #{rho}"
66
89
  end
67
- # Cdf for a given x and y
90
+
91
+ # CDF. Iterative method by Jantaravareerat (n/d)
92
+ #
68
93
  # Reference:
69
94
  # * Jantaravareerat, M. & Thomopoulos, N. (n/d). Tables for standard bivariate normal distribution
70
95
 
71
- def cdf_iterate(x,y,rho,s1=1,s2=1)
96
+ def cdf_jantaravareerat(x,y,rho,s1=1,s2=1)
72
97
  # Special cases
73
98
  return 1 if x>LIMIT and y>LIMIT
74
99
  return 0 if x<-LIMIT or y<-LIMIT
75
100
  return Distribution::Normal.cdf(y) if x>LIMIT
76
101
  return Distribution::Normal.cdf(x) if y>LIMIT
77
-
102
+
78
103
  #puts "x:#{x} - y:#{y}"
79
104
  x=-LIMIT if x<-LIMIT
80
105
  x=LIMIT if x>LIMIT
@@ -95,6 +120,159 @@ module Distribution
95
120
  end
96
121
  sum
97
122
  end
123
+ # Normal cumulative distribution function (cdf) for a given x, y and rho.
124
+ # Based on Fortran code by Alan Genz
125
+ #
126
+ # Original documentation
127
+ # DOUBLE PRECISION FUNCTION BVND( DH, DK, R )
128
+ # A function for computing bivariate normal probabilities.
129
+ #
130
+ # Alan Genz
131
+ # Department of Mathematics
132
+ # Washington State University
133
+ # Pullman, WA 99164-3113
134
+ # Email : alangenz_AT_wsu.edu
135
+ #
136
+ # This function is based on the method described by
137
+ # Drezner, Z and G.O. Wesolowsky, (1989),
138
+ # On the computation of the bivariate normal integral,
139
+ # Journal of Statist. Comput. Simul. 35, pp. 101-107,
140
+ # with major modifications for double precision, and for |R| close to 1.
141
+ #
142
+ # Original location:
143
+ # * http://www.math.wsu.edu/faculty/genz/software/fort77/tvpack.f
144
+ def cdf_genz(x,y,rho)
145
+ dh=-x
146
+ dk=-y
147
+ r=rho
148
+ twopi = 6.283185307179586
149
+
150
+ w=11.times.collect {[nil]*4}; x=11.times.collect {[nil]*4}
151
+
152
+ data=[
153
+ 0.1713244923791705E+00, -0.9324695142031522E+00,
154
+ 0.3607615730481384E+00, -0.6612093864662647E+00,
155
+ 0.4679139345726904E+00, -0.2386191860831970E+00]
156
+
157
+ (1..3).each {|i|
158
+ w[i][1]=data[(i-1)*2]
159
+ x[i][1]=data[(i-1)*2+1]
160
+
161
+ }
162
+ data=[
163
+ 0.4717533638651177E-01,-0.9815606342467191E+00,
164
+ 0.1069393259953183E+00,-0.9041172563704750E+00,
165
+ 0.1600783285433464E+00,-0.7699026741943050E+00,
166
+ 0.2031674267230659E+00,-0.5873179542866171E+00,
167
+ 0.2334925365383547E+00,-0.3678314989981802E+00,
168
+ 0.2491470458134029E+00,-0.1252334085114692E+00]
169
+ (1..6).each {|i|
170
+ w[i][2]=data[(i-1)*2]
171
+ x[i][2]=data[(i-1)*2+1]
172
+
173
+
174
+ }
175
+ data=[
176
+ 0.1761400713915212E-01,-0.9931285991850949E+00,
177
+ 0.4060142980038694E-01,-0.9639719272779138E+00,
178
+ 0.6267204833410906E-01,-0.9122344282513259E+00,
179
+ 0.8327674157670475E-01,-0.8391169718222188E+00,
180
+ 0.1019301198172404E+00,-0.7463319064601508E+00,
181
+ 0.1181945319615184E+00,-0.6360536807265150E+00,
182
+ 0.1316886384491766E+00,-0.5108670019508271E+00,
183
+ 0.1420961093183821E+00,-0.3737060887154196E+00,
184
+ 0.1491729864726037E+00,-0.2277858511416451E+00,
185
+ 0.1527533871307259E+00,-0.7652652113349733E-01]
186
+
187
+ (1..10).each {|i|
188
+ w[i][3]=data[(i-1)*2]
189
+ x[i][3]=data[(i-1)*2+1]
190
+
191
+
192
+ }
193
+
194
+
195
+ if ( r.abs < 0.3 )
196
+ ng = 1
197
+ lg = 3
198
+ elsif ( r.abs < 0.75 )
199
+ ng = 2
200
+ lg = 6
201
+ else
202
+ ng = 3
203
+ lg = 10
204
+ end
205
+
206
+
207
+ h = dh
208
+ k = dk
209
+ hk = h*k
210
+ bvn = 0
211
+ if ( r.abs < 0.925 )
212
+ if ( r.abs > 0 )
213
+ hs = ( h*h + k*k ).quo(2)
214
+ asr = Math::asin(r)
215
+ (1..lg).each do |i|
216
+ [-1,1].each do |is|
217
+ sn = Math::sin( asr *( is * x[i][ng] + 1 ).quo(2) )
218
+ bvn = bvn + w[i][ng] * Math::exp( ( sn*hk-hs ).quo( 1-sn*sn ) )
219
+ end # do
220
+ end # do
221
+ bvn = bvn*asr.quo( 2*twopi )
222
+ end # if
223
+ bvn = bvn + Distribution::Normal.cdf(-h) * Distribution::Normal.cdf(-k)
224
+
225
+
226
+ else # r.abs
227
+ if ( r < 0 )
228
+ k = -k
229
+ hk = -hk
230
+ end
231
+
232
+ if ( r.abs < 1 )
233
+ as = ( 1 - r )*( 1 + r )
234
+ a = Math::sqrt(as)
235
+ bs = ( h - k )**2
236
+ c = ( 4 - hk ).quo(8)
237
+ d = ( 12 - hk ).quo(16)
238
+ asr = -( bs.quo(as) + hk ).quo(2)
239
+ if ( asr > -100 )
240
+ bvn = a*Math::exp(asr) * ( 1 - c*( bs - as )*( 1 - d*bs.quo(5) ).quo(3) + c*d*as*as.quo(5) )
241
+ end
242
+ if ( -hk < 100 )
243
+ b = Math::sqrt(bs)
244
+ bvn = bvn - Math::exp( -hk.quo(2) ) * Math::sqrt(twopi)*Distribution::Normal.cdf(-b.quo(a))*b *
245
+ ( 1 - c*bs*( 1 - d*bs.quo(5) ).quo(3) )
246
+ end
247
+
248
+
249
+ a = a.quo(2)
250
+ (1..lg).each do |i|
251
+ [-1,1].each do |is|
252
+ xs = (a*( is*x[i][ng] + 1 ) )**2
253
+ rs = Math::sqrt( 1 - xs )
254
+ asr = -( bs/xs + hk ).quo(2)
255
+ if ( asr > -100 )
256
+ bvn = bvn + a*w[i][ng] * Math::exp( asr ) *
257
+ ( Math::exp( -hk*( 1 - rs ).quo(2*( 1 + rs ) ) ) .quo(rs) - ( 1 + c*xs*( 1 + d*xs ) ) )
258
+ end
259
+ end
260
+ end
261
+ bvn = -bvn/twopi
262
+ end
263
+
264
+ if ( r > 0 )
265
+ bvn = bvn + Distribution::Normal.cdf(-[h,k].max)
266
+ else
267
+ bvn = -bvn
268
+ if ( k > h )
269
+ bvn = bvn + Distribution::Normal.cdf(k) - Distribution::Normal.cdf(h)
270
+ end
271
+ end
272
+ end
273
+ bvn
274
+ end
275
+ private :f, :sgn
98
276
  end
99
277
  end
100
278
  end