RubyGems - statsample - Versions diffs - 0.6.2 → 0.6.3 - Mend

statsample 0.6.2 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

data/History.txt +3 -0
data/Manifest.txt +1 -1
data/README.txt +40 -36
data/demo/correlation_matrix.rb +11 -0
data/demo/dominance_analysis_bootstrap.rb +0 -4
data/demo/polychoric.rb +14 -6
data/lib/distribution.rb +1 -0
data/lib/distribution/normal.rb +18 -18
data/lib/distribution/normalbivariate.rb +189 -11
data/lib/statsample.rb +1 -1
data/lib/statsample/bivariate/polychoric.rb +232 -129
data/lib/statsample/bivariate/tetrachoric.rb +8 -4
data/lib/statsample/combination.rb +2 -2
data/lib/statsample/dominanceanalysis/bootstrap.rb +11 -6
data/lib/statsample/factor/pca.rb +1 -1
data/lib/statsample/graph/gdchart.rb +2 -2
data/lib/statsample/graph/svgboxplot.rb +100 -100
data/lib/statsample/graph/svggraph.rb +1 -1
data/lib/statsample/graph/svghistogram.rb +1 -1
data/lib/statsample/graph/svgscatterplot.rb +96 -98
data/test/test_bivariate.rb +27 -4
data/test/test_distribution.rb +17 -16
metadata +5 -5
data/lib/statsample/htmlreport.rb +0 -255

data/History.txt CHANGED Viewed

@@ -1,3 +1,6 @@
+=== 0.6.3 / 2010-02-15
+* Statsample::Bivariate::Polychoric have joint estimation.
+* Some extra documentation and bug fixs
 === 0.6.2 / 2010-02-11
 * New Statsample::Bivariate::Polychoric. For implement: X2 and G2
 * New matrix.rb, for faster development of Contingence Tables and Correlation Matrix

data/Manifest.txt CHANGED Viewed

@@ -9,6 +9,7 @@ data/repeated_fields.csv
 data/test_binomial.csv
 data/tetmat_matrix.txt
 data/tetmat_test.txt
+demo/correlation_matrix.rb
 demo/dominance_analysis_bootstrap.rb
 demo/dominanceanalysis.rb
 demo/multiple_regression.rb
@@ -47,7 +48,6 @@ lib/statsample/graph/svggraph.rb
 lib/statsample/graph/svghistogram.rb
 lib/statsample/graph/svgscatterplot.rb
 lib/statsample/histogram.rb
-lib/statsample/htmlreport.rb
 lib/statsample/matrix.rb
 lib/statsample/mle.rb
 lib/statsample/mle/logit.rb

data/README.txt CHANGED Viewed

@@ -3,66 +3,70 @@
 http://ruby-statsample.rubyforge.org/
-== DESCRIPTION:
+== FEATURES:
-A suite for your basic and advanced statistics needs. Descriptive statistics, multiple regression, factorial analysis, dominance analysis, scale's reliability analysis, bivariate statistics and others procedures.
+A suite for basic and advanced statistics. Includes:
+* Descriptive statistics: frequencies, median, mean, standard error, skew, kurtosis (and many others).
+* Imports and exports datasets from and to Excel, CSV and plain text files.
+* Correlations: Pearson (r), Rho, Tetrachoric, Polychoric
+* Regression: Simple, Multiple, Probit and Logit
+* Factorial Analysis: Extraction (PCA and Principal Axis) and Rotation (Varimax and relatives)
+* Dominance Analysis (Azen & Budescu)
+* Sample calculation related formulas
-== FEATURES:
+== DETAILED FEATURES:
 * Factorial Analysis. Principal Component Analysis and Principal Axis extraction, with orthogonal rotations (Varimax, Equimax, Quartimax)
 * Multiple Regression. Listwise analysis optimized with use of Alglib library. Pairwise analysis is executed on pure ruby and reports same values as SPSS
+* Module Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma, tetrachoric and polychoric correlation correlations. Include methods to create correlation (pearson and tetrachoric) and covariance matrices
+* Regression module provides linear regression methods
 * Dominance Analysis. Based on Budescu and Azen papers, <strong>DominanceAnalysis</strong> class can report dominance analysis for a sample and <strong>DominanceAnalysisBootstrap</strong> can execute bootstrap analysis to determine dominance stability, as recomended by  Azen & Budescu (2003) link[http://psycnet.apa.org/journals/met/8/2/129/].
 * Classes for Vector, Datasets (set of Vectors) and Multisets (multiple datasets with same fields and type of vectors), and multiple methods to manipulate them
 * Module Codification, to help to codify open questions
 * Converters to and from database and csv files, and to output Mx and GGobi files
-* Module Bivariate provides covariance and pearson, spearman, point biserial, tau a, tau b, gamma and tetrachoric correlations. Include methods to create correlation (pearson and tetrachoric) and covariance matrices
 * Module Crosstab provides function to create crosstab for categorical data
-* Module HtmlReport provides methods to create a report for scale analysis and matrix correlation
-* Regression module provides linear regression methods
 * Reliability analysis provides functions to analyze scales. Class ItemAnalysis provides statistics like mean, standard deviation for a scale, Cronbach's alpha and standarized Cronbach's alpha, and for each item: mean, correlation with total scale, mean if deleted, Cronbach's alpha is deleted. With HtmlReport, graph the histogram of the scale and the Item Characteristic Curve for each item
 * Module SRS (Simple Random Sampling) provides a lot of functions to estimate standard error for several type of samples
 * Interfaces to gdchart, gnuplot and SVG::Graph
-== Example of use:
+== Examples of use:
-    # Read a CSV file, using '' and 'error' as missing values and ommiting 1 lines
-    ds=Statsample::CSV.read('resultados_c1.csv',['','error'],1)
-    # Create a new vector (column), calculating the mean of 13 vectors. Accept 1 missing values on one of the vectors
-    indice_constructivismo_becker=ds.vector_mean(%w{fd_2_1 fd_2_2 fd_3_1 fd_3_2 fd_3_3},1)
-    # Add the vector to the dataset
-    ds.add_vector("ind_cons_becker",indice_constructivismo_becker)
-    # Verify data. Vecto 'de_3_sex' must have values 'a' or 'b'. Dataset#verify returns and array with all errors
-    t_sex=create_test("Sex must be a o b",'de_3_sex') {|v| v['de_3_sex']=="a" or v['de_3_sex']=="b")}
-    p ds.verify(t_sexo)
-    # Creates a new dataset, based on the names of vectors
-    ds_software=ds.dup(%w{pe1n1 pe1n2 pe1n3 pe1n4 pe1n5 })
-    # Creates an html report, add a correlation matrix with all the scale vectors and save the report into a file
-    hr=Statsample::HtmlReport.new(ds_software,"correlations")
-    hr.add_correlation_matrix()
-    hr.save("correlation_matrix.html")
+=== Correlation matrix
+    require 'statsample'
+    a=1000.times.collect {rand}.to_scale
+    b=1000.times.collect {rand}.to_scale
+    c=1000.times.collect {rand}.to_scale
+    d=1000.times.collect {rand}.to_scale
+    ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
+    cm=Statsample::Bivariate.correlation_matrix(ds)
+    puts cm.summary
+=== Tetrachoric correlation
+    require 'statsample'
+    a=40
+    b=10
+    c=20
+    d=30
+    tetra=Statsample::Bivariate::Tetrachoric.new(a,b,c,d)
+    puts tetra.summary
+=== Polychoric correlation
+    require 'statsample'
+    ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
-    # Saves the new dataset
-    Statsample::CSV.write(ds_software,"ds_software.csv",true)
+    poly=Statsample::Bivariate::Polychoric.new(ct)
+    puts poly.summary
 == REQUIREMENTS:
 Optional:
 * Plotting: gnuplot and rbgnuplot, SVG::Graph
-* Advanced Statistical: gsl and rb-gsl (http://rb-gsl.rubyforge.org/)
+* Factorial analysis and polychorical correlation: gsl and rb-gsl (http://rb-gsl.rubyforge.org/)
 == DOWNLOAD
 * Gems and bugs report: http://rubyforge.org/projects/ruby-statsample/

data/demo/correlation_matrix.rb ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/ruby
+$:.unshift(File.dirname(__FILE__)+'/../lib/')
+require 'statsample'
+a=1000.times.collect {rand}.to_scale
+b=1000.times.collect {rand}.to_scale
+c=1000.times.collect {rand}.to_scale
+d=1000.times.collect {rand}.to_scale
+ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
+cm=Statsample::Bivariate.correlation_matrix(ds)
+puts cm.summary

data/demo/dominance_analysis_bootstrap.rb CHANGED Viewed

@@ -12,9 +12,5 @@ ds={'a'=>a,'b'=>b,'c'=>c,'d'=>d}.to_dataset
 ds['y']=ds.collect{|row| row['a']*5+row['b']*2+row['c']*2+row['d']*2+10*rand()}
 dab=Statsample::DominanceAnalysis::Bootstrap.new(ds, 'y')
-if HAS_GSL
-  # Use Gsl if available (faster calculation)
-  dab.regression_class=Statsample::Regression::Multiple::GslEngine
-end
 dab.bootstrap(100,nil,true)
 puts dab.summary

data/demo/polychoric.rb CHANGED Viewed

@@ -2,12 +2,20 @@
 $:.unshift(File.dirname(__FILE__)+'/../lib/')
 require 'statsample'
-#ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
+ct=Matrix[[58,52,1],[26,58,3],[8,12,9]]
+# Estimation of polychoric correlation using two-step (default)
+poly=Statsample::Bivariate::Polychoric.new(ct, :name=>"Polychoric with two-step")
+puts poly.summary
-ct=Matrix[[30,1,0,0,0,0],[0,10,2,0,0,0], [0,4,8,3,1,0], [0,3,3,37,9,0], [0,0,1, 25, 71, 49], [ 0,0,0,2, 20, 181]]
-poly=Statsample::Bivariate::Polychoric.new(ct)
+# Estimation of polychoric correlation using joint method (slow)
+poly=Statsample::Bivariate::Polychoric.new(ct, :method=>:joint, :name=>"Polychoric with joint")
 puts poly.summary
-puts poly.chi_square_independence
-puts poly.chi_square_model
-puts poly.chi_square_independence
+# Uses polychoric series (not recomended)
+poly=Statsample::Bivariate::Polychoric.new(ct, :method=>:polychoric_series, :name=>"Polychoric with polychoric series")
+puts poly.summary

data/lib/distribution.rb CHANGED Viewed

@@ -13,4 +13,5 @@ module Distribution
     autoload(:F, 'distribution/f')
     autoload(:Normal, 'distribution/normal')
     autoload(:NormalBivariate, 'distribution/normalbivariate')
+    autoload(:NormalMultivariate, 'distribution/normalmultivariate')
 end

data/lib/distribution/normal.rb CHANGED Viewed

@@ -2,24 +2,24 @@ module Distribution
     # Calculate cdf and inverse cdf for Normal Distribution.
     # Uses Statistics2 module
     module Normal
-        class << self
-            # Return the P-value of the corresponding integral
-            def p_value(pr)
-                Statistics2.pnormaldist(pr)
-            end
-            # Normal cumulative distribution function (cdf).
-            #
-            # Returns the integral of  normal distribution
-            # over (-Infty, x].
-            #
-            def cdf(x)
-                Statistics2.normaldist(x)
-            end
-            # Normal probability density function (pdf)
-            # With x=0 and sigma=1
-            def pdf(x)
-                (1.0/Math::sqrt(2*Math::PI))*Math::exp(-(x**2/2.0))
-            end
+      class << self
+        # Return the P-value of the corresponding integral
+        def p_value(pr)
+            Statistics2.pnormaldist(pr)
         end
+        # Normal cumulative distribution function (cdf).
+        #
+        # Returns the integral of  normal distribution
+        # over (-Infty, x].
+        #
+        def cdf(x)
+            Statistics2.normaldist(x)
+        end
+        # Normal probability density function (pdf)
+        # With x=0 and sigma=1
+        def pdf(x)
+            (1.0/Math::sqrt(2*Math::PI))*Math::exp(-(x**2/2.0))
+        end
+      end
     end
 end

data/lib/distribution/normalbivariate.rb CHANGED Viewed

@@ -1,24 +1,42 @@
 module Distribution
-  # Calculate pdf and cdf for bivariate normal distribution
+  # Calculate pdf and cdf for bivariate normal distribution.
+  #
+  # Pdf if easy to calculate, but CDF is not trivial. Several papers
+  # describe methods to calculate the integral.
+  #
+  # Three methods are implemented on this module:
+  # * Genz:: Used by default, with improvement to calculate p on rho > 0.95
+  # * Hull:: Port from a C++ code
+  # * Jantaravareerat:: Iterative (and slow)
+  #
   module NormalBivariate
     class << self
-      SIDE=0.1
-      LIMIT=5
-      # Probability density function
+      SIDE=0.1 # :nodoc:
+      LIMIT=5 # :nodoc:
+      # Probability density function for a given x, y and rho value.
+      #
       # Source: http://en.wikipedia.org/wiki/Multivariate_normal_distribution
       def pdf(x,y, rho, sigma1=1.0, sigma2=1.0)
         (1.quo(2 * Math::PI * sigma1*sigma2 * Math::sqrt( 1 - rho**2 ))) *
           Math::exp(-(1.quo(2*(1-rho**2))) *
           ((x**2/sigma1) + (y**2/sigma2) - (2*rho*x*y).quo(sigma1*sigma2)  ))
       end
-      def f(x,y,aprime,bprime,rho)
+      def f(x,y,aprime,bprime,rho)
         r=aprime*(2*x-aprime)+bprime*(2*y-bprime)+2*rho*(x-aprime)*(y-bprime)
         Math::exp(r)
       end
+      # CDF for a given x, y and rho value.
+      # Uses Genz algorithm (cdf_genz method).
+      #
       def cdf(a,b,rho)
-        cdf_math(a,b,rho)
+        cdf_genz(a,b,rho)
       end
       def sgn(x)
         if(x>=0)
         1
@@ -26,8 +44,13 @@ module Distribution
         -1
         end
       end
-      # As http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html
-      def cdf_math(a,b,rho)
+      # Normal cumulative distribution function (cdf) for a given x, y and rho.
+      # Based on Hull (1993, cited by Arne, 2003)
+      #
+      # References:
+      # * Arne, B.(2003). Financial Numerical Recipes in C ++. Available on  http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html
+      def cdf_hull(a,b,rho)
         #puts "a:#{a} - b:#{b} - rho:#{rho}"
         if (a<=0 and b<=0 and rho<=0)
          # puts "ruta 1"
@@ -64,17 +87,19 @@ module Distribution
         end
         raise "Should'nt be here! #{a} - #{b} #{rho}"
       end
-      # Cdf for a given x and y
+      # CDF. Iterative method by Jantaravareerat (n/d)
+      #
       # Reference:
       # * Jantaravareerat, M. & Thomopoulos, N. (n/d). Tables for standard bivariate normal distribution
-      def cdf_iterate(x,y,rho,s1=1,s2=1)
+      def cdf_jantaravareerat(x,y,rho,s1=1,s2=1)
         # Special cases
         return 1 if x>LIMIT and y>LIMIT
         return 0 if x<-LIMIT or y<-LIMIT
         return Distribution::Normal.cdf(y) if  x>LIMIT
         return Distribution::Normal.cdf(x) if  y>LIMIT
         #puts "x:#{x} - y:#{y}"
         x=-LIMIT if x<-LIMIT
         x=LIMIT if x>LIMIT
@@ -95,6 +120,159 @@ module Distribution
         end
         sum
       end
+      # Normal cumulative distribution function (cdf) for a given x, y and rho.
+      # Based on Fortran code by Alan Genz
+      #
+      # Original documentation
+      #    DOUBLE PRECISION FUNCTION BVND( DH, DK, R )
+      #    A function for computing bivariate normal probabilities.
+      #
+      #       Alan Genz
+      #       Department of Mathematics
+      #       Washington State University
+      #       Pullman, WA 99164-3113
+      #       Email : alangenz_AT_wsu.edu
+      #
+      #    This function is based on the method described by
+      #        Drezner, Z and G.O. Wesolowsky, (1989),
+      #        On the computation of the bivariate normal integral,
+      #        Journal of Statist. Comput. Simul. 35, pp. 101-107,
+      #    with major modifications for double precision, and for |R| close to 1.
+      #
+      # Original location:
+      # * http://www.math.wsu.edu/faculty/genz/software/fort77/tvpack.f
+      def cdf_genz(x,y,rho)
+        dh=-x
+        dk=-y
+        r=rho
+        twopi = 6.283185307179586
+        w=11.times.collect {[nil]*4}; x=11.times.collect {[nil]*4}
+        data=[
+        0.1713244923791705E+00, -0.9324695142031522E+00,
+        0.3607615730481384E+00, -0.6612093864662647E+00,
+        0.4679139345726904E+00, -0.2386191860831970E+00]
+        (1..3).each {|i|
+          w[i][1]=data[(i-1)*2]
+          x[i][1]=data[(i-1)*2+1]
+        }
+        data=[
+        0.4717533638651177E-01,-0.9815606342467191E+00,
+        0.1069393259953183E+00,-0.9041172563704750E+00,
+        0.1600783285433464E+00,-0.7699026741943050E+00,
+        0.2031674267230659E+00,-0.5873179542866171E+00,
+        0.2334925365383547E+00,-0.3678314989981802E+00,
+        0.2491470458134029E+00,-0.1252334085114692E+00]
+        (1..6).each {|i|
+          w[i][2]=data[(i-1)*2]
+          x[i][2]=data[(i-1)*2+1]
+        }
+        data=[
+        0.1761400713915212E-01,-0.9931285991850949E+00,
+        0.4060142980038694E-01,-0.9639719272779138E+00,
+        0.6267204833410906E-01,-0.9122344282513259E+00,
+        0.8327674157670475E-01,-0.8391169718222188E+00,
+        0.1019301198172404E+00,-0.7463319064601508E+00,
+        0.1181945319615184E+00,-0.6360536807265150E+00,
+        0.1316886384491766E+00,-0.5108670019508271E+00,
+        0.1420961093183821E+00,-0.3737060887154196E+00,
+        0.1491729864726037E+00,-0.2277858511416451E+00,
+        0.1527533871307259E+00,-0.7652652113349733E-01]
+        (1..10).each {|i|
+          w[i][3]=data[(i-1)*2]
+          x[i][3]=data[(i-1)*2+1]
+        }
+        if ( r.abs < 0.3 )
+          ng = 1
+          lg = 3
+        elsif ( r.abs < 0.75 )
+          ng = 2
+          lg = 6
+        else
+          ng = 3
+          lg = 10
+        end
+        h = dh
+        k = dk
+        hk = h*k
+        bvn = 0
+        if ( r.abs < 0.925 )
+          if ( r.abs > 0 )
+            hs = ( h*h + k*k ).quo(2)
+            asr = Math::asin(r)
+            (1..lg).each do |i|
+              [-1,1].each do |is|
+                sn = Math::sin( asr *(  is * x[i][ng] + 1 ).quo(2) )
+                bvn = bvn + w[i][ng] * Math::exp( ( sn*hk-hs ).quo( 1-sn*sn ) )
+              end # do
+            end # do
+            bvn = bvn*asr.quo( 2*twopi )
+          end # if
+          bvn = bvn + Distribution::Normal.cdf(-h) * Distribution::Normal.cdf(-k)
+        else # r.abs
+          if ( r < 0 )
+            k = -k
+            hk = -hk
+          end
+          if ( r.abs < 1 )
+            as = ( 1 - r )*( 1 + r )
+            a = Math::sqrt(as)
+            bs = ( h - k )**2
+            c = ( 4 - hk ).quo(8)
+            d = ( 12 - hk ).quo(16)
+            asr = -( bs.quo(as) + hk ).quo(2)
+            if ( asr > -100 )
+              bvn = a*Math::exp(asr) * ( 1 - c*( bs - as )*( 1 - d*bs.quo(5) ).quo(3) + c*d*as*as.quo(5) )
+            end
+            if ( -hk < 100 )
+              b = Math::sqrt(bs)
+              bvn = bvn - Math::exp( -hk.quo(2) ) * Math::sqrt(twopi)*Distribution::Normal.cdf(-b.quo(a))*b *
+              ( 1 - c*bs*( 1 - d*bs.quo(5) ).quo(3) )
+            end
+            a = a.quo(2)
+            (1..lg).each do |i|
+              [-1,1].each do |is|
+                xs = (a*(  is*x[i][ng] + 1 ) )**2
+                rs = Math::sqrt( 1 - xs )
+                asr = -( bs/xs + hk ).quo(2)
+                if ( asr > -100 )
+                  bvn = bvn + a*w[i][ng] * Math::exp( asr ) *
+                    ( Math::exp( -hk*( 1 - rs ).quo(2*( 1 + rs ) ) ) .quo(rs) - ( 1 + c*xs*( 1 + d*xs ) ) )
+                end
+              end
+            end
+            bvn = -bvn/twopi
+          end
+          if ( r > 0 )
+            bvn =  bvn + Distribution::Normal.cdf(-[h,k].max)
+          else
+            bvn = -bvn
+            if ( k > h )
+              bvn = bvn + Distribution::Normal.cdf(k) - Distribution::Normal.cdf(h)
+            end
+          end
+        end
+        bvn
+      end
+      private :f, :sgn
     end
   end
 end