statsample 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (59) hide show
  1. data/History.txt +79 -0
  2. data/Manifest.txt +56 -0
  3. data/README.txt +77 -0
  4. data/Rakefile +22 -0
  5. data/bin/statsample +2 -0
  6. data/demo/benchmark.rb +52 -0
  7. data/demo/chi-square.rb +44 -0
  8. data/demo/dice.rb +13 -0
  9. data/demo/distribution_t.rb +95 -0
  10. data/demo/graph.rb +9 -0
  11. data/demo/item_analysis.rb +30 -0
  12. data/demo/mean.rb +81 -0
  13. data/demo/proportion.rb +57 -0
  14. data/demo/sample_test.csv +113 -0
  15. data/demo/strata_proportion.rb +152 -0
  16. data/demo/stratum.rb +141 -0
  17. data/lib/spss.rb +131 -0
  18. data/lib/statsample.rb +216 -0
  19. data/lib/statsample/anova.rb +74 -0
  20. data/lib/statsample/bivariate.rb +255 -0
  21. data/lib/statsample/chidistribution.rb +39 -0
  22. data/lib/statsample/codification.rb +120 -0
  23. data/lib/statsample/converters.rb +338 -0
  24. data/lib/statsample/crosstab.rb +122 -0
  25. data/lib/statsample/dataset.rb +526 -0
  26. data/lib/statsample/dominanceanalysis.rb +259 -0
  27. data/lib/statsample/dominanceanalysis/bootstrap.rb +126 -0
  28. data/lib/statsample/graph/gdchart.rb +45 -0
  29. data/lib/statsample/graph/svgboxplot.rb +108 -0
  30. data/lib/statsample/graph/svggraph.rb +181 -0
  31. data/lib/statsample/graph/svghistogram.rb +208 -0
  32. data/lib/statsample/graph/svgscatterplot.rb +111 -0
  33. data/lib/statsample/htmlreport.rb +232 -0
  34. data/lib/statsample/multiset.rb +281 -0
  35. data/lib/statsample/regression.rb +522 -0
  36. data/lib/statsample/reliability.rb +235 -0
  37. data/lib/statsample/resample.rb +20 -0
  38. data/lib/statsample/srs.rb +159 -0
  39. data/lib/statsample/test.rb +25 -0
  40. data/lib/statsample/vector.rb +759 -0
  41. data/test/_test_chart.rb +58 -0
  42. data/test/test_anova.rb +31 -0
  43. data/test/test_codification.rb +59 -0
  44. data/test/test_crosstab.rb +55 -0
  45. data/test/test_csv.csv +7 -0
  46. data/test/test_csv.rb +27 -0
  47. data/test/test_dataset.rb +293 -0
  48. data/test/test_ggobi.rb +42 -0
  49. data/test/test_multiset.rb +98 -0
  50. data/test/test_regression.rb +108 -0
  51. data/test/test_reliability.rb +32 -0
  52. data/test/test_resample.rb +23 -0
  53. data/test/test_srs.rb +14 -0
  54. data/test/test_statistics.rb +152 -0
  55. data/test/test_stratified.rb +19 -0
  56. data/test/test_svg_graph.rb +63 -0
  57. data/test/test_vector.rb +265 -0
  58. data/test/test_xls.rb +32 -0
  59. metadata +158 -0
data/History.txt ADDED
@@ -0,0 +1,79 @@
1
+ === 0.3.0 / 2009-08-02
2
+
3
+ * Statsample renamed to Statsample
4
+ * Optimization extension goes to another gem: ruby-statsample-optimization
5
+
6
+ === 0.2.0 / 2009-08-01
7
+
8
+ * One Way Anova on Statsample::Anova::OneWay
9
+ * Dominance Analysis!!!! The one and only reason to develop a Multiple Regression on pure ruby.
10
+ * Multiple Regression on Multiple Regression module. Pairwise (pure ruby) or MultipleRegressionPairwise and Listwise (optimized) on MultipleRegressionAlglib and
11
+ * New Dataset#to_gsl_matrix, #from_to,#[..],#bootstrap,#vector_missing_values, #vector_count_characters, #each_with_index, #collect_with_index
12
+ * New Vector#box_cox_transformation
13
+ * Module Correlation renamed to Bivariate
14
+ * Some fancy methods and classes to create Summaries
15
+ * Some documentation about Algorithm used on doc_latex
16
+ * Deleted 'distributions' extension. Ruby/GSL has all the pdf and cdf you ever need.
17
+ * Tests work without any dependency. Only nags about missing deps.
18
+ * Test for MultipleRegression, Anova, Excel, Bivariate.correlation_matrix and many others
19
+
20
+ === 0.1.9 / 2009-05-22
21
+
22
+ * Class Vector: new method vector_standarized_pop, []=, min,max
23
+ * Class Dataset: global variable $RUBY_SS_ROW stores the row number on each() and related methods. dup() with argument returns a copy of the dataset only for given fields. New methods: standarize, vector_mean, collect, verify,collect_matrix
24
+ * Module Correlation: new methods covariance, t_pearson, t_r, prop_pearson, covariance_matrix, correlation_matrix, correlation_probability_matrix
25
+ * Module SRS: New methods estimation_n0 and estimation_n
26
+ * Module Reliability: new ItemCharacteristicCurve class
27
+ * New HtmlReport class
28
+ * New experimental SPSS Class.
29
+ * Converters: Module CSV with new options. Added write() method for GGobi module
30
+ * New Mx exporter (http://www.vcu.edu/mx/)
31
+ * Class SimpleRegression: new methods standard error
32
+
33
+ * Added tests for regression and reliability, Vector#vector_mean, Dataset#dup (partial) and Dataset#verify
34
+
35
+
36
+ === 0.1.8 / 2008-12-10
37
+ * Added Regression and Reliability modules
38
+ * Class Vector: added methods vector_standarized, recode, inspect, ranked
39
+ * Class Dataset: added methods vector_by_calculation, vector_sum, filter_field
40
+ * Module Correlation: added methods like spearman, point biserial and tau-b
41
+ * Added tests for Vector#ranked, Vector#vector_standarized, Vector#sum_of_squared_deviation, Dataset#vector_by_calculation, Dataset#vector_sum, Dataset#filter_field and various test for Correlation module
42
+ * Added demos: item_analysis and sample_test
43
+
44
+ === 0.1.7 / 2008-10-1
45
+ * New module for codification
46
+ * ...
47
+ === 0.1.6 / 2008-09-26
48
+ * New modules for SRS and stratified sampling
49
+ * Statsample::Database for read and write onto databases.
50
+ You could use Database and CSV on-tandem for mass-editing and reimport
51
+ of databases
52
+
53
+ === 0.1.5 / 2008-08-29
54
+ * New extension statsampleopt for optimizing some functions on Statsample submodules
55
+ * New submodules Correlation and Test
56
+
57
+ === 0.1.4 / 2008-08-27
58
+
59
+ * New extension, with cdf functions for
60
+ chi-square, t, gamma and normal distributions.
61
+ Based on dcdflib (http://www.netlib.org/random/)
62
+ Also, has a function to calculate the tail for a noncentral T distribution
63
+
64
+ === 0.1.3 / 2008-08-22
65
+
66
+ * Operational versions of Vector, Dataset, Crosstab and Resample
67
+ * Read and write CSV files
68
+ * Calculate chi-square for 2 matrixes
69
+
70
+ === 0.1.1 - 0.1.2 / 2008-08-18
71
+
72
+ * Included several methods on Ruby::Type classes
73
+ * Organized dirs with sow
74
+
75
+
76
+ === 0.1.0 / 2008-08-12
77
+
78
+ * First version.
79
+
data/Manifest.txt ADDED
@@ -0,0 +1,56 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ bin/statsample
6
+ demo/benchmark.rb
7
+ demo/chi-square.rb
8
+ demo/dice.rb
9
+ demo/distribution_t.rb
10
+ demo/graph.rb
11
+ demo/item_analysis.rb
12
+ demo/mean.rb
13
+ demo/proportion.rb
14
+ demo/sample_test.csv
15
+ demo/strata_proportion.rb
16
+ demo/stratum.rb
17
+ lib/statsample.rb
18
+ lib/statsample/anova.rb
19
+ lib/statsample/bivariate.rb
20
+ lib/statsample/chidistribution.rb
21
+ lib/statsample/codification.rb
22
+ lib/statsample/converters.rb
23
+ lib/statsample/crosstab.rb
24
+ lib/statsample/dataset.rb
25
+ lib/statsample/dominanceanalysis.rb
26
+ lib/statsample/dominanceanalysis/bootstrap.rb
27
+ lib/statsample/graph/gdchart.rb
28
+ lib/statsample/graph/svggraph.rb
29
+ lib/statsample/graph/svgboxplot.rb
30
+ lib/statsample/graph/svghistogram.rb
31
+ lib/statsample/graph/svgscatterplot.rb
32
+ lib/statsample/htmlreport.rb
33
+ lib/statsample/multiset.rb
34
+ lib/statsample/regression.rb
35
+ lib/statsample/reliability.rb
36
+ lib/statsample/resample.rb
37
+ lib/statsample/srs.rb
38
+ lib/statsample/test.rb
39
+ lib/statsample/vector.rb
40
+ lib/spss.rb
41
+ test/_test_chart.rb
42
+ test/test_anova.rb
43
+ test/test_codification.rb
44
+ test/test_crosstab.rb
45
+ test/test_csv.csv
46
+ test/test_csv.rb
47
+ test/test_dataset.rb
48
+ test/test_ggobi.rb
49
+ test/test_multiset.rb
50
+ test/test_regression.rb
51
+ test/test_reliability.rb
52
+ test/test_resample.rb
53
+ test/test_statistics.rb
54
+ test/test_stratified.rb
55
+ test/test_svg_graph.rb
56
+ test/test_vector.rb
data/README.txt ADDED
@@ -0,0 +1,77 @@
1
+ = Statsample
2
+
3
+ * http://rubyforge.org/projects/ruby-statsample/
4
+ * http://code.google.com/p/ruby-statsample/
5
+
6
+ == DESCRIPTION:
7
+
8
+ This package allows to process files and databases for statistical purposes, with focus on validation, recodification and estimation of parameters for several types of samples (simple random, stratified and multistage sampling).
9
+
10
+ == FEATURES:
11
+
12
+ * Multiple Regression. Listwise analysis optimized with use of Alglib library. Pairwise analysis is executed on pure ruby and reports same values as SPSS
13
+ * Dominance Analysis. Based on Bodescu and Azen papers, DominanceAnalysis? class can report dominance analysis for a sample and DominanceAnalysis?. Bootstrap can execute bootstrap analysis to determine dominance stability, as recomended by Azen & Bodescu (2003).
14
+ * Classes for Vector, Datasets (set of Vectors) and Multisets (multiple datasets with same fields and type of vectors), and multiple methods to manipulate them
15
+ * Module Codification, to help to codify open questions
16
+ * Converters to and from database and csv files, and to output Mx and GGobi files
17
+ * Module Correlation provides covariance and pearson, spearman, point biserial, tau a, tau b and gamma correlations. Include methods to create correlation and covariance matrices
18
+ * Module Crosstab provides function to create crosstab for categorical data
19
+ * Module HtmlReport provides methods to create a report for scale analysis and matrix correlation
20
+ * Regression module provides linear regression methods
21
+ * Reliability analysis provides functions to analyze scales. Class ItemAnalysis provides statistics like mean, standard deviation for a scale, alpha and standarized alpha, and for each item: mean, correlation with total scale, mean if deleted, alpha is deleted. With HtmlReport, graph the histogram of the scale and the Item Characteristic Curve for each item
22
+ * Module SRS (Simple Random Sampling) provides a lot of functions to estimate standard error for several type of samples
23
+ * Interfaces to gdchart, gnuplot and SVG::Graph
24
+
25
+
26
+ == Example of use:
27
+
28
+ # Read a CSV file, using '' and 'error' as missing values and ommiting 1 lines
29
+ ds=Statsample::CSV.read('resultados_c1.csv',['','error'],1)
30
+
31
+ # Create a new vector (column), calculating the mean of 13 vectors. Accept 1 missing values on one of the vectors
32
+
33
+ indice_constructivismo_becker=ds.vector_mean(%w{fd_2_1 fd_2_2 fd_3_1 fd_3_2 fd_3_3},1)
34
+
35
+ # Add the vector to the dataset
36
+
37
+ ds.add_vector("ind_cons_becker",indice_constructivismo_becker)
38
+
39
+ # Verify data. Vecto 'de_3_sex' must have values 'a' or 'b'. Dataset#verify returns and array with all errors
40
+
41
+ t_sex=create_test("Sex must be a o b",'de_3_sex') {|v| v['de_3_sex']=="a" or v['de_3_sex']=="b")}
42
+
43
+ p ds.verify(t_sexo)
44
+
45
+
46
+ # Creates a new dataset, based on the names of vectors
47
+
48
+ ds_software=ds.dup(%w{pe1n1 pe1n2 pe1n3 pe1n4 pe1n5 })
49
+
50
+ # Creates an html report, add a correlation matrix with all the scale vectors and save the report into a file
51
+ hr=Statsample::HtmlReport.new(ds_software,"correlations")
52
+ hr.add_correlation_matrix()
53
+ hr.save("correlation_matrix.html")
54
+
55
+
56
+ # Saves the new dataset
57
+ Statsample::CSV.write(ds_software,"ds_software.csv",true)
58
+
59
+ == REQUIREMENTS:
60
+
61
+ Optional:
62
+
63
+ * Plotting: gnuplot and rbgnuplot, SVG::Graph
64
+ * Advanced Statistical: gsl and rb-gsl (http://rb-gsl.rubyforge.org/)
65
+
66
+ == INSTALL:
67
+
68
+ gem install ruby-statsample
69
+
70
+ For optimization on *nix env
71
+
72
+ gem install ruby-statsample-optimization
73
+
74
+
75
+ == LICENSE:
76
+
77
+ GPL-2
data/Rakefile ADDED
@@ -0,0 +1,22 @@
1
+ #!/usr/bin/ruby
2
+ # -*- ruby -*-
3
+
4
+ require 'rubygems'
5
+ require 'hoe'
6
+ require './lib/statsample'
7
+
8
+ if File.exists? './local_rakefile.rb'
9
+ require './local_rakefile'
10
+ end
11
+
12
+ Hoe.spec('statsample') do |p|
13
+ p.version=Statsample::VERSION
14
+ p.rubyforge_name = "ruby-statsample"
15
+ p.developer('Claudio Bustos', 'clbustos@gmail.com')
16
+ p.extra_deps << ["spreadsheet","=0.6.4"] << "svg-graph"
17
+ p.clean_globs << "test/images/*"
18
+ # p.rdoc_pattern = /^(lib|bin|ext\/distributions)|txt$/
19
+ end
20
+
21
+
22
+ # vim: syntax=Ruby
data/bin/statsample ADDED
@@ -0,0 +1,2 @@
1
+ #!/usr/bin/ruby
2
+ echo "Nothing today!"
data/demo/benchmark.rb ADDED
@@ -0,0 +1,52 @@
1
+ require File.dirname(__FILE__)+'/../lib/statsample.rb'
2
+ require 'benchmark'
3
+ v=(0..10000).collect{|n|
4
+ r=rand(100)
5
+ if(r<90)
6
+ r
7
+ else
8
+ nil
9
+ end
10
+ }.to_vector
11
+ v.missing_values=[5,10,20]
12
+ v.type=:scale
13
+
14
+ n = 300
15
+ if (false)
16
+ Benchmark.bm(7) do |bench|
17
+ bench.report("missing or") { for i in 1..n; v.each {|x| !(x.nil? or v.missing_values.include? x) }; end }
18
+ bench.report("missing and") { for i in 1..n;v.each {|x| !x.nil? and !v.missing_values.include? x } ; end }
19
+ end
20
+ end
21
+ if (false)
22
+ Benchmark.bm(7) do |bench|
23
+ bench.report("true") { Statsample::OPTIMIZED=true; for i in 1..n; v.set_valid_data ; end }
24
+ bench.report("false") { Statsample::OPTIMIZED=false; for i in 1..n; v.set_valid_data ; end }
25
+ end
26
+ end
27
+
28
+
29
+ if (true)
30
+ Benchmark.bm(7) do |x|
31
+ x.report("mean") { for i in 1..n; v.mean; end }
32
+ x.report("slow_mean") { for i in 1..n; v.slow_mean; end }
33
+
34
+ end
35
+
36
+ Benchmark.bm(7) do |x|
37
+ x.report("variance_sample") { for i in 1..n; v.variance_sample; end }
38
+ x.report("variance_slow") { for i in 1..n; v.slow_variance_sample; end }
39
+
40
+ end
41
+
42
+
43
+ Benchmark.bm(7) do |x|
44
+
45
+ x.report("Nominal.frequencies") { for i in 1..n; v.frequencies; end }
46
+ x.report("Nominal.frequencies_slow") { for i in 1..n; v.frequencies_slow; end }
47
+
48
+ x.report("_frequencies") { for i in 1..n; Statsample._frequencies(v.valid_data); end }
49
+
50
+ end
51
+
52
+ end
@@ -0,0 +1,44 @@
1
+ require File.dirname(__FILE__)+'/../lib/statsample'
2
+ require 'rbgsl'
3
+ require 'statsample/resample'
4
+ require 'statsample/test'
5
+ require 'matrix'
6
+ ideal=Matrix[[30,30,40]]
7
+ tests=10000
8
+ monte=Statsample::Resample.repeat_and_save(tests) {
9
+ observed=[0,0,0]
10
+ (1..100).each{|i|
11
+ r=rand(100)
12
+ if r<30
13
+ observed[0]+=1
14
+ elsif r<60
15
+ observed[1]+=1
16
+ else
17
+ observed[2]+=1
18
+ end
19
+ }
20
+ Statsample::Test::chi_square(Matrix[observed],ideal)
21
+ }
22
+
23
+
24
+
25
+ v=monte.to_vector(:scale)
26
+
27
+ x=[]
28
+ y=[]
29
+ y2=[]
30
+ y3=[]
31
+ y4=[]
32
+ prev=0
33
+ prev_chi=0
34
+ v.frequencies.sort.each{|k,v1|
35
+ x.push(k)
36
+ y.push(prev+v1)
37
+ prev=prev+v1
38
+ cdf_chi=GSL::Cdf.chisq_P(k,2)
39
+ y2.push(cdf_chi)
40
+ y4.push(prev.quo(tests))
41
+ }
42
+
43
+
44
+ GSL::graph(GSL::Vector.alloc(x), GSL::Vector.alloc(y2), GSL::Vector.alloc(y4))
data/demo/dice.rb ADDED
@@ -0,0 +1,13 @@
1
+ require File.dirname(__FILE__)+"/../lib/statsample"
2
+ require 'statsample/srs'
3
+ require 'statsample/resample'
4
+ require 'gnuplot'
5
+
6
+ tests=3000
7
+ # rand a 50%
8
+ monte_with=Statsample::Resample.repeat_and_save(tests) {
9
+ (1+rand(6))+(1+rand(6))
10
+ }.to_vector(:scale)
11
+
12
+ p monte_with.mean
13
+
@@ -0,0 +1,95 @@
1
+ #!/usr/bin/ruby
2
+
3
+ require File.dirname(__FILE__)+"/../lib/statsample"
4
+ require 'statsample/resample'
5
+ require 'gnuplot'
6
+ r = GSL::Rng.alloc(GSL::Rng::TAUS, 1)
7
+ v=[]
8
+ population_size=10000
9
+ population_size.times{|i|
10
+ v.push(r.ugaussian)
11
+ }
12
+
13
+ v=v.to_vector(:scale)
14
+ vm=v.mean
15
+ vsd=v.sdp
16
+ puts "Population sd:#{v.sdp}"
17
+ tests=3000
18
+ Gnuplot.open do |gp|
19
+ Gnuplot::Plot.new( gp ) do |plot|
20
+ plot.boxwidth("0.9 absolute")
21
+ plot.xrange("[-3:3]")
22
+ plot.yrange("[0:1.1]")
23
+ plot.style("fill solid 1.00 border -1")
24
+ [2].each {|ss|
25
+ puts "Sample size:#{ss}"
26
+ ee=v.sdp.quo(Math::sqrt(ss))
27
+ puts "SE: #{ee}"
28
+
29
+ puts "Expected variance with replacement: #{v.variance_population.quo(ss)*(v.size-1).quo(v.size)}"
30
+ puts "Expected variance without replacement: #{v.variance_population.quo(ss)*(1-ss.quo(v.size))}"
31
+
32
+ sample_size=ss
33
+ sds_prom=[]
34
+ sds_prom_wo=[]
35
+ monte_wr=Statsample::Resample.repeat_and_save(tests) {
36
+ sample=v.sample_with_replacement(sample_size)
37
+ sds_prom.push(sample.sds)
38
+ sample.mean
39
+ }
40
+ monte_wor=Statsample::Resample.repeat_and_save(tests) {
41
+ sample=v.sample_without_replacement(sample_size)
42
+ sds_prom_wo.push(sample.sds)
43
+ sample.mean
44
+ }
45
+ xxz=[]
46
+ xxt=[]
47
+ xa=[]
48
+ xy=[]
49
+ xt=[]
50
+ xz=[]
51
+
52
+ s_wr=sds_prom.to_vector(:scale).mean
53
+ s_wor=sds_prom_wo.to_vector(:scale).mean
54
+
55
+ mw=monte_wr.to_vector(:scale)
56
+ mwo=monte_wor.to_vector(:scale)
57
+ puts "Sample variance with replacement: #{mw.variance_population}"
58
+ puts "Sample variance without replacement: #{monte_wor.to_vector(:scale).variance_population}"
59
+ puts "Mean sd estimadet :#{vsd*Math::sqrt(ss-1)}"
60
+ puts "Mean Sd W/R: #{s_wr}"
61
+ puts "Mean Sd WO/R: #{s_wor}"
62
+
63
+ mx=mw.mean
64
+ er=mw.sds
65
+
66
+ prev=0
67
+ mw.frequencies.sort.each{|x,y|
68
+ t=(x-vm).quo(s_wr.quo(Math::sqrt(ss))*s_wr.quo(ss-1))
69
+ z=(x-vm).quo(vsd.quo(Math::sqrt(ss)))
70
+ xxz.push(z)
71
+ xxt.push(t)
72
+ prev+=y
73
+ xy.push(prev.to_f/tests)
74
+ xt.push(GSL::Cdf.tdist_P(t, ss-1))
75
+ xz.push(GSL::Cdf.gaussian_P(z))
76
+
77
+ }
78
+ plot.data << Gnuplot::DataSet.new( [xxt,xy] ) do |ds|
79
+ ds.with="lines"
80
+ ds.title = "sim #{sample_size}"
81
+ end
82
+ plot.data << Gnuplot::DataSet.new( [xxt,xt] ) do |ds|
83
+ ds.with="lines"
84
+ ds.title = "t #{sample_size}"
85
+ end
86
+ plot.data << Gnuplot::DataSet.new( [xxz,xz] ) do |ds|
87
+ ds.with="lines"
88
+ ds.title = "z"
89
+ end
90
+
91
+ }
92
+
93
+ end
94
+ end
95
+