cass 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,3 +1,7 @@
1
+ v0.0.3 11/10/2010 -- Minor bug fixes.
2
+ - Can now initialize Document by passing a single Contrast for target extraction; previously, CASS would crash because it was expecting an array of Contrasts or words.
3
+ - Fixed issue with uninitialized VERBOSE constant. Previous version crashed if the constant wasn't defined in the environment; now checks if constant is defined first.
4
+
1
5
  v0.0.2. 08/04/2010 -- Fixed bugs and added additional output information.
2
6
  - No longer includes narray as a dependency upon installation due to presence of multiple narray gems. User must now manually install the correct version before running CASS.
3
7
  - Fixed minor syntax issues that prevented CASS from running on Ruby 1.8.6 (mostly to do with the splat operator).
data/Manifest CHANGED
@@ -3,7 +3,6 @@ LICENSE
3
3
  Manifest
4
4
  README.rdoc
5
5
  Rakefile
6
- cass.gemspec
7
6
  lib/cass.rb
8
7
  lib/cass/analysis.rb
9
8
  lib/cass/context.rb
@@ -4,23 +4,31 @@ CASS (Contrast Analysis of Semantic Similarity) is a set of tools for conducting
4
4
 
5
5
  == Version
6
6
 
7
- The current version of the tools is 0.0.1 (6/15/2010).
7
+ The current version of the tools is 0.0.2 (08/05/2010).
8
8
 
9
9
  == License
10
10
 
11
- Copyright 2010 Tal Yarkoni and Nick Holtzman. Licensed under the GPL license. See the included LICENSE file for details.
11
+ CASS is licensed under the GPL license. See the included LICENSE file for details.
12
12
 
13
13
  == Installation
14
14
 
15
- The CASS tools are packaged as a library for the Ruby programming language. You must have Ruby interpreter installed on your system, as well as the NMatrix library. To install, follow these steps:
15
+ The CASS tools are packaged as a library for the Ruby programming language. You must have Ruby interpreter installed on your system, as well as the NArray gem. To install, follow these steps:
16
16
 
17
- (1) <b>Install Ruby</b>--preferably 1.9 or greater. Installers for most platforms are available here[http://www.ruby-lang.org/en/downloads/].
17
+ (1) <b>Install Ruby</b>. Installers for most platforms are available here[http://www.ruby-lang.org/en/downloads/]. For Windows, the most recent installer can be found here[http://rubyinstaller.org/]. CASS should work with any recent version of Ruby (1.8.6+), but we recommend using 1.9+.
18
18
 
19
- (2) <b>Install the NArray[http://narray.rubyforge.org] library</b>. On most platforms, you should be able to just type:
19
+ (2) <b>Install the NArray[http://narray.rubyforge.org] library</b>, which supports numerical operations CASS requires. On most platforms, you should be able to just type the following from the command prompt:
20
20
 
21
21
  gem install narray
22
22
 
23
- On Windows, it's a bit more involved; follow the instructions here[http://narray.rubyforge.org].
23
+ On Windows, it's slightly more involved, as you'll need to install the right version of the gem and explicitly indicate the architecture. If you're running Ruby 1.8, use the following command:
24
+
25
+ gem install narray --platform=x86-mingw32
26
+
27
+ On Ruby 1.9+, use the following:
28
+
29
+ gem install narray-rub19 --platform=x86-mingw32
30
+
31
+ Additional instructions for installing NArray are here[http://narray.rubyforge.org] should you need them, though they appear to be out of date.
24
32
 
25
33
  (3) <b>Install the CASS gem</b> from the command prompt, like so:
26
34
 
@@ -76,7 +84,7 @@ Next, we read the file containing the transcripts:
76
84
 
77
85
  text = File.new("cake.txt").read
78
86
 
79
- And then we can create a corresponding Document. We initialize the Document object by passing a descriptive name, the contrasts we want to run, and the full text we want to analyze:
87
+ And then we can create a corresponding Document. We initialize the Document object by passing a descriptive name, a list of target words (in this case, the target words will be extracted from the contrast we've already defined, but we could also have passed an array of words), and the full text we want to analyze:
80
88
 
81
89
  doc = Document.new("cake_vs_spinach", contrast, text)
82
90
 
@@ -93,6 +101,12 @@ And that prints something like this to our screen:
93
101
 
94
102
  Nothing too fancy, just basic descriptive information. The summary method has some additional arguments we could use to get more detailed information (e.g., word_count, list_context, etc.), but we'll skip those for now.
95
103
 
104
+ Having created the Document and specified the target words, we can now generate its coocurrence matrix:
105
+
106
+ doc.coocurrence
107
+
108
+ This step creates a correlation matrix in the Document object that represents the similarities between all possible target pairs. The cooccurrence matrix forms the basis for our subsequent analysis.
109
+
96
110
  Now if we want to compute the interaction term for our contrast (i.e., the difference of differences, reflecting the equation (cake.good - cake.bad) - (spinach.good - spinach.bad)), all we have to do is:
97
111
 
98
112
  contrast.apply(doc)
@@ -107,7 +121,7 @@ Well, sort of. By itself, the number 0.23 doesn't mean very much. We don't know
107
121
 
108
122
  Analysis.bootstrap_test(doc, contrasts, "speech_results.txt", 1000)
109
123
 
110
- Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look like this:
124
+ Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look something like this (the exact values in your file will differ somewhat due to the bootstrapping):
111
125
 
112
126
  contrast result_id doc_name pair_1 pair_2 pair_3 pair_4 interaction_term
113
127
  cake.spinach.good.bad observed cake.txt 0.5117 0.4039 0.3256 0.4511 0.2333
@@ -127,4 +141,8 @@ The columns tell us, respectively, what file the results came from, the bootstra
127
141
  cake.txt cake.spinach.good.bad 1000 0.2333 0.0
128
142
  cake.txt mean 1000 0.2333 0.0
129
143
 
130
- As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, you can surf around this RDoc, or just experiment with the software. Eventually, there will be a more comprehensive manual; in the meantime, if you have questions about usage, email[mailto:nick.holtzman@gmail.com] Nick Holtzman, and if you have technical questions about the Ruby code, email[mailto:tyarkoni@gmail.com] Tal Yarkoni.
144
+ As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, you can surf around this RDoc, or just experiment with the software. Eventually, there will be a more comprehensive manual.
145
+
146
+ == Bug reports / installation problems
147
+
148
+ If you have questions about usage, email[mailto:nick.holtzman@gmail.com] Nick Holtzman. For bug reports or technical questions about the Ruby code, email[mailto:tyarkoni@gmail.com] Tal Yarkoni.
data/Rakefile CHANGED
@@ -2,7 +2,7 @@ require 'rubygems'
2
2
  require 'rake'
3
3
  require 'echoe'
4
4
 
5
- Echoe.new("cass", "0.0.2") { |p|
5
+ Echoe.new("cass", "0.0.3") { |p|
6
6
  p.author = "Tal Yarkoni"
7
7
  p.email = "tyarkoni@gmail.com"
8
8
  p.summary = "A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS)."
@@ -2,15 +2,15 @@
2
2
 
3
3
  Gem::Specification.new do |s|
4
4
  s.name = %q{cass}
5
- s.version = "0.0.2"
5
+ s.version = "0.0.3"
6
6
 
7
7
  s.required_rubygems_version = Gem::Requirement.new(">= 1.2") if s.respond_to? :required_rubygems_version=
8
8
  s.authors = ["Tal Yarkoni"]
9
- s.date = %q{2010-08-04}
9
+ s.date = %q{2010-11-10}
10
10
  s.description = %q{A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).}
11
11
  s.email = %q{tyarkoni@gmail.com}
12
12
  s.extra_rdoc_files = ["CHANGELOG", "LICENSE", "README.rdoc", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
13
- s.files = ["CHANGELOG", "LICENSE", "Manifest", "README.rdoc", "Rakefile", "cass.gemspec", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
13
+ s.files = ["CHANGELOG", "LICENSE", "Manifest", "README.rdoc", "Rakefile", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb", "cass.gemspec"]
14
14
  s.homepage = %q{http://casstools.org}
15
15
  s.rdoc_options = ["--line-numbers", "--inline-source", "--title", "Cass", "--main", "README.rdoc"]
16
16
  s.require_paths = ["lib"]
@@ -9,6 +9,6 @@ require 'cass/parser'
9
9
 
10
10
  module Cass
11
11
 
12
- VERSION = '0.0.2'
12
+ VERSION = '0.0.3'
13
13
 
14
14
  end
@@ -25,7 +25,7 @@ module Cass
25
25
  opts[c.downcase] = Module.const_get(c) if consts.include?(c)
26
26
  }
27
27
 
28
- if VERBOSE
28
+ if (defined?(VERBOSE) and VERBOSE)
29
29
  puts "\nRunning CASS with the following options:"
30
30
  opts.each { |k,v| puts "\t#{k}: #{v}" }
31
31
  end
@@ -33,11 +33,11 @@ module Cass
33
33
  contrasts = parse_contrasts(CONTRAST_FILE)
34
34
 
35
35
  # Create contrasts
36
- puts "\nFound #{contrasts.size} contrasts." if VERBOSE
36
+ puts "\nFound #{contrasts.size} contrasts." if (defined?(VERBOSE) and VERBOSE)
37
37
 
38
38
  # Set targets
39
39
  targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
40
- puts "\nFound #{targets.size} target words." if VERBOSE
40
+ puts "\nFound #{targets.size} target words." if (defined?(VERBOSE) and VERBOSE)
41
41
 
42
42
  # Read in files and create documents
43
43
  docs = []
@@ -61,15 +61,15 @@ module Cass
61
61
  docs.each { |d|
62
62
  base = File.basename(d.name, '.txt')
63
63
  puts "\nRunning one-sample analysis on document '#{d.name}'."
64
- puts "Generating #{n_perm} bootstraps..." if VERBOSE and STATS
64
+ puts "Generating #{n_perm} bootstraps..." if (defined?(VERBOSE) and VERBOSE) and STATS
65
65
  bootstrap_test(d, contrasts, "#{OUTPUT_ROOT}_#{base}_results.txt", n_perm, opts)
66
66
  p_values("#{OUTPUT_ROOT}_#{base}_results.txt", 'boot', true) if STATS
67
67
  }
68
68
 
69
69
  when 2
70
70
  abort("Error: in order to run a permutation test, you need to pass exactly two files as input.") if FILES.size != 2 or docs.size != 2
71
- puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if VERBOSE
72
- puts "Generating #{n_perm} permutations..." if VERBOSE and STATS
71
+ puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if (defined?(VERBOSE) and VERBOSE)
72
+ puts "Generating #{n_perm} permutations..." if (defined?(VERBOSE) and VERBOSE) and STATS
73
73
  permutation_test(docs[0], docs[1], contrasts, "#{OUTPUT_ROOT}_results.txt", n_perm, opts)
74
74
  p_values("#{OUTPUT_ROOT}_results.txt", 'perm', true)
75
75
 
@@ -148,6 +148,8 @@ module Cass
148
148
  outf.sync = true
149
149
 
150
150
  doc.cooccurrence(opts['normalize_weights'])
151
+
152
+ contrasts = [contrasts] if contrasts.class == Contrast
151
153
  contrasts.each { |c|
152
154
  observed = c.apply(doc)
153
155
  outf.puts "#{c.words.join(".")}\tobserved\t#{observed}"
@@ -8,11 +8,13 @@ module Cass
8
8
  def initialize(doc, opts)
9
9
  min_prop = opts['min_prop'] || 0
10
10
  max_prop = opts['max_prop'] || 1
11
- puts "Creating new context..." if VERBOSE
12
- puts "Using all words with token frequency in range of #{min_prop} and #{max_prop}."
11
+ if (defined?(VERBOSE) and VERBOSE)
12
+ puts "Creating new context..."
13
+ puts "Using all words with token frequency in range of #{min_prop} and #{max_prop}."
14
+ end
13
15
  words = doc.lines.join(' ').split(/\s+/)
14
16
  nwords = words.size
15
- puts "Found #{nwords} words."
17
+ puts "Found #{nwords} words." if (defined?(VERBOSE) and VERBOSE)
16
18
  if min_prop > 0 or max_prop < 1
17
19
  word_hash = Hash.new(0)
18
20
  words.each {|w| word_hash[w] += 1 }
@@ -28,12 +30,12 @@ module Cass
28
30
  rescue
29
31
  abort("Error: could not open stopword file #{opts['stop_file']}!")
30
32
  end
31
- puts "Removing #{stopwords.size} stopwords from context." if VERBOSE
33
+ puts "Removing #{stopwords.size} stopwords from context." if (defined?(VERBOSE) and VERBOSE)
32
34
  words -= stopwords
33
35
  end
34
36
  @words = opts.key?('context_size') ? words.sort_by{rand}[0, opts['context_size']] : words
35
37
  index_words
36
- puts "Using #{@words.size} words as context." if VERBOSE
38
+ puts "Using #{@words.size} words as context." if (defined?(VERBOSE) and VERBOSE)
37
39
  end
38
40
 
39
41
  # Index the context. Necessary when words are updated manually.
@@ -8,6 +8,7 @@ module Cass
8
8
  attr_accessor :words
9
9
 
10
10
  def initialize(words)
11
+ words = words.split(/,*\s+/) if words.class == String
11
12
  @words = words
12
13
  end
13
14
 
@@ -26,8 +26,8 @@ module Cass
26
26
  # Error checking...
27
27
  if name.nil?
28
28
  abort("Error: document has no name!")
29
- elsif targets.nil? or targets.class != Array or targets.empty?
30
- abort("Error: invalid target specification; targets must be an array of words or Contrasts.")
29
+ elsif targets.nil?
30
+ abort("Error: you must specify the targets to use!")
31
31
  elsif text.nil?
32
32
  abort("Error: no text provided!")
33
33
  end
@@ -36,9 +36,10 @@ module Cass
36
36
  @name, @text, @tindex = name, text, {}
37
37
 
38
38
  # Get list of words from contrasts if necessary
39
+ targets = [targets] if targets.class == Contrast
39
40
  @targets =
40
41
  if targets[0].class == Contrast
41
- targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
42
+ targets.inject([]) { |t, c| t += c.words.flatten }.uniq
42
43
  else
43
44
  targets
44
45
  end
@@ -57,14 +58,14 @@ module Cass
57
58
  if opts['skip_preproc']
58
59
  @lines = (text.class == Array) ? @text : text.split(/[\r\n]+/)
59
60
  else
60
- puts "Converting to lowercase..." if VERBOSE
61
+ puts "Converting to lowercase..." if defined?(VERBOSE) and VERBOSE
61
62
  @text.downcase! unless opts['keep_case']
62
63
  @text.gsub!(/[^a-z \n]+/, '') unless opts['keep_special']
63
64
  if opts.key?('recode')
64
- puts "Recoding words..." if VERBOSE
65
+ puts "Recoding words..." if defined?(VERBOSE) and VERBOSE
65
66
  opts['recode'].each { |k,v| @text.gsub!(/(^|\s+)(#{k})($|\s+)/, "\\1#{v}\\3") }
66
67
  end
67
- puts "Parsing text..." if VERBOSE
68
+ puts "Parsing text..." if defined?(VERBOSE) and VERBOSE
68
69
  @lines = opts['parse_text'] ? Parser.parse(@text, opts) : @text.split(/[\r\n]+/)
69
70
  @lines = @lines[0, opts['max_lines']] if opts['max_lines'] and opts['max_lines'] > 0
70
71
  trim!
@@ -74,12 +75,12 @@ module Cass
74
75
  # Trim internal list of lines, keeping only those that contain
75
76
  # at least one target word.
76
77
  def trim!
77
- puts "Deleting target-less lines..." if VERBOSE
78
+ puts "Deleting target-less lines..." if defined?(VERBOSE) and VERBOSE
78
79
  ts = @targets.join("|")
79
80
  #@lines.delete_if { |s| (s.split(/\s+/) & @targets).empty? } # another way to do it
80
81
  nl = @lines.size
81
82
  @lines = @lines.grep(/(^|\s+)(#{ts})($|\s+)/)
82
- puts "Keeping #{@lines.size} / #{nl} lines." if VERBOSE
83
+ puts "Keeping #{@lines.size} / #{nl} lines." if defined?(VERBOSE) and VERBOSE
83
84
  self
84
85
  end
85
86
 
@@ -131,7 +132,7 @@ module Cass
131
132
  # permute!
132
133
  # docs = [self]
133
134
  # n.times { |i|
134
- # puts "Generating bootstrap ##{i+1}..." if VERBOSE
135
+ # puts "Generating bootstrap ##{i+1}..." if defined?(VERBOSE) and VERBOSE
135
136
  # d = self.clone
136
137
  # d.name = "#{@name}_boot_#{(i+1)}"
137
138
  # d.resample!
@@ -145,7 +146,7 @@ module Cass
145
146
  # Drop all words that aren't in target list or context. Store as an array of arrays,
146
147
  # with first element = array of targets and second = array of context words.
147
148
  def compact
148
- puts "Compacting all lines..." if VERBOSE
149
+ puts "Compacting all lines..." if defined?(VERBOSE) and VERBOSE
149
150
  @clines = []
150
151
  @lines.each { |l|
151
152
  w = l.split(/\s+/).uniq
@@ -158,7 +159,7 @@ module Cass
158
159
  # Computes co-occurrence matrix between target words and the context.
159
160
  # Stores a target-by-context integer matrix internally.
160
161
  def cooccurrence(normalize_weights=false)
161
- # puts "Generating co-occurrence matrix..." if VERBOSE
162
+ # puts "Generating co-occurrence matrix..." if defined?(VERBOSE) and VERBOSE
162
163
  coocc = NMatrix.float(@targets.size, @context.size)
163
164
  compact if @clines.nil?
164
165
  lc = 0 # line counter
@@ -214,6 +215,7 @@ module Cass
214
215
  def summary(filename=nil, list_context=false, word_count=false)
215
216
 
216
217
  buffer = []
218
+ compact if @clines.nil?
217
219
 
218
220
  # Basic info that always gets shown
219
221
  buffer << "Summary for document '#{@name}':"
@@ -32,7 +32,7 @@ module Cass
32
32
  rx = opts.key?('parser_regex') ? opts['parser_regex'] : "[\r\n\.]+"
33
33
  text.split(/#{rx}/)
34
34
  else
35
- puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if VERBOSE
35
+ puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if (defined?(VERBOSE) and VERBOSE)
36
36
  parser = StanfordParser::DocumentPreprocessor.new
37
37
  parser.getSentencesFromString(text)
38
38
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: cass
3
3
  version: !ruby/object:Gem::Version
4
- hash: 27
4
+ hash: 25
5
5
  prerelease: false
6
6
  segments:
7
7
  - 0
8
8
  - 0
9
- - 2
10
- version: 0.0.2
9
+ - 3
10
+ version: 0.0.3
11
11
  platform: ruby
12
12
  authors:
13
13
  - Tal Yarkoni
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2010-08-04 00:00:00 -06:00
18
+ date: 2010-11-10 00:00:00 -07:00
19
19
  default_executable:
20
20
  dependencies: []
21
21
 
@@ -43,7 +43,6 @@ files:
43
43
  - Manifest
44
44
  - README.rdoc
45
45
  - Rakefile
46
- - cass.gemspec
47
46
  - lib/cass.rb
48
47
  - lib/cass/analysis.rb
49
48
  - lib/cass/context.rb
@@ -52,6 +51,7 @@ files:
52
51
  - lib/cass/extensions.rb
53
52
  - lib/cass/parser.rb
54
53
  - lib/cass/stats.rb
54
+ - cass.gemspec
55
55
  has_rdoc: true
56
56
  homepage: http://casstools.org
57
57
  licenses: []