cass 0.0.2 → 0.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +4 -0
- data/Manifest +0 -1
- data/README.rdoc +27 -9
- data/Rakefile +1 -1
- data/cass.gemspec +3 -3
- data/lib/cass.rb +1 -1
- data/lib/cass/analysis.rb +8 -6
- data/lib/cass/context.rb +7 -5
- data/lib/cass/contrast.rb +1 -0
- data/lib/cass/document.rb +13 -11
- data/lib/cass/parser.rb +1 -1
- metadata +5 -5
data/CHANGELOG
CHANGED
@@ -1,3 +1,7 @@
|
|
1
|
+
v0.0.3 11/10/2010 -- Minor bug fixes.
|
2
|
+
- Can now initialize Document by passing a single Contrast for target extraction; previously, CASS would crash because it was expecting an array of Contrasts or words.
|
3
|
+
- Fixed issue with uninitialized VERBOSE constant. Previous version crashed if the constant wasn't defined in the environment; now checks if constant is defined first.
|
4
|
+
|
1
5
|
v0.0.2. 08/04/2010 -- Fixed bugs and added additional output information.
|
2
6
|
- No longer includes narray as a dependency upon installation due to presence of multiple narray gems. User must now manually install the correct version before running CASS.
|
3
7
|
- Fixed minor syntax issues that prevented CASS from running on Ruby 1.8.6 (mostly to do with the splat operator).
|
data/Manifest
CHANGED
data/README.rdoc
CHANGED
@@ -4,23 +4,31 @@ CASS (Contrast Analysis of Semantic Similarity) is a set of tools for conducting
|
|
4
4
|
|
5
5
|
== Version
|
6
6
|
|
7
|
-
The current version of the tools is 0.0.
|
7
|
+
The current version of the tools is 0.0.2 (08/05/2010).
|
8
8
|
|
9
9
|
== License
|
10
10
|
|
11
|
-
|
11
|
+
CASS is licensed under the GPL license. See the included LICENSE file for details.
|
12
12
|
|
13
13
|
== Installation
|
14
14
|
|
15
|
-
The CASS tools are packaged as a library for the Ruby programming language. You must have Ruby interpreter installed on your system, as well as the
|
15
|
+
The CASS tools are packaged as a library for the Ruby programming language. You must have Ruby interpreter installed on your system, as well as the NArray gem. To install, follow these steps:
|
16
16
|
|
17
|
-
(1) <b>Install Ruby</b
|
17
|
+
(1) <b>Install Ruby</b>. Installers for most platforms are available here[http://www.ruby-lang.org/en/downloads/]. For Windows, the most recent installer can be found here[http://rubyinstaller.org/]. CASS should work with any recent version of Ruby (1.8.6+), but we recommend using 1.9+.
|
18
18
|
|
19
|
-
(2) <b>Install the NArray[http://narray.rubyforge.org] library</b
|
19
|
+
(2) <b>Install the NArray[http://narray.rubyforge.org] library</b>, which supports numerical operations CASS requires. On most platforms, you should be able to just type the following from the command prompt:
|
20
20
|
|
21
21
|
gem install narray
|
22
22
|
|
23
|
-
On Windows, it's
|
23
|
+
On Windows, it's slightly more involved, as you'll need to install the right version of the gem and explicitly indicate the architecture. If you're running Ruby 1.8, use the following command:
|
24
|
+
|
25
|
+
gem install narray --platform=x86-mingw32
|
26
|
+
|
27
|
+
On Ruby 1.9+, use the following:
|
28
|
+
|
29
|
+
gem install narray-rub19 --platform=x86-mingw32
|
30
|
+
|
31
|
+
Additional instructions for installing NArray are here[http://narray.rubyforge.org] should you need them, though they appear to be out of date.
|
24
32
|
|
25
33
|
(3) <b>Install the CASS gem</b> from the command prompt, like so:
|
26
34
|
|
@@ -76,7 +84,7 @@ Next, we read the file containing the transcripts:
|
|
76
84
|
|
77
85
|
text = File.new("cake.txt").read
|
78
86
|
|
79
|
-
And then we can create a corresponding Document. We initialize the Document object by passing a descriptive name, the
|
87
|
+
And then we can create a corresponding Document. We initialize the Document object by passing a descriptive name, a list of target words (in this case, the target words will be extracted from the contrast we've already defined, but we could also have passed an array of words), and the full text we want to analyze:
|
80
88
|
|
81
89
|
doc = Document.new("cake_vs_spinach", contrast, text)
|
82
90
|
|
@@ -93,6 +101,12 @@ And that prints something like this to our screen:
|
|
93
101
|
|
94
102
|
Nothing too fancy, just basic descriptive information. The summary method has some additional arguments we could use to get more detailed information (e.g., word_count, list_context, etc.), but we'll skip those for now.
|
95
103
|
|
104
|
+
Having created the Document and specified the target words, we can now generate its coocurrence matrix:
|
105
|
+
|
106
|
+
doc.coocurrence
|
107
|
+
|
108
|
+
This step creates a correlation matrix in the Document object that represents the similarities between all possible target pairs. The cooccurrence matrix forms the basis for our subsequent analysis.
|
109
|
+
|
96
110
|
Now if we want to compute the interaction term for our contrast (i.e., the difference of differences, reflecting the equation (cake.good - cake.bad) - (spinach.good - spinach.bad)), all we have to do is:
|
97
111
|
|
98
112
|
contrast.apply(doc)
|
@@ -107,7 +121,7 @@ Well, sort of. By itself, the number 0.23 doesn't mean very much. We don't know
|
|
107
121
|
|
108
122
|
Analysis.bootstrap_test(doc, contrasts, "speech_results.txt", 1000)
|
109
123
|
|
110
|
-
Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look like this:
|
124
|
+
Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look something like this (the exact values in your file will differ somewhat due to the bootstrapping):
|
111
125
|
|
112
126
|
contrast result_id doc_name pair_1 pair_2 pair_3 pair_4 interaction_term
|
113
127
|
cake.spinach.good.bad observed cake.txt 0.5117 0.4039 0.3256 0.4511 0.2333
|
@@ -127,4 +141,8 @@ The columns tell us, respectively, what file the results came from, the bootstra
|
|
127
141
|
cake.txt cake.spinach.good.bad 1000 0.2333 0.0
|
128
142
|
cake.txt mean 1000 0.2333 0.0
|
129
143
|
|
130
|
-
As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, you can surf around this RDoc, or just experiment with the software. Eventually, there will be a more comprehensive manual
|
144
|
+
As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, you can surf around this RDoc, or just experiment with the software. Eventually, there will be a more comprehensive manual.
|
145
|
+
|
146
|
+
== Bug reports / installation problems
|
147
|
+
|
148
|
+
If you have questions about usage, email[mailto:nick.holtzman@gmail.com] Nick Holtzman. For bug reports or technical questions about the Ruby code, email[mailto:tyarkoni@gmail.com] Tal Yarkoni.
|
data/Rakefile
CHANGED
@@ -2,7 +2,7 @@ require 'rubygems'
|
|
2
2
|
require 'rake'
|
3
3
|
require 'echoe'
|
4
4
|
|
5
|
-
Echoe.new("cass", "0.0.
|
5
|
+
Echoe.new("cass", "0.0.3") { |p|
|
6
6
|
p.author = "Tal Yarkoni"
|
7
7
|
p.email = "tyarkoni@gmail.com"
|
8
8
|
p.summary = "A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS)."
|
data/cass.gemspec
CHANGED
@@ -2,15 +2,15 @@
|
|
2
2
|
|
3
3
|
Gem::Specification.new do |s|
|
4
4
|
s.name = %q{cass}
|
5
|
-
s.version = "0.0.
|
5
|
+
s.version = "0.0.3"
|
6
6
|
|
7
7
|
s.required_rubygems_version = Gem::Requirement.new(">= 1.2") if s.respond_to? :required_rubygems_version=
|
8
8
|
s.authors = ["Tal Yarkoni"]
|
9
|
-
s.date = %q{2010-
|
9
|
+
s.date = %q{2010-11-10}
|
10
10
|
s.description = %q{A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).}
|
11
11
|
s.email = %q{tyarkoni@gmail.com}
|
12
12
|
s.extra_rdoc_files = ["CHANGELOG", "LICENSE", "README.rdoc", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
|
13
|
-
s.files = ["CHANGELOG", "LICENSE", "Manifest", "README.rdoc", "Rakefile", "
|
13
|
+
s.files = ["CHANGELOG", "LICENSE", "Manifest", "README.rdoc", "Rakefile", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb", "cass.gemspec"]
|
14
14
|
s.homepage = %q{http://casstools.org}
|
15
15
|
s.rdoc_options = ["--line-numbers", "--inline-source", "--title", "Cass", "--main", "README.rdoc"]
|
16
16
|
s.require_paths = ["lib"]
|
data/lib/cass.rb
CHANGED
data/lib/cass/analysis.rb
CHANGED
@@ -25,7 +25,7 @@ module Cass
|
|
25
25
|
opts[c.downcase] = Module.const_get(c) if consts.include?(c)
|
26
26
|
}
|
27
27
|
|
28
|
-
if VERBOSE
|
28
|
+
if (defined?(VERBOSE) and VERBOSE)
|
29
29
|
puts "\nRunning CASS with the following options:"
|
30
30
|
opts.each { |k,v| puts "\t#{k}: #{v}" }
|
31
31
|
end
|
@@ -33,11 +33,11 @@ module Cass
|
|
33
33
|
contrasts = parse_contrasts(CONTRAST_FILE)
|
34
34
|
|
35
35
|
# Create contrasts
|
36
|
-
puts "\nFound #{contrasts.size} contrasts." if VERBOSE
|
36
|
+
puts "\nFound #{contrasts.size} contrasts." if (defined?(VERBOSE) and VERBOSE)
|
37
37
|
|
38
38
|
# Set targets
|
39
39
|
targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
|
40
|
-
puts "\nFound #{targets.size} target words." if VERBOSE
|
40
|
+
puts "\nFound #{targets.size} target words." if (defined?(VERBOSE) and VERBOSE)
|
41
41
|
|
42
42
|
# Read in files and create documents
|
43
43
|
docs = []
|
@@ -61,15 +61,15 @@ module Cass
|
|
61
61
|
docs.each { |d|
|
62
62
|
base = File.basename(d.name, '.txt')
|
63
63
|
puts "\nRunning one-sample analysis on document '#{d.name}'."
|
64
|
-
puts "Generating #{n_perm} bootstraps..." if VERBOSE and STATS
|
64
|
+
puts "Generating #{n_perm} bootstraps..." if (defined?(VERBOSE) and VERBOSE) and STATS
|
65
65
|
bootstrap_test(d, contrasts, "#{OUTPUT_ROOT}_#{base}_results.txt", n_perm, opts)
|
66
66
|
p_values("#{OUTPUT_ROOT}_#{base}_results.txt", 'boot', true) if STATS
|
67
67
|
}
|
68
68
|
|
69
69
|
when 2
|
70
70
|
abort("Error: in order to run a permutation test, you need to pass exactly two files as input.") if FILES.size != 2 or docs.size != 2
|
71
|
-
puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if VERBOSE
|
72
|
-
puts "Generating #{n_perm} permutations..." if VERBOSE and STATS
|
71
|
+
puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if (defined?(VERBOSE) and VERBOSE)
|
72
|
+
puts "Generating #{n_perm} permutations..." if (defined?(VERBOSE) and VERBOSE) and STATS
|
73
73
|
permutation_test(docs[0], docs[1], contrasts, "#{OUTPUT_ROOT}_results.txt", n_perm, opts)
|
74
74
|
p_values("#{OUTPUT_ROOT}_results.txt", 'perm', true)
|
75
75
|
|
@@ -148,6 +148,8 @@ module Cass
|
|
148
148
|
outf.sync = true
|
149
149
|
|
150
150
|
doc.cooccurrence(opts['normalize_weights'])
|
151
|
+
|
152
|
+
contrasts = [contrasts] if contrasts.class == Contrast
|
151
153
|
contrasts.each { |c|
|
152
154
|
observed = c.apply(doc)
|
153
155
|
outf.puts "#{c.words.join(".")}\tobserved\t#{observed}"
|
data/lib/cass/context.rb
CHANGED
@@ -8,11 +8,13 @@ module Cass
|
|
8
8
|
def initialize(doc, opts)
|
9
9
|
min_prop = opts['min_prop'] || 0
|
10
10
|
max_prop = opts['max_prop'] || 1
|
11
|
-
|
12
|
-
|
11
|
+
if (defined?(VERBOSE) and VERBOSE)
|
12
|
+
puts "Creating new context..."
|
13
|
+
puts "Using all words with token frequency in range of #{min_prop} and #{max_prop}."
|
14
|
+
end
|
13
15
|
words = doc.lines.join(' ').split(/\s+/)
|
14
16
|
nwords = words.size
|
15
|
-
puts "Found #{nwords} words."
|
17
|
+
puts "Found #{nwords} words." if (defined?(VERBOSE) and VERBOSE)
|
16
18
|
if min_prop > 0 or max_prop < 1
|
17
19
|
word_hash = Hash.new(0)
|
18
20
|
words.each {|w| word_hash[w] += 1 }
|
@@ -28,12 +30,12 @@ module Cass
|
|
28
30
|
rescue
|
29
31
|
abort("Error: could not open stopword file #{opts['stop_file']}!")
|
30
32
|
end
|
31
|
-
puts "Removing #{stopwords.size} stopwords from context." if VERBOSE
|
33
|
+
puts "Removing #{stopwords.size} stopwords from context." if (defined?(VERBOSE) and VERBOSE)
|
32
34
|
words -= stopwords
|
33
35
|
end
|
34
36
|
@words = opts.key?('context_size') ? words.sort_by{rand}[0, opts['context_size']] : words
|
35
37
|
index_words
|
36
|
-
puts "Using #{@words.size} words as context." if VERBOSE
|
38
|
+
puts "Using #{@words.size} words as context." if (defined?(VERBOSE) and VERBOSE)
|
37
39
|
end
|
38
40
|
|
39
41
|
# Index the context. Necessary when words are updated manually.
|
data/lib/cass/contrast.rb
CHANGED
data/lib/cass/document.rb
CHANGED
@@ -26,8 +26,8 @@ module Cass
|
|
26
26
|
# Error checking...
|
27
27
|
if name.nil?
|
28
28
|
abort("Error: document has no name!")
|
29
|
-
elsif targets.nil?
|
30
|
-
|
29
|
+
elsif targets.nil?
|
30
|
+
abort("Error: you must specify the targets to use!")
|
31
31
|
elsif text.nil?
|
32
32
|
abort("Error: no text provided!")
|
33
33
|
end
|
@@ -36,9 +36,10 @@ module Cass
|
|
36
36
|
@name, @text, @tindex = name, text, {}
|
37
37
|
|
38
38
|
# Get list of words from contrasts if necessary
|
39
|
+
targets = [targets] if targets.class == Contrast
|
39
40
|
@targets =
|
40
41
|
if targets[0].class == Contrast
|
41
|
-
targets
|
42
|
+
targets.inject([]) { |t, c| t += c.words.flatten }.uniq
|
42
43
|
else
|
43
44
|
targets
|
44
45
|
end
|
@@ -57,14 +58,14 @@ module Cass
|
|
57
58
|
if opts['skip_preproc']
|
58
59
|
@lines = (text.class == Array) ? @text : text.split(/[\r\n]+/)
|
59
60
|
else
|
60
|
-
puts "Converting to lowercase..." if VERBOSE
|
61
|
+
puts "Converting to lowercase..." if defined?(VERBOSE) and VERBOSE
|
61
62
|
@text.downcase! unless opts['keep_case']
|
62
63
|
@text.gsub!(/[^a-z \n]+/, '') unless opts['keep_special']
|
63
64
|
if opts.key?('recode')
|
64
|
-
puts "Recoding words..." if VERBOSE
|
65
|
+
puts "Recoding words..." if defined?(VERBOSE) and VERBOSE
|
65
66
|
opts['recode'].each { |k,v| @text.gsub!(/(^|\s+)(#{k})($|\s+)/, "\\1#{v}\\3") }
|
66
67
|
end
|
67
|
-
puts "Parsing text..." if VERBOSE
|
68
|
+
puts "Parsing text..." if defined?(VERBOSE) and VERBOSE
|
68
69
|
@lines = opts['parse_text'] ? Parser.parse(@text, opts) : @text.split(/[\r\n]+/)
|
69
70
|
@lines = @lines[0, opts['max_lines']] if opts['max_lines'] and opts['max_lines'] > 0
|
70
71
|
trim!
|
@@ -74,12 +75,12 @@ module Cass
|
|
74
75
|
# Trim internal list of lines, keeping only those that contain
|
75
76
|
# at least one target word.
|
76
77
|
def trim!
|
77
|
-
puts "Deleting target-less lines..." if VERBOSE
|
78
|
+
puts "Deleting target-less lines..." if defined?(VERBOSE) and VERBOSE
|
78
79
|
ts = @targets.join("|")
|
79
80
|
#@lines.delete_if { |s| (s.split(/\s+/) & @targets).empty? } # another way to do it
|
80
81
|
nl = @lines.size
|
81
82
|
@lines = @lines.grep(/(^|\s+)(#{ts})($|\s+)/)
|
82
|
-
puts "Keeping #{@lines.size} / #{nl} lines." if VERBOSE
|
83
|
+
puts "Keeping #{@lines.size} / #{nl} lines." if defined?(VERBOSE) and VERBOSE
|
83
84
|
self
|
84
85
|
end
|
85
86
|
|
@@ -131,7 +132,7 @@ module Cass
|
|
131
132
|
# permute!
|
132
133
|
# docs = [self]
|
133
134
|
# n.times { |i|
|
134
|
-
# puts "Generating bootstrap ##{i+1}..." if VERBOSE
|
135
|
+
# puts "Generating bootstrap ##{i+1}..." if defined?(VERBOSE) and VERBOSE
|
135
136
|
# d = self.clone
|
136
137
|
# d.name = "#{@name}_boot_#{(i+1)}"
|
137
138
|
# d.resample!
|
@@ -145,7 +146,7 @@ module Cass
|
|
145
146
|
# Drop all words that aren't in target list or context. Store as an array of arrays,
|
146
147
|
# with first element = array of targets and second = array of context words.
|
147
148
|
def compact
|
148
|
-
puts "Compacting all lines..." if VERBOSE
|
149
|
+
puts "Compacting all lines..." if defined?(VERBOSE) and VERBOSE
|
149
150
|
@clines = []
|
150
151
|
@lines.each { |l|
|
151
152
|
w = l.split(/\s+/).uniq
|
@@ -158,7 +159,7 @@ module Cass
|
|
158
159
|
# Computes co-occurrence matrix between target words and the context.
|
159
160
|
# Stores a target-by-context integer matrix internally.
|
160
161
|
def cooccurrence(normalize_weights=false)
|
161
|
-
# puts "Generating co-occurrence matrix..." if VERBOSE
|
162
|
+
# puts "Generating co-occurrence matrix..." if defined?(VERBOSE) and VERBOSE
|
162
163
|
coocc = NMatrix.float(@targets.size, @context.size)
|
163
164
|
compact if @clines.nil?
|
164
165
|
lc = 0 # line counter
|
@@ -214,6 +215,7 @@ module Cass
|
|
214
215
|
def summary(filename=nil, list_context=false, word_count=false)
|
215
216
|
|
216
217
|
buffer = []
|
218
|
+
compact if @clines.nil?
|
217
219
|
|
218
220
|
# Basic info that always gets shown
|
219
221
|
buffer << "Summary for document '#{@name}':"
|
data/lib/cass/parser.rb
CHANGED
@@ -32,7 +32,7 @@ module Cass
|
|
32
32
|
rx = opts.key?('parser_regex') ? opts['parser_regex'] : "[\r\n\.]+"
|
33
33
|
text.split(/#{rx}/)
|
34
34
|
else
|
35
|
-
puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if VERBOSE
|
35
|
+
puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if (defined?(VERBOSE) and VERBOSE)
|
36
36
|
parser = StanfordParser::DocumentPreprocessor.new
|
37
37
|
parser.getSentencesFromString(text)
|
38
38
|
end
|
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: cass
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 25
|
5
5
|
prerelease: false
|
6
6
|
segments:
|
7
7
|
- 0
|
8
8
|
- 0
|
9
|
-
-
|
10
|
-
version: 0.0.
|
9
|
+
- 3
|
10
|
+
version: 0.0.3
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Tal Yarkoni
|
@@ -15,7 +15,7 @@ autorequire:
|
|
15
15
|
bindir: bin
|
16
16
|
cert_chain: []
|
17
17
|
|
18
|
-
date: 2010-
|
18
|
+
date: 2010-11-10 00:00:00 -07:00
|
19
19
|
default_executable:
|
20
20
|
dependencies: []
|
21
21
|
|
@@ -43,7 +43,6 @@ files:
|
|
43
43
|
- Manifest
|
44
44
|
- README.rdoc
|
45
45
|
- Rakefile
|
46
|
-
- cass.gemspec
|
47
46
|
- lib/cass.rb
|
48
47
|
- lib/cass/analysis.rb
|
49
48
|
- lib/cass/context.rb
|
@@ -52,6 +51,7 @@ files:
|
|
52
51
|
- lib/cass/extensions.rb
|
53
52
|
- lib/cass/parser.rb
|
54
53
|
- lib/cass/stats.rb
|
54
|
+
- cass.gemspec
|
55
55
|
has_rdoc: true
|
56
56
|
homepage: http://casstools.org
|
57
57
|
licenses: []
|