cass 0.0.1 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +7 -1
- data/README.rdoc +23 -18
- data/Rakefile +6 -2
- data/cass.gemspec +2 -5
- data/lib/cass.rb +1 -1
- data/lib/cass/analysis.rb +45 -28
- data/lib/cass/context.rb +10 -1
- data/lib/cass/document.rb +6 -6
- data/lib/cass/parser.rb +1 -1
- metadata +6 -22
data/CHANGELOG
CHANGED
@@ -1 +1,7 @@
|
|
1
|
-
v0.0.
|
1
|
+
v0.0.2. 08/04/2010 -- Fixed bugs and added additional output information.
|
2
|
+
- No longer includes narray as a dependency upon installation due to presence of multiple narray gems. User must now manually install the correct version before running CASS.
|
3
|
+
- Fixed minor syntax issues that prevented CASS from running on Ruby 1.8.6 (mostly to do with the splat operator).
|
4
|
+
- Expanded output information; specifying verbose output will now produce more extensive information when CASS runs.
|
5
|
+
- Fixed bug that caused CASS to try (unsuccessfully) to parse all text using the built-in Parser.
|
6
|
+
|
7
|
+
v0.0.1. 6/12/2010 -- Initial version.
|
data/README.rdoc
CHANGED
@@ -2,6 +2,10 @@
|
|
2
2
|
|
3
3
|
CASS (Contrast Analysis of Semantic Similarity) is a set of tools for conducting contrast-based analyses of semantic similarity in text. CASS is based on the BEAGLE model described by Jones and Mewhort (2007). For a more detailed explanation, see Holtzman et al (under review).
|
4
4
|
|
5
|
+
== Version
|
6
|
+
|
7
|
+
The current version of the tools is 0.0.1 (6/15/2010).
|
8
|
+
|
5
9
|
== License
|
6
10
|
|
7
11
|
Copyright 2010 Tal Yarkoni and Nick Holtzman. Licensed under the GPL license. See the included LICENSE file for details.
|
@@ -31,16 +35,16 @@ There are two general ways to use CASS:
|
|
31
35
|
|
32
36
|
=== The easy way
|
33
37
|
|
34
|
-
For users without prior programming experience, CASS is streamlined to make things as user-friendly as possible. Assuming you've installed CASS per the instructions above, you can run a full CASS analysis just by tweaking a few settings and running the analysis script (run_cass.rb) included in the sample analysis package (cass_sample.zip[http://casstools.org/downloads/cass_sample.zip]). Detailed instructions are
|
38
|
+
For users without prior programming experience, CASS is streamlined to make things as user-friendly as possible. Assuming you've installed CASS per the instructions above, you can run a full CASS analysis just by tweaking a few settings and running the analysis script (run_cass.rb) included in the sample analysis package (cass_sample.zip[http://casstools.org/downloads/cass_sample.zip]). Detailed instructions are forthcoming in a more detailed manual; in brief, here's what you need to do to get up and running:
|
35
39
|
|
36
40
|
1. Download cass_sample.zip[http://casstools.org/downloads/cass_sample.zip] and unpack it somewhere. The package contains several files that tell CASS what to do, as well as some sample text you can process. These include:
|
37
|
-
- contrasts.txt: a specification of the contrasts you'd like to run on the text, one per line. For
|
38
|
-
- default.spec: the main specification file containing all the key settings CASS needs in order to run. All settings are commented
|
41
|
+
- contrasts.txt: a specification of the contrasts you'd like to run on the text, one per line. For an explanation of the format the contrasts are in, see the next section.
|
42
|
+
- default.spec: the main specification file containing all the key settings CASS needs in order to run. All settings are commented in detail in the file itself. You can create as many .spec files as you like (no need to edit this one repeatedly!), just make sure to edit run_cass.rb to indicate which .spec file to use.
|
39
43
|
- stopwords.txt: A sample list of stopwords to exclude from analysis (CASS will use this file by default). These are mostly high-frequency function words that carry little meaning but can strongly bias a text.
|
40
44
|
- sample1.txt and sample2.txt: two sample documents to get you started.
|
41
45
|
- run_cass.rb: the script you'll use to run CASS.
|
42
46
|
|
43
|
-
2. Edit the settings as you please. An explanation of what everything means is in the
|
47
|
+
2. Edit the settings as you please. An explanation of what everything means is in the .spec file itself; if you're just getting started, you can just leave everything as is and run the sample analysis.
|
44
48
|
|
45
49
|
3. Run run_cass.rb. If you're on Windows, you may be able to double-click on the script to run it; however, if you do that, you won't see any of the output. On most platforms (and optionally on Windows), you'll have to run the script from the command prompt. You can do this by opening up a terminal window (or, in Windows, a command prompt), navigating to the directory that contains the sample analysis files, and typing:
|
46
50
|
|
@@ -48,7 +52,7 @@ For users without prior programming experience, CASS is streamlined to make thin
|
|
48
52
|
|
49
53
|
After doing that, you should get a bunch of output showing you exactly what's going on. There should also be some new files in the working directory containing the results of the analysis.
|
50
54
|
|
51
|
-
Assuming the analysis ran successfully, you can now set about running your own analyses.
|
55
|
+
Assuming the analysis ran successfully, you can now set about running your own analyses.
|
52
56
|
|
53
57
|
=== As a library
|
54
58
|
|
@@ -82,14 +86,14 @@ If we want to see some information about the contents of our document, we can ty
|
|
82
86
|
|
83
87
|
And that prints something like this to our screen:
|
84
88
|
|
85
|
-
> Summary for document '
|
89
|
+
> Summary for document 'cake_vs_spinach':
|
86
90
|
> 4 target words (cake, spinach, good, bad)
|
87
91
|
> 35 words in context.
|
88
92
|
> Using 21 lines (containing at least one target word) for analysis.
|
89
93
|
|
90
94
|
Nothing too fancy, just basic descriptive information. The summary method has some additional arguments we could use to get more detailed information (e.g., word_count, list_context, etc.), but we'll skip those for now.
|
91
95
|
|
92
|
-
Now if we want to compute the interaction term for our contrast (i.e., the difference of differences, reflecting the equation (cake.good -
|
96
|
+
Now if we want to compute the interaction term for our contrast (i.e., the difference of differences, reflecting the equation (cake.good - cake.bad) - (spinach.good - spinach.bad)), all we have to do is:
|
93
97
|
|
94
98
|
contrast.apply(doc)
|
95
99
|
|
@@ -97,29 +101,30 @@ And we get back something that looks like this:
|
|
97
101
|
|
98
102
|
0.5117 0.4039 0.3256 0.4511 0.2333
|
99
103
|
|
100
|
-
Where the first four values represent the similarity between the 4 pairs of words used to generate the interaction term (e.g., the first value reflects the correlation between 'cake' and 'good', the second between '
|
104
|
+
Where the first four values represent the similarity between the 4 pairs of words used to generate the interaction term (e.g., the first value reflects the correlation between 'cake' and 'good', the second between 'cake' and 'bad', and so on), and the fifth is the interaction term. So in this case, the result (0.23) tells us that there's a positive bias in the text, such that cake is semantically more closely related to good (relative to bad) than spinach is. Hypothesis confirmed!
|
101
105
|
|
102
|
-
Well, sort of. By itself, the number 0.23 doesn't mean very much. We don't know what the standard error is, so we have no idea whether 0.23 is a very large number or a very small one that might occur pretty often just by chance. Fortunately, we can
|
106
|
+
Well, sort of. By itself, the number 0.23 doesn't mean very much. We don't know what the standard error is, so we have no idea whether 0.23 is a very large number or a very small one that might occur pretty often just by chance. Fortunately, we can get some p values by bootstrapping a distribution around our observed value. First, we generate the distribution:
|
103
107
|
|
104
108
|
Analysis.bootstrap_test(doc, contrasts, "speech_results.txt", 1000)
|
105
109
|
|
106
110
|
Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look like this:
|
107
111
|
|
112
|
+
contrast result_id doc_name pair_1 pair_2 pair_3 pair_4 interaction_term
|
108
113
|
cake.spinach.good.bad observed cake.txt 0.5117 0.4039 0.3256 0.4511 0.2333
|
109
|
-
cake.spinach.good.bad boot_1 cake.txt 0.
|
110
|
-
cake.spinach.good.bad boot_2 cake.txt 0.
|
111
|
-
cake.spinach.good.bad boot_3 cake.txt 0.
|
112
|
-
cake.spinach.good.bad boot_4 cake.txt 0.
|
114
|
+
cake.spinach.good.bad boot_1 cake.txt 0.5146 0.4585 0.1885 0.45 0.3176
|
115
|
+
cake.spinach.good.bad boot_2 cake.txt 0.4606 0.4421 0.2984 0.4563 0.1764
|
116
|
+
cake.spinach.good.bad boot_3 cake.txt 0.4215 0.438 0.0694 0.5695 0.4836
|
117
|
+
cake.spinach.good.bad boot_4 cake.txt 0.5734 0.353 0.2094 0.5013 0.5123
|
113
118
|
...
|
114
119
|
|
115
120
|
The columns tell us, respectively, what file the results came from, the bootstrap iteration (the first line shows us the actual, or observed value), and the observed interaction terms. Given this information, we can now compare the bootstrapped distribution to zero to test our hypothesis. We do that like this:
|
116
121
|
|
117
122
|
Analysis.p_values("speech_results.txt", 'boot')
|
118
123
|
|
119
|
-
...where the first argument specifies the full path to the file containing the bootstrap results we want to summarize, and the second argument indicates the type of test that was conducted (either 'boot' or 'perm'). The results will be written to a file named
|
124
|
+
...where the first argument specifies the full path to the file containing the bootstrap results we want to summarize, and the second argument indicates the type of test that was conducted (either 'boot' or 'perm'). The results will be written to a file named speech_results_p_values.txt. If we open that document up, we see this:
|
120
125
|
|
121
|
-
file contrast N value p-value
|
122
|
-
cake.txt cake.spinach.good.bad 1000 0.2333 0.0
|
123
|
-
cake.txt mean 1000 0.2333 0.0
|
126
|
+
file contrast N value p-value
|
127
|
+
cake.txt cake.spinach.good.bad 1000 0.2333 0.0
|
128
|
+
cake.txt mean 1000 0.2333 0.0
|
124
129
|
|
125
|
-
As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS,
|
130
|
+
As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, you can surf around this RDoc, or just experiment with the software. Eventually, there will be a more comprehensive manual; in the meantime, if you have questions about usage, email[mailto:nick.holtzman@gmail.com] Nick Holtzman, and if you have technical questions about the Ruby code, email[mailto:tyarkoni@gmail.com] Tal Yarkoni.
|
data/Rakefile
CHANGED
@@ -2,11 +2,15 @@ require 'rubygems'
|
|
2
2
|
require 'rake'
|
3
3
|
require 'echoe'
|
4
4
|
|
5
|
-
Echoe.new("cass", "0.0.
|
5
|
+
Echoe.new("cass", "0.0.2") { |p|
|
6
6
|
p.author = "Tal Yarkoni"
|
7
7
|
p.email = "tyarkoni@gmail.com"
|
8
8
|
p.summary = "A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS)."
|
9
9
|
p.url = "http://casstools.org"
|
10
10
|
p.docs_host = "http://casstools.org/doc/"
|
11
|
-
|
11
|
+
# Currently we don't impose a run-time dependency because narray has different
|
12
|
+
# gems for < 1.9 and >= 1.9. Easier just to tell the user to install
|
13
|
+
# the right one.
|
14
|
+
# p.runtime_dependencies = ['narray >=0.5.9.7']
|
15
|
+
|
12
16
|
}
|
data/cass.gemspec
CHANGED
@@ -2,11 +2,11 @@
|
|
2
2
|
|
3
3
|
Gem::Specification.new do |s|
|
4
4
|
s.name = %q{cass}
|
5
|
-
s.version = "0.0.
|
5
|
+
s.version = "0.0.2"
|
6
6
|
|
7
7
|
s.required_rubygems_version = Gem::Requirement.new(">= 1.2") if s.respond_to? :required_rubygems_version=
|
8
8
|
s.authors = ["Tal Yarkoni"]
|
9
|
-
s.date = %q{2010-
|
9
|
+
s.date = %q{2010-08-04}
|
10
10
|
s.description = %q{A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).}
|
11
11
|
s.email = %q{tyarkoni@gmail.com}
|
12
12
|
s.extra_rdoc_files = ["CHANGELOG", "LICENSE", "README.rdoc", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
|
@@ -23,11 +23,8 @@ Gem::Specification.new do |s|
|
|
23
23
|
s.specification_version = 3
|
24
24
|
|
25
25
|
if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
|
26
|
-
s.add_runtime_dependency(%q<narray>, [">= 0.5.9.7"])
|
27
26
|
else
|
28
|
-
s.add_dependency(%q<narray>, [">= 0.5.9.7"])
|
29
27
|
end
|
30
28
|
else
|
31
|
-
s.add_dependency(%q<narray>, [">= 0.5.9.7"])
|
32
29
|
end
|
33
30
|
end
|
data/lib/cass.rb
CHANGED
data/lib/cass/analysis.rb
CHANGED
@@ -1,10 +1,9 @@
|
|
1
1
|
module Cass
|
2
2
|
|
3
|
-
#
|
4
|
-
#
|
5
|
-
#
|
6
|
-
#
|
7
|
-
# supported.
|
3
|
+
# Various methods used to conduct analyses on one or more Documents.
|
4
|
+
# The primary processing stream is run_spec, which is essentially
|
5
|
+
# a wrapper arouond the other methods for conducting one and
|
6
|
+
# two-sample tests.
|
8
7
|
class Analysis
|
9
8
|
|
10
9
|
attr_accessor :docs, :contexts, :targets
|
@@ -17,26 +16,34 @@ module Cass
|
|
17
16
|
abort("Error: can't find spec file (#{spec_file}).") if !File.exist?(spec_file)
|
18
17
|
load spec_file
|
19
18
|
abort("Error: can't find contrast file (#{CONTRAST_FILE}).") if !File.exist?(CONTRAST_FILE)
|
19
|
+
|
20
|
+
# Create options hash
|
21
|
+
opts = {}
|
22
|
+
# Ruby 1.9 returns constants as symbols, 1.8.6 uses strings, so standardize
|
23
|
+
consts = Module.constants.map { |c| c.to_s }
|
24
|
+
%w[PARSE_TEXT N_PERM N_BOOT MAX_LINES RECODE CONTEXT_SIZE MIN_PROP STOP_FILE NORMALIZE_WEIGHTS VERBOSE].each { |c|
|
25
|
+
opts[c.downcase] = Module.const_get(c) if consts.include?(c)
|
26
|
+
}
|
27
|
+
|
28
|
+
if VERBOSE
|
29
|
+
puts "\nRunning CASS with the following options:"
|
30
|
+
opts.each { |k,v| puts "\t#{k}: #{v}" }
|
31
|
+
end
|
32
|
+
|
20
33
|
contrasts = parse_contrasts(CONTRAST_FILE)
|
21
34
|
|
22
35
|
# Create contrasts
|
23
|
-
puts "
|
36
|
+
puts "\nFound #{contrasts.size} contrasts." if VERBOSE
|
24
37
|
|
25
38
|
# Set targets
|
26
39
|
targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
|
27
|
-
puts "
|
28
|
-
|
29
|
-
# Create options hash
|
30
|
-
opts = {}
|
31
|
-
%w[PARSE_TEXT N_PERM N_BOOT MAX_LINES RECODE CONTEXT_SIZE MIN_PROP STOP_FILE NORMALIZE_WEIGHTS].each { |c|
|
32
|
-
opts[c.downcase] = Module.const_get(c) if Module.constants.include?(c)
|
33
|
-
}
|
40
|
+
puts "\nFound #{targets.size} target words." if VERBOSE
|
34
41
|
|
35
42
|
# Read in files and create documents
|
36
43
|
docs = []
|
37
44
|
FILES.each { |f|
|
38
45
|
abort("Error: can't find input file #{f}.") if !File.exist?(f)
|
39
|
-
puts "
|
46
|
+
puts "\nReading in file #{f}..."
|
40
47
|
text = File.new(f).read
|
41
48
|
docs << Document.new(f.split(/\//)[-1], targets, text, opts)
|
42
49
|
}
|
@@ -55,15 +62,15 @@ module Cass
|
|
55
62
|
base = File.basename(d.name, '.txt')
|
56
63
|
puts "\nRunning one-sample analysis on document '#{d.name}'."
|
57
64
|
puts "Generating #{n_perm} bootstraps..." if VERBOSE and STATS
|
58
|
-
bootstrap_test(d, contrasts, "#{OUTPUT_ROOT}_#{base}_results.txt", n_perm)
|
65
|
+
bootstrap_test(d, contrasts, "#{OUTPUT_ROOT}_#{base}_results.txt", n_perm, opts)
|
59
66
|
p_values("#{OUTPUT_ROOT}_#{base}_results.txt", 'boot', true) if STATS
|
60
67
|
}
|
61
68
|
|
62
69
|
when 2
|
63
|
-
abort("Error: in order to run a permutation test, you need to pass exactly two files as input.") if FILES.size != 2
|
70
|
+
abort("Error: in order to run a permutation test, you need to pass exactly two files as input.") if FILES.size != 2 or docs.size != 2
|
64
71
|
puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if VERBOSE
|
65
72
|
puts "Generating #{n_perm} permutations..." if VERBOSE and STATS
|
66
|
-
permutation_test(
|
73
|
+
permutation_test(docs[0], docs[1], contrasts, "#{OUTPUT_ROOT}_results.txt", n_perm, opts)
|
67
74
|
p_values("#{OUTPUT_ROOT}_results.txt", 'perm', true)
|
68
75
|
|
69
76
|
# No other test types implemented for now.
|
@@ -84,8 +91,13 @@ module Cass
|
|
84
91
|
# * contrasts: an array of Contrasts used to compare the documents
|
85
92
|
# * output_file: name of output file
|
86
93
|
# * n_perm: number of permutations to run
|
87
|
-
|
94
|
+
# * opts: an optional hash of additional settings. Currently, only
|
95
|
+
# 'verbose' and 'normalize_weights' apply here.
|
96
|
+
def self.permutation_test(doc1, doc2, contrasts, output_file, n_perm, opts={})
|
88
97
|
|
98
|
+
# Merge options with defaults
|
99
|
+
opts = {'verbose'=>true, 'normalize_weights'=>false }.merge(opts)
|
100
|
+
|
89
101
|
# Merge contexts. Could change this later to allow different contexts for each
|
90
102
|
# document, but that would make processing substantially slower.
|
91
103
|
context = doc1.context
|
@@ -94,8 +106,8 @@ module Cass
|
|
94
106
|
doc1.context, doc2.context = context, context
|
95
107
|
|
96
108
|
# Generate cooccurrence matrices and get observed difference.
|
97
|
-
doc1.cooccurrence(
|
98
|
-
doc2.cooccurrence(
|
109
|
+
doc1.cooccurrence(opts['normalize_weights'])
|
110
|
+
doc2.cooccurrence(opts['normalize_weights'])
|
99
111
|
|
100
112
|
outf = File.new(output_file,'w')
|
101
113
|
outf.puts "contrast\titeration\t#{doc1.name}\t#{doc2.name}\tdifference"
|
@@ -108,10 +120,10 @@ module Cass
|
|
108
120
|
# Run permutations and save results
|
109
121
|
d1, d2 = doc1.clone, doc2.clone
|
110
122
|
n_perm.times { |i|
|
111
|
-
puts "
|
123
|
+
puts "Running permutation #{i+1}..." if opts['verbose']
|
112
124
|
d1.clines, d2.clines = permute_labels(doc1.clines, doc2.clines)
|
113
|
-
d1.cooccurrence(
|
114
|
-
d2.cooccurrence(
|
125
|
+
d1.cooccurrence(opts['normalize_weights'])
|
126
|
+
d2.cooccurrence(opts['normalize_weights'])
|
115
127
|
contrasts.each { |c|
|
116
128
|
res1, res2, diff = compare_docs(c, d1, d2)
|
117
129
|
outf.puts "#{c.words.join(".")}\tperm_#{i+1}\t#{res1}\t#{res2}\t#{diff}"
|
@@ -124,23 +136,28 @@ module Cass
|
|
124
136
|
# * contrasts: an array of Contrast objects to apply
|
125
137
|
# * output_file: name of output file
|
126
138
|
# * n_boot: number of bootstrap iterations to run
|
127
|
-
|
128
|
-
|
139
|
+
# * opts: an optional hash of additional settings. Currently, only
|
140
|
+
# 'verbose' and 'normalize_weights' apply here.
|
141
|
+
def self.bootstrap_test(doc, contrasts, output_file, n_boot, opts={})
|
142
|
+
|
143
|
+
# Merge options with defaults
|
144
|
+
opts = {'verbose'=>true, 'normalize_weights'=>false }.merge(opts)
|
145
|
+
|
129
146
|
outf = File.new(output_file,'w')
|
130
147
|
outf.puts(%w[contrast result_id doc_name pair_1 pair_2 pair_3 pair_4 interaction_term].join("\t"))
|
131
148
|
outf.sync = true
|
132
149
|
|
133
|
-
doc.cooccurrence(
|
150
|
+
doc.cooccurrence(opts['normalize_weights'])
|
134
151
|
contrasts.each { |c|
|
135
152
|
observed = c.apply(doc)
|
136
153
|
outf.puts "#{c.words.join(".")}\tobserved\t#{observed}"
|
137
154
|
}
|
138
155
|
d1 = doc.clone
|
139
156
|
n_boot.times { |i|
|
140
|
-
puts "
|
157
|
+
puts "Running bootstrap iteration #{i+1}..." if opts['verbose']
|
141
158
|
d1.clines = doc.resample(clines=true)
|
142
159
|
# d1.context = Context.new(d1) # Currently uses the same context; can uncomment
|
143
|
-
d1.cooccurrence(
|
160
|
+
d1.cooccurrence(opts['normalize_weights'])
|
144
161
|
contrasts.each { |c|
|
145
162
|
res = c.apply(d1)
|
146
163
|
outf.puts "#{c.words.join(".")}\tboot_#{i+1}\t#{res}"
|
data/lib/cass/context.rb
CHANGED
@@ -9,6 +9,7 @@ module Cass
|
|
9
9
|
min_prop = opts['min_prop'] || 0
|
10
10
|
max_prop = opts['max_prop'] || 1
|
11
11
|
puts "Creating new context..." if VERBOSE
|
12
|
+
puts "Using all words with token frequency in range of #{min_prop} and #{max_prop}."
|
12
13
|
words = doc.lines.join(' ').split(/\s+/)
|
13
14
|
nwords = words.size
|
14
15
|
puts "Found #{nwords} words."
|
@@ -21,7 +22,15 @@ module Cass
|
|
21
22
|
words.uniq!
|
22
23
|
end
|
23
24
|
# words = words - doc.targets
|
24
|
-
|
25
|
+
if opts.key?('stop_file') and !opts['stop_file'].empty?
|
26
|
+
begin
|
27
|
+
stopwords = File.new(opts['stop_file']).read.split(/\s+/)
|
28
|
+
rescue
|
29
|
+
abort("Error: could not open stopword file #{opts['stop_file']}!")
|
30
|
+
end
|
31
|
+
puts "Removing #{stopwords.size} stopwords from context." if VERBOSE
|
32
|
+
words -= stopwords
|
33
|
+
end
|
25
34
|
@words = opts.key?('context_size') ? words.sort_by{rand}[0, opts['context_size']] : words
|
26
35
|
index_words
|
27
36
|
puts "Using #{@words.size} words as context." if VERBOSE
|
data/lib/cass/document.rb
CHANGED
@@ -54,19 +54,19 @@ module Cass
|
|
54
54
|
# skip certain preprocessing steps, on the assumption that
|
55
55
|
# these have already been performed.
|
56
56
|
def parse(opts={})
|
57
|
-
if opts
|
57
|
+
if opts['skip_preproc']
|
58
58
|
@lines = (text.class == Array) ? @text : text.split(/[\r\n]+/)
|
59
59
|
else
|
60
60
|
puts "Converting to lowercase..." if VERBOSE
|
61
|
-
@text.downcase! unless opts
|
62
|
-
@text.gsub!(/[^a-z \n]+/, '') unless opts
|
61
|
+
@text.downcase! unless opts['keep_case']
|
62
|
+
@text.gsub!(/[^a-z \n]+/, '') unless opts['keep_special']
|
63
63
|
if opts.key?('recode')
|
64
64
|
puts "Recoding words..." if VERBOSE
|
65
65
|
opts['recode'].each { |k,v| @text.gsub!(/(^|\s+)(#{k})($|\s+)/, "\\1#{v}\\3") }
|
66
66
|
end
|
67
67
|
puts "Parsing text..." if VERBOSE
|
68
|
-
@lines = opts
|
69
|
-
@lines = @lines[0, opts['max_lines']] if opts
|
68
|
+
@lines = opts['parse_text'] ? Parser.parse(@text, opts) : @text.split(/[\r\n]+/)
|
69
|
+
@lines = @lines[0, opts['max_lines']] if opts['max_lines'] and opts['max_lines'] > 0
|
70
70
|
trim!
|
71
71
|
end
|
72
72
|
end
|
@@ -217,7 +217,7 @@ module Cass
|
|
217
217
|
|
218
218
|
# Basic info that always gets shown
|
219
219
|
buffer << "Summary for document '#{@name}':"
|
220
|
-
buffer << "#{@targets.size} target words (#{@targets.join
|
220
|
+
buffer << "#{@targets.size} target words (#{@targets.join(", ")})"
|
221
221
|
buffer << "#{@context.words.size} words in context."
|
222
222
|
buffer << "Using #{@clines.size} lines (containing at least one target word) for analysis."
|
223
223
|
|
data/lib/cass/parser.rb
CHANGED
@@ -27,7 +27,7 @@ module Cass
|
|
27
27
|
spfail = true
|
28
28
|
end
|
29
29
|
|
30
|
-
if spfail or opts
|
30
|
+
if spfail or opts['parser_basic'] == true
|
31
31
|
puts "Using a basic parser to split text into sentences. Note that this is intended as a last resort only; you are strongly encouraged to process all input texts yourself and make sure that lines are broken up the way you want them to be (with each line on a new line of text in the file). If you use this parser, we make no guarantees about the quality of the output."
|
32
32
|
rx = opts.key?('parser_regex') ? opts['parser_regex'] : "[\r\n\.]+"
|
33
33
|
text.split(/#{rx}/)
|
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: cass
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 27
|
5
5
|
prerelease: false
|
6
6
|
segments:
|
7
7
|
- 0
|
8
8
|
- 0
|
9
|
-
-
|
10
|
-
version: 0.0.
|
9
|
+
- 2
|
10
|
+
version: 0.0.2
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Tal Yarkoni
|
@@ -15,26 +15,10 @@ autorequire:
|
|
15
15
|
bindir: bin
|
16
16
|
cert_chain: []
|
17
17
|
|
18
|
-
date: 2010-
|
18
|
+
date: 2010-08-04 00:00:00 -06:00
|
19
19
|
default_executable:
|
20
|
-
dependencies:
|
21
|
-
|
22
|
-
name: narray
|
23
|
-
prerelease: false
|
24
|
-
requirement: &id001 !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
|
-
requirements:
|
27
|
-
- - ">="
|
28
|
-
- !ruby/object:Gem::Version
|
29
|
-
hash: 77
|
30
|
-
segments:
|
31
|
-
- 0
|
32
|
-
- 5
|
33
|
-
- 9
|
34
|
-
- 7
|
35
|
-
version: 0.5.9.7
|
36
|
-
type: :runtime
|
37
|
-
version_requirements: *id001
|
20
|
+
dependencies: []
|
21
|
+
|
38
22
|
description: A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).
|
39
23
|
email: tyarkoni@gmail.com
|
40
24
|
executables: []
|