cass 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/Manifest ADDED
@@ -0,0 +1,14 @@
1
+ CHANGELOG
2
+ LICENSE
3
+ Manifest
4
+ README.rdoc
5
+ Rakefile
6
+ cass.gemspec
7
+ lib/cass.rb
8
+ lib/cass/analysis.rb
9
+ lib/cass/context.rb
10
+ lib/cass/contrast.rb
11
+ lib/cass/document.rb
12
+ lib/cass/extensions.rb
13
+ lib/cass/parser.rb
14
+ lib/cass/stats.rb
data/README.rdoc ADDED
@@ -0,0 +1,125 @@
1
+ = Getting started with the CASS tools
2
+
3
+ CASS (Contrast Analysis of Semantic Similarity) is a set of tools for conducting contrast-based analyses of semantic similarity in text. CASS is based on the BEAGLE model described by Jones and Mewhort (2007). For a more detailed explanation, see Holtzman et al (under review).
4
+
5
+ == License
6
+
7
+ Copyright 2010 Tal Yarkoni and Nick Holtzman. Licensed under the GPL license. See the included LICENSE file for details.
8
+
9
+ == Installation
10
+
11
+ The CASS tools are packaged as a library for the Ruby programming language. You must have Ruby interpreter installed on your system, as well as the NMatrix library. To install, follow these steps:
12
+
13
+ (1) <b>Install Ruby</b>--preferably 1.9 or greater. Installers for most platforms are available here[http://www.ruby-lang.org/en/downloads/].
14
+
15
+ (2) <b>Install the NArray[http://narray.rubyforge.org] library</b>. On most platforms, you should be able to just type:
16
+
17
+ gem install narray
18
+
19
+ On Windows, it's a bit more involved; follow the instructions here[http://narray.rubyforge.org].
20
+
21
+ (3) <b>Install the CASS gem</b> from the command prompt, like so:
22
+
23
+ gem install cass
24
+
25
+ (4) <b>Download the sample analysis files</b> (cass_sample.zip[http://casstools.org/downloads/cass_sample.zip]), which will help you get started working with CASS. Unpack the file anywhere you like, and you should be ready to roll.
26
+
27
+
28
+ == Usage
29
+
30
+ There are two general ways to use CASS:
31
+
32
+ === The easy way
33
+
34
+ For users without prior programming experience, CASS is streamlined to make things as user-friendly as possible. Assuming you've installed CASS per the instructions above, you can run a full CASS analysis just by tweaking a few settings and running the analysis script (run_cass.rb) included in the sample analysis package (cass_sample.zip[http://casstools.org/downloads/cass_sample.zip]). Detailed instructions are available in [this] tutorial; in brief, here's what you need to do to get up and running:
35
+
36
+ 1. Download cass_sample.zip[http://casstools.org/downloads/cass_sample.zip] and unpack it somewhere. The package contains several files that tell CASS what to do, as well as some sample text you can process. These include:
37
+ - contrasts.txt: a specification of the contrasts you'd like to run on the text, one per line. For a detailed explanation of the format, see the [tutorial].
38
+ - default.spec: the main specification file containing all the key settings CASS needs in order to run. All settings are commented, but a more detailed explanation is provided in the [tutorial]. You can create as many .spec files as you like (no need to edit this one repeatedly!), just make sure to edit run_cass.rb to indicate which .spec file to use.
39
+ - stopwords.txt: A sample list of stopwords to exclude from analysis (CASS will use this file by default). These are mostly high-frequency function words that carry little meaning but can strongly bias a text.
40
+ - sample1.txt and sample2.txt: two sample documents to get you started.
41
+ - run_cass.rb: the script you'll use to run CASS.
42
+
43
+ 2. Edit the settings as you please. An explanation of what everything means is in the [tutorial]; if you're just getting started, you can just leave everything as is and run the sample analysis.
44
+
45
+ 3. Run run_cass.rb. If you're on Windows, you may be able to double-click on the script to run it; however, if you do that, you won't see any of the output. On most platforms (and optionally on Windows), you'll have to run the script from the command prompt. You can do this by opening up a terminal window (or, in Windows, a command prompt), navigating to the directory that contains the sample analysis files, and typing:
46
+
47
+ ruby run_cass.rb
48
+
49
+ After doing that, you should get a bunch of output showing you exactly what's going on. There should also be some new files in the working directory containing the results of the analysis.
50
+
51
+ Assuming the analysis ran successfully, you can now set about running your own analyses. We recommend reading the entire [tutorial] before diving in.
52
+
53
+ === As a library
54
+
55
+ Advanced users familiar with Ruby or other programming languages will probably want to use CASS as a library. Assuming you've installed CASS as a gem (see above), running a basic analysis with CASS is straightforward. First, we require the gem:
56
+
57
+ require 'cass'
58
+
59
+ We don't want the inconvenience of having to call all the methods through the Cass module (e.g., Cass::Contrast.new, Cass::Document.new, etc.), so let's go ahead and include the contents of the module in the namespace:
60
+
61
+ include Cass
62
+
63
+ Now we can start running analyses. Let's say we have a text file containing transcribed conversations of people discussing foods they like and dislike (e.g., cake.txt in the sample analysis package[http://casstools.org/downloads/cass_sample.zip]). Suppose we're particularly interested in two foods: cake and spinach. Our goal is to test the hypothesis that people prefer cake to spinach. Operationally, we're going to do that by examining the relative distance from 'spinach' and 'cake' to the terms 'good' and 'bad' in semantic space.
64
+
65
+ The first thing to do is set up the right contrasts. In this case, we'll create a single contrast comparing the distance between cake and spinach with respect to good and bad:
66
+
67
+ contrast = Contrast.new("cake spinach good bad")
68
+
69
+ CASS interprets a string of four words as two ordered pairs: 'cake' and 'spinach' form one pair, 'good' and 'bad' the other (we could, equivalently, initialize the contrast by passing the 4-element array ['cake', 'spinach', 'good', 'bad']).
70
+
71
+ Next, we read the file containing the transcripts:
72
+
73
+ text = File.new("cake.txt").read
74
+
75
+ And then we can create a corresponding Document. We initialize the Document object by passing a descriptive name, the contrasts we want to run, and the full text we want to analyze:
76
+
77
+ doc = Document.new("cake_vs_spinach", contrast, text)
78
+
79
+ If we want to see some information about the contents of our document, we can type:
80
+
81
+ doc.summary
82
+
83
+ And that prints something like this to our screen:
84
+
85
+ > Summary for document 'cake.txt':
86
+ > 4 target words (cake, spinach, good, bad)
87
+ > 35 words in context.
88
+ > Using 21 lines (containing at least one target word) for analysis.
89
+
90
+ Nothing too fancy, just basic descriptive information. The summary method has some additional arguments we could use to get more detailed information (e.g., word_count, list_context, etc.), but we'll skip those for now.
91
+
92
+ Now if we want to compute the interaction term for our contrast (i.e., the difference of differences, reflecting the equation (cake.good - spinach.bad) - (cake.bad - spinach.good)), all we have to do is:
93
+
94
+ contrast.apply(doc)
95
+
96
+ And we get back something that looks like this:
97
+
98
+ 0.5117 0.4039 0.3256 0.4511 0.2333
99
+
100
+ Where the first four values represent the similarity between the 4 pairs of words used to generate the interaction term (e.g., the first value reflects the correlation between 'cake' and 'good', the second between 'spinach' and 'bad', and so on), and the fifth is the interaction term. So in this case, the result (0.23) tells us that there's a positive bias in the text, such that cake is semantically more closely related to good (relative to bad) than spinach is. Hypothesis confirmed!
101
+
102
+ Well, sort of. By itself, the number 0.23 doesn't mean very much. We don't know what the standard error is, so we have no idea whether 0.23 is a very large number or a very small one that might occur pretty often just by chance. Fortunately, we can generate some bootstrapped p values quite easily. First, we generate the bootstraps:
103
+
104
+ Analysis.bootstrap_test(doc, contrasts, "speech_results.txt", 1000)
105
+
106
+ Here we call the bootstrap_test method, feeding it the document we want to analyze, the Contrasts we want to apply, the filename root we want to use, and the number of iterations we want to run (generally, as many as is computationally viable). The results will be saved to a plaintext file with the specified name, and we can peruse that file at our leisure. If we open it up, the first few lines look like this:
107
+
108
+ cake.spinach.good.bad observed cake.txt 0.5117 0.4039 0.3256 0.4511 0.2333
109
+ cake.spinach.good.bad boot_1 cake.txt 0.4118 0.2569 0.1481 0.4086 0.4154
110
+ cake.spinach.good.bad boot_2 cake.txt 0.5321 0.4353 0.4349 0.5396 0.2015
111
+ cake.spinach.good.bad boot_3 cake.txt 0.51 0.4216 0.3043 0.6923 0.4764
112
+ cake.spinach.good.bad boot_4 cake.txt 0.6222 0.3452 0.238 0.249 0.288
113
+ ...
114
+
115
+ The columns tell us, respectively, what file the results came from, the bootstrap iteration (the first line shows us the actual, or observed value), and the observed interaction terms. Given this information, we can now compare the bootstrapped distribution to zero to test our hypothesis. We do that like this:
116
+
117
+ Analysis.p_values("speech_results.txt", 'boot')
118
+
119
+ ...where the first argument specifies the full path to the file containing the bootstrap results we want to summarize, and the second argument indicates the type of test that was conducted (either 'boot' or 'perm'). The results will be written to a file named speech_results_boot_p_values.txt. If we open that document up, we see this:
120
+
121
+ file contrast N value p-value
122
+ cake.txt cake.spinach.good.bad 1000 0.2333 0.0
123
+ cake.txt mean 1000 0.2333 0.0
124
+
125
+ As you can see, the last column (p-value) reads 0.0, which is to say, none of the 1,000 iterations we ran had a value greater than 0. So we can reject the null hypothesis of zero effect at p < .001 in this case. Put differently, it's exceedingly unlikely that we would get this result (people having a positive bias towards cake relative to spinach) just by chance. Of course, that's a contrived example that won't surprise anyone. But the point is that you can use the CASS tools in a similar way to ask other much more interesting questions about the relation between different terms in semantic space. So that's the end of this overview; to learn more about the other functionality in CASS, read the [tutorial] document, or just surf around this RDoc.
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ require 'rubygems'
2
+ require 'rake'
3
+ require 'echoe'
4
+
5
+ Echoe.new("cass", "0.0.1") { |p|
6
+ p.author = "Tal Yarkoni"
7
+ p.email = "tyarkoni@gmail.com"
8
+ p.summary = "A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS)."
9
+ p.url = "http://casstools.org"
10
+ p.docs_host = "http://casstools.org/doc/"
11
+ p.runtime_dependencies = ['narray >=0.5.9.7']
12
+ }
data/cass.gemspec ADDED
@@ -0,0 +1,33 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ Gem::Specification.new do |s|
4
+ s.name = %q{cass}
5
+ s.version = "0.0.1"
6
+
7
+ s.required_rubygems_version = Gem::Requirement.new(">= 1.2") if s.respond_to? :required_rubygems_version=
8
+ s.authors = ["Tal Yarkoni"]
9
+ s.date = %q{2010-06-15}
10
+ s.description = %q{A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).}
11
+ s.email = %q{tyarkoni@gmail.com}
12
+ s.extra_rdoc_files = ["CHANGELOG", "LICENSE", "README.rdoc", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
13
+ s.files = ["CHANGELOG", "LICENSE", "Manifest", "README.rdoc", "Rakefile", "cass.gemspec", "lib/cass.rb", "lib/cass/analysis.rb", "lib/cass/context.rb", "lib/cass/contrast.rb", "lib/cass/document.rb", "lib/cass/extensions.rb", "lib/cass/parser.rb", "lib/cass/stats.rb"]
14
+ s.homepage = %q{http://casstools.org}
15
+ s.rdoc_options = ["--line-numbers", "--inline-source", "--title", "Cass", "--main", "README.rdoc"]
16
+ s.require_paths = ["lib"]
17
+ s.rubyforge_project = %q{cass}
18
+ s.rubygems_version = %q{1.3.7}
19
+ s.summary = %q{A set of tools for conducting Contrast Analyses of Semantic Similarity (CASS).}
20
+
21
+ if s.respond_to? :specification_version then
22
+ current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
23
+ s.specification_version = 3
24
+
25
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
26
+ s.add_runtime_dependency(%q<narray>, [">= 0.5.9.7"])
27
+ else
28
+ s.add_dependency(%q<narray>, [">= 0.5.9.7"])
29
+ end
30
+ else
31
+ s.add_dependency(%q<narray>, [">= 0.5.9.7"])
32
+ end
33
+ end
data/lib/cass.rb ADDED
@@ -0,0 +1,14 @@
1
+ require 'narray'
2
+ require 'cass/stats'
3
+ require 'cass/analysis'
4
+ require 'cass/context'
5
+ require 'cass/contrast'
6
+ require 'cass/document'
7
+ require 'cass/extensions'
8
+ require 'cass/parser'
9
+
10
+ module Cass
11
+
12
+ VERSION = '0.1.0'
13
+
14
+ end
@@ -0,0 +1,229 @@
1
+ module Cass
2
+
3
+ # Instantiates an analysis on one or more Documents.
4
+ # Currently, only the default processing stream (runSpec)
5
+ # is implemented. Eventually, direct methods for specific
6
+ # analyses (e.g., two-document permutation tests) will be
7
+ # supported.
8
+ class Analysis
9
+
10
+ attr_accessor :docs, :contexts, :targets
11
+
12
+ # Read and parse the specifications for an analysis, then run the analysis.
13
+ # Only does basic error checking for now...
14
+ def self.run_spec(spec_file='default.spec')
15
+
16
+ # Basic error checking
17
+ abort("Error: can't find spec file (#{spec_file}).") if !File.exist?(spec_file)
18
+ load spec_file
19
+ abort("Error: can't find contrast file (#{CONTRAST_FILE}).") if !File.exist?(CONTRAST_FILE)
20
+ contrasts = parse_contrasts(CONTRAST_FILE)
21
+
22
+ # Create contrasts
23
+ puts "Found #{contrasts.size} contrasts." if VERBOSE
24
+
25
+ # Set targets
26
+ targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
27
+ puts "Found #{targets.size} target words." if VERBOSE
28
+
29
+ # Create options hash
30
+ opts = {}
31
+ %w[PARSE_TEXT N_PERM N_BOOT MAX_LINES RECODE CONTEXT_SIZE MIN_PROP STOP_FILE NORMALIZE_WEIGHTS].each { |c|
32
+ opts[c.downcase] = Module.const_get(c) if Module.constants.include?(c)
33
+ }
34
+
35
+ # Read in files and create documents
36
+ docs = []
37
+ FILES.each { |f|
38
+ abort("Error: can't find input file #{f}.") if !File.exist?(f)
39
+ puts "Reading in file #{f}..."
40
+ text = File.new(f).read
41
+ docs << Document.new(f.split(/\//)[-1], targets, text, opts)
42
+ }
43
+ docs
44
+
45
+ # Load contrasts
46
+ contrasts = parse_contrasts(CONTRAST_FILE)
47
+
48
+ # Make sure N_PERM is zero if we don't want stats
49
+ n_perm = STATS ? N_PERM : 0
50
+
51
+ # One or two-sample test?
52
+ case TEST_TYPE
53
+ when 1
54
+ docs.each { |d|
55
+ base = File.basename(d.name, '.txt')
56
+ puts "\nRunning one-sample analysis on document '#{d.name}'."
57
+ puts "Generating #{n_perm} bootstraps..." if VERBOSE and STATS
58
+ bootstrap_test(d, contrasts, "#{OUTPUT_ROOT}_#{base}_results.txt", n_perm)
59
+ p_values("#{OUTPUT_ROOT}_#{base}_results.txt", 'boot', true) if STATS
60
+ }
61
+
62
+ when 2
63
+ abort("Error: in order to run a permutation test, you need to pass exactly two files as input.") if FILES.size != 2
64
+ puts "Running two-sample comparison between '#{File.basename(FILES[0])}' and '#{File.basename(FILES[1])}'." if VERBOSE
65
+ puts "Generating #{n_perm} permutations..." if VERBOSE and STATS
66
+ permutation_test(*docs, contrasts, "#{OUTPUT_ROOT}_results.txt", n_perm)
67
+ p_values("#{OUTPUT_ROOT}_results.txt", 'perm', true)
68
+
69
+ # No other test types implemented for now.
70
+ else
71
+
72
+ end
73
+ puts "Done!"
74
+
75
+ end
76
+
77
+ # Parse contrast file. Takes a filename as input and returns an array of Contrasts.
78
+ def self.parse_contrasts(contrast_file)
79
+ File.new(contrast_file).readlines.map { |l| next if l.empty?; Contrast.parse(l) }
80
+ end
81
+
82
+ # Run a permutation test comparing two Documents.
83
+ # * doc1, doc2: The two Documents to compare
84
+ # * contrasts: an array of Contrasts used to compare the documents
85
+ # * output_file: name of output file
86
+ # * n_perm: number of permutations to run
87
+ def self.permutation_test(doc1, doc2, contrasts, output_file, n_perm)
88
+
89
+ # Merge contexts. Could change this later to allow different contexts for each
90
+ # document, but that would make processing substantially slower.
91
+ context = doc1.context
92
+ context.words = context.words & doc2.context.words
93
+ context.index_words
94
+ doc1.context, doc2.context = context, context
95
+
96
+ # Generate cooccurrence matrices and get observed difference.
97
+ doc1.cooccurrence(NORMALIZE_WEIGHTS)
98
+ doc2.cooccurrence(NORMALIZE_WEIGHTS)
99
+
100
+ outf = File.new(output_file,'w')
101
+ outf.puts "contrast\titeration\t#{doc1.name}\t#{doc2.name}\tdifference"
102
+ outf.sync = true
103
+ # Save observed values
104
+ contrasts.each { |c|
105
+ res1, res2, diff = compare_docs(c, doc1, doc2)
106
+ outf.puts "#{c.words.join(".")}\tobserved\t#{res1}\t#{res2}\t#{diff}"
107
+ }
108
+ # Run permutations and save results
109
+ d1, d2 = doc1.clone, doc2.clone
110
+ n_perm.times { |i|
111
+ puts "\n\nRunning permutation #{i+1}..."
112
+ d1.clines, d2.clines = permute_labels(doc1.clines, doc2.clines)
113
+ d1.cooccurrence(NORMALIZE_WEIGHTS)
114
+ d2.cooccurrence(NORMALIZE_WEIGHTS)
115
+ contrasts.each { |c|
116
+ res1, res2, diff = compare_docs(c, d1, d2)
117
+ outf.puts "#{c.words.join(".")}\tperm_#{i+1}\t#{res1}\t#{res2}\t#{diff}"
118
+ }
119
+ }
120
+ end
121
+
122
+ # Do a bootstrap test comparing the bootstrapped distribution to zero.
123
+ # * doc: The Document object to analyze
124
+ # * contrasts: an array of Contrast objects to apply
125
+ # * output_file: name of output file
126
+ # * n_boot: number of bootstrap iterations to run
127
+ def self.bootstrap_test(doc, contrasts, output_file, n_boot)
128
+
129
+ outf = File.new(output_file,'w')
130
+ outf.puts(%w[contrast result_id doc_name pair_1 pair_2 pair_3 pair_4 interaction_term].join("\t"))
131
+ outf.sync = true
132
+
133
+ doc.cooccurrence(NORMALIZE_WEIGHTS)
134
+ contrasts.each { |c|
135
+ observed = c.apply(doc)
136
+ outf.puts "#{c.words.join(".")}\tobserved\t#{observed}"
137
+ }
138
+ d1 = doc.clone
139
+ n_boot.times { |i|
140
+ puts "\n\nRunning bootstrap iteration #{i+1}..." if VERBOSE
141
+ d1.clines = doc.resample(clines=true)
142
+ # d1.context = Context.new(d1) # Currently uses the same context; can uncomment
143
+ d1.cooccurrence(NORMALIZE_WEIGHTS)
144
+ contrasts.each { |c|
145
+ res = c.apply(d1)
146
+ outf.puts "#{c.words.join(".")}\tboot_#{i+1}\t#{res}"
147
+ }
148
+ }
149
+ end
150
+
151
+ # Permute labels across two documents.
152
+ def self.permute_labels(lines1, lines2)
153
+ n1 = lines1.size
154
+ lines = (lines1 + lines2).sort_by{rand}
155
+ [lines.slice!(0,n1), lines]
156
+ end
157
+
158
+ # Run pairwise contrast on two docs and return difference.
159
+ def self.compare_docs(contrast, doc1, doc2)
160
+ res1, res2 = contrast.apply(doc1).split("\t")[-1].to_f, contrast.apply(doc2).split("\t")[-1].to_f
161
+ [res1, res2, res1 - res2]
162
+ end
163
+
164
+ # Takes the results of a bootstrap or permutation test as input and saves
165
+ # a file summarizing the corresponding p-values.
166
+ # * input_file: path to the results of the bootstrapping/permutation analysis
167
+ # * mode: indicates the source analysis type. Must be either 'boot' or 'perm'
168
+ # * mean: boolean variable indicating whether or not to compute the mean across all contrasts
169
+ def self.p_values(input_file, mode='boot', mean=true)
170
+ c = File.new(input_file).readlines
171
+ c.shift
172
+ buffer = ["file\tcontrast\tN_permutations\tvalue\tp-value"]
173
+ tests = {}
174
+ c.each { |l|
175
+ l = l.strip.split(/\t/)
176
+ row = [l[0], l[1], l[-1].to_f]
177
+ fname = mode == 'boot' ? l[2] : input_file
178
+ tests[fname] = [] if !tests.key?(fname)
179
+ tests[fname] << row
180
+ }
181
+
182
+ tests.each { |fname, rows|
183
+ dists, obs, means = {}, {}, []
184
+ rows.each { |row|
185
+ test, iter, val = row
186
+ if iter == 'observed'
187
+ obs[test] = val
188
+ else
189
+ dists[test] = [] if !dists.key?(test)
190
+ dists[test] << val
191
+ if mean
192
+ i = iter[/\d+$/].to_i-1
193
+ means[i] = 0 if means[i].nil?
194
+ means[i] += val
195
+ end
196
+ end
197
+ }
198
+ if mean
199
+ means.map! { |m| m/obs.size }
200
+ dists['mean'] = means
201
+ obs['mean'] = obs.values.inject(0) {|sum, e| sum+e }/obs.size
202
+ end
203
+
204
+ dists.each { |k,v|
205
+ v, o = v.sort, obs[k]
206
+ gt = v.inject(0) { |sum, e|
207
+ sum +
208
+ if mode == 'perm'
209
+ o >= e ? 1 : 0
210
+ else
211
+ e > 0 ? 1 : 0
212
+ end
213
+ }
214
+ p = gt.to_f / v.size
215
+ p = 1 - p if p > 0.5
216
+ line = [fname, k, v.size, o, p*2]
217
+ buffer << line.join("\t")
218
+ }
219
+
220
+ }
221
+ base = File.basename(input_file, '.txt')
222
+ File.new("#{base}_p_values.txt",'w').puts buffer
223
+ end
224
+
225
+ private_class_method :permute_labels, :compare_docs
226
+
227
+ end
228
+
229
+ end
@@ -0,0 +1,54 @@
1
+ module Cass
2
+
3
+ # Represents the context of a document, i.e., a list of words to analyze, along with an index.
4
+ class Context
5
+
6
+ attr_accessor :words, :index
7
+
8
+ def initialize(doc, opts)
9
+ min_prop = opts['min_prop'] || 0
10
+ max_prop = opts['max_prop'] || 1
11
+ puts "Creating new context..." if VERBOSE
12
+ words = doc.lines.join(' ').split(/\s+/)
13
+ nwords = words.size
14
+ puts "Found #{nwords} words."
15
+ if min_prop > 0 or max_prop < 1
16
+ word_hash = Hash.new(0)
17
+ words.each {|w| word_hash[w] += 1 }
18
+ min_t, max_t = (min_prop * nwords).round, (max_prop * nwords).round
19
+ words = word_hash.delete_if { |w,c| c < min_t or c > max_t }.keys
20
+ else
21
+ words.uniq!
22
+ end
23
+ # words = words - doc.targets
24
+ words -= opts['stop_file'].read.split(/\s+/) if opts.key?('stop_file')
25
+ @words = opts.key?('context_size') ? words.sort_by{rand}[0, opts['context_size']] : words
26
+ index_words
27
+ puts "Using #{@words.size} words as context." if VERBOSE
28
+ end
29
+
30
+ # Index the context. Necessary when words are updated manually.
31
+ def index_words
32
+ @index = {}
33
+ @words.each_index { |i| @index[@words[i]] = i }
34
+ end
35
+
36
+ # Convenience accessor method for getting either words in the context,
37
+ # or their index in the array. If an integer is passed, returns a word;
38
+ # If a string is passed, return the index of the word in the array.
39
+ def [](el)
40
+ el.class == Integer ? @words[el] : @index[el]
41
+ end
42
+
43
+ # Returns true if a word is in the context, false otherwise.
44
+ def key?(k)
45
+ @index.key?(k)
46
+ end
47
+
48
+ # Number of words in the context.
49
+ def size
50
+ @words.size
51
+ end
52
+ end
53
+
54
+ end