DRMacIver-term-extractor 0.0.0 → 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,7 +1,11 @@
1
1
  # The Trampoline Systems term extractor
2
2
 
3
+ ## Introduction
4
+
3
5
  The term extractor is a library for taking natural text and extracting a
4
- set of terms from it which make sense without additional context. For example, feeding it the following text from my home page:
6
+ set of terms from it which make sense without additional context. We developed it at [Trampoline Systems](http://trampolinesystems.com/) as part of our work on [SONAR](http://trampolinesystems.com/product/Sonar+Expertise/benefits).
7
+
8
+ For example, feeding it the following text from my home page:
5
9
 
6
10
  Hi. I’m David.
7
11
 
@@ -35,6 +39,32 @@ One limitation of this is that it doesn't necessarily extract all reasonable ter
35
39
 
36
40
  Currently only english is supported. There are plans to support other languages, but nothing is implemented in that regard: It requires someone who is native to that language, a competent programmer and at least passingly familiar with NLP, so understandably we're a bit resource constrained on getting wide spread non-english support.
37
41
 
42
+ ## Usage
43
+
44
+ The primary use for the term extractor is as a JRuby library. There is a command line script wrapping it, but it currently only supports very basic use and isn't really practical because of a long startup time (this is more to do with loading models than Java startup).
45
+
46
+ Use of the library is very simple:
47
+
48
+ jirb -rubygems -rterm-extractor
49
+ irb(main):001:0> extractor = TermExtractor.new
50
+ irb(main):002:0> puts extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
51
+ Scala
52
+ multi-paradigm programming language
53
+ features
54
+ irb(main):003:0> p extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
55
+ [#<Term:0xd36ff3 @to_s="Scala", @pos="NNP", @sentence=0>, #<Term:0x15af049 @to_s="multi-paradigm programming language", @pos="JJ-NN-NN", @sentence=0>, #<Term:0x1555185 @to_s="features", @pos="NNS", @sentence=0>]
56
+ irb(main):004:0> terms = extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
57
+ irb(main):006:0> p terms[0]
58
+ #<Term:0x1c958af @to_s="Scala", @pos="NNP", @sentence=0>
59
+ irb(main):007:0> puts terms[0].pos
60
+ NNP
61
+ irb(main):008:0> puts terms[0].sentence
62
+ 0
63
+ irb(main):009:0> puts terms[0].to_s
64
+ Scala
65
+
66
+ You create a term extractor. You pass it text with extract_terms_from_text and it returns an array of Term objects. You'll probably most be interested in these to convert them straight to strings, where they correspond to the desired snippets of text from the document, but they also provide some additional information. Currently they provide information about parts of speech and which sentence in the text they occur in. More information may be added later.
67
+
38
68
  ## Copyright
39
69
 
40
70
  Copyright (c) 2009 Trampoline Systems. See LICENSE for details.
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.0
1
+ 0.0.1
data/app.rb ADDED
@@ -0,0 +1,56 @@
1
+ require "date"
2
+ require "rubygems"
3
+ require "sinatra"
4
+
5
+ $: << "lib"
6
+
7
+ require "term-extractor"
8
+
9
+ TE = TermExtractor.new
10
+
11
+ get '/' do
12
+ haml :index
13
+ end
14
+
15
+ post '/' do
16
+ if params[:extract]
17
+ @text = params[:text]
18
+ @terms = TE.extract_terms_from_text(@text).map{|x| x.to_s}.uniq
19
+ elsif params[:train]
20
+ File.open("training/good", "a"){|o| o.puts params[:goodterms]}
21
+ File.open("training/bad", "a"){|o| o.puts params[:badterms]}
22
+
23
+ time = DateTime.now.strftime("%Y-%m-%d-%H:%M")
24
+
25
+ File.open("test/examples/#{time}_spec.rb", "w"){ |o|
26
+ o.puts <<SPEC
27
+ TE = TermExtractor.new unless defined? TE
28
+
29
+ Text = <<TEXT
30
+ #{params[:text]}
31
+ TEXT
32
+
33
+ Terms = TE.extract_terms_from_text(Text).map{|x| x.to_s}.sort.uniq
34
+
35
+ describe "the example generated at #{time}" do
36
+ it "should contain the following terms" do
37
+ #{(params[:goodterms] || "").split(/\r?\n/).map{|x| x.strip}.inspect}.each do |term|
38
+ Terms.should include(term)
39
+ end
40
+ end
41
+
42
+ it "should not contain the following terms" do
43
+ #{(params[:badterms] || "").split(/\r?\n/).map{|x| x.strip}.inspect}.each do |term|
44
+ Terms.should_not include(term)
45
+ end
46
+ end
47
+ end
48
+
49
+ SPEC
50
+
51
+
52
+ }
53
+ end
54
+
55
+ haml :index
56
+ end
@@ -1,28 +1,29 @@
1
1
  require "term-extractor/nlp"
2
2
 
3
3
  class Term
4
- attr_accessor :to_s, :pos, :sentence
4
+ attr_accessor :pos, :sentence, :chunks, :tokens
5
5
 
6
- def initialize(ts, pos, sentence = nil)
7
- @to_s, @pos, @sentence = ts, pos, sentence
6
+ def initialize(tokens)
7
+ @tokens = tokens
8
+ yield self if block_given?
9
+ end
10
+
11
+ def to_s
12
+ @to_s ||= TermExtractor.recombobulate_term(@tokens)
8
13
  end
9
14
  end
10
15
 
16
+ # A class for extracting useful snippets of text from a document
11
17
  class TermExtractor
12
- attr_accessor :nlp, :max_term_length, :proscribed_start, :required_ending, :remove_urls, :remove_paths
18
+ attr_accessor :nlp, :max_term_length, :remove_urls, :remove_paths
13
19
 
14
20
  def initialize(models = File.dirname(__FILE__) + "/../models")
15
21
  @nlp = NLP.new(models)
16
22
 
17
23
  # Empirically, terms longer than about 5 words seem to be either
18
24
  # too specific to be useful or very noisy.
19
- @max_term_length = 5
20
-
21
- # Common sources of crap starting words
22
- @proscribed_start = /CC|PRP|IN|DT|PRP\$|WP|WP\$|TO|EX/
25
+ @max_term_length = 4
23
26
 
24
- # We have to end in a noun, foreign word or number.
25
- @required_ending = /NN|NNS|NNP|NNPS|FW|CD/
26
27
 
27
28
  self.remove_urls = true
28
29
  self.remove_paths = true
@@ -30,7 +31,14 @@ class TermExtractor
30
31
  yield self if block_given?
31
32
  end
32
33
 
33
-
34
+
35
+ # This class holds all the state needed for term calculations
36
+ # on a single sentence.
37
+ # It uses chunking and part of speech tagging information to
38
+ # mark each token in the sentence as to whether it is allowed
39
+ # to start a term or end a term and whether terms can cross it
40
+ # Terms are then calculated by simply looking for all sequences
41
+ # of tokens up to the maximum length which meet these constraints.
34
42
  class TermContext
35
43
  attr_accessor :parent, :tokens, :postags, :chunks
36
44
 
@@ -55,7 +63,8 @@ class TermExtractor
55
63
  @sentence = sentence
56
64
 
57
65
  end
58
-
66
+
67
+ # This is the bit where all the work happens
59
68
  def boundaries
60
69
  return @boundaries if @boundaries
61
70
 
@@ -66,13 +75,32 @@ class TermExtractor
66
75
  @boundaries = tokens.map{|t| {}}
67
76
 
68
77
  @boundaries.each_with_index do |b, i|
78
+ # WARNING: It's important to only write boundaries for indices
79
+ # <= i. Otherwise the next loop iteration will overwrite the
80
+ # set value.
81
+
82
+
69
83
  tok = tokens[i]
70
84
  pos = postags[i]
71
85
  chunk = chunks[i]
72
86
 
73
87
  # Cannot cross commas or coordinating conjections (and, or, etc)
74
- b[:can_cross] = !(pos =~ /,|CC/)
75
-
88
+ b[:can_cross] = !(pos =~ /,/)
89
+
90
+ # words which are extra double plus stop wordy and shouldn't appear inside
91
+ # terms
92
+ # FIXME: This is a hack. We're really hitting the limit of
93
+ # rule based systems here
94
+ b[:can_cross] &&= ![
95
+ "after",
96
+ "where",
97
+ "when",
98
+ "for",
99
+ "at",
100
+ "to",
101
+ "with"
102
+ ].include?(tok)
103
+
76
104
  # Cannot cross the beginning of verb terms
77
105
  # i.e. we may start with verb terms but not include them
78
106
  b[:can_cross] = (chunk != "B-VP") if b[:can_cross]
@@ -83,21 +111,36 @@ class TermExtractor
83
111
 
84
112
  # We are only allowed to start terms on the beginning of a term chunk
85
113
  b[:can_start] = (chunks[i] == "B-NP")
86
- if i > 0
87
- if postags[i-1] =~ /DT|WDT|PRP|JJR|JJS/
88
- # In some cases we want to move the start of a term to the right. These cases are:
89
- # - a determiner (the, a, etc)
90
- # - a posessive pronoun (my, your, etc)
91
- # - comparative and superlative adjectives (best, better, etc.)
92
- # In all cases we only do this for noun terms, and will only move them to internal points.
93
- b[:can_start] ||= (chunks[i] == "I-NP")
94
- @boundaries[i - 1][:can_start] = false
95
- end
114
+
115
+ # In some cases we want to move the start of a term to the right. These cases are:
116
+ # - a determiner (the, a, etc)
117
+ # - a posessive pronoun (my, your, etc)
118
+ # - comparative and superlative adjectives (best, better, etc.)
119
+ # - A number. In this case note that starting with the number is also allowed. e.g. "two cities" will produce both "two cities"
120
+ # In all cases we only do this for noun terms, and will only move them to internal points.
121
+ if (chunks[i] == "I-NP") && (postags[i-1] =~ /DT|WDT|PRP|JJR|JJS|CD/)
122
+ b[:can_start] = true
96
123
  end
97
124
 
98
125
  # We must include any tokens internal to the current chunk
99
126
  b[:can_end] = !(chunks[i + 1] =~ /I-/)
100
127
 
128
+ # We break phrases around coordinating conjunctions (and, or, etc)
129
+ # but allow phrases that should rightfully be forced to continue past
130
+ # the conjunction. e.g. in "nuts and bolts", we allow "nuts" and "bolts"
131
+ # but not the whole phrase. This is true even if this resolves as a single
132
+ # chunk
133
+ if pos == 'CC'
134
+ @boundaries[i-1][:can_end] = true if i > 0
135
+ @boundaries[i][:can_cross] = false
136
+ end
137
+ # need to do it here rather than in previous if statement
138
+ # as otherwise the next pass along will overwrite the result
139
+ # we set here.
140
+ if i > 0 && @postags[i-1] == 'CC'
141
+ @boundaries[i][:can_start] = true
142
+ end
143
+
101
144
  # It is permitted to cross stopwords, but they cannot lie at the term boundary
102
145
  if (nlp.stopword? tok) || (nlp.stopword? tokens[i..i+1].join) # Need to take into account contractions, which span multiple tokens
103
146
  b[:can_end] = false
@@ -111,10 +154,12 @@ class TermExtractor
111
154
  b[:can_start] = false
112
155
  @boundaries[i - 1][:can_end] = false
113
156
  end
114
-
115
- # Must match the requirements for POSes at the beginning and end.
116
- b[:can_start] &&= !(pos =~ parent.proscribed_start)
117
- b[:can_end] &&= (pos =~ parent.required_ending)
157
+
158
+ # Common sources of crap starting words
159
+ b[:can_start] &&= !(pos =~ /CC|PRP|IN|DT|PRP\$|WP|WP\$|TO|EX|JJR|JJS/)
160
+
161
+ # TODO: Is this still a good idea?
162
+ b[:can_end] &&= (pos =~ /NN|NNS|NNP|NNPS|FW|CD/)
118
163
 
119
164
  end
120
165
 
@@ -149,7 +194,10 @@ class TermExtractor
149
194
 
150
195
  term = tokens[i..j]
151
196
  poses = postags.to_a[i..j]
152
- term = Term.new(TermExtractor.recombobulate_term(term), poses.join("-"))
197
+ term = Term.new(term){ |it|
198
+ it.pos = poses.join("-")
199
+ it.chunks = chunks.to_a[i..j]
200
+ }
153
201
  terms << term if TermExtractor.allowed_term?(term)
154
202
 
155
203
  j += 1
@@ -179,7 +227,7 @@ class TermExtractor
179
227
 
180
228
  # Final post filter on terms to determine if they're allowed.
181
229
  def self.allowed_term?(p)
182
- return false if p.pos =~ /^CD(-CD)*$/ # We don't allow things which are just sequences of numbers
230
+ return false if p.to_s =~ /^[^a-zA-Z]*$/ # We don't allow things which are just sequences of numbers
183
231
  return false if p.to_s.length > 255
184
232
  true
185
233
  end
@@ -86,6 +86,8 @@ class TermExtractor
86
86
  text = text.dup
87
87
  text.gsub!(/--+/, " -- ") # TODO: What's this for?
88
88
 
89
+ text.gsub!(/…/, "...") # expand ellipsis character
90
+
89
91
  # Normalize bracket types.
90
92
  # TODO: Shouldn't do this inside of tokens.
91
93
  text.gsub!(/{\[/, "(")
@@ -2,11 +2,11 @@
2
2
 
3
3
  Gem::Specification.new do |s|
4
4
  s.name = %q{term-extractor}
5
- s.version = "0.0.0"
5
+ s.version = "0.0.1"
6
6
 
7
7
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
8
8
  s.authors = ["David R. MacIver"]
9
- s.date = %q{2009-08-06}
9
+ s.date = %q{2009-09-08}
10
10
  s.default_executable = %q{terms.rb}
11
11
  s.email = %q{david.maciver@gmail.com}
12
12
  s.executables = ["terms.rb"]
@@ -19,6 +19,7 @@ Gem::Specification.new do |s|
19
19
  "README.markdown",
20
20
  "Rakefile",
21
21
  "VERSION",
22
+ "app.rb",
22
23
  "bin/terms.rb",
23
24
  "lib/term-extractor.rb",
24
25
  "lib/term-extractor/maxent-2.5.2.jar",
@@ -37,11 +38,14 @@ Gem::Specification.new do |s|
37
38
  "models/tagdict",
38
39
  "models/tok.bin.gz",
39
40
  "term-extractor.gemspec",
40
- "test/examples_spec.rb",
41
+ "test/examples/2009-08-16-14:41_spec.rb",
41
42
  "test/files/1.email",
42
43
  "test/files/juries_seg_8_v1",
43
44
  "test/nlp_spec.rb",
44
- "test/term_extractor_spec.rb"
45
+ "test/term_extractor_spec.rb",
46
+ "training/bad",
47
+ "training/good",
48
+ "views/index.haml"
45
49
  ]
46
50
  s.homepage = %q{http://github.com/david.maciver@gmail.com/term-extractor}
47
51
  s.rdoc_options = ["--charset=UTF-8"]
@@ -49,9 +53,9 @@ Gem::Specification.new do |s|
49
53
  s.rubygems_version = %q{1.3.4}
50
54
  s.summary = %q{A library for extracting useful terms from text}
51
55
  s.test_files = [
52
- "test/term_extractor_spec.rb",
53
- "test/nlp_spec.rb",
54
- "test/examples_spec.rb"
56
+ "test/examples/2009-08-16-14:41_spec.rb",
57
+ "test/term_extractor_spec.rb",
58
+ "test/nlp_spec.rb"
55
59
  ]
56
60
 
57
61
  if s.respond_to? :specification_version then
@@ -0,0 +1,22 @@
1
+ TE = TermExtractor.new unless defined? TE
2
+
3
+ Text = <<TEXT
4
+ As the healthcare debate picks up pace, I find myself being asked with increasing regularity what I think of Britain’s healthcare system. Six months ago, I’d have jumped into the answer with gusto, but these days… I don’t know, I am just so fatigued by all the fear-mongering and hysteria, the ignorance and the downright idiocy of the current debate that I can hardly summon the energy to add my voice to the cacophony.
5
+ TEXT
6
+
7
+ Terms = TE.extract_terms_from_text(Text).map{|x| x.to_s}.sort.uniq
8
+
9
+ describe "the example generated at 2009-08-16-14:41" do
10
+ it "should contain the following terms" do
11
+ ["healthcare debate", "Britain's healthcare system", "Six months", "answer", "gusto", "fear-mongering", "hysteria", "ignorance", "downright idiocy", "current debate", "energy", "voice", "cacophony"].each do |term|
12
+ Terms.should include(term)
13
+ end
14
+ end
15
+
16
+ it "should not contain the following terms" do
17
+ ["days\342\200\246", "voice to the cacophony", "answer with gusto"].each do |term|
18
+ Terms.should_not include(term)
19
+ end
20
+ end
21
+ end
22
+
@@ -39,9 +39,8 @@ I like kitties
39
39
 
40
40
  I like puppies
41
41
  KITTIES
42
-
43
-
44
42
  end
43
+
45
44
  end
46
45
 
47
46
  describe "url removal" do
@@ -15,18 +15,6 @@ def each_tag(&blk)
15
15
  end
16
16
 
17
17
  describe TermExtractor do
18
- it "should only return themes ending in nouns" do
19
- each_tag do |tag|
20
- tag.pos.should =~ /(^|-)(#{PE.required_ending})$/
21
- end
22
- end
23
-
24
- it "must not return themes starting with proscribed parts of speech" do
25
- each_tag do |tag|
26
- tag.pos.should_not =~ /^(#{PE.proscribed_start})($|-)/
27
- end
28
- end
29
-
30
18
  it "should produce at least as many tags as words" do
31
19
  each_tag do |tag|
32
20
  tag.pos.split("-").length.should be >= tag.to_s.split.length
@@ -137,5 +125,4 @@ BINARYSOLO
137
125
  term.to_s.should_not =~ /’|'/
138
126
  }
139
127
  end
140
-
141
128
  end
@@ -0,0 +1,3 @@
1
+ days…
2
+ voice to the cacophony
3
+ answer with gusto
@@ -0,0 +1,13 @@
1
+ healthcare debate
2
+ Britain's healthcare system
3
+ Six months
4
+ answer
5
+ gusto
6
+ fear-mongering
7
+ hysteria
8
+ ignorance
9
+ downright idiocy
10
+ current debate
11
+ energy
12
+ voice
13
+ cacophony
@@ -0,0 +1,19 @@
1
+ %html{ :xmlns => "http://www.w3.org/1999/xhtml", "xml:lang" => "en" }
2
+ %head
3
+ %title Training
4
+
5
+ %link{:rel => "stylesheet", :type => "text/css", :href => "/style.css"}/
6
+ %body
7
+ %h1 Term Extractor Training
8
+ %form{ :method => "POST" }
9
+ %div{ :style => "float: left; width: 60%;"}
10
+ %textarea{ :style =>"width: 100%; height: 60em;", :name => "text"}= @text
11
+ %input{:type => "submit", :name => "extract", :value => "Extract"}
12
+
13
+ %div{ :style => "float: right; width: 35%;"}
14
+ %h2 Good terms
15
+ %textarea{ :style => "width: 100%; height: 20em;", :name => "goodterms"}= @terms && @terms.join("\n")
16
+ %h2 Bad terms
17
+ %textarea{ :style => "width: 100%; height: 20em;", :name => "badterms"}
18
+
19
+ %input{:type => "submit", :name => "train", :value => "Train"}
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: DRMacIver-term-extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.0
4
+ version: 0.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - David R. MacIver
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-08-06 00:00:00 -07:00
12
+ date: 2009-09-08 00:00:00 -07:00
13
13
  default_executable: terms.rb
14
14
  dependencies: []
15
15
 
@@ -27,6 +27,7 @@ files:
27
27
  - README.markdown
28
28
  - Rakefile
29
29
  - VERSION
30
+ - app.rb
30
31
  - bin/terms.rb
31
32
  - lib/term-extractor.rb
32
33
  - lib/term-extractor/maxent-2.5.2.jar
@@ -45,14 +46,16 @@ files:
45
46
  - models/tagdict
46
47
  - models/tok.bin.gz
47
48
  - term-extractor.gemspec
48
- - test/examples_spec.rb
49
+ - test/examples/2009-08-16-14:41_spec.rb
49
50
  - test/files/1.email
50
51
  - test/files/juries_seg_8_v1
51
52
  - test/nlp_spec.rb
52
53
  - test/term_extractor_spec.rb
54
+ - training/bad
55
+ - training/good
56
+ - views/index.haml
53
57
  has_rdoc: false
54
58
  homepage: http://github.com/david.maciver@gmail.com/term-extractor
55
- licenses:
56
59
  post_install_message:
57
60
  rdoc_options:
58
61
  - --charset=UTF-8
@@ -73,11 +76,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
73
76
  requirements: []
74
77
 
75
78
  rubyforge_project:
76
- rubygems_version: 1.3.5
79
+ rubygems_version: 1.2.0
77
80
  signing_key:
78
81
  specification_version: 3
79
82
  summary: A library for extracting useful terms from text
80
83
  test_files:
84
+ - test/examples/2009-08-16-14:41_spec.rb
81
85
  - test/term_extractor_spec.rb
82
86
  - test/nlp_spec.rb
83
- - test/examples_spec.rb
@@ -1,131 +0,0 @@
1
- require "term-extractor"
2
-
3
- PE = TermExtractor.new
4
-
5
- Diagrams = <<DIAGRAMS
6
- I think having nice standardised diagrams of stuff like that is REALLY
7
- useful. One OO architect drops dead and your replacement walks in and
8
- can pick up the documents and read them because they already speak
9
- that language. That's a great thing. I sort of wish it had been pushed
10
- as being that -- a lingua franca for documenting designs.
11
- DIAGRAMS
12
-
13
-
14
- describe "Diagram terms" do
15
-
16
-
17
- end
18
-
19
- Murray = <<MURRAY
20
- The MCHS Department of Music is one of the most distinguished music programs in the State, having an award-winning choral and band program. The Marching Indians, under the direction of Mr. Mike Weaver, have performed all over the country, most recently at Universal Studios in Orlando, Disney World and the St. Patrick's Day Parade in New York City. Since 1958, the Marching Indians have been entreating fans with exciting, visually stimulating shows and their trademark deep, loud sound. Recently the Marching Indians received the Grand Championship at the 2008 Golden River Music Festival and won the first ever US101 radio battle of the bands receiving a concert by the Eli Young Band. Many students from MCHS Department of Bands have been involved with All District and All State bands as well as various summer clinics, orchestras and even the Georgia Lions All State Band.
21
- MURRAY
22
- MurrayTerms = PE.extract_terms_from_text(Murray).map{|x| x.to_s}
23
-
24
- describe "Murray terms" do
25
- it "should get Mike's name right" do
26
- MurrayTerms.should_not include("Mr . Mike Weaver")
27
- MurrayTerms.should include("Mr. Mike Weaver")
28
- end
29
- end
30
-
31
- Chromosome = <<CHROM
32
- Humans have 23 pairs of chromosomes packed with genes that dictate every aspect of our biological functioning. Of these pairs, the sex chromosomes are different; women have two X chromosomes and men have an X and a Y chromosome. The Y chromosome contains essential blueprints for the male reproductive system, in particular those for sperm development.
33
-
34
- But the Y chromosome, which once contained as many genes as the X chromosome, has deteriorated over time and now contains less than 80 functional genes compared to its partner, which contains more than 1,000 genes. Geneticists and evolutionary biologists determined that the Y chromosome's deterioration is due to accumulated mutations, deletions and anomalies that have nowhere to go because the chromosome doesn't swap genes with the X chromosome like every other chromosomal pair in our cells do.
35
- CHROM
36
-
37
- ChromosomeTerms = PE.extract_terms_from_text(Chromosome).map{|x| x.to_s}
38
-
39
- describe "Chromosome terms" do
40
- it "should say nothing about what humans have" do
41
- ChromosomeTerms.should_not include("Humans have 23 pairs")
42
- end
43
-
44
- it "knows about the male reproductive system, if you know what I mean" do
45
- ChromosomeTerms.should include("male reproductive system")
46
- ChromosomeTerms.should include("sperm development")
47
- end
48
-
49
- it "is about humans" do
50
- ChromosomeTerms.should include("Humans")
51
- end
52
- end
53
-
54
- Environment = "Please consider the environment before printing this e-mail"
55
- EnvironmentTerms = PE.extract_terms_from_text(Environment).map{|x| x.to_s}.sort
56
-
57
- describe "Environment terms" do
58
- it "is about email" do
59
- EnvironmentTerms.should include("e-mail")
60
- end
61
- end
62
-
63
- Apollo = <<APOLLO
64
- Fate has ordained that the men who went to the moon to explore in peace will stay on the moon to rest in peace.
65
-
66
- These brave men, Neil Armstrong and Edwin Aldrin, know that there is no hope for their recovery. But they also know that there is hope for mankind in their sacrifice.
67
-
68
- These two men are laying down their lives in mankind's most noble goal: the search for truth and understanding.
69
-
70
- They will be mourned by their families and friends; they will be mourned by their nation; they will be mourned by the people of the world; they will be mourned by a Mother Earth that dared send two of her sons into the unknown.
71
-
72
- In their exploration, they stirred the people of the world to feel as one; in their sacrifice, they bind more tightly the brotherhood of man.
73
-
74
- In ancient days, men looked at stars and saw their heroes in the constellations. In modern times, we do much the same, but our heroes are epic men of flesh and blood.
75
-
76
- Others will follow, and surely find their way home. Man's search will not be denied. But these men were the first, and they will remain the foremost in our hearts.
77
- APOLLO
78
-
79
- ApolloTerms = PE.extract_terms_from_text(Apollo).map{|x| x.to_s}.sort.uniq
80
-
81
- describe "Apollo terms" do
82
- it "knows of Neil and Buzz" do
83
- ApolloTerms.should include("Neil Armstrong")
84
- ApolloTerms.should include("Edwin Aldrin")
85
- end
86
-
87
- it "knows of where they've been" do
88
- ApolloTerms.should include("moon")
89
- end
90
-
91
- it "knows of times past and present" do
92
- ApolloTerms.should include("ancient days")
93
- ApolloTerms.should include("modern times")
94
- end
95
-
96
- it "knows of destiny" do
97
- ApolloTerms.should include("Fate")
98
- end
99
-
100
- it "knows of searching" do
101
- ApolloTerms.should include("exploration")
102
- ApolloTerms.should include("search")
103
- ApolloTerms.should include("Man's search")
104
- end
105
-
106
- it "knows not of mourning, but of courage and sacrifice" do
107
- ApolloTerms.should_not include("mourned")
108
- ApolloTerms.should include("brave men")
109
- ApolloTerms.should include("sacrifice")
110
- end
111
-
112
- it "knows of brotherhood" do
113
- ApolloTerms.should include("brotherhood of man")
114
- end
115
-
116
- it "knows of mankind, and of its heroes" do
117
- ApolloTerms.should include("man")
118
- ApolloTerms.should include("men")
119
- ApolloTerms.should include("mankind")
120
- ApolloTerms.should include("heroes")
121
- ApolloTerms.should include("epic men")
122
- end
123
-
124
- it "looks to the stars from the earth" do
125
- ApolloTerms.should include("stars")
126
- ApolloTerms.should include("constellations")
127
- ApolloTerms.should include("Mother Earth")
128
- ApolloTerms.should include("world")
129
- end
130
-
131
- end