DRMacIver-term-extractor 0.0.0 → 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.markdown +31 -1
- data/VERSION +1 -1
- data/app.rb +56 -0
- data/lib/term-extractor.rb +78 -30
- data/lib/term-extractor/nlp.rb +2 -0
- data/term-extractor.gemspec +11 -7
- data/test/examples/2009-08-16-14:41_spec.rb +22 -0
- data/test/nlp_spec.rb +1 -2
- data/test/term_extractor_spec.rb +0 -13
- data/training/bad +3 -0
- data/training/good +13 -0
- data/views/index.haml +19 -0
- metadata +9 -6
- data/test/examples_spec.rb +0 -131
data/README.markdown
CHANGED
@@ -1,7 +1,11 @@
|
|
1
1
|
# The Trampoline Systems term extractor
|
2
2
|
|
3
|
+
## Introduction
|
4
|
+
|
3
5
|
The term extractor is a library for taking natural text and extracting a
|
4
|
-
set of terms from it which make sense without additional context.
|
6
|
+
set of terms from it which make sense without additional context. We developed it at [Trampoline Systems](http://trampolinesystems.com/) as part of our work on [SONAR](http://trampolinesystems.com/product/Sonar+Expertise/benefits).
|
7
|
+
|
8
|
+
For example, feeding it the following text from my home page:
|
5
9
|
|
6
10
|
Hi. I’m David.
|
7
11
|
|
@@ -35,6 +39,32 @@ One limitation of this is that it doesn't necessarily extract all reasonable ter
|
|
35
39
|
|
36
40
|
Currently only english is supported. There are plans to support other languages, but nothing is implemented in that regard: It requires someone who is native to that language, a competent programmer and at least passingly familiar with NLP, so understandably we're a bit resource constrained on getting wide spread non-english support.
|
37
41
|
|
42
|
+
## Usage
|
43
|
+
|
44
|
+
The primary use for the term extractor is as a JRuby library. There is a command line script wrapping it, but it currently only supports very basic use and isn't really practical because of a long startup time (this is more to do with loading models than Java startup).
|
45
|
+
|
46
|
+
Use of the library is very simple:
|
47
|
+
|
48
|
+
jirb -rubygems -rterm-extractor
|
49
|
+
irb(main):001:0> extractor = TermExtractor.new
|
50
|
+
irb(main):002:0> puts extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
|
51
|
+
Scala
|
52
|
+
multi-paradigm programming language
|
53
|
+
features
|
54
|
+
irb(main):003:0> p extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
|
55
|
+
[#<Term:0xd36ff3 @to_s="Scala", @pos="NNP", @sentence=0>, #<Term:0x15af049 @to_s="multi-paradigm programming language", @pos="JJ-NN-NN", @sentence=0>, #<Term:0x1555185 @to_s="features", @pos="NNS", @sentence=0>]
|
56
|
+
irb(main):004:0> terms = extractor.extract_terms_from_text("Scala is a multi-paradigm programming language designed to integrate features of object-orientedd programming and functional programming.")
|
57
|
+
irb(main):006:0> p terms[0]
|
58
|
+
#<Term:0x1c958af @to_s="Scala", @pos="NNP", @sentence=0>
|
59
|
+
irb(main):007:0> puts terms[0].pos
|
60
|
+
NNP
|
61
|
+
irb(main):008:0> puts terms[0].sentence
|
62
|
+
0
|
63
|
+
irb(main):009:0> puts terms[0].to_s
|
64
|
+
Scala
|
65
|
+
|
66
|
+
You create a term extractor. You pass it text with extract_terms_from_text and it returns an array of Term objects. You'll probably most be interested in these to convert them straight to strings, where they correspond to the desired snippets of text from the document, but they also provide some additional information. Currently they provide information about parts of speech and which sentence in the text they occur in. More information may be added later.
|
67
|
+
|
38
68
|
## Copyright
|
39
69
|
|
40
70
|
Copyright (c) 2009 Trampoline Systems. See LICENSE for details.
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.1
|
data/app.rb
ADDED
@@ -0,0 +1,56 @@
|
|
1
|
+
require "date"
|
2
|
+
require "rubygems"
|
3
|
+
require "sinatra"
|
4
|
+
|
5
|
+
$: << "lib"
|
6
|
+
|
7
|
+
require "term-extractor"
|
8
|
+
|
9
|
+
TE = TermExtractor.new
|
10
|
+
|
11
|
+
get '/' do
|
12
|
+
haml :index
|
13
|
+
end
|
14
|
+
|
15
|
+
post '/' do
|
16
|
+
if params[:extract]
|
17
|
+
@text = params[:text]
|
18
|
+
@terms = TE.extract_terms_from_text(@text).map{|x| x.to_s}.uniq
|
19
|
+
elsif params[:train]
|
20
|
+
File.open("training/good", "a"){|o| o.puts params[:goodterms]}
|
21
|
+
File.open("training/bad", "a"){|o| o.puts params[:badterms]}
|
22
|
+
|
23
|
+
time = DateTime.now.strftime("%Y-%m-%d-%H:%M")
|
24
|
+
|
25
|
+
File.open("test/examples/#{time}_spec.rb", "w"){ |o|
|
26
|
+
o.puts <<SPEC
|
27
|
+
TE = TermExtractor.new unless defined? TE
|
28
|
+
|
29
|
+
Text = <<TEXT
|
30
|
+
#{params[:text]}
|
31
|
+
TEXT
|
32
|
+
|
33
|
+
Terms = TE.extract_terms_from_text(Text).map{|x| x.to_s}.sort.uniq
|
34
|
+
|
35
|
+
describe "the example generated at #{time}" do
|
36
|
+
it "should contain the following terms" do
|
37
|
+
#{(params[:goodterms] || "").split(/\r?\n/).map{|x| x.strip}.inspect}.each do |term|
|
38
|
+
Terms.should include(term)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
it "should not contain the following terms" do
|
43
|
+
#{(params[:badterms] || "").split(/\r?\n/).map{|x| x.strip}.inspect}.each do |term|
|
44
|
+
Terms.should_not include(term)
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
SPEC
|
50
|
+
|
51
|
+
|
52
|
+
}
|
53
|
+
end
|
54
|
+
|
55
|
+
haml :index
|
56
|
+
end
|
data/lib/term-extractor.rb
CHANGED
@@ -1,28 +1,29 @@
|
|
1
1
|
require "term-extractor/nlp"
|
2
2
|
|
3
3
|
class Term
|
4
|
-
attr_accessor :
|
4
|
+
attr_accessor :pos, :sentence, :chunks, :tokens
|
5
5
|
|
6
|
-
def initialize(
|
7
|
-
@
|
6
|
+
def initialize(tokens)
|
7
|
+
@tokens = tokens
|
8
|
+
yield self if block_given?
|
9
|
+
end
|
10
|
+
|
11
|
+
def to_s
|
12
|
+
@to_s ||= TermExtractor.recombobulate_term(@tokens)
|
8
13
|
end
|
9
14
|
end
|
10
15
|
|
16
|
+
# A class for extracting useful snippets of text from a document
|
11
17
|
class TermExtractor
|
12
|
-
attr_accessor :nlp, :max_term_length, :
|
18
|
+
attr_accessor :nlp, :max_term_length, :remove_urls, :remove_paths
|
13
19
|
|
14
20
|
def initialize(models = File.dirname(__FILE__) + "/../models")
|
15
21
|
@nlp = NLP.new(models)
|
16
22
|
|
17
23
|
# Empirically, terms longer than about 5 words seem to be either
|
18
24
|
# too specific to be useful or very noisy.
|
19
|
-
@max_term_length =
|
20
|
-
|
21
|
-
# Common sources of crap starting words
|
22
|
-
@proscribed_start = /CC|PRP|IN|DT|PRP\$|WP|WP\$|TO|EX/
|
25
|
+
@max_term_length = 4
|
23
26
|
|
24
|
-
# We have to end in a noun, foreign word or number.
|
25
|
-
@required_ending = /NN|NNS|NNP|NNPS|FW|CD/
|
26
27
|
|
27
28
|
self.remove_urls = true
|
28
29
|
self.remove_paths = true
|
@@ -30,7 +31,14 @@ class TermExtractor
|
|
30
31
|
yield self if block_given?
|
31
32
|
end
|
32
33
|
|
33
|
-
|
34
|
+
|
35
|
+
# This class holds all the state needed for term calculations
|
36
|
+
# on a single sentence.
|
37
|
+
# It uses chunking and part of speech tagging information to
|
38
|
+
# mark each token in the sentence as to whether it is allowed
|
39
|
+
# to start a term or end a term and whether terms can cross it
|
40
|
+
# Terms are then calculated by simply looking for all sequences
|
41
|
+
# of tokens up to the maximum length which meet these constraints.
|
34
42
|
class TermContext
|
35
43
|
attr_accessor :parent, :tokens, :postags, :chunks
|
36
44
|
|
@@ -55,7 +63,8 @@ class TermExtractor
|
|
55
63
|
@sentence = sentence
|
56
64
|
|
57
65
|
end
|
58
|
-
|
66
|
+
|
67
|
+
# This is the bit where all the work happens
|
59
68
|
def boundaries
|
60
69
|
return @boundaries if @boundaries
|
61
70
|
|
@@ -66,13 +75,32 @@ class TermExtractor
|
|
66
75
|
@boundaries = tokens.map{|t| {}}
|
67
76
|
|
68
77
|
@boundaries.each_with_index do |b, i|
|
78
|
+
# WARNING: It's important to only write boundaries for indices
|
79
|
+
# <= i. Otherwise the next loop iteration will overwrite the
|
80
|
+
# set value.
|
81
|
+
|
82
|
+
|
69
83
|
tok = tokens[i]
|
70
84
|
pos = postags[i]
|
71
85
|
chunk = chunks[i]
|
72
86
|
|
73
87
|
# Cannot cross commas or coordinating conjections (and, or, etc)
|
74
|
-
b[:can_cross] = !(pos =~
|
75
|
-
|
88
|
+
b[:can_cross] = !(pos =~ /,/)
|
89
|
+
|
90
|
+
# words which are extra double plus stop wordy and shouldn't appear inside
|
91
|
+
# terms
|
92
|
+
# FIXME: This is a hack. We're really hitting the limit of
|
93
|
+
# rule based systems here
|
94
|
+
b[:can_cross] &&= ![
|
95
|
+
"after",
|
96
|
+
"where",
|
97
|
+
"when",
|
98
|
+
"for",
|
99
|
+
"at",
|
100
|
+
"to",
|
101
|
+
"with"
|
102
|
+
].include?(tok)
|
103
|
+
|
76
104
|
# Cannot cross the beginning of verb terms
|
77
105
|
# i.e. we may start with verb terms but not include them
|
78
106
|
b[:can_cross] = (chunk != "B-VP") if b[:can_cross]
|
@@ -83,21 +111,36 @@ class TermExtractor
|
|
83
111
|
|
84
112
|
# We are only allowed to start terms on the beginning of a term chunk
|
85
113
|
b[:can_start] = (chunks[i] == "B-NP")
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
end
|
114
|
+
|
115
|
+
# In some cases we want to move the start of a term to the right. These cases are:
|
116
|
+
# - a determiner (the, a, etc)
|
117
|
+
# - a posessive pronoun (my, your, etc)
|
118
|
+
# - comparative and superlative adjectives (best, better, etc.)
|
119
|
+
# - A number. In this case note that starting with the number is also allowed. e.g. "two cities" will produce both "two cities"
|
120
|
+
# In all cases we only do this for noun terms, and will only move them to internal points.
|
121
|
+
if (chunks[i] == "I-NP") && (postags[i-1] =~ /DT|WDT|PRP|JJR|JJS|CD/)
|
122
|
+
b[:can_start] = true
|
96
123
|
end
|
97
124
|
|
98
125
|
# We must include any tokens internal to the current chunk
|
99
126
|
b[:can_end] = !(chunks[i + 1] =~ /I-/)
|
100
127
|
|
128
|
+
# We break phrases around coordinating conjunctions (and, or, etc)
|
129
|
+
# but allow phrases that should rightfully be forced to continue past
|
130
|
+
# the conjunction. e.g. in "nuts and bolts", we allow "nuts" and "bolts"
|
131
|
+
# but not the whole phrase. This is true even if this resolves as a single
|
132
|
+
# chunk
|
133
|
+
if pos == 'CC'
|
134
|
+
@boundaries[i-1][:can_end] = true if i > 0
|
135
|
+
@boundaries[i][:can_cross] = false
|
136
|
+
end
|
137
|
+
# need to do it here rather than in previous if statement
|
138
|
+
# as otherwise the next pass along will overwrite the result
|
139
|
+
# we set here.
|
140
|
+
if i > 0 && @postags[i-1] == 'CC'
|
141
|
+
@boundaries[i][:can_start] = true
|
142
|
+
end
|
143
|
+
|
101
144
|
# It is permitted to cross stopwords, but they cannot lie at the term boundary
|
102
145
|
if (nlp.stopword? tok) || (nlp.stopword? tokens[i..i+1].join) # Need to take into account contractions, which span multiple tokens
|
103
146
|
b[:can_end] = false
|
@@ -111,10 +154,12 @@ class TermExtractor
|
|
111
154
|
b[:can_start] = false
|
112
155
|
@boundaries[i - 1][:can_end] = false
|
113
156
|
end
|
114
|
-
|
115
|
-
#
|
116
|
-
b[:can_start] &&= !(pos =~
|
117
|
-
|
157
|
+
|
158
|
+
# Common sources of crap starting words
|
159
|
+
b[:can_start] &&= !(pos =~ /CC|PRP|IN|DT|PRP\$|WP|WP\$|TO|EX|JJR|JJS/)
|
160
|
+
|
161
|
+
# TODO: Is this still a good idea?
|
162
|
+
b[:can_end] &&= (pos =~ /NN|NNS|NNP|NNPS|FW|CD/)
|
118
163
|
|
119
164
|
end
|
120
165
|
|
@@ -149,7 +194,10 @@ class TermExtractor
|
|
149
194
|
|
150
195
|
term = tokens[i..j]
|
151
196
|
poses = postags.to_a[i..j]
|
152
|
-
term = Term.new(
|
197
|
+
term = Term.new(term){ |it|
|
198
|
+
it.pos = poses.join("-")
|
199
|
+
it.chunks = chunks.to_a[i..j]
|
200
|
+
}
|
153
201
|
terms << term if TermExtractor.allowed_term?(term)
|
154
202
|
|
155
203
|
j += 1
|
@@ -179,7 +227,7 @@ class TermExtractor
|
|
179
227
|
|
180
228
|
# Final post filter on terms to determine if they're allowed.
|
181
229
|
def self.allowed_term?(p)
|
182
|
-
return false if p.
|
230
|
+
return false if p.to_s =~ /^[^a-zA-Z]*$/ # We don't allow things which are just sequences of numbers
|
183
231
|
return false if p.to_s.length > 255
|
184
232
|
true
|
185
233
|
end
|
data/lib/term-extractor/nlp.rb
CHANGED
data/term-extractor.gemspec
CHANGED
@@ -2,11 +2,11 @@
|
|
2
2
|
|
3
3
|
Gem::Specification.new do |s|
|
4
4
|
s.name = %q{term-extractor}
|
5
|
-
s.version = "0.0.
|
5
|
+
s.version = "0.0.1"
|
6
6
|
|
7
7
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
8
8
|
s.authors = ["David R. MacIver"]
|
9
|
-
s.date = %q{2009-08
|
9
|
+
s.date = %q{2009-09-08}
|
10
10
|
s.default_executable = %q{terms.rb}
|
11
11
|
s.email = %q{david.maciver@gmail.com}
|
12
12
|
s.executables = ["terms.rb"]
|
@@ -19,6 +19,7 @@ Gem::Specification.new do |s|
|
|
19
19
|
"README.markdown",
|
20
20
|
"Rakefile",
|
21
21
|
"VERSION",
|
22
|
+
"app.rb",
|
22
23
|
"bin/terms.rb",
|
23
24
|
"lib/term-extractor.rb",
|
24
25
|
"lib/term-extractor/maxent-2.5.2.jar",
|
@@ -37,11 +38,14 @@ Gem::Specification.new do |s|
|
|
37
38
|
"models/tagdict",
|
38
39
|
"models/tok.bin.gz",
|
39
40
|
"term-extractor.gemspec",
|
40
|
-
"test/
|
41
|
+
"test/examples/2009-08-16-14:41_spec.rb",
|
41
42
|
"test/files/1.email",
|
42
43
|
"test/files/juries_seg_8_v1",
|
43
44
|
"test/nlp_spec.rb",
|
44
|
-
"test/term_extractor_spec.rb"
|
45
|
+
"test/term_extractor_spec.rb",
|
46
|
+
"training/bad",
|
47
|
+
"training/good",
|
48
|
+
"views/index.haml"
|
45
49
|
]
|
46
50
|
s.homepage = %q{http://github.com/david.maciver@gmail.com/term-extractor}
|
47
51
|
s.rdoc_options = ["--charset=UTF-8"]
|
@@ -49,9 +53,9 @@ Gem::Specification.new do |s|
|
|
49
53
|
s.rubygems_version = %q{1.3.4}
|
50
54
|
s.summary = %q{A library for extracting useful terms from text}
|
51
55
|
s.test_files = [
|
52
|
-
"test/
|
53
|
-
"test/
|
54
|
-
"test/
|
56
|
+
"test/examples/2009-08-16-14:41_spec.rb",
|
57
|
+
"test/term_extractor_spec.rb",
|
58
|
+
"test/nlp_spec.rb"
|
55
59
|
]
|
56
60
|
|
57
61
|
if s.respond_to? :specification_version then
|
@@ -0,0 +1,22 @@
|
|
1
|
+
TE = TermExtractor.new unless defined? TE
|
2
|
+
|
3
|
+
Text = <<TEXT
|
4
|
+
As the healthcare debate picks up pace, I find myself being asked with increasing regularity what I think of Britain’s healthcare system. Six months ago, I’d have jumped into the answer with gusto, but these days… I don’t know, I am just so fatigued by all the fear-mongering and hysteria, the ignorance and the downright idiocy of the current debate that I can hardly summon the energy to add my voice to the cacophony.
|
5
|
+
TEXT
|
6
|
+
|
7
|
+
Terms = TE.extract_terms_from_text(Text).map{|x| x.to_s}.sort.uniq
|
8
|
+
|
9
|
+
describe "the example generated at 2009-08-16-14:41" do
|
10
|
+
it "should contain the following terms" do
|
11
|
+
["healthcare debate", "Britain's healthcare system", "Six months", "answer", "gusto", "fear-mongering", "hysteria", "ignorance", "downright idiocy", "current debate", "energy", "voice", "cacophony"].each do |term|
|
12
|
+
Terms.should include(term)
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
it "should not contain the following terms" do
|
17
|
+
["days\342\200\246", "voice to the cacophony", "answer with gusto"].each do |term|
|
18
|
+
Terms.should_not include(term)
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
data/test/nlp_spec.rb
CHANGED
data/test/term_extractor_spec.rb
CHANGED
@@ -15,18 +15,6 @@ def each_tag(&blk)
|
|
15
15
|
end
|
16
16
|
|
17
17
|
describe TermExtractor do
|
18
|
-
it "should only return themes ending in nouns" do
|
19
|
-
each_tag do |tag|
|
20
|
-
tag.pos.should =~ /(^|-)(#{PE.required_ending})$/
|
21
|
-
end
|
22
|
-
end
|
23
|
-
|
24
|
-
it "must not return themes starting with proscribed parts of speech" do
|
25
|
-
each_tag do |tag|
|
26
|
-
tag.pos.should_not =~ /^(#{PE.proscribed_start})($|-)/
|
27
|
-
end
|
28
|
-
end
|
29
|
-
|
30
18
|
it "should produce at least as many tags as words" do
|
31
19
|
each_tag do |tag|
|
32
20
|
tag.pos.split("-").length.should be >= tag.to_s.split.length
|
@@ -137,5 +125,4 @@ BINARYSOLO
|
|
137
125
|
term.to_s.should_not =~ /’|'/
|
138
126
|
}
|
139
127
|
end
|
140
|
-
|
141
128
|
end
|
data/training/bad
ADDED
data/training/good
ADDED
data/views/index.haml
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
%html{ :xmlns => "http://www.w3.org/1999/xhtml", "xml:lang" => "en" }
|
2
|
+
%head
|
3
|
+
%title Training
|
4
|
+
|
5
|
+
%link{:rel => "stylesheet", :type => "text/css", :href => "/style.css"}/
|
6
|
+
%body
|
7
|
+
%h1 Term Extractor Training
|
8
|
+
%form{ :method => "POST" }
|
9
|
+
%div{ :style => "float: left; width: 60%;"}
|
10
|
+
%textarea{ :style =>"width: 100%; height: 60em;", :name => "text"}= @text
|
11
|
+
%input{:type => "submit", :name => "extract", :value => "Extract"}
|
12
|
+
|
13
|
+
%div{ :style => "float: right; width: 35%;"}
|
14
|
+
%h2 Good terms
|
15
|
+
%textarea{ :style => "width: 100%; height: 20em;", :name => "goodterms"}= @terms && @terms.join("\n")
|
16
|
+
%h2 Bad terms
|
17
|
+
%textarea{ :style => "width: 100%; height: 20em;", :name => "badterms"}
|
18
|
+
|
19
|
+
%input{:type => "submit", :name => "train", :value => "Train"}
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: DRMacIver-term-extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- David R. MacIver
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-08
|
12
|
+
date: 2009-09-08 00:00:00 -07:00
|
13
13
|
default_executable: terms.rb
|
14
14
|
dependencies: []
|
15
15
|
|
@@ -27,6 +27,7 @@ files:
|
|
27
27
|
- README.markdown
|
28
28
|
- Rakefile
|
29
29
|
- VERSION
|
30
|
+
- app.rb
|
30
31
|
- bin/terms.rb
|
31
32
|
- lib/term-extractor.rb
|
32
33
|
- lib/term-extractor/maxent-2.5.2.jar
|
@@ -45,14 +46,16 @@ files:
|
|
45
46
|
- models/tagdict
|
46
47
|
- models/tok.bin.gz
|
47
48
|
- term-extractor.gemspec
|
48
|
-
- test/
|
49
|
+
- test/examples/2009-08-16-14:41_spec.rb
|
49
50
|
- test/files/1.email
|
50
51
|
- test/files/juries_seg_8_v1
|
51
52
|
- test/nlp_spec.rb
|
52
53
|
- test/term_extractor_spec.rb
|
54
|
+
- training/bad
|
55
|
+
- training/good
|
56
|
+
- views/index.haml
|
53
57
|
has_rdoc: false
|
54
58
|
homepage: http://github.com/david.maciver@gmail.com/term-extractor
|
55
|
-
licenses:
|
56
59
|
post_install_message:
|
57
60
|
rdoc_options:
|
58
61
|
- --charset=UTF-8
|
@@ -73,11 +76,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
73
76
|
requirements: []
|
74
77
|
|
75
78
|
rubyforge_project:
|
76
|
-
rubygems_version: 1.
|
79
|
+
rubygems_version: 1.2.0
|
77
80
|
signing_key:
|
78
81
|
specification_version: 3
|
79
82
|
summary: A library for extracting useful terms from text
|
80
83
|
test_files:
|
84
|
+
- test/examples/2009-08-16-14:41_spec.rb
|
81
85
|
- test/term_extractor_spec.rb
|
82
86
|
- test/nlp_spec.rb
|
83
|
-
- test/examples_spec.rb
|
data/test/examples_spec.rb
DELETED
@@ -1,131 +0,0 @@
|
|
1
|
-
require "term-extractor"
|
2
|
-
|
3
|
-
PE = TermExtractor.new
|
4
|
-
|
5
|
-
Diagrams = <<DIAGRAMS
|
6
|
-
I think having nice standardised diagrams of stuff like that is REALLY
|
7
|
-
useful. One OO architect drops dead and your replacement walks in and
|
8
|
-
can pick up the documents and read them because they already speak
|
9
|
-
that language. That's a great thing. I sort of wish it had been pushed
|
10
|
-
as being that -- a lingua franca for documenting designs.
|
11
|
-
DIAGRAMS
|
12
|
-
|
13
|
-
|
14
|
-
describe "Diagram terms" do
|
15
|
-
|
16
|
-
|
17
|
-
end
|
18
|
-
|
19
|
-
Murray = <<MURRAY
|
20
|
-
The MCHS Department of Music is one of the most distinguished music programs in the State, having an award-winning choral and band program. The Marching Indians, under the direction of Mr. Mike Weaver, have performed all over the country, most recently at Universal Studios in Orlando, Disney World and the St. Patrick's Day Parade in New York City. Since 1958, the Marching Indians have been entreating fans with exciting, visually stimulating shows and their trademark deep, loud sound. Recently the Marching Indians received the Grand Championship at the 2008 Golden River Music Festival and won the first ever US101 radio battle of the bands receiving a concert by the Eli Young Band. Many students from MCHS Department of Bands have been involved with All District and All State bands as well as various summer clinics, orchestras and even the Georgia Lions All State Band.
|
21
|
-
MURRAY
|
22
|
-
MurrayTerms = PE.extract_terms_from_text(Murray).map{|x| x.to_s}
|
23
|
-
|
24
|
-
describe "Murray terms" do
|
25
|
-
it "should get Mike's name right" do
|
26
|
-
MurrayTerms.should_not include("Mr . Mike Weaver")
|
27
|
-
MurrayTerms.should include("Mr. Mike Weaver")
|
28
|
-
end
|
29
|
-
end
|
30
|
-
|
31
|
-
Chromosome = <<CHROM
|
32
|
-
Humans have 23 pairs of chromosomes packed with genes that dictate every aspect of our biological functioning. Of these pairs, the sex chromosomes are different; women have two X chromosomes and men have an X and a Y chromosome. The Y chromosome contains essential blueprints for the male reproductive system, in particular those for sperm development.
|
33
|
-
|
34
|
-
But the Y chromosome, which once contained as many genes as the X chromosome, has deteriorated over time and now contains less than 80 functional genes compared to its partner, which contains more than 1,000 genes. Geneticists and evolutionary biologists determined that the Y chromosome's deterioration is due to accumulated mutations, deletions and anomalies that have nowhere to go because the chromosome doesn't swap genes with the X chromosome like every other chromosomal pair in our cells do.
|
35
|
-
CHROM
|
36
|
-
|
37
|
-
ChromosomeTerms = PE.extract_terms_from_text(Chromosome).map{|x| x.to_s}
|
38
|
-
|
39
|
-
describe "Chromosome terms" do
|
40
|
-
it "should say nothing about what humans have" do
|
41
|
-
ChromosomeTerms.should_not include("Humans have 23 pairs")
|
42
|
-
end
|
43
|
-
|
44
|
-
it "knows about the male reproductive system, if you know what I mean" do
|
45
|
-
ChromosomeTerms.should include("male reproductive system")
|
46
|
-
ChromosomeTerms.should include("sperm development")
|
47
|
-
end
|
48
|
-
|
49
|
-
it "is about humans" do
|
50
|
-
ChromosomeTerms.should include("Humans")
|
51
|
-
end
|
52
|
-
end
|
53
|
-
|
54
|
-
Environment = "Please consider the environment before printing this e-mail"
|
55
|
-
EnvironmentTerms = PE.extract_terms_from_text(Environment).map{|x| x.to_s}.sort
|
56
|
-
|
57
|
-
describe "Environment terms" do
|
58
|
-
it "is about email" do
|
59
|
-
EnvironmentTerms.should include("e-mail")
|
60
|
-
end
|
61
|
-
end
|
62
|
-
|
63
|
-
Apollo = <<APOLLO
|
64
|
-
Fate has ordained that the men who went to the moon to explore in peace will stay on the moon to rest in peace.
|
65
|
-
|
66
|
-
These brave men, Neil Armstrong and Edwin Aldrin, know that there is no hope for their recovery. But they also know that there is hope for mankind in their sacrifice.
|
67
|
-
|
68
|
-
These two men are laying down their lives in mankind's most noble goal: the search for truth and understanding.
|
69
|
-
|
70
|
-
They will be mourned by their families and friends; they will be mourned by their nation; they will be mourned by the people of the world; they will be mourned by a Mother Earth that dared send two of her sons into the unknown.
|
71
|
-
|
72
|
-
In their exploration, they stirred the people of the world to feel as one; in their sacrifice, they bind more tightly the brotherhood of man.
|
73
|
-
|
74
|
-
In ancient days, men looked at stars and saw their heroes in the constellations. In modern times, we do much the same, but our heroes are epic men of flesh and blood.
|
75
|
-
|
76
|
-
Others will follow, and surely find their way home. Man's search will not be denied. But these men were the first, and they will remain the foremost in our hearts.
|
77
|
-
APOLLO
|
78
|
-
|
79
|
-
ApolloTerms = PE.extract_terms_from_text(Apollo).map{|x| x.to_s}.sort.uniq
|
80
|
-
|
81
|
-
describe "Apollo terms" do
|
82
|
-
it "knows of Neil and Buzz" do
|
83
|
-
ApolloTerms.should include("Neil Armstrong")
|
84
|
-
ApolloTerms.should include("Edwin Aldrin")
|
85
|
-
end
|
86
|
-
|
87
|
-
it "knows of where they've been" do
|
88
|
-
ApolloTerms.should include("moon")
|
89
|
-
end
|
90
|
-
|
91
|
-
it "knows of times past and present" do
|
92
|
-
ApolloTerms.should include("ancient days")
|
93
|
-
ApolloTerms.should include("modern times")
|
94
|
-
end
|
95
|
-
|
96
|
-
it "knows of destiny" do
|
97
|
-
ApolloTerms.should include("Fate")
|
98
|
-
end
|
99
|
-
|
100
|
-
it "knows of searching" do
|
101
|
-
ApolloTerms.should include("exploration")
|
102
|
-
ApolloTerms.should include("search")
|
103
|
-
ApolloTerms.should include("Man's search")
|
104
|
-
end
|
105
|
-
|
106
|
-
it "knows not of mourning, but of courage and sacrifice" do
|
107
|
-
ApolloTerms.should_not include("mourned")
|
108
|
-
ApolloTerms.should include("brave men")
|
109
|
-
ApolloTerms.should include("sacrifice")
|
110
|
-
end
|
111
|
-
|
112
|
-
it "knows of brotherhood" do
|
113
|
-
ApolloTerms.should include("brotherhood of man")
|
114
|
-
end
|
115
|
-
|
116
|
-
it "knows of mankind, and of its heroes" do
|
117
|
-
ApolloTerms.should include("man")
|
118
|
-
ApolloTerms.should include("men")
|
119
|
-
ApolloTerms.should include("mankind")
|
120
|
-
ApolloTerms.should include("heroes")
|
121
|
-
ApolloTerms.should include("epic men")
|
122
|
-
end
|
123
|
-
|
124
|
-
it "looks to the stars from the earth" do
|
125
|
-
ApolloTerms.should include("stars")
|
126
|
-
ApolloTerms.should include("constellations")
|
127
|
-
ApolloTerms.should include("Mother Earth")
|
128
|
-
ApolloTerms.should include("world")
|
129
|
-
end
|
130
|
-
|
131
|
-
end
|