open-nlp 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,29 +1,34 @@
1
1
  [![Build Status](https://secure.travis-ci.org/louismullie/open-nlp.png)](http://travis-ci.org/louismullie/open-nlp)
2
2
 
3
- **About**
3
+ ###About
4
4
 
5
- This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
5
+ This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
6
6
 
7
- This gem only provides a thin wrapper over the OpenNLP API. If you are looking for a Ruby natural language processing framework, have a look at [Treat](https://github.com/louismullie/treat).
7
+ ###Installing
8
8
 
9
- **Installing**
9
+ __Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.__
10
10
 
11
- _Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.
12
-
13
- First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all english language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
11
+ First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all English language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
14
12
 
15
13
  Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).
16
14
 
17
- **Configuration**
15
+ Alternatively, from a terminal window, `cd` to the gem's folder and run:
16
+
17
+ ```
18
+ wget http://www.louismullie.com/treat/open-nlp-english.zip
19
+ unzip -o open-nlp-english.zip -d bin/
20
+ ```
21
+
22
+ ###Configuring
18
23
 
19
- After installing and requiring the gem (`require 'open-nlp'`), you may want to set some optional configuration options. Here are some examples:
24
+ After installing and requiring the gem (`require 'open-nlp'`), you may want to set some of the following configuration options.
20
25
 
21
26
  ```ruby
22
- # Set an alternative path to look for the JAR files
27
+ # Set an alternative path to look for the JAR files.
23
28
  # Default is gem's bin folder.
24
29
  OpenNLP.jar_path = '/path_to_jars/'
25
30
 
26
- # Set an alternative path to look for the model files
31
+ # Set an alternative path to look for the model files.
27
32
  # Default is gem's bin folder.
28
33
  OpenNLP.model_path = '/path_to_models/'
29
34
 
@@ -34,76 +39,131 @@ OpenNLP.jvm_args = ['-option1', '-option2']
34
39
  # Redirect VM output to log.txt
35
40
  OpenNLP.log_file = 'log.txt'
36
41
 
37
- # WARNING: Not implemented yet.
42
+ ```
43
+
44
+ ###Examples
45
+
38
46
 
39
- # Use the model files for a different language than English.
40
- # OpenNLP.use(:french) # or :german
41
- #
42
- # Change a specific model file.
43
- # OpenNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
47
+ **Simple tokenizer**
48
+
49
+ ```ruby
50
+ OpenNLP.load
51
+
52
+ sent = "The death of the poet was kept from his poems."
53
+ tokenizer = OpenNLP::SimpleTokenizer.new
54
+
55
+ tokens = tokenizer.tokenize(sent).to_a
56
+ # => %w[The death of the poet was kept from his poems .]
44
57
  ```
45
58
 
46
- **Using the gem**
59
+ **Maximum entropy tokenizer, chunker and POS tagger**
47
60
 
48
61
  ```ruby
49
- text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
50
- 'Berlin to discuss a new $25 billion austerity package.' +
51
- 'Sarkozy looked pleased, but Merkel was dismayed.'
62
+
63
+ OpenNLP.load
64
+
65
+ chunker = OpenNLP::ChunkerME.new
66
+ tokenizer = OpenNLP::TokenizerME.new
67
+ tagger = OpenNLP::POSTaggerME.new
68
+
69
+ sent = "The death of the poet was kept from his poems."
70
+
71
+ tokens = tokenizer.tokenize(sent).to_a
72
+ # => %w[The death of the poet was kept from his poems .]
73
+
74
+ tags = tagger.tag(tokens).to_a
75
+ # => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]
76
+
77
+ chunks = chunker.chunk(tokens, tags).to_a
78
+ # => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]
79
+ ```
80
+
81
+ **Abstract Bottom-Up Parser**
82
+
83
+ ```ruby
84
+ OpenNLP.load
85
+
86
+ sent = "The death of the poet was kept from his poems."
87
+ parser = OpenNLP::Parser.new
88
+ parse = parser.parse(sent)
89
+
90
+ parse.get_text.should eql sent
91
+
92
+ parse.get_span.get_start.should eql 0
93
+ parse.get_span.get_end.should eql 46
94
+ parse.get_child_count.should eql 1
95
+
96
+ child = parse.get_children[0]
97
+
98
+ child.text # => "The death of the poet was kept from his poems."
99
+ child.get_child_count # => 3
100
+ child.get_head_index #=> 5
101
+ child.get_type # => "S"
102
+ ```
103
+
104
+ **Maximum Entropy Name Finder***
105
+
106
+ ```ruby
107
+ OpenNLP.load
108
+
109
+ text = File.read('./spec/sample.txt').gsub!("\n", "")
52
110
 
53
111
  tokenizer = OpenNLP::TokenizerME.new
54
112
  segmenter = OpenNLP::SentenceDetectorME.new
55
- tagger = OpenNLP::POSTaggerME.new
56
113
  ner_models = ['person', 'time', 'money']
57
114
 
58
115
  ner_finders = ner_models.map do |model|
59
- OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
116
+ OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
60
117
  end
61
118
 
62
119
  sentences = segmenter.sent_detect(text)
63
- all_entities = []
120
+ named_entities = []
64
121
 
65
122
  sentences.each do |sentence|
66
123
 
67
- tokens = tokenizer.tokenize(sentence)
68
- tags = tagger.tag(tokens)
69
-
70
- # Get a list of all tokens.
71
- puts tokens.to_a.inspect
72
- # Get the sentence's text.
73
- puts sentence.to_s.inspect
74
- # Get the sentence's tags.
75
- puts tags.to_a.inspect
76
-
77
- # Run three NER models and find entities.
78
- ner_models.each_with_index do |model,i|
79
- finder = ner_finders[i]
80
- name_spans = finder.find(tokens)
81
- name_spans.each do |name_span|
82
- start = name_span.get_start
83
- stop = name_span.get_end-1
84
- slice = tokens[start..stop].to_a
85
- all_entities << [slice, model]
86
- end
87
- end
124
+ tokens = tokenizer.tokenize(sentence)
125
+
126
+ ner_models.each_with_index do |model,i|
127
+ finder = ner_finders[i]
128
+ name_spans = finder.find(tokens)
129
+ name_spans.each do |name_span|
130
+ start = name_span.get_start
131
+ stop = name_span.get_end-1
132
+ slice = tokens[start..stop].to_a
133
+ named_entities << [slice, model]
134
+ end
135
+ end
88
136
 
89
137
  end
138
+ ```
139
+
140
+ **Loading specific models**
141
+
142
+ Just pass the name of the model file to the constructor. The gem will search for the file in the `OpenNLP.model_path` folder.
143
+
144
+ ```ruby
145
+ OpenNLP.load
90
146
 
91
- # Show all named entities.
92
- puts all_entities.inspect
147
+ tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
148
+ tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
149
+ name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
150
+ # etc.
93
151
  ```
94
152
 
95
153
  **Loading specific classes**
96
154
 
97
- You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
155
+ You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:
98
156
 
99
157
  ```ruby
100
158
  # Default base class is opennlp.tools.
101
159
  OpenNLP.load_class('SomeClassName')
160
+ # => OpenNLP::SomeClassName
102
161
 
103
162
  # Here, we specify another base class.
104
- OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
163
+ OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
164
+ # => OpenNLP::SomeOtherClass
105
165
  ```
106
166
 
107
167
  **Contributing**
108
168
 
109
- Feel free to fork the project and send me a pull request!
169
+ Fork the project and send me a pull request! Config updates for other languages are welcome.
@@ -1,31 +1,35 @@
1
1
  module OpenNLP
2
2
 
3
3
  # Library version.
4
- VERSION = '0.1.0'
4
+ VERSION = '0.1.1'
5
5
 
6
6
  # Require Java bindings.
7
7
  require 'open-nlp/bindings'
8
- OpenNLP::Bindings.bind
9
8
 
10
9
  # Require Ruby wrappers.
11
10
  require 'open-nlp/classes'
12
11
 
12
+ # Setup the JVM and load the default JARs.
13
+ def self.load
14
+ OpenNLP::Bindings.bind
15
+ end
16
+
13
17
  # Load a Java class into the OpenNLP
14
18
  # namespace (e.g. OpenNLP::Loaded).
15
- def load_class(*args)
19
+ def self.load_class(*args)
16
20
  OpenNLP::Bindings.load_class(*args)
17
21
  end
18
22
 
19
23
  # Forwards the handling of missing
20
24
  # constants to the Bindings class.
21
- def const_missing(const)
25
+ def self.const_missing(const)
22
26
  OpenNLP::Bindings.const_get(const)
23
27
  end
24
28
 
25
29
  # Forward the handling of missing
26
30
  # methods to the Bindings class.
27
- def method_missing(sym, *args, &block)
31
+ def self.method_missing(sym, *args, &block)
28
32
  OpenNLP::Bindings.send(sym, *args, &block)
29
33
  end
30
34
 
31
- end
35
+ end
@@ -10,10 +10,6 @@ module OpenNLP::Bindings
10
10
  require 'bind-it'
11
11
  extend BindIt::Binding
12
12
 
13
- # The path in which to look for JAR files, with
14
- # a trailing slash (default is gem's bin folder).
15
- self.jar_path = File.dirname(__FILE__) + '/../../bin/'
16
-
17
13
  # Load the JVM with a minimum heap size of 512MB,
18
14
  # and a maximum heap size of 1024MB.
19
15
  self.jvm_args = ['-Xms512M', '-Xmx1024M']
@@ -34,6 +30,7 @@ module OpenNLP::Bindings
34
30
 
35
31
  # Default classes.
36
32
  self.default_classes = [
33
+ # OpenNLP classes.
37
34
  ['AbstractBottomUpParser', 'opennlp.tools.parser'],
38
35
  ['DocumentCategorizerME', 'opennlp.tools.doccat'],
39
36
  ['ChunkerME', 'opennlp.tools.chunker'],
@@ -46,7 +43,12 @@ module OpenNLP::Bindings
46
43
  ['SentenceDetectorME', 'opennlp.tools.sentdetect'],
47
44
  ['SimpleTokenizer', 'opennlp.tools.tokenize'],
48
45
  ['Span', 'opennlp.tools.util'],
49
- ['TokenizerME', 'opennlp.tools.tokenize']
46
+ ['TokenizerME', 'opennlp.tools.tokenize'],
47
+
48
+ # Generic Java classes.
49
+ ['FileInputStream', 'java.io'],
50
+ ['String', 'java.lang'],
51
+ ['ArrayList', 'java.util']
50
52
  ]
51
53
 
52
54
  # Add in Rjb workarounds.
@@ -54,14 +56,6 @@ module OpenNLP::Bindings
54
56
  self.default_jars << 'utils.jar'
55
57
  self.default_classes << ['Utils', '']
56
58
  end
57
-
58
- # Make the bindings.
59
- self.bind
60
-
61
- # Load utility classes.
62
- self.load_class('FileInputStream', 'java.io')
63
- self.load_class('String', 'java.lang')
64
- self.load_class('ArrayList', 'java.util')
65
59
 
66
60
  # ############################ #
67
61
  # OpenNLP bindings proper #
@@ -78,12 +72,20 @@ module OpenNLP::Bindings
78
72
  attr_accessor :language
79
73
  end
80
74
 
75
+ def self.default_path
76
+ File.dirname(__FILE__) + '/../../bin/'
77
+ end
78
+
81
79
  # The loaded models.
82
80
  self.models = {}
83
81
 
84
82
  # The names of loaded models.
85
83
  self.model_files = {}
86
84
 
85
+ # The path in which to look for JAR files, with
86
+ # a trailing slash (default is gem's bin folder).
87
+ self.jar_path = self.default_path
88
+
87
89
  # The path to the main folder containing the folders
88
90
  # with the individual models inside. By default, this
89
91
  # is the same as the JAR path.
@@ -2,10 +2,51 @@
2
2
  require_relative 'spec_helper'
3
3
 
4
4
  describe OpenNLP do
5
+
6
+ context "when an unreachable jar_path or model_path is provided" do
7
+ it "raises an exception when trying to load" do
8
+ OpenNLP.jar_path = '/unreachable/'
9
+ OpenNLP::Bindings.jar_path.should eql '/unreachable/'
10
+ OpenNLP.model_path = '/unreachable/'
11
+ OpenNLP::Bindings.model_path.should eql '/unreachable/'
12
+ expect { OpenNLP.load }.to raise_exception
13
+ OpenNLP.jar_path = OpenNLP.model_path = OpenNLP.default_path
14
+ expect { OpenNLP.load }.not_to raise_exception
15
+ end
16
+ end
17
+
18
+ context "when a constructor is provided with a specific model to load" do
19
+ it "loads that model, looking for the supplied file relative to OpenNLP.model_path " do
20
+
21
+ OpenNLP.load
22
+
23
+ tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
24
+ tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
25
+
26
+ sent = "The death of the poet was kept from his poems."
27
+ tokens = tokenizer.tokenize(sent)
28
+ tags = tagger.tag(tokens)
29
+
30
+ OpenNLP.models[:pos_tagger].get_pos_model.to_s
31
+ .index('opennlp.perceptron.PerceptronModel').should_not be_nil
32
+
33
+ tags.should eql ["DT", "NN", "IN", "DT", "NN", "VBD", "VBN", "IN", "PRP$", "NNS", "."]
34
+
35
+ end
36
+ end
37
+
38
+ context "when a class is loaded through the #load_class method" do
39
+ it "loads the class and allows to access it through the global namespace" do
40
+ OpenNLP.load_class('ChunkSample', 'opennlp.tools.chunker')
41
+ expect { OpenNLP::ChunkSample }.not_to raise_exception
42
+ end
43
+ end
5
44
 
6
45
  context "the maximum entropy chunker is run after tokenization and POS tagging" do
7
46
  it "should find the accurate chunks" do
8
47
 
48
+ OpenNLP.load
49
+
9
50
  chunker = OpenNLP::ChunkerME.new
10
51
  tokenizer = OpenNLP::TokenizerME.new
11
52
  tagger = OpenNLP::POSTaggerME.new
@@ -25,7 +66,9 @@ describe OpenNLP do
25
66
 
26
67
  context "the maximum entropy parser is run after tokenization" do
27
68
  it "parses the text accurately" do
28
-
69
+
70
+ OpenNLP.load
71
+
29
72
  sent = "The death of the poet was kept from his poems."
30
73
  parser = OpenNLP::Parser.new
31
74
  parse = parser.parse(sent)
@@ -51,10 +94,14 @@ describe OpenNLP do
51
94
 
52
95
  context "the SimpleTokenizer is run" do
53
96
  it "tokenizes the text accurately" do
97
+
98
+ OpenNLP.load
99
+
54
100
  sent = "The death of the poet was kept from his poems."
55
101
  tokenizer = OpenNLP::SimpleTokenizer.new
56
102
  tokens = tokenizer.tokenize(sent).to_a
57
103
  tokens.should eql %w[The death of the poet was kept from his poems .]
104
+
58
105
  end
59
106
  end
60
107
 
@@ -63,6 +110,8 @@ describe OpenNLP do
63
110
 
64
111
  it "should accurately detect tokens, sentences and named entities" do
65
112
 
113
+ OpenNLP.load
114
+
66
115
  text = File.read('./spec/sample.txt').gsub!("\n", "")
67
116
 
68
117
  tokenizer = OpenNLP::TokenizerME.new
@@ -96,6 +145,7 @@ describe OpenNLP do
96
145
  all_tokens << tokens.to_a
97
146
  all_sentences << sentence
98
147
  all_tags << tags.to_a
148
+
99
149
  end
100
150
 
101
151
  all_tokens.should eql [["To", "describe", "2009", "as", "a", "stellar", "year", "for", "Petrofac", "(", "LON:PFC)", "would", "be", "a", "huge", "understatement", "."], ["The", "group", "finished", "the", "year", "with", "an", "order", "backlog", "twice", "the", "size", "than", "it", "had", "at", "the", "outset", "."], ["The", "group", "has", "since", "been", "awarded", "a", "US", "600", "million", "contract", "and", "spun", "off", "its", "North", "Sea", "assets", "."], ["The", "group", "’s", "recently", "released", "full", "year", "results", "show", "a", "jump", "in", "revenues", ",", "pre-tax", "profits", "and", "order", "backlog", "."], ["Whilst", "group", "revenue", "rose", "by", "10", "%", "from", "$", "3.3", "billion", "to", "$", "3.7", "billion", ",", "pre-tax", "profits", "rose", "by", "25", "%", "from", "$", "358", "million", "to", "$", "448", "million", ".All", "the", "more", "impressive", ",", "the", "group", "’s", "order", "backlog", "doubled", "to", "over", "$", "8", "billion", "paying", "no", "attention", "to", "the", "15", "%", "cut", "in", "capital", "expenditure", "witnessed", "across", "the", "oil", "and", "gas", "industry", "as", "whole", "in", "2009", ".Focussing", "in", "on", "which", "the", "underlying", "performances", "of", "the", "individual", "segments", ",", "the", "group", "cash", "cow", ",", "its", "Engineering", "and", "Construction", "division", ",", "saw", "operating", "profit", "rise", "33", "%", "over", "the", "year", "to", "$", "322", "million", ",", "thanks", "to", "US$", "6.3", "billion", "worth", "of", "new", "contract", "wins", "during", "the", "year", "which", "included", "a", "$", "100", "million", "contract", "with", "Turkmengaz", ",", "the", "Turkmenistan", "national", "energy", "company", "."], ["The", "division", "has", "picked", "up", "in", "2010", "where", "it", "left", "off", "in", "2009", "and", "has", "been", "awarded", "a", "contract", "worth", "more", "than", "US600", "million", "for", "a", "gas", "sweetening", "facilities", "project", "by", "Qatar", "Petroleum.Elsewhere", "the", "group", "’s", "Offshore", "Engineering", "&", "Operations", "division", "may", "have", "seen", "a", "pullback", "in", "revenue", "and", "earnings", "vis-a-vis", "2008", ",", "but", "it", "did", "secure", "a", "£75", "million", "contract", "with", "Apache", "to", "provideengineering", "and", "construction", "services", "for", "the", "Forties", "field", "in", "the", "UK", "North", "Sea", "."], ["And", "to", "underscore", "the", "fact", "that", "there", "is", "life", "beyond", "NOC’s", "for", "Petrofac", "(", "LON:PFC)", "the", "division", "was", "awarded", "a", "£100", "million", "5-year", "contract", "by", "BP", "(", "LON:BP.", ")", "to", "deliver", "integrated", "maintenance", "management", "support", "services", "for", "all", "of", "BP", "'s", "UK", "offshore", "assets", "and", "onshore", "Dimlington", "plant", "."], ["The", "laggard", "of", "the", "group", "was", "the", "Engineering", ",", "Training", "Services", "and", "Production", "Solutions", "division", "."], ["The", "business", "suffered", "as", "the", "oil", "price", "tailed", "off", "and", "the", "economic", "outlook", "deteriorated", "forcing", "a", "number", "ofmajor", "customers", "to", "postpone", "early", "stage", "engineering", "studies", "or", "re-phased", "work", "upon", "which", "the", "division", "depends", "."], ["Although", "the", "fall", "in", "activity", "was", "notable", ",", "the", "division’s", "operational", "performance", "in", "service", "operator", "role", "for", "production", "of", "Dubai", "'s", "offshore", "oil", "&", "gas", "proved", "a", "highlight.Energy", "Developments", "meanwhile", "saw", "the", "start", "of", "oil", "production", "from", "the", "West", "Don", "field", "during", "the", "first", "half", "of", "the", "year", "less", "than", "a", "year", "from", "Field", "Development", "Programme", "approval", "."], ["In", "addition", "output", "from", "Don", "Southwest", "field", "began", "in", "June", "."], ["Despite", "considerably", "lower", "oil", "prices", "in", "2009", "compared", "to", "the", "prior", "year", ",", "Energy", "Developments", "'", "revenue", "reached", "almost", "US$", "250", "million", "(", "significantly", "higher", "than", "the", "US$", "153", "million", "of", "2008", ")", "due", "not", "only", "to", "the", "‘Don", "fields", "effect", "’", "but", "also", "a", "full", "year", "'s", "contribution", "from", "the", "Chergui", "gas", "plant", ",", "which", "began", "exports", "in", "August", "2008.In", "order", "to", "maximize", "the", "earnings", "potential", "of", "the", "division’s", "North", "Sea", "assets", ",", "including", "the", "Don", "assets", ",", "the", "group", "has", "demerged", "them", "providing", "its", "shareholders", "with", "shares", "in", "a", "newly", "listed", "independent", "exploration", "and", "production", "company", "called", "EnQuest", "(", "LON:ENQ", ")", "."], ["EnQuest", "is", "a", "product", "of", "the", "Petrofac’s", "North", "Sea", "Assets", "with", "those", "off", "of", "Swedish", "explorer", "Lundin", "with", "both", "companies", "divesting", "for", "different", "reasons", "."], ["Upon", "listing", "(", "April", "6th", ")", ",", "Petrofac", "(", "LON:PFC)", "shareholders", "owned", "around", "45", "%", "of", "the", "new", "EnQuest", "entity", "with", "Lundin", "shareholders", "owning", "approximately", "55", "%", "."], ["It", "is", "important", "to", "note", "that", "post", "demerger", "the", "Energy", "Developments", "business", "unit", "is", "still", "a", "key", "constituent", "of", "Petrofac", "'s", "business", "portfolio", ",", "and", "will", "continue", "to", "hold", "significant", "assets", "Tunisia", ",", "Malaysia", ",", "Algeria", "and", "Kyrgyz", "Republic", "-", "sandwiched", "between", "Kazakhstan", "and", "China", "."]]
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: open-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors: