open-nlp 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,29 +1,34 @@
1
1
  [![Build Status](https://secure.travis-ci.org/louismullie/open-nlp.png)](http://travis-ci.org/louismullie/open-nlp)
2
2
 
3
- **About**
3
+ ###About
4
4
 
5
- This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
5
+ This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
6
6
 
7
- This gem only provides a thin wrapper over the OpenNLP API. If you are looking for a Ruby natural language processing framework, have a look at [Treat](https://github.com/louismullie/treat).
7
+ ###Installing
8
8
 
9
- **Installing**
9
+ __Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.__
10
10
 
11
- _Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.
12
-
13
- First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all english language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
11
+ First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all English language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
14
12
 
15
13
  Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).
16
14
 
17
- **Configuration**
15
+ Alternatively, from a terminal window, `cd` to the gem's folder and run:
16
+
17
+ ```
18
+ wget http://www.louismullie.com/treat/open-nlp-english.zip
19
+ unzip -o open-nlp-english.zip -d bin/
20
+ ```
21
+
22
+ ###Configuring
18
23
 
19
- After installing and requiring the gem (`require 'open-nlp'`), you may want to set some optional configuration options. Here are some examples:
24
+ After installing and requiring the gem (`require 'open-nlp'`), you may want to set some of the following configuration options.
20
25
 
21
26
  ```ruby
22
- # Set an alternative path to look for the JAR files
27
+ # Set an alternative path to look for the JAR files.
23
28
  # Default is gem's bin folder.
24
29
  OpenNLP.jar_path = '/path_to_jars/'
25
30
 
26
- # Set an alternative path to look for the model files
31
+ # Set an alternative path to look for the model files.
27
32
  # Default is gem's bin folder.
28
33
  OpenNLP.model_path = '/path_to_models/'
29
34
 
@@ -34,76 +39,131 @@ OpenNLP.jvm_args = ['-option1', '-option2']
34
39
  # Redirect VM output to log.txt
35
40
  OpenNLP.log_file = 'log.txt'
36
41
 
37
- # WARNING: Not implemented yet.
42
+ ```
43
+
44
+ ###Examples
45
+
38
46
 
39
- # Use the model files for a different language than English.
40
- # OpenNLP.use(:french) # or :german
41
- #
42
- # Change a specific model file.
43
- # OpenNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
47
+ **Simple tokenizer**
48
+
49
+ ```ruby
50
+ OpenNLP.load
51
+
52
+ sent = "The death of the poet was kept from his poems."
53
+ tokenizer = OpenNLP::SimpleTokenizer.new
54
+
55
+ tokens = tokenizer.tokenize(sent).to_a
56
+ # => %w[The death of the poet was kept from his poems .]
44
57
  ```
45
58
 
46
- **Using the gem**
59
+ **Maximum entropy tokenizer, chunker and POS tagger**
47
60
 
48
61
  ```ruby
49
- text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
50
- 'Berlin to discuss a new $25 billion austerity package.' +
51
- 'Sarkozy looked pleased, but Merkel was dismayed.'
62
+
63
+ OpenNLP.load
64
+
65
+ chunker = OpenNLP::ChunkerME.new
66
+ tokenizer = OpenNLP::TokenizerME.new
67
+ tagger = OpenNLP::POSTaggerME.new
68
+
69
+ sent = "The death of the poet was kept from his poems."
70
+
71
+ tokens = tokenizer.tokenize(sent).to_a
72
+ # => %w[The death of the poet was kept from his poems .]
73
+
74
+ tags = tagger.tag(tokens).to_a
75
+ # => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]
76
+
77
+ chunks = chunker.chunk(tokens, tags).to_a
78
+ # => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]
79
+ ```
80
+
81
+ **Abstract Bottom-Up Parser**
82
+
83
+ ```ruby
84
+ OpenNLP.load
85
+
86
+ sent = "The death of the poet was kept from his poems."
87
+ parser = OpenNLP::Parser.new
88
+ parse = parser.parse(sent)
89
+
90
+ parse.get_text.should eql sent
91
+
92
+ parse.get_span.get_start.should eql 0
93
+ parse.get_span.get_end.should eql 46
94
+ parse.get_child_count.should eql 1
95
+
96
+ child = parse.get_children[0]
97
+
98
+ child.text # => "The death of the poet was kept from his poems."
99
+ child.get_child_count # => 3
100
+ child.get_head_index #=> 5
101
+ child.get_type # => "S"
102
+ ```
103
+
104
+ **Maximum Entropy Name Finder***
105
+
106
+ ```ruby
107
+ OpenNLP.load
108
+
109
+ text = File.read('./spec/sample.txt').gsub!("\n", "")
52
110
 
53
111
  tokenizer = OpenNLP::TokenizerME.new
54
112
  segmenter = OpenNLP::SentenceDetectorME.new
55
- tagger = OpenNLP::POSTaggerME.new
56
113
  ner_models = ['person', 'time', 'money']
57
114
 
58
115
  ner_finders = ner_models.map do |model|
59
- OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
116
+ OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
60
117
  end
61
118
 
62
119
  sentences = segmenter.sent_detect(text)
63
- all_entities = []
120
+ named_entities = []
64
121
 
65
122
  sentences.each do |sentence|
66
123
 
67
- tokens = tokenizer.tokenize(sentence)
68
- tags = tagger.tag(tokens)
69
-
70
- # Get a list of all tokens.
71
- puts tokens.to_a.inspect
72
- # Get the sentence's text.
73
- puts sentence.to_s.inspect
74
- # Get the sentence's tags.
75
- puts tags.to_a.inspect
76
-
77
- # Run three NER models and find entities.
78
- ner_models.each_with_index do |model,i|
79
- finder = ner_finders[i]
80
- name_spans = finder.find(tokens)
81
- name_spans.each do |name_span|
82
- start = name_span.get_start
83
- stop = name_span.get_end-1
84
- slice = tokens[start..stop].to_a
85
- all_entities << [slice, model]
86
- end
87
- end
124
+ tokens = tokenizer.tokenize(sentence)
125
+
126
+ ner_models.each_with_index do |model,i|
127
+ finder = ner_finders[i]
128
+ name_spans = finder.find(tokens)
129
+ name_spans.each do |name_span|
130
+ start = name_span.get_start
131
+ stop = name_span.get_end-1
132
+ slice = tokens[start..stop].to_a
133
+ named_entities << [slice, model]
134
+ end
135
+ end
88
136
 
89
137
  end
138
+ ```
139
+
140
+ **Loading specific models**
141
+
142
+ Just pass the name of the model file to the constructor. The gem will search for the file in the `OpenNLP.model_path` folder.
143
+
144
+ ```ruby
145
+ OpenNLP.load
90
146
 
91
- # Show all named entities.
92
- puts all_entities.inspect
147
+ tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
148
+ tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
149
+ name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
150
+ # etc.
93
151
  ```
94
152
 
95
153
  **Loading specific classes**
96
154
 
97
- You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
155
+ You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:
98
156
 
99
157
  ```ruby
100
158
  # Default base class is opennlp.tools.
101
159
  OpenNLP.load_class('SomeClassName')
160
+ # => OpenNLP::SomeClassName
102
161
 
103
162
  # Here, we specify another base class.
104
- OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
163
+ OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
164
+ # => OpenNLP::SomeOtherClass
105
165
  ```
106
166
 
107
167
  **Contributing**
108
168
 
109
- Feel free to fork the project and send me a pull request!
169
+ Fork the project and send me a pull request! Config updates for other languages are welcome.
@@ -1,31 +1,35 @@
1
1
  module OpenNLP
2
2
 
3
3
  # Library version.
4
- VERSION = '0.1.0'
4
+ VERSION = '0.1.1'
5
5
 
6
6
  # Require Java bindings.
7
7
  require 'open-nlp/bindings'
8
- OpenNLP::Bindings.bind
9
8
 
10
9
  # Require Ruby wrappers.
11
10
  require 'open-nlp/classes'
12
11
 
12
+ # Setup the JVM and load the default JARs.
13
+ def self.load
14
+ OpenNLP::Bindings.bind
15
+ end
16
+
13
17
  # Load a Java class into the OpenNLP
14
18
  # namespace (e.g. OpenNLP::Loaded).
15
- def load_class(*args)
19
+ def self.load_class(*args)
16
20
  OpenNLP::Bindings.load_class(*args)
17
21
  end
18
22
 
19
23
  # Forwards the handling of missing
20
24
  # constants to the Bindings class.
21
- def const_missing(const)
25
+ def self.const_missing(const)
22
26
  OpenNLP::Bindings.const_get(const)
23
27
  end
24
28
 
25
29
  # Forward the handling of missing
26
30
  # methods to the Bindings class.
27
- def method_missing(sym, *args, &block)
31
+ def self.method_missing(sym, *args, &block)
28
32
  OpenNLP::Bindings.send(sym, *args, &block)
29
33
  end
30
34
 
31
- end
35
+ end
@@ -10,10 +10,6 @@ module OpenNLP::Bindings
10
10
  require 'bind-it'
11
11
  extend BindIt::Binding
12
12
 
13
- # The path in which to look for JAR files, with
14
- # a trailing slash (default is gem's bin folder).
15
- self.jar_path = File.dirname(__FILE__) + '/../../bin/'
16
-
17
13
  # Load the JVM with a minimum heap size of 512MB,
18
14
  # and a maximum heap size of 1024MB.
19
15
  self.jvm_args = ['-Xms512M', '-Xmx1024M']
@@ -34,6 +30,7 @@ module OpenNLP::Bindings
34
30
 
35
31
  # Default classes.
36
32
  self.default_classes = [
33
+ # OpenNLP classes.
37
34
  ['AbstractBottomUpParser', 'opennlp.tools.parser'],
38
35
  ['DocumentCategorizerME', 'opennlp.tools.doccat'],
39
36
  ['ChunkerME', 'opennlp.tools.chunker'],
@@ -46,7 +43,12 @@ module OpenNLP::Bindings
46
43
  ['SentenceDetectorME', 'opennlp.tools.sentdetect'],
47
44
  ['SimpleTokenizer', 'opennlp.tools.tokenize'],
48
45
  ['Span', 'opennlp.tools.util'],
49
- ['TokenizerME', 'opennlp.tools.tokenize']
46
+ ['TokenizerME', 'opennlp.tools.tokenize'],
47
+
48
+ # Generic Java classes.
49
+ ['FileInputStream', 'java.io'],
50
+ ['String', 'java.lang'],
51
+ ['ArrayList', 'java.util']
50
52
  ]
51
53
 
52
54
  # Add in Rjb workarounds.
@@ -54,14 +56,6 @@ module OpenNLP::Bindings
54
56
  self.default_jars << 'utils.jar'
55
57
  self.default_classes << ['Utils', '']
56
58
  end
57
-
58
- # Make the bindings.
59
- self.bind
60
-
61
- # Load utility classes.
62
- self.load_class('FileInputStream', 'java.io')
63
- self.load_class('String', 'java.lang')
64
- self.load_class('ArrayList', 'java.util')
65
59
 
66
60
  # ############################ #
67
61
  # OpenNLP bindings proper #
@@ -78,12 +72,20 @@ module OpenNLP::Bindings
78
72
  attr_accessor :language
79
73
  end
80
74
 
75
+ def self.default_path
76
+ File.dirname(__FILE__) + '/../../bin/'
77
+ end
78
+
81
79
  # The loaded models.
82
80
  self.models = {}
83
81
 
84
82
  # The names of loaded models.
85
83
  self.model_files = {}
86
84
 
85
+ # The path in which to look for JAR files, with
86
+ # a trailing slash (default is gem's bin folder).
87
+ self.jar_path = self.default_path
88
+
87
89
  # The path to the main folder containing the folders
88
90
  # with the individual models inside. By default, this
89
91
  # is the same as the JAR path.
@@ -2,10 +2,51 @@
2
2
  require_relative 'spec_helper'
3
3
 
4
4
  describe OpenNLP do
5
+
6
+ context "when an unreachable jar_path or model_path is provided" do
7
+ it "raises an exception when trying to load" do
8
+ OpenNLP.jar_path = '/unreachable/'
9
+ OpenNLP::Bindings.jar_path.should eql '/unreachable/'
10
+ OpenNLP.model_path = '/unreachable/'
11
+ OpenNLP::Bindings.model_path.should eql '/unreachable/'
12
+ expect { OpenNLP.load }.to raise_exception
13
+ OpenNLP.jar_path = OpenNLP.model_path = OpenNLP.default_path
14
+ expect { OpenNLP.load }.not_to raise_exception
15
+ end
16
+ end
17
+
18
+ context "when a constructor is provided with a specific model to load" do
19
+ it "loads that model, looking for the supplied file relative to OpenNLP.model_path " do
20
+
21
+ OpenNLP.load
22
+
23
+ tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
24
+ tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
25
+
26
+ sent = "The death of the poet was kept from his poems."
27
+ tokens = tokenizer.tokenize(sent)
28
+ tags = tagger.tag(tokens)
29
+
30
+ OpenNLP.models[:pos_tagger].get_pos_model.to_s
31
+ .index('opennlp.perceptron.PerceptronModel').should_not be_nil
32
+
33
+ tags.should eql ["DT", "NN", "IN", "DT", "NN", "VBD", "VBN", "IN", "PRP$", "NNS", "."]
34
+
35
+ end
36
+ end
37
+
38
+ context "when a class is loaded through the #load_class method" do
39
+ it "loads the class and allows to access it through the global namespace" do
40
+ OpenNLP.load_class('ChunkSample', 'opennlp.tools.chunker')
41
+ expect { OpenNLP::ChunkSample }.not_to raise_exception
42
+ end
43
+ end
5
44
 
6
45
  context "the maximum entropy chunker is run after tokenization and POS tagging" do
7
46
  it "should find the accurate chunks" do
8
47
 
48
+ OpenNLP.load
49
+
9
50
  chunker = OpenNLP::ChunkerME.new
10
51
  tokenizer = OpenNLP::TokenizerME.new
11
52
  tagger = OpenNLP::POSTaggerME.new
@@ -25,7 +66,9 @@ describe OpenNLP do
25
66
 
26
67
  context "the maximum entropy parser is run after tokenization" do
27
68
  it "parses the text accurately" do
28
-
69
+
70
+ OpenNLP.load
71
+
29
72
  sent = "The death of the poet was kept from his poems."
30
73
  parser = OpenNLP::Parser.new
31
74
  parse = parser.parse(sent)
@@ -51,10 +94,14 @@ describe OpenNLP do
51
94
 
52
95
  context "the SimpleTokenizer is run" do
53
96
  it "tokenizes the text accurately" do
97
+
98
+ OpenNLP.load
99
+
54
100
  sent = "The death of the poet was kept from his poems."
55
101
  tokenizer = OpenNLP::SimpleTokenizer.new
56
102
  tokens = tokenizer.tokenize(sent).to_a
57
103
  tokens.should eql %w[The death of the poet was kept from his poems .]
104
+
58
105
  end
59
106
  end
60
107
 
@@ -63,6 +110,8 @@ describe OpenNLP do
63
110
 
64
111
  it "should accurately detect tokens, sentences and named entities" do
65
112
 
113
+ OpenNLP.load
114
+
66
115
  text = File.read('./spec/sample.txt').gsub!("\n", "")
67
116
 
68
117
  tokenizer = OpenNLP::TokenizerME.new
@@ -96,6 +145,7 @@ describe OpenNLP do
96
145
  all_tokens << tokens.to_a
97
146
  all_sentences << sentence
98
147
  all_tags << tags.to_a
148
+
99
149
  end
100
150
 
101
151
  all_tokens.should eql [["To", "describe", "2009", "as", "a", "stellar", "year", "for", "Petrofac", "(", "LON:PFC)", "would", "be", "a", "huge", "understatement", "."], ["The", "group", "finished", "the", "year", "with", "an", "order", "backlog", "twice", "the", "size", "than", "it", "had", "at", "the", "outset", "."], ["The", "group", "has", "since", "been", "awarded", "a", "US", "600", "million", "contract", "and", "spun", "off", "its", "North", "Sea", "assets", "."], ["The", "group", "’s", "recently", "released", "full", "year", "results", "show", "a", "jump", "in", "revenues", ",", "pre-tax", "profits", "and", "order", "backlog", "."], ["Whilst", "group", "revenue", "rose", "by", "10", "%", "from", "$", "3.3", "billion", "to", "$", "3.7", "billion", ",", "pre-tax", "profits", "rose", "by", "25", "%", "from", "$", "358", "million", "to", "$", "448", "million", ".All", "the", "more", "impressive", ",", "the", "group", "’s", "order", "backlog", "doubled", "to", "over", "$", "8", "billion", "paying", "no", "attention", "to", "the", "15", "%", "cut", "in", "capital", "expenditure", "witnessed", "across", "the", "oil", "and", "gas", "industry", "as", "whole", "in", "2009", ".Focussing", "in", "on", "which", "the", "underlying", "performances", "of", "the", "individual", "segments", ",", "the", "group", "cash", "cow", ",", "its", "Engineering", "and", "Construction", "division", ",", "saw", "operating", "profit", "rise", "33", "%", "over", "the", "year", "to", "$", "322", "million", ",", "thanks", "to", "US$", "6.3", "billion", "worth", "of", "new", "contract", "wins", "during", "the", "year", "which", "included", "a", "$", "100", "million", "contract", "with", "Turkmengaz", ",", "the", "Turkmenistan", "national", "energy", "company", "."], ["The", "division", "has", "picked", "up", "in", "2010", "where", "it", "left", "off", "in", "2009", "and", "has", "been", "awarded", "a", "contract", "worth", "more", "than", "US600", "million", "for", "a", "gas", "sweetening", "facilities", "project", "by", "Qatar", "Petroleum.Elsewhere", "the", "group", "’s", "Offshore", "Engineering", "&", "Operations", "division", "may", "have", "seen", "a", "pullback", "in", "revenue", "and", "earnings", "vis-a-vis", "2008", ",", "but", "it", "did", "secure", "a", "£75", "million", "contract", "with", "Apache", "to", "provideengineering", "and", "construction", "services", "for", "the", "Forties", "field", "in", "the", "UK", "North", "Sea", "."], ["And", "to", "underscore", "the", "fact", "that", "there", "is", "life", "beyond", "NOC’s", "for", "Petrofac", "(", "LON:PFC)", "the", "division", "was", "awarded", "a", "£100", "million", "5-year", "contract", "by", "BP", "(", "LON:BP.", ")", "to", "deliver", "integrated", "maintenance", "management", "support", "services", "for", "all", "of", "BP", "'s", "UK", "offshore", "assets", "and", "onshore", "Dimlington", "plant", "."], ["The", "laggard", "of", "the", "group", "was", "the", "Engineering", ",", "Training", "Services", "and", "Production", "Solutions", "division", "."], ["The", "business", "suffered", "as", "the", "oil", "price", "tailed", "off", "and", "the", "economic", "outlook", "deteriorated", "forcing", "a", "number", "ofmajor", "customers", "to", "postpone", "early", "stage", "engineering", "studies", "or", "re-phased", "work", "upon", "which", "the", "division", "depends", "."], ["Although", "the", "fall", "in", "activity", "was", "notable", ",", "the", "division’s", "operational", "performance", "in", "service", "operator", "role", "for", "production", "of", "Dubai", "'s", "offshore", "oil", "&", "gas", "proved", "a", "highlight.Energy", "Developments", "meanwhile", "saw", "the", "start", "of", "oil", "production", "from", "the", "West", "Don", "field", "during", "the", "first", "half", "of", "the", "year", "less", "than", "a", "year", "from", "Field", "Development", "Programme", "approval", "."], ["In", "addition", "output", "from", "Don", "Southwest", "field", "began", "in", "June", "."], ["Despite", "considerably", "lower", "oil", "prices", "in", "2009", "compared", "to", "the", "prior", "year", ",", "Energy", "Developments", "'", "revenue", "reached", "almost", "US$", "250", "million", "(", "significantly", "higher", "than", "the", "US$", "153", "million", "of", "2008", ")", "due", "not", "only", "to", "the", "‘Don", "fields", "effect", "’", "but", "also", "a", "full", "year", "'s", "contribution", "from", "the", "Chergui", "gas", "plant", ",", "which", "began", "exports", "in", "August", "2008.In", "order", "to", "maximize", "the", "earnings", "potential", "of", "the", "division’s", "North", "Sea", "assets", ",", "including", "the", "Don", "assets", ",", "the", "group", "has", "demerged", "them", "providing", "its", "shareholders", "with", "shares", "in", "a", "newly", "listed", "independent", "exploration", "and", "production", "company", "called", "EnQuest", "(", "LON:ENQ", ")", "."], ["EnQuest", "is", "a", "product", "of", "the", "Petrofac’s", "North", "Sea", "Assets", "with", "those", "off", "of", "Swedish", "explorer", "Lundin", "with", "both", "companies", "divesting", "for", "different", "reasons", "."], ["Upon", "listing", "(", "April", "6th", ")", ",", "Petrofac", "(", "LON:PFC)", "shareholders", "owned", "around", "45", "%", "of", "the", "new", "EnQuest", "entity", "with", "Lundin", "shareholders", "owning", "approximately", "55", "%", "."], ["It", "is", "important", "to", "note", "that", "post", "demerger", "the", "Energy", "Developments", "business", "unit", "is", "still", "a", "key", "constituent", "of", "Petrofac", "'s", "business", "portfolio", ",", "and", "will", "continue", "to", "hold", "significant", "assets", "Tunisia", ",", "Malaysia", ",", "Algeria", "and", "Kyrgyz", "Republic", "-", "sandwiched", "between", "Kazakhstan", "and", "China", "."]]
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: open-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors: