open-nlp 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Ruby bindings for the OpenNLP tools.
2
+
3
+ This program is free software: you can redistribute it and/or modify
4
+ it under the terms of the GNU General Public License as published by
5
+ the Free Software Foundation, either version 3 of the License, or
6
+ (at your option) any later version.
7
+
8
+ This program is distributed in the hope that it will be useful,
9
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
10
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11
+ GNU General Public License for more details.
12
+
13
+ This license also applies to the included Stanford CoreNLP files.
14
+
15
+ You should have received a copy of the GNU General Public License
16
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
17
+
18
+ Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2012.
data/README.md ADDED
@@ -0,0 +1,112 @@
1
+ **Warning**
2
+
3
+ This is an alpha release. Expect things to break and/or change in the near future! Also, keep the following in mind:
4
+
5
+ - Currently, this gem is only tested on JRuby, but support for MRI through Rjb is coming very soon.
6
+ - Currently, the parser and chunker classes are not working.
7
+
8
+ **About**
9
+
10
+ This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
11
+
12
+ This gem only provides a thin wrapper over the OpenNLP API. If you are looking for a Ruby natural language processing framework, have a look at [Treat](https://github.com/louismullie/treat).
13
+
14
+ **Installing**
15
+
16
+ First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all english language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
17
+
18
+ Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).
19
+
20
+ **Configuration**
21
+
22
+ After installing and requiring the gem (`require 'open-nlp'`), you may want to set some optional configuration options. Here are some examples:
23
+
24
+ ```ruby
25
+ # Set an alternative path to look for the JAR files
26
+ # Default is gem's bin folder.
27
+ OpenNLP.jar_path = '/path_to_jars/'
28
+
29
+ # Set an alternative path to look for the model files
30
+ # Default is gem's bin folder.
31
+ OpenNLP.model_path = '/path_to_models/'
32
+
33
+ # Pass some alternative arguments to the Java VM.
34
+ # Default is ['-Xms512M', '-Xmx1024M'].
35
+ OpenNLP.jvm_args = ['-option1', '-option2']
36
+
37
+ # Redirect VM output to log.txt
38
+ OpenNLP.log_file = 'log.txt'
39
+
40
+ # WARNING: Not implemented yet.
41
+
42
+ # Use the model files for a different language than English.
43
+ # OpenNLP.use(:french) # or :german
44
+ #
45
+ # Change a specific model file.
46
+ # StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
47
+ ```
48
+
49
+ **Using the gem**
50
+
51
+ ```ruby
52
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
53
+ 'Berlin to discuss a new $25 billion austerity package.' +
54
+ 'Sarkozy looked pleased, but Merkel was dismayed.'
55
+
56
+ tokenizer = OpenNLP::TokenizerME.new
57
+ segmenter = OpenNLP::SentenceDetectorME.new
58
+ tagger = OpenNLP::POSTaggerME.new
59
+ ner_models = ['person', 'time', 'money']
60
+
61
+ ner_finders = ner_models.map do |model|
62
+ OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
63
+ end
64
+
65
+ sentences = segmenter.sent_detect(text)
66
+ all_entities = []
67
+
68
+ sentences.each do |sentence|
69
+
70
+ tokens = tokenizer.tokenize(sentence)
71
+ tags = tagger.tag(tokens)
72
+
73
+ # Get a list of all tokens.
74
+ puts tokens.to_a.inspect
75
+ # Get the sentence's text.
76
+ puts sentence.to_s.inspect
77
+ # Get the sentence's tags.
78
+ puts tags.to_a.inspect
79
+
80
+ # Run three NER models and find entities.
81
+ ner_models.each_with_index do |model,i|
82
+ finder = ner_finders[i]
83
+ name_spans = finder.find(tokens)
84
+ name_spans.each do |name_span|
85
+ start = name_span.get_start
86
+ stop = name_span.get_end-1
87
+ slice = tokens[start..stop].to_a
88
+ all_entities << [slice, model]
89
+ end
90
+ end
91
+
92
+ end
93
+
94
+ # Show all named entities.
95
+ puts all_entities.inspect
96
+ ```
97
+
98
+ **Loading specific classes**
99
+
100
+ You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
101
+
102
+ ```ruby
103
+ # Default base class is opennlp.tools.
104
+ OpenNLP.load_class('SomeClassName')
105
+
106
+ # Here, we specify another base class.
107
+ OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
108
+ ```
109
+
110
+ **Contributing**
111
+
112
+ Feel free to fork the project and send me a pull request!
Binary file
data/lib/open-nlp.rb ADDED
@@ -0,0 +1,162 @@
1
+ module OpenNLP
2
+
3
+ # Library version.
4
+ VERSION = '0.0.1'
5
+
6
+ # Require configuration.
7
+ require 'open-nlp/config'
8
+
9
+ # ############################ #
10
+ # BindIt Configuration Options #
11
+ # ############################ #
12
+
13
+ require 'bind-it'
14
+ extend BindIt::Binding
15
+
16
+ # The path in which to look for JAR files, with
17
+ # a trailing slash (default is gem's bin folder).
18
+ self.jar_path = File.dirname(__FILE__) + '/../bin/'
19
+
20
+ # Load the JVM with a minimum heap size of 512MB,
21
+ # and a maximum heap size of 1024MB.
22
+ self.jvm_args = ['-Xms512M', '-Xmx1024M']
23
+
24
+ # Turn logging off by default.
25
+ self.log_file = nil
26
+
27
+ # Default JARs to load.
28
+ self.default_jars = [
29
+ 'jwnl-1.3.3.jar',
30
+ 'opennlp-tools-1.5.2-incubating.jar',
31
+ 'opennlp-maxent-3.0.2-incubating.jar',
32
+ 'opennlp-uima-1.5.2-incubating.jar'
33
+ ]
34
+
35
+ # Default namespace.
36
+ self.default_namespace = 'opennlp.tools'
37
+
38
+ # Default classes.
39
+ self.default_classes = [
40
+ ['DocumentCategorizerME', 'opennlp.tools.doccat'],
41
+ ['ChunkerME', 'opennlp.tools.chunker'],
42
+ ['DictionaryDetokenizer', 'opennlp.tools.tokenize'],
43
+ ['NameFinderME', 'opennlp.tools.namefind'],
44
+ ['Parse', 'opennlp.tools.parser'],
45
+ ['ParserFactory', 'opennlp.tools.parser'],
46
+ ['POSTaggerME', 'opennlp.tools.postag'],
47
+ ['SentenceDetectorME', 'opennlp.tools.sentdetect'],
48
+ ['SimpleTokenizer', 'opennlp.tools.tokenize'],
49
+ ['TokenizerME', 'opennlp.tools.tokenize']
50
+ ]
51
+
52
+ # Redefine the Bind-It class loader to redefine
53
+ # a new constructor for classes that require a model.
54
+ def self.load_klass(klass, base, name = nil)
55
+ super(klass,base,name)
56
+ requires_model = OpenNLP::Config::RequiresModel
57
+ return unless requires_model.include?(klass)
58
+ new_class = Class.new(const_get(klass)) do
59
+ def initialize(file = nil, *args)
60
+ klass = OpenNLP.last_name(self.class)
61
+ if !file && !OpenNLP.has_default_model?(klass)
62
+ raise 'This class intentionally has no default ' +
63
+ 'model. Please supply a file name as an argument ' +
64
+ 'to the class constructor.'
65
+ else
66
+ model = OpenNLP.get_model(klass, file)
67
+ super(*([model] + args))
68
+ end
69
+ end
70
+ end
71
+ remove_const(klass)
72
+ const_set(klass, new_class)
73
+ end
74
+
75
+ # Make the bindings.
76
+ self.bind
77
+
78
+ # Load utility classes.
79
+ self.load_class('FileInputStream', 'java.io')
80
+
81
+ # ############################ #
82
+ # OpenNLP bindings proper #
83
+ # ############################ #
84
+
85
+ class <<self
86
+ # A hash containing loaded models.
87
+ attr_accessor :models
88
+ # A hash containing the names of loaded models.
89
+ attr_accessor :model_files
90
+ # The folder in which to look for models.
91
+ attr_accessor :model_path
92
+ # Store the language currently being used.
93
+ attr_accessor :language
94
+ end
95
+
96
+ # The loaded models.
97
+ self.models = {}
98
+
99
+ # The names of loaded models.
100
+ self.model_files = {}
101
+
102
+ # The path to the main folder containing the folders
103
+ # with the individual models inside. By default, this
104
+ # is the same as the JAR path.
105
+ self.model_path = self.jar_path
106
+
107
+ # Default the language to English.
108
+ self.language = :english
109
+
110
+ # Use a given language for default models.
111
+ def self.use(language)
112
+ self.language = language
113
+ end
114
+
115
+ def self.set_model
116
+ # Implement
117
+ end
118
+
119
+ def self.has_default_model?(klass)
120
+ name = OpenNLP::Config::ClassToName[klass]
121
+ if !OpenNLP::Config::DefaultModels[name]
122
+ raise 'No default model files are available ' +
123
+ "for the class #{klass}. Please supply a model " +
124
+ 'as an argument to the constructor.'
125
+ end
126
+ !OpenNLP::Config::DefaultModels[name].empty?
127
+ end
128
+
129
+ def self.get_model(klass, file=nil)
130
+ name = OpenNLP::Config::ClassToName[klass]
131
+ if !self.language and !file
132
+ raise 'No model file was supplied to the ' +
133
+ 'constructor. Please supply a model file ' +
134
+ 'or call OpenNLP.use(:some_language), to ' +
135
+ 'load the default models for a language.'
136
+ end
137
+ OpenNLP.load_model(name, file)
138
+ model = OpenNLP.models[name]
139
+ end
140
+
141
+ def self.load_model(name, file = nil)
142
+ if self.models[name] && file ==
143
+ self.model_files[name]
144
+ return self.models[name]
145
+ end
146
+ models = Config::DefaultModels[name]
147
+ file ||= models[self.language]
148
+ path = self.model_path + file
149
+ stream = FileInputStream.new(path)
150
+ klass = Config::NameToClass[name]
151
+ load_class(*klass)
152
+ klass = const_get(klass[0])
153
+ model = klass.new(stream)
154
+ self.model_files[name] = file
155
+ self.models[name] = model
156
+ end
157
+
158
+ def self.last_name(klass)
159
+ klass.to_s.split('::')[-1]
160
+ end
161
+
162
+ end
@@ -0,0 +1,55 @@
1
+ module OpenNLP::Config
2
+
3
+ NameToClass = {
4
+ categorizer: ['DoccatModel', 'opennlp.tools.doccat'],
5
+ chunker: ['ChunkerModel', 'opennlp.tools.chunker'],
6
+ detokenizer: ['DetokenizationDictionary', 'opennlp.tools.tokenize'],
7
+ name_finder: ['TokenNameFinderModel', 'opennlp.tools.namefind'],
8
+ parser: ['ParserModel', 'opennlp.tools.parser'],
9
+ pos_tagger: ['POSModel', 'opennlp.tools.postag'],
10
+ sentence_detector: ['SentenceModel', 'opennlp.tools.sentdetect'],
11
+ tokenizer: ['TokenizerModel', 'opennlp.tools.tokenize']
12
+ }
13
+
14
+ ClassToName = {
15
+ 'ChunkerME' => :chunker,
16
+ 'DictionaryDetokenizer' => :detokenizer,
17
+ 'DocumentCategorizerME' => :categorizer,
18
+ 'NameFinderME' => :name_finder,
19
+ 'POSTaggerME' => :pos_tagger,
20
+ 'ParserME' => :parser,
21
+ 'SentenceDetectorME' => :sentence_detector,
22
+ 'TokenizerME' => :tokenizer,
23
+ }
24
+
25
+ DefaultModels = {
26
+ chunker: {
27
+ english: 'en-chunker.bin'
28
+ },
29
+ detokenizer: {
30
+ english: 'en-detokenizer.xml'
31
+ },
32
+ # Intentionally left empty.
33
+ name_finder: {},
34
+ parser: {
35
+ english: 'en-parser-chunking.bin'
36
+ },
37
+ pos_tagger: {
38
+ english: 'en-pos-maxent.bin'
39
+ },
40
+ sentence_detector: {
41
+ english: 'en-sent.bin'
42
+ },
43
+ tokenizer: {
44
+ english: 'en-token.bin'
45
+ }
46
+ }
47
+
48
+ # Classes that require a model as first argument to constructor.
49
+ RequiresModel = [
50
+ 'SentenceDetectorME', 'NameFinderME', 'DictionaryDetokenizer',
51
+ 'TokenizerME', 'ChunkerME', 'POSTaggerME'
52
+ ]
53
+
54
+
55
+ end
@@ -0,0 +1,103 @@
1
+ # encoding: utf-8
2
+ require_relative 'spec_helper'
3
+
4
+ describe OpenNLP do
5
+
6
+ # Failing spec #1
7
+ context "the maximum entropy chunker is run after tokenization and POS tagging" do
8
+ it "should find the accurate chunks" do
9
+
10
+ chunker = OpenNLP::ChunkerME.new
11
+ tokenizer = OpenNLP::TokenizerME.new
12
+ tagger = OpenNLP::POSTaggerME.new
13
+
14
+ sent = "The death of the poet was kept from his poems."
15
+ tokens = tokenizer.tokenize(sent)
16
+ tags = tagger.tag(tokens)
17
+
18
+ chunks = chunker.chunk(tokens.to_java(:String), pos_tags.to_java(:String))
19
+ # cannot convert instance of class org.jruby.java.proxies.ArrayJavaProxy to class java.lang.String
20
+
21
+ tokens.to_a.should eql %w[The death of the poet was kept from his poems .]
22
+ tags.should eql ['put tags here']
23
+
24
+ end
25
+ end
26
+
27
+ # Failing spec #2
28
+ context "the maximum entropy parser is run after tokenization" do
29
+ it "parses the text accurately" do
30
+ sent = "The death of the poet was kept from his poems."
31
+ tokenizer = OpenNLP::TokenizerME.new
32
+ p_model = OpenNLP.load_model(:parser)
33
+ parser = OpenNLP::ParserFactory.create(p_model)
34
+ tokens = tokenizer.tokenize(sent)
35
+ result = parser.parse(tokens.to_java(:String))
36
+ # cannot convert instance of class org.jruby.java.proxies.ArrayJavaProxy to class java.lang.String
37
+ # org/jruby/java/addons/KernelJavaAddons.java:70:in `to_java'
38
+ # /ruby/gems/open-nlp/spec/english_spec.rb in `(root)'
39
+ puts result.to_a.inspect
40
+ end
41
+ end
42
+
43
+ context "the SimpleTokenizer is run" do
44
+ it "tokenizes the text accurately" do
45
+ sent = "The death of the poet was kept from his poems."
46
+ tokenizer = OpenNLP::SimpleTokenizer.new
47
+ tokens = tokenizer.tokenize(sent).to_a
48
+ tokens.should eql %w[The death of the poet was kept from his poems .]
49
+ end
50
+ end
51
+
52
+ context "the maximum entropy sentence detector, tokenizer, POS tagger " +
53
+ "and NER finders are run with the default models for English" do
54
+
55
+ it "should accurately detect tokens, sentences and named entities" do
56
+
57
+ text = File.read('./spec/sample.txt').gsub!("\n", "")
58
+
59
+ tokenizer = OpenNLP::TokenizerME.new
60
+ segmenter = OpenNLP::SentenceDetectorME.new
61
+ tagger = OpenNLP::POSTaggerME.new
62
+ ner_models = ['person', 'time', 'money']
63
+
64
+ ner_finders = ner_models.map do |model|
65
+ OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
66
+ end
67
+
68
+ sentences = segmenter.sent_detect(text)
69
+ all_entities, all_tags, all_sentences, all_tokens = [], [], [], []
70
+
71
+ sentences.each do |sentence|
72
+
73
+ tokens = tokenizer.tokenize(sentence)
74
+ tags = tagger.tag(tokens)
75
+
76
+ ner_models.each_with_index do |model,i|
77
+ finder = ner_finders[i]
78
+ name_spans = finder.find(tokens)
79
+ name_spans.each do |name_span|
80
+ start = name_span.get_start
81
+ stop = name_span.get_end-1
82
+ slice = tokens[start..stop].to_a
83
+ all_entities << [slice, model]
84
+ end
85
+ end
86
+
87
+ all_tokens << tokens.to_a
88
+ all_sentences << sentence
89
+ all_tags << tags.to_a
90
+ end
91
+
92
+ all_tokens.should eql [["To", "describe", "2009", "as", "a", "stellar", "year", "for", "Petrofac", "(", "LON:PFC)", "would", "be", "a", "huge", "understatement", "."], ["The", "group", "finished", "the", "year", "with", "an", "order", "backlog", "twice", "the", "size", "than", "it", "had", "at", "the", "outset", "."], ["The", "group", "has", "since", "been", "awarded", "a", "US", "600", "million", "contract", "and", "spun", "off", "its", "North", "Sea", "assets", "."], ["The", "group", "’s", "recently", "released", "full", "year", "results", "show", "a", "jump", "in", "revenues", ",", "pre-tax", "profits", "and", "order", "backlog", "."], ["Whilst", "group", "revenue", "rose", "by", "10", "%", "from", "$", "3.3", "billion", "to", "$", "3.7", "billion", ",", "pre-tax", "profits", "rose", "by", "25", "%", "from", "$", "358", "million", "to", "$", "448", "million", ".All", "the", "more", "impressive", ",", "the", "group", "’s", "order", "backlog", "doubled", "to", "over", "$", "8", "billion", "paying", "no", "attention", "to", "the", "15", "%", "cut", "in", "capital", "expenditure", "witnessed", "across", "the", "oil", "and", "gas", "industry", "as", "whole", "in", "2009", ".Focussing", "in", "on", "which", "the", "underlying", "performances", "of", "the", "individual", "segments", ",", "the", "group", "cash", "cow", ",", "its", "Engineering", "and", "Construction", "division", ",", "saw", "operating", "profit", "rise", "33", "%", "over", "the", "year", "to", "$", "322", "million", ",", "thanks", "to", "US$", "6.3", "billion", "worth", "of", "new", "contract", "wins", "during", "the", "year", "which", "included", "a", "$", "100", "million", "contract", "with", "Turkmengaz", ",", "the", "Turkmenistan", "national", "energy", "company", "."], ["The", "division", "has", "picked", "up", "in", "2010", "where", "it", "left", "off", "in", "2009", "and", "has", "been", "awarded", "a", "contract", "worth", "more", "than", "US600", "million", "for", "a", "gas", "sweetening", "facilities", "project", "by", "Qatar", "Petroleum.Elsewhere", "the", "group", "’s", "Offshore", "Engineering", "&", "Operations", "division", "may", "have", "seen", "a", "pullback", "in", "revenue", "and", "earnings", "vis-a-vis", "2008", ",", "but", "it", "did", "secure", "a", "£75", "million", "contract", "with", "Apache", "to", "provideengineering", "and", "construction", "services", "for", "the", "Forties", "field", "in", "the", "UK", "North", "Sea", "."], ["And", "to", "underscore", "the", "fact", "that", "there", "is", "life", "beyond", "NOC’s", "for", "Petrofac", "(", "LON:PFC)", "the", "division", "was", "awarded", "a", "£100", "million", "5-year", "contract", "by", "BP", "(", "LON:BP.", ")", "to", "deliver", "integrated", "maintenance", "management", "support", "services", "for", "all", "of", "BP", "'s", "UK", "offshore", "assets", "and", "onshore", "Dimlington", "plant", "."], ["The", "laggard", "of", "the", "group", "was", "the", "Engineering", ",", "Training", "Services", "and", "Production", "Solutions", "division", "."], ["The", "business", "suffered", "as", "the", "oil", "price", "tailed", "off", "and", "the", "economic", "outlook", "deteriorated", "forcing", "a", "number", "ofmajor", "customers", "to", "postpone", "early", "stage", "engineering", "studies", "or", "re-phased", "work", "upon", "which", "the", "division", "depends", "."], ["Although", "the", "fall", "in", "activity", "was", "notable", ",", "the", "division’s", "operational", "performance", "in", "service", "operator", "role", "for", "production", "of", "Dubai", "'s", "offshore", "oil", "&", "gas", "proved", "a", "highlight.Energy", "Developments", "meanwhile", "saw", "the", "start", "of", "oil", "production", "from", "the", "West", "Don", "field", "during", "the", "first", "half", "of", "the", "year", "less", "than", "a", "year", "from", "Field", "Development", "Programme", "approval", "."], ["In", "addition", "output", "from", "Don", "Southwest", "field", "began", "in", "June", "."], ["Despite", "considerably", "lower", "oil", "prices", "in", "2009", "compared", "to", "the", "prior", "year", ",", "Energy", "Developments", "'", "revenue", "reached", "almost", "US$", "250", "million", "(", "significantly", "higher", "than", "the", "US$", "153", "million", "of", "2008", ")", "due", "not", "only", "to", "the", "‘Don", "fields", "effect", "’", "but", "also", "a", "full", "year", "'s", "contribution", "from", "the", "Chergui", "gas", "plant", ",", "which", "began", "exports", "in", "August", "2008.In", "order", "to", "maximize", "the", "earnings", "potential", "of", "the", "division’s", "North", "Sea", "assets", ",", "including", "the", "Don", "assets", ",", "the", "group", "has", "demerged", "them", "providing", "its", "shareholders", "with", "shares", "in", "a", "newly", "listed", "independent", "exploration", "and", "production", "company", "called", "EnQuest", "(", "LON:ENQ", ")", "."], ["EnQuest", "is", "a", "product", "of", "the", "Petrofac’s", "North", "Sea", "Assets", "with", "those", "off", "of", "Swedish", "explorer", "Lundin", "with", "both", "companies", "divesting", "for", "different", "reasons", "."], ["Upon", "listing", "(", "April", "6th", ")", ",", "Petrofac", "(", "LON:PFC)", "shareholders", "owned", "around", "45", "%", "of", "the", "new", "EnQuest", "entity", "with", "Lundin", "shareholders", "owning", "approximately", "55", "%", "."], ["It", "is", "important", "to", "note", "that", "post", "demerger", "the", "Energy", "Developments", "business", "unit", "is", "still", "a", "key", "constituent", "of", "Petrofac", "'s", "business", "portfolio", ",", "and", "will", "continue", "to", "hold", "significant", "assets", "Tunisia", ",", "Malaysia", ",", "Algeria", "and", "Kyrgyz", "Republic", "-", "sandwiched", "between", "Kazakhstan", "and", "China", "."]]
93
+
94
+ all_sentences.should eql ["To describe 2009 as a stellar year for Petrofac (LON:PFC) would be a huge understatement.", "The group finished the year with an order backlog twice the size than it had at the outset.", "The group has since been awarded a US 600 million contract and spun off its North Sea assets.", "The group’s recently released full year results show a jump in revenues, pre-tax profits and order backlog.", "Whilst group revenue rose by 10% from $3.3 billion to $3.7 billion, pre-tax profits rose by 25% from $358 million to $448 million.All the more impressive, the group’s order backlog doubled to over $8 billion paying no attention to the 15% cut in capital expenditure witnessed across the oil and gas industry as whole in 2009.Focussing in on which the underlying performances of the individual segments, the group cash cow, its Engineering and Construction division, saw operating profit rise 33% over the year to $322 million, thanks to US$6.3 billion worth of new contract wins during the year which included a $100 million contract with Turkmengaz, the Turkmenistan national energy company.", "The division has picked up in 2010 where it left off in 2009 and has been awarded a contract worth more than US600 million for a gas sweetening facilities project by Qatar Petroleum.Elsewhere the group’s Offshore Engineering & Operations division may have seen a pullback in revenue and earnings vis-a-vis 2008, but it did secure a £75 million contract with Apache to provideengineering and construction services for the Forties field in the UK North Sea.", "And to underscore the fact that there is life beyond NOC’s for Petrofac (LON:PFC) the division was awarded a £100 million 5-year contract by BP (LON:BP.) to deliver integrated maintenance management support services for all of BP's UK offshore assets and onshore Dimlington plant.", "The laggard of the group was the Engineering, Training Services and Production Solutions division.", "The business suffered as the oil price tailed off and the economic outlook deteriorated forcing a number ofmajor customers to postpone early stage engineering studies or re-phased work upon which the division depends.", "Although the fall in activity was notable, the division’s operational performance in service operator role for production of Dubai's offshore oil & gas proved a highlight.Energy Developments meanwhile saw the start of oil production from the West Don field during the first half of the year less than a year from Field Development Programme approval.", "In addition output from Don Southwest field began in June.", "Despite considerably lower oil prices in 2009 compared to the prior year, Energy Developments' revenue reached almost US$250 million (significantly higher than the US$153 million of 2008) due not only to the ‘Don fields effect’ but also a full year's contribution from the Chergui gas plant, which began exports in August 2008.In order to maximize the earnings potential of the division’s North Sea assets, including the Don assets, the group has demerged them providing its shareholders with shares in a newly listed independent exploration and production company called EnQuest (LON:ENQ).", "EnQuest is a product of the Petrofac’s North Sea Assets with those off of Swedish explorer Lundin with both companies divesting for different reasons.", "Upon listing (April 6th), Petrofac (LON:PFC) shareholders owned around 45% of the new EnQuest entity with Lundin shareholders owning approximately 55%.", "It is important to note that post demerger the Energy Developments business unit is still a key constituent of Petrofac's business portfolio, and will continue to hold significant assets Tunisia, Malaysia, Algeria and Kyrgyz Republic - sandwiched between Kazakhstan and China."]
95
+
96
+ all_entities.should eql [[["$", "3.3", "billion"], "money"], [["$", "3.7", "billion"], "money"], [["$", "358", "million", "to", "$", "448", "million"], "money"], [["$", "8", "billion"], "money"], [["$", "322", "million"], "money"], [["$", "100", "million"], "money"], [["Lundin"], "person"], [["Lundin"], "person"]]
97
+
98
+ all_tags.should eql [["TO", "VB", "CD", "IN", "DT", "NN", "NN", "IN", "NNP", "-LRB-", "NNP", "MD", "VB", "DT", "JJ", "NN", "."], ["DT", "NN", "VBD", "DT", "NN", "IN", "DT", "NN", "NN", "RB", "DT", "NN", "IN", "PRP", "VBD", "IN", "DT", "NN", "."], ["DT", "NN", "VBZ", "RB", "VBN", "VBN", "DT", "PRP", "CD", "CD", "NN", "CC", "VBD", "RP", "PRP$", "NNP", "NNP", "NNS", "."], ["DT", "NN", "VBD", "RB", "VBN", "JJ", "NN", "NNS", "VBP", "DT", "NN", "IN", "NNS", ",", "JJ", "NNS", "CC", "NN", "NN", "."], ["NNP", "NN", "NN", "VBD", "IN", "CD", "NN", "IN", "$", "CD", "CD", "TO", "$", "CD", "CD", ",", "JJ", "NNS", "VBD", "IN", "CD", "NN", "IN", "$", "CD", "CD", "TO", "$", "CD", "CD", "PDT", "DT", "RBR", "JJ", ",", "DT", "NN", "VBZ", "NN", "NN", "VBD", "TO", "RP", "$", "CD", "CD", "VBG", "DT", "NN", "TO", "DT", "CD", "NN", "NN", "IN", "NN", "NN", "VBN", "IN", "DT", "NN", "CC", "NN", "NN", "IN", "JJ", "IN", "CD", "NN", "IN", "IN", "WDT", "DT", "JJ", "NNS", "IN", "DT", "JJ", "NNS", ",", "DT", "NN", "NN", "NN", ",", "PRP$", "NNP", "CC", "NNP", "NN", ",", "VBD", "NN", "NN", "VB", "CD", "NN", "IN", "DT", "NN", "TO", "$", "CD", "CD", ",", "NNS", "TO", "$", "CD", "CD", "NN", "IN", "JJ", "NN", "VBZ", "IN", "DT", "NN", "WDT", "VBD", "DT", "$", "CD", "CD", "NN", "IN", "NNP", ",", "DT", "NNP", "JJ", "NN", "NN", "."], ["DT", "NN", "VBZ", "VBN", "RP", "IN", "CD", "WRB", "PRP", "VBD", "RP", "IN", "CD", "CC", "VBZ", "VBN", "VBN", "DT", "NN", "NN", "JJR", "IN", "CD", "CD", "IN", "DT", "NN", "VBG", "NNS", "NN", "IN", "NNP", "NNP", "DT", "NN", "JJ", "NNP", "NNP", "CC", "NNP", "NN", "MD", "VB", "VBN", "DT", "NN", "IN", "NN", "CC", "NNS", "NN", "CD", ",", "CC", "PRP", "VBD", "VB", "DT", "CD", "CD", "NN", "IN", "NNP", "TO", "VB", "CC", "NN", "NNS", "IN", "DT", "NNP", "NN", "IN", "DT", "NNP", "NNP", "NNP", "."], ["CC", "TO", "VB", "DT", "NN", "IN", "EX", "VBZ", "NN", "IN", "NNP", "IN", "NNP", "-LRB-", "NNP", "DT", "NN", "VBD", "VBN", "DT", "CD", "CD", "JJ", "NN", "IN", "NNP", "-LRB-", "NNP", "-RRB-", "TO", "VB", "JJ", "NN", "NN", "NN", "NNS", "IN", "DT", "IN", "NNP", "POS", "NN", "JJ", "NNS", "CC", "RB", "NNP", "NN", "."], ["DT", "NN", "IN", "DT", "NN", "VBD", "DT", "NNP", ",", "NNP", "NNP", "CC", "NNP", "NNP", "NN", "."], ["DT", "NN", "VBD", "IN", "DT", "NN", "NN", "VBN", "RB", "CC", "DT", "JJ", "NN", "VBD", "VBG", "DT", "NN", "IN", "NNS", "TO", "VB", "JJ", "NN", "NN", "NNS", "CC", "JJ", "NN", "IN", "WDT", "DT", "NN", "VBZ", "."], ["IN", "DT", "NN", "IN", "NN", "VBD", "JJ", ",", "DT", "JJ", "JJ", "NN", "IN", "NN", "NN", "NN", "IN", "NN", "IN", "NNP", "POS", "JJ", "NN", "CC", "NN", "VBD", "DT", "RB", "NNPS", "RB", "VBD", "DT", "NN", "IN", "NN", "NN", "IN", "DT", "NNP", "NNP", "NN", "IN", "DT", "JJ", "NN", "IN", "DT", "NN", "RBR", "IN", "DT", "NN", "IN", "NNP", "NNP", "NNP", "NN", "."], ["IN", "NN", "NN", "IN", "NNP", "NNP", "NN", "VBD", "IN", "NNP", "."], ["IN", "RB", "JJR", "NN", "NNS", "IN", "CD", "VBN", "TO", "DT", "JJ", "NN", ",", "NNP", "NNPS", "POS", "NN", "VBD", "RB", "$", "CD", "CD", "-LRB-", "RB", "JJR", "IN", "DT", "$", "CD", "CD", "IN", "CD", "-RRB-", "RB", "RB", "RB", "TO", "DT", "JJ", "NNS", "NN", ",", "CC", "RB", "DT", "JJ", "NN", "POS", "NN", "IN", "DT", "NNP", "NN", "NN", ",", "WDT", "VBD", "NNS", "IN", "NNP", "IN", "NN", "TO", "VB", "DT", "NNS", "NN", "IN", "DT", "JJ", "NNP", "NNP", "NNS", ",", "VBG", "DT", "NNP", "NNS", ",", "DT", "NN", "VBZ", "VBN", "PRP", "VBG", "PRP$", "NNS", "IN", "NNS", "IN", "DT", "RB", "VBN", "JJ", "NN", "CC", "NN", "NN", "VBD", "NNP", "-LRB-", "NN", "-RRB-", "."], ["NNP", "VBZ", "DT", "NN", "IN", "DT", "NNP", "NNP", "NNP", "NNS", "IN", "DT", "IN", "IN", "JJ", "NN", "NN", "IN", "DT", "NNS", "VBG", "IN", "JJ", "NNS", "."], ["IN", "VBG", "-LRB-", "NNP", "NN", "-RRB-", ",", "NNP", "-LRB-", "NNP", "NNS", "VBD", "IN", "CD", "NN", "IN", "DT", "JJ", "NNP", "NN", "IN", "NNP", "NNS", "VBG", "RB", "CD", "NN", "."], ["PRP", "VBZ", "JJ", "TO", "VB", "IN", "NN", "NN", "DT", "NNP", "NNPS", "NN", "NN", "VBZ", "RB", "DT", "JJ", "NN", "IN", "NNP", "POS", "NN", "NN", ",", "CC", "MD", "VB", "TO", "VB", "JJ", "NNS", "NNP", ",", "NNP", ",", "NNP", "CC", "NNP", "NNP", ":", "VBD", "IN", "NNP", "CC", "NNP", "."]]
99
+
100
+ end
101
+ end
102
+
103
+ end
data/spec/sample.txt ADDED
@@ -0,0 +1,20 @@
1
+ To describe 2009 as a stellar year for Petrofac (LON:PFC) would be a huge understatement. The group finished the year with an order backlog twice the size than it had at the outset. The group has since been awarded a US 600 million contract and spun off its North Sea assets.
2
+ The group’s recently released full year results show a jump in revenues, pre-tax profits and order backlog. Whilst group revenue rose by 10% from $3.3 billion to $3.7 billion, pre-tax profits rose by 25% from $358 million to $448 million.
3
+
4
+ All the more impressive, the group’s order backlog doubled to over $8 billion paying no attention to the 15% cut in capital expenditure witnessed across the oil and gas industry as whole in 2009.
5
+
6
+ Focussing in on which the underlying performances of the individual segments, the group cash cow, its Engineering and Construction division, saw operating profit rise 33% over the year to $322 million, thanks to US$6.3 billion worth of new contract wins during the year which included a $100 million contract with Turkmengaz, the Turkmenistan national energy company. The division has picked up in 2010 where it left off in 2009 and has been awarded a contract worth more than US600 million for a gas sweetening facilities project by Qatar Petroleum.
7
+
8
+ Elsewhere the group’s Offshore Engineering & Operations division may have seen a pullback in revenue and earnings vis-a-vis 2008, but it did secure a £75 million contract with Apache to provide
9
+ engineering and construction services for the Forties field in the UK North Sea. And to underscore the fact that there is life beyond NOC’s for Petrofac (LON:PFC) the division was awarded a £100 million 5-year contract by BP (LON:BP.) to deliver integrated maintenance management support services for all of BP's UK offshore assets and onshore Dimlington plant.
10
+
11
+ The laggard of the group was the Engineering, Training Services and Production Solutions division. The business suffered as the oil price tailed off and the economic outlook deteriorated forcing a number of
12
+ major customers to postpone early stage engineering studies or re-phased work upon which the division depends. Although the fall in activity was notable, the division’s operational performance in service operator role for production of Dubai's offshore oil & gas proved a highlight.
13
+
14
+ Energy Developments meanwhile saw the start of oil production from the West Don field during the first half of the year less than a year from Field Development Programme approval. In addition output from Don Southwest field began in June. Despite considerably lower oil prices in 2009 compared to the prior year, Energy Developments' revenue reached almost US$250 million (significantly higher than the US$153 million of 2008) due not only to the ‘Don fields effect’ but also a full year's contribution from the Chergui gas plant, which began exports in August 2008.
15
+
16
+ In order to maximize the earnings potential of the division’s North Sea assets, including the Don assets, the group has demerged them providing its shareholders with shares in a newly listed independent exploration and production company called EnQuest (LON:ENQ).
17
+
18
+ EnQuest is a product of the Petrofac’s North Sea Assets with those off of Swedish explorer Lundin with both companies divesting for different reasons. Upon listing (April 6th), Petrofac (LON:PFC) shareholders owned around 45% of the new EnQuest entity with Lundin shareholders owning approximately 55%.
19
+
20
+ It is important to note that post demerger the Energy Developments business unit is still a key constituent of Petrofac's business portfolio, and will continue to hold significant assets Tunisia, Malaysia, Algeria and Kyrgyz Republic - sandwiched between Kazakhstan and China.
@@ -0,0 +1,2 @@
1
+ require 'rspec'
2
+ require_relative '../lib/open-nlp'
metadata ADDED
@@ -0,0 +1,74 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: open-nlp
3
+ version: !ruby/object:Gem::Version
4
+ prerelease:
5
+ version: 0.0.1
6
+ platform: ruby
7
+ authors:
8
+ - Louis Mullie
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-12-19 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bind-it
16
+ version_requirements: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - ! '>='
19
+ - !ruby/object:Gem::Version
20
+ version: '0'
21
+ none: false
22
+ requirement: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ! '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ none: false
28
+ prerelease: false
29
+ type: :runtime
30
+ description: ! ' Ruby bindings to the OpenNLP tools, a Java machine learning toolkit
31
+ for natural language processing (NLP). '
32
+ email:
33
+ - louis.mullie@gmail.com
34
+ executables: []
35
+ extensions: []
36
+ extra_rdoc_files: []
37
+ files:
38
+ - bin/jwnl-1.3.3.jar
39
+ - bin/opennlp-maxent-3.0.2-incubating.jar
40
+ - bin/opennlp-tools-1.5.2-incubating.jar
41
+ - bin/opennlp-uima-1.5.2-incubating.jar
42
+ - lib/open-nlp.rb
43
+ - lib/open-nlp/config.rb
44
+ - spec/english_spec.rb
45
+ - spec/sample.txt
46
+ - spec/spec_helper.rb
47
+ - README.md
48
+ - LICENSE
49
+ homepage: https://github.com/louismullie/open-nlp
50
+ licenses: []
51
+ post_install_message:
52
+ rdoc_options: []
53
+ require_paths:
54
+ - lib
55
+ required_ruby_version: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - ! '>='
58
+ - !ruby/object:Gem::Version
59
+ version: '0'
60
+ none: false
61
+ required_rubygems_version: !ruby/object:Gem::Requirement
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: '0'
66
+ none: false
67
+ requirements: []
68
+ rubyforge_project:
69
+ rubygems_version: 1.8.24
70
+ signing_key:
71
+ specification_version: 3
72
+ summary: Ruby bindings to the OpenNLP Java toolkit.
73
+ test_files: []
74
+ ...