stanford-core-nlp-abstractor 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: aaef5a50a17996fda21974a9c41edcacba2b27f1
4
+ data.tar.gz: a16c9702bf92e93d69e66be79d0743d1df98f020
5
+ SHA512:
6
+ metadata.gz: 9e0061a31396b5564500e8264a4b68f5d57f8fb62e943838feb843018178bc6ec6047a9aca218a5727d34c6a52efa7f843ee897ec433e005ec59520a7be41b20
7
+ data.tar.gz: 866e1c5ab898820a181e25f495136ec4a57e8bcc6b7a0d8d870aee452c049de54b8f5a3b5338e072cc09f3e454ccedbd7e24c9a73c3f3b16c116ccd6864bca25
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Ruby bindings for the Stanford CoreNLP package
2
+
3
+ This program is free software: you can redistribute it and/or modify
4
+ it under the terms of the GNU General Public License as published by
5
+ the Free Software Foundation, either version 3 of the License, or
6
+ (at your option) any later version.
7
+
8
+ This program is distributed in the hope that it will be useful,
9
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
10
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11
+ GNU General Public License for more details.
12
+
13
+ This license also applies to the included Stanford CoreNLP files.
14
+
15
+ You should have received a copy of the GNU General Public License
16
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
17
+
18
+ Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2012.
@@ -0,0 +1,208 @@
1
+ [![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
2
+
3
+ **About**
4
+
5
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
6
+
7
+ This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.
8
+
9
+ **Installing**
10
+
11
+ First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Two packages are available:
12
+
13
+ * A [minimal package](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with the default tagger and parser models for English, French and German.
14
+ * A [full package](http://louismullie.com/treat/stanford-core-nlp-full.zip), with all of the tagger and parser models for English, French and German, as well as named entity and coreference resolution models for English.
15
+
16
+ Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
17
+
18
+ **Configuration**
19
+
20
+ You may want to set some optional configuration options. Here are some examples:
21
+
22
+ ```ruby
23
+ # Set an alternative path to look for the JAR files
24
+ # Default is gem's bin folder.
25
+ StanfordCoreNLP.jar_path = '/path_to_jars/'
26
+
27
+ # Set an alternative path to look for the model files
28
+ # Default is gem's bin folder.
29
+ StanfordCoreNLP.model_path = '/path_to_models/'
30
+
31
+ # Pass some alternative arguments to the Java VM.
32
+ # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
33
+ # to take a coffee break).
34
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
35
+
36
+ # Redirect VM output to log.txt
37
+ StanfordCoreNLP.log_file = 'log.txt'
38
+
39
+ # Change a specific model file.
40
+ StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
41
+ ```
42
+
43
+ **Using the gem**
44
+
45
+ ```ruby
46
+ # Use the model files for a different language than English.
47
+ StanfordCoreNLP.use :french # or :german
48
+
49
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
50
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
51
+ 'looked pleased, but Merkel was dismayed.'
52
+
53
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
54
+ text = StanfordCoreNLP::Annotation.new(text)
55
+ pipeline.annotate(text)
56
+
57
+ text.get(:sentences).each do |sentence|
58
+ # Syntatical dependencies
59
+ puts sentence.get(:basic_dependencies).to_s
60
+ sentence.get(:tokens).each do |token|
61
+ # Default annotations for all tokens
62
+ puts token.get(:value).to_s
63
+ puts token.get(:original_text).to_s
64
+ puts token.get(:character_offset_begin).to_s
65
+ puts token.get(:character_offset_end).to_s
66
+ # POS returned by the tagger
67
+ puts token.get(:part_of_speech).to_s
68
+ # Lemma (base form of the token)
69
+ puts token.get(:lemma).to_s
70
+ # Named entity tag
71
+ puts token.get(:named_entity_tag).to_s
72
+ # Coreference
73
+ puts token.get(:coref_cluster_id).to_s
74
+ # Also of interest: coref, coref_chain,
75
+ # coref_cluster, coref_dest, coref_graph.
76
+ end
77
+ end
78
+ ```
79
+
80
+ > Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Annotation class.
81
+
82
+ The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class is the `snake_case` of the class name, with 'Annotation' at the end removed. For example, `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.
83
+
84
+ A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the `config.rb` file inside the gem.
85
+
86
+
87
+ **Loading specific classes**
88
+
89
+ You may want to load additional Java classes (including any class from the Stanford NLP packages). The gem provides an API for this:
90
+
91
+ ```ruby
92
+ # Default base class is edu.stanford.nlp.pipeline.
93
+ StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
94
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
95
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
96
+
97
+ # Here, we specify another base class.
98
+ StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
99
+ puts StanfordCoreNLP::MaxentTagger.inspect
100
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
101
+ ```
102
+
103
+ **List of annotator classes**
104
+
105
+ Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using `StanfordCoreNLP.load_class` (see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
106
+
107
+ * PTBTokenizerAnnotator - tokenizes the text following Penn Treebank conventions.
108
+ * WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
109
+ * POSTaggerAnnotator - annotates the text with part-of-speech tags.
110
+ * MorphaAnnotator - morphological normalizer (generates lemmas).
111
+ * NERAnnotator - annotates the text with named-entity labels.
112
+ * NERCombinerAnnotator - combines several NER models.
113
+ * TrueCaseAnnotator - detects the true case of words in free text.
114
+ * ParserAnnotator - generates constituent and dependency trees.
115
+ * NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
116
+ * TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
117
+ * QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
118
+ * SRLAnnotator - annotates predicates and their semantic roles.
119
+ * DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model.
120
+ * NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
121
+
122
+ **List of model files**
123
+
124
+ Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using `StanfordCoreNLP.set_model` (see above).
125
+
126
+ * 'pos.model' - 'english-left3words-distsim.tagger'
127
+ * 'ner.model' - 'all.3class.distsim.crf.ser.gz'
128
+ * 'parse.model' - 'englishPCFG.ser.gz'
129
+ * 'dcoref.demonym' - 'demonyms.txt'
130
+ * 'dcoref.animate' - 'animate.unigrams.txt'
131
+ * 'dcoref.female' - 'female.unigrams.txt'
132
+ * 'dcoref.inanimate' - 'inanimate.unigrams.txt'
133
+ * 'dcoref.male' - 'male.unigrams.txt'
134
+ * 'dcoref.neutral' - 'neutral.unigrams.txt'
135
+ * 'dcoref.plural' - 'plural.unigrams.txt'
136
+ * 'dcoref.singular' - 'singular.unigrams.txt'
137
+ * 'dcoref.states' - 'state-abbreviations.txt'
138
+ * 'dcoref.extra.gender' - 'namegender.combine.txt'
139
+
140
+ **Testing**
141
+
142
+ To run the specs for each language (after copying the JARs into the `bin` folder):
143
+
144
+ rake spec[english]
145
+ rake spec[german]
146
+ rake spec[french]
147
+
148
+ **Using the latest version of the Stanford CoreNLP**
149
+
150
+ Using the latest version of the Stanford CoreNLP (version 3.5.0 as of 31/10/2014) requires some additional manual steps:
151
+
152
+ * Download [Stanford CoreNLP version 3.5.0](http://nlp.stanford.edu/software/stanford-corenlp-full-2014-10-31.zip) from http://nlp.stanford.edu/.
153
+ * Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory location configured by setting StanfordCoreNLP.jar_path.
154
+ * Download [the full Stanford Tagger version 3.5.0](http://nlp.stanford.edu/software/stanford-postagger-full-2014-10-26.zip) from http://nlp.stanford.edu/.
155
+ * Make a directory named 'taggers' inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
156
+ * Place the contents of the extracted archive inside taggers directory.
157
+ * Download [the bridge.jar file](https://github.com/louismullie/stanford-core-nlp/blob/master/bin/bridge.jar?raw=true) from https://github.com/louismullie/stanford-core-nlp.
158
+ * Place the downloaded bridger.jar file inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/taggers/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
159
+ * Configure your setup (for English) as follows:
160
+ ```ruby
161
+ StanfordCoreNLP.use :english
162
+ StanfordCoreNLP.model_files = {}
163
+ StanfordCoreNLP.default_jars = [
164
+ 'joda-time.jar',
165
+ 'xom.jar',
166
+ 'stanford-corenlp-3.5.0.jar',
167
+ 'stanford-corenlp-3.5.0-models.jar',
168
+ 'jollyday.jar',
169
+ 'bridge.jar'
170
+ ]
171
+ end
172
+ ```
173
+ Or configure your setup (for French) as follows:
174
+ ```ruby
175
+ StanfordCoreNLP.use :french
176
+ StanfordCoreNLP.model_files = {}
177
+ StanfordCoreNLP.set_model('pos.model', 'french.tagger')
178
+ StanfordCoreNLP.default_jars = [
179
+ 'joda-time.jar',
180
+ 'xom.jar',
181
+ 'stanford-corenlp-3.5.0.jar',
182
+ 'stanford-corenlp-3.5.0-models.jar',
183
+ 'jollyday.jar',
184
+ 'bridge.jar'
185
+ ]
186
+ end
187
+ ```
188
+ Or configure your setup (for German) as follows:
189
+ ```ruby
190
+ StanfordCoreNLP.use :german
191
+ StanfordCoreNLP.model_files = {}
192
+ StanfordCoreNLP.set_model('pos.model', 'german-fast.tagger')
193
+ StanfordCoreNLP.default_jars = [
194
+ 'joda-time.jar',
195
+ 'xom.jar',
196
+ 'stanford-corenlp-3.5.0.jar',
197
+ 'stanford-corenlp-3.5.0-models.jar',
198
+ 'jollyday.jar',
199
+ 'bridge.jar'
200
+ ]
201
+ end
202
+ ```
203
+ **Contributing**
204
+
205
+ Simple.
206
+
207
+ 1. Fork the project.
208
+ 2. Send me a pull request!
@@ -0,0 +1,22 @@
1
+ import edu.stanford.nlp.ling.CoreAnnotation;
2
+ import edu.stanford.nlp.util.ArrayCoreMap;
3
+ import java.util.Properties;
4
+ import edu.stanford.nlp.pipeline.StanfordCoreNLP;
5
+
6
+ // export JAVA_HOME='/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home'
7
+ // javac -cp '.:stanford-corenlp.jar' AnnotationBridge.java
8
+ // jar cf bridge.jar AnnotationBridge.class
9
+ public class AnnotationBridge {
10
+
11
+ public static Object getAnnotation(Object entity, String name) throws ClassNotFoundException {
12
+ Class<CoreAnnotation> klass;
13
+ klass = (Class<CoreAnnotation>) Class.forName(name);
14
+ Object object = ((ArrayCoreMap) entity).get(klass);
15
+ return object;
16
+ }
17
+
18
+ public static Object getPipelineWithProperties(Properties properties) {
19
+ StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
20
+ return pipeline;
21
+ }
22
+ }
Binary file
@@ -0,0 +1,238 @@
1
+ require 'stanford-core-nlp/config'
2
+
3
+ module StanfordCoreNLP
4
+
5
+ VERSION = '0.5.3'
6
+
7
+ require 'bind-it'
8
+ extend BindIt::Binding
9
+
10
+ # ############################ #
11
+ # BindIt Configuration Options #
12
+ # ############################ #
13
+
14
+ # The default path for the JAR files
15
+ # is the gem's bin folder.
16
+ self.jar_path = File.dirname(__FILE__).gsub(/\/lib\z/, '') + '/bin/'
17
+
18
+ # Default namespace is the Stanford pipeline namespace.
19
+ self.default_namespace = 'edu.stanford.nlp.pipeline'
20
+
21
+ # Load the JVM with a minimum heap size of 512MB,
22
+ # and a maximum heap size of 1024MB.
23
+ StanfordCoreNLP.jvm_args = ['-Xms512M', '-Xmx1024M']
24
+
25
+ # Turn logging off by default.
26
+ StanfordCoreNLP.log_file = nil
27
+
28
+ # Default JAR files to load.
29
+ StanfordCoreNLP.default_jars = [
30
+ 'joda-time.jar',
31
+ 'xom.jar',
32
+ 'stanford-corenlp.jar',
33
+ 'jollyday.jar',
34
+ 'bridge.jar'
35
+ ]
36
+
37
+ # Default classes to load.
38
+ StanfordCoreNLP.default_classes = [
39
+ ['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
40
+ ['Annotation', 'edu.stanford.nlp.pipeline'],
41
+ ['Word', 'edu.stanford.nlp.ling'],
42
+ ['CoreLabel', 'edu.stanford.nlp.ling'],
43
+ ['MaxentTagger', 'edu.stanford.nlp.tagger.maxent'],
44
+ ['CRFClassifier', 'edu.stanford.nlp.ie.crf'],
45
+ ['LexicalizedParser', 'edu.stanford.nlp.parser.lexparser'],
46
+ ['Options', 'edu.stanford.nlp.parser.lexparser'],
47
+ ['Properties', 'java.util'],
48
+ ['ArrayList', 'java.util'],
49
+ ['AnnotationBridge', '']
50
+ ]
51
+
52
+ # ########################### #
53
+ # Stanford Core NLP bindings #
54
+ # ########################### #
55
+
56
+ require 'stanford-core-nlp/bridge'
57
+ extend StanfordCoreNLP::Bridge
58
+
59
+ class << self
60
+ # The model file names for a given language.
61
+ attr_accessor :model_files
62
+ # The folder in which to look for models.
63
+ attr_accessor :model_path
64
+ # Store the language currently being used.
65
+ attr_accessor :language
66
+ #Custom properties
67
+ attr_accessor :custom_properties
68
+ end
69
+
70
+ self.custom_properties = {}
71
+
72
+ # The path to the main folder containing the folders
73
+ # with the individual models inside. By default, this
74
+ # is the same as the JAR path.
75
+ self.model_path = self.jar_path
76
+
77
+ # ########################### #
78
+ # Public configuration params #
79
+ # ########################### #
80
+
81
+ # Use models for a given language. Language can be
82
+ # supplied as full-length, or ISO-639 2 or 3 letter
83
+ # code (e.g. :english, :eng or :en will work).
84
+ def self.use(language)
85
+ lang = nil
86
+ self.model_files = {}
87
+ Config::LanguageCodes.each do |l,codes|
88
+ lang = codes[2] if codes.include?(language)
89
+ end
90
+ self.language = lang
91
+ Config::Models.each do |n, languages|
92
+ models = languages[lang]
93
+ folder = Config::ModelFolders[n]
94
+ if models.is_a?(Hash)
95
+ n = n.to_s
96
+ models.each do |m, file|
97
+ self.model_files["#{n}.#{m}"] = folder + file
98
+ end
99
+ elsif models.is_a?(String)
100
+ self.model_files["#{n}.model"] = folder + models
101
+ end
102
+ end
103
+ end
104
+
105
+ # Use english by default.
106
+ self.use :english
107
+
108
+ # Set a model file.
109
+ def self.set_model(name, file)
110
+ n = name.split('.')[0].intern
111
+ self.model_files[name] = Config::ModelFolders[n] + file
112
+ end
113
+
114
+ # ########################### #
115
+ # Public API methods #
116
+ # ########################### #
117
+
118
+ def self.bind
119
+
120
+ # Take care of Windows users.
121
+ if self.running_on_windows?
122
+ self.jar_path.gsub!('/', '\\')
123
+ self.model_path.gsub!('/', '\\')
124
+ end
125
+
126
+ # Make the bindings.
127
+ super
128
+
129
+ # Bind annotation bridge.
130
+ self.default_classes.each do |info|
131
+ klass = const_get(info.first)
132
+ self.inject_get_method(klass)
133
+ end
134
+
135
+ end
136
+
137
+ # Load a StanfordCoreNLP pipeline with the
138
+ # specified JVM flags and StanfordCoreNLP
139
+ # properties.
140
+ def self.load(*annotators)
141
+
142
+ self.bind unless self.bound
143
+
144
+ # Prepend the JAR path to the model files.
145
+ properties = {}
146
+ self.model_files.each do |k,v|
147
+ found = false
148
+ annotators.each do |annotator|
149
+ found = true if k.index(annotator.to_s)
150
+ break if found
151
+ end
152
+ next unless found
153
+ f = self.model_path + v
154
+ unless File.readable?(f)
155
+ raise "Model file #{f} could not be found. " +
156
+ "You may need to download this file manually " +
157
+ "and/or set paths properly."
158
+ end
159
+ properties[k] = f
160
+ end
161
+
162
+ properties['annotators'] = annotators.map { |x| x.to_s }.join(', ')
163
+
164
+ unless self.language == :english
165
+ # Bug fix for French/German parsers.
166
+ # Otherwise throws "IllegalArgumentException:
167
+ # Unknown option: -retainTmpSubcategories"
168
+ properties['parse.flags'] = ''
169
+ # Bug fix for French/German parsers.
170
+ # Otherswise throws java.lang.NullPointerException: null.
171
+ properties['parse.buildgraphs'] = 'false'
172
+ end
173
+
174
+ # Bug fix for NER system. Otherwise throws:
175
+ # Error initializing binder 1 at edu.stanford.
176
+ # nlp.time.Options.<init>(Options.java:88)
177
+ properties['sutime.binders'] = '0'
178
+
179
+ # Manually include SUTime models.
180
+ if annotators.include?(:ner)
181
+ properties['sutime.rules'] =
182
+ self.model_path + 'sutime/defs.sutime.txt, ' +
183
+ self.model_path + 'sutime/english.sutime.txt'
184
+ end
185
+
186
+ props = get_properties(properties)
187
+
188
+ # Hack for Java7 compatibility.
189
+ bridge = const_get(:AnnotationBridge)
190
+ bridge.getPipelineWithProperties(props)
191
+
192
+ end
193
+
194
+ # Hack in order not to break backwards compatibility.
195
+ def self.const_missing(const)
196
+ if const == :Text
197
+ puts "WARNING: StanfordCoreNLP::Text has been deprecated." +
198
+ "Please use StanfordCoreNLP::Annotation instead."
199
+ Annotation
200
+ else
201
+ super(const)
202
+ end
203
+ end
204
+
205
+ private
206
+
207
+ # Create a java.util.Properties object from a hash.
208
+ def self.get_properties(properties)
209
+ properties = properties.merge(self.custom_properties)
210
+ props = Properties.new
211
+ properties.each do |property, value|
212
+ props.set_property(property.to_s, value.to_s)
213
+ end
214
+ props
215
+ end
216
+
217
+ # Get a Java ArrayList binding to pass lists
218
+ # of tokens to the Stanford Core NLP process.
219
+ def self.get_list(tokens)
220
+ list = StanfordCoreNLP::ArrayList.new
221
+ tokens.each do |t|
222
+ list.add(Word.new(t.to_s))
223
+ end
224
+ list
225
+ end
226
+
227
+ # Returns true if we're running on Windows.
228
+ def self.running_on_windows?
229
+ RUBY_PLATFORM.split("-")[1] == 'mswin32'
230
+ end
231
+
232
+ # camel_case which also support dot as separator
233
+ def self.camel_case(s)
234
+ s = s.to_s.gsub(/\/(.?)/) { "::#{$1.upcase}" }
235
+ s.gsub(/(?:^|_|\.)(.)/) { $1.upcase }
236
+ end
237
+
238
+ end
@@ -0,0 +1,57 @@
1
+ module StanfordCoreNLP::Bridge
2
+
3
+ def inject_get_method(klass)
4
+
5
+ klass.class_eval do
6
+
7
+ if RUBY_PLATFORM =~ /java/
8
+ return unless method_defined?(:get)
9
+ alias_method :get_without_casting, :get
10
+ end
11
+
12
+ # Dynamically defined on all proxied annotation classes.
13
+ # Get an annotation using the annotation bridge.
14
+ def get(annotation, anno_base = nil)
15
+
16
+ unless RUBY_PLATFORM =~ /java/
17
+ return unless java_methods.include?('get(Ljava.lang.Class;)')
18
+ end
19
+
20
+ anno_class = "#{StanfordCoreNLP.camel_case(annotation)}Annotation"
21
+ if anno_base
22
+ unless StanfordNLP::Config::Annotations[anno_base]
23
+ raise "The path #{anno_base} doesn't exist."
24
+ end
25
+ anno_bases = [anno_base]
26
+ else
27
+ anno_bases = StanfordCoreNLP::Config::AnnotationsByName[anno_class]
28
+ raise "The annotation #{anno_class} doesn't exist." unless anno_bases
29
+ end
30
+ if anno_bases.size > 1
31
+ msg = "There are many different annotations bearing the name #{anno_class}." +
32
+ "\nPlease specify one of the following base classes as second parameter to disambiguate: "
33
+ msg << anno_bases.join(',')
34
+ raise msg
35
+ else
36
+ base_class = anno_bases[0]
37
+ end
38
+
39
+ if RUBY_PLATFORM =~ /java/
40
+ fqcn = "edu.stanford.#{base_class}"
41
+ class_path = fqcn.split(".")
42
+ class_name = class_path.pop
43
+ path = StanfordCoreNLP.camel_case(class_path.join("."))
44
+ jruby_class = "Java::#{path}::#{class_name}::#{anno_class}"
45
+ get_without_casting(Object.module_eval(jruby_class))
46
+ else
47
+ url = "edu.stanford.#{base_class}$#{anno_class}"
48
+ StanfordCoreNLP::AnnotationBridge.getAnnotation(self, url)
49
+ end
50
+
51
+ end
52
+
53
+ end
54
+
55
+ end
56
+
57
+ end
@@ -0,0 +1,392 @@
1
+ module StanfordCoreNLP
2
+
3
+ class Config
4
+
5
+ # A hash of language codes in humanized,
6
+ # 2 and 3-letter ISO639 codes.
7
+ LanguageCodes = {
8
+ :english => [:en, :eng, :english],
9
+ :german => [:de, :ger, :german],
10
+ :french => [:fr, :fre, :french]
11
+ }
12
+
13
+ # Folders inside the JAR path for the models.
14
+ ModelFolders = {
15
+ :pos => 'taggers/',
16
+ :parse => 'grammar/',
17
+ :ner => 'classifiers/',
18
+ :dcoref => 'dcoref/'
19
+ }
20
+
21
+ # Tag sets used by Stanford for each language.
22
+ TagSets = {
23
+ :english => :penn,
24
+ :german => :stutgart,
25
+ :french => :paris7
26
+ }
27
+
28
+ # Default models for all languages.
29
+ Models = {
30
+
31
+ :pos => {
32
+ :english => 'english-left3words-distsim.tagger',
33
+ :german => 'german-fast.tagger',
34
+ :french => 'french.tagger'
35
+ },
36
+
37
+ :parse => {
38
+ :english => 'englishPCFG.ser.gz',
39
+ :german => 'germanPCFG.ser.gz',
40
+ :french => 'frenchFactored.ser.gz'
41
+ },
42
+
43
+ :ner => {
44
+ :english => 'english.all.3class.distsim.crf.ser.gz'
45
+ # :german => {} # Add this at some point.
46
+ },
47
+
48
+ :dcoref => {
49
+ :english => {
50
+ 'demonym' => 'demonyms.txt',
51
+ 'animate' => 'animate.unigrams.txt',
52
+ 'female' => 'female.unigrams.txt',
53
+ 'inanimate' => 'inanimate.unigrams.txt',
54
+ 'male' => 'male.unigrams.txt',
55
+ 'neutral' => 'neutral.unigrams.txt',
56
+ 'plural' => 'plural.unigrams.txt',
57
+ 'singular' => 'singular.unigrams.txt',
58
+ 'states' => 'state-abbreviations.txt',
59
+ 'countries' => 'countries',
60
+ 'states.provinces' => 'statesandprovinces',
61
+ 'extra.gender' => 'namegender.combine.txt',
62
+ 'singleton.predictor' => 'singleton.predictor.ser'
63
+ },
64
+ :german => {},
65
+ :french => {}
66
+ }
67
+
68
+ # Models to add.
69
+
70
+ #"truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz
71
+ #"truecase.bias" - class bias of the true case model; default: INIT_UPPER:-0.7,UPPER:-0.7,O:0
72
+ #"truecase.mixedcasefile" - path towards the mixed case file; default: StanfordCoreNLPModels/truecase/MixDisambiguation.list
73
+ #"nfl.gazetteer" - path towards the gazetteer for the NFL domain
74
+ #"nfl.relation.model" - path towards the NFL relation extraction model
75
+ }
76
+
77
+ # List of annotations by JAVA class path.
78
+ Annotations = {
79
+
80
+ 'nlp.dcoref.CoNLL2011DocumentReader' => [
81
+ 'CorefMentionAnnotation',
82
+ 'NamedEntityAnnotation'
83
+ ],
84
+
85
+ 'nlp.ling.CoreAnnotations' => [
86
+
87
+ 'AbbrAnnotation',
88
+ 'AbgeneAnnotation',
89
+ 'AbstrAnnotation',
90
+ 'AfterAnnotation',
91
+ 'AnswerAnnotation',
92
+ 'AnswerObjectAnnotation',
93
+ 'AntecedentAnnotation',
94
+ 'ArgDescendentAnnotation',
95
+ 'ArgumentAnnotation',
96
+ 'BagOfWordsAnnotation',
97
+ 'BeAnnotation',
98
+ 'BeforeAnnotation',
99
+ 'BeginIndexAnnotation',
100
+ 'BestCliquesAnnotation',
101
+ 'BestFullAnnotation',
102
+ 'CalendarAnnotation',
103
+ 'CategoryAnnotation',
104
+ 'CategoryFunctionalTagAnnotation',
105
+ 'CharacterOffsetBeginAnnotation',
106
+ 'CharacterOffsetEndAnnotation',
107
+ 'CharAnnotation',
108
+ 'ChineseCharAnnotation',
109
+ 'ChineseIsSegmentedAnnotation',
110
+ 'ChineseOrigSegAnnotation',
111
+ 'ChineseSegAnnotation',
112
+ 'ChunkAnnotation',
113
+ 'CoarseTagAnnotation',
114
+ 'CommonWordsAnnotation',
115
+ 'CoNLLDepAnnotation',
116
+ 'CoNLLDepParentIndexAnnotation',
117
+ 'CoNLLDepTypeAnnotation',
118
+ 'CoNLLPredicateAnnotation',
119
+ 'CoNLLSRLAnnotation',
120
+ 'ContextsAnnotation',
121
+ 'CopyAnnotation',
122
+ 'CostMagnificationAnnotation',
123
+ 'CovertIDAnnotation',
124
+ 'D2_LBeginAnnotation',
125
+ 'D2_LEndAnnotation',
126
+ 'D2_LMiddleAnnotation',
127
+ 'DayAnnotation',
128
+ 'DependentsAnnotation',
129
+ 'DictAnnotation',
130
+ 'DistSimAnnotation',
131
+ 'DoAnnotation',
132
+ 'DocDateAnnotation',
133
+ 'DocIDAnnotation',
134
+ 'DomainAnnotation',
135
+ 'EndIndexAnnotation',
136
+ 'EntityClassAnnotation',
137
+ 'EntityRuleAnnotation',
138
+ 'EntityTypeAnnotation',
139
+ 'FeaturesAnnotation',
140
+ 'FemaleGazAnnotation',
141
+ 'FirstChildAnnotation',
142
+ 'ForcedSentenceEndAnnotation',
143
+ 'FreqAnnotation',
144
+ 'GazAnnotation',
145
+ 'GazetteerAnnotation',
146
+ 'GenericTokensAnnotation',
147
+ 'GeniaAnnotation',
148
+ 'GoldAnswerAnnotation',
149
+ 'GovernorAnnotation',
150
+ 'GrandparentAnnotation',
151
+ 'HaveAnnotation',
152
+ 'HeadWordStringAnnotation',
153
+ 'HeightAnnotation',
154
+ 'IDAnnotation',
155
+ 'IDFAnnotation',
156
+ 'INAnnotation',
157
+ 'IndexAnnotation',
158
+ 'InterpretationAnnotation',
159
+ 'IsDateRangeAnnotation',
160
+ 'IsURLAnnotation',
161
+ 'LabelAnnotation',
162
+ 'LastGazAnnotation',
163
+ 'LastTaggedAnnotation',
164
+ 'LBeginAnnotation',
165
+ 'LeftChildrenNodeAnnotation',
166
+ 'LeftTermAnnotation',
167
+ 'LemmaAnnotation',
168
+ 'LEndAnnotation',
169
+ 'LengthAnnotation',
170
+ 'LMiddleAnnotation',
171
+ 'MaleGazAnnotation',
172
+ 'MarkingAnnotation',
173
+ 'MonthAnnotation',
174
+ 'MorphoCaseAnnotation',
175
+ 'MorphoGenAnnotation',
176
+ 'MorphoNumAnnotation',
177
+ 'MorphoPersAnnotation',
178
+ 'NamedEntityTagAnnotation',
179
+ 'NeighborsAnnotation',
180
+ 'NERIDAnnotation',
181
+ 'NormalizedNamedEntityTagAnnotation',
182
+ 'NotAnnotation',
183
+ 'NumericCompositeObjectAnnotation',
184
+ 'NumericCompositeTypeAnnotation',
185
+ 'NumericCompositeValueAnnotation',
186
+ 'NumericObjectAnnotation',
187
+ 'NumericTypeAnnotation',
188
+ 'NumericValueAnnotation',
189
+ 'NumerizedTokensAnnotation',
190
+ 'NumTxtSentencesAnnotation',
191
+ 'OriginalAnswerAnnotation',
192
+ 'OriginalCharAnnotation',
193
+ 'OriginalTextAnnotation',
194
+ 'ParagraphAnnotation',
195
+ 'ParagraphsAnnotation',
196
+ 'ParaPositionAnnotation',
197
+ 'ParentAnnotation',
198
+ 'PartOfSpeechAnnotation',
199
+ 'PercentAnnotation',
200
+ 'PhraseWordsAnnotation',
201
+ 'PhraseWordsTagAnnotation',
202
+ 'PolarityAnnotation',
203
+ 'PositionAnnotation',
204
+ 'PossibleAnswersAnnotation',
205
+ 'PredictedAnswerAnnotation',
206
+ 'PrevChildAnnotation',
207
+ 'PriorAnnotation',
208
+ 'ProjectedCategoryAnnotation',
209
+ 'ProtoAnnotation',
210
+ 'RoleAnnotation',
211
+ 'SectionAnnotation',
212
+ 'SemanticHeadTagAnnotation',
213
+ 'SemanticHeadWordAnnotation',
214
+ 'SemanticTagAnnotation',
215
+ 'SemanticWordAnnotation',
216
+ 'SentenceIDAnnotation',
217
+ 'SentenceIndexAnnotation',
218
+ 'SentencePositionAnnotation',
219
+ 'SentencesAnnotation',
220
+ 'ShapeAnnotation',
221
+ 'SpaceBeforeAnnotation',
222
+ 'SpanAnnotation',
223
+ 'SpeakerAnnotation',
224
+ 'SRL_ID',
225
+ 'SRLIDAnnotation',
226
+ 'SRLInstancesAnnotation',
227
+ 'StackedNamedEntityTagAnnotation',
228
+ 'StateAnnotation',
229
+ 'StemAnnotation',
230
+ 'SubcategorizationAnnotation',
231
+ 'TagLabelAnnotation',
232
+ 'TextAnnotation',
233
+ 'TokenBeginAnnotation',
234
+ 'TokenEndAnnotation',
235
+ 'TokensAnnotation',
236
+ 'TopicAnnotation',
237
+ 'TrueCaseAnnotation',
238
+ 'TrueCaseTextAnnotation',
239
+ 'TrueTagAnnotation',
240
+ 'UBlockAnnotation',
241
+ 'UnaryAnnotation',
242
+ 'UnknownAnnotation',
243
+ 'UtteranceAnnotation',
244
+ 'UTypeAnnotation',
245
+ 'ValueAnnotation',
246
+ 'VerbSenseAnnotation',
247
+ 'WebAnnotation',
248
+ 'WordFormAnnotation',
249
+ 'WordnetSynAnnotation',
250
+ 'WordPositionAnnotation',
251
+ 'WordSenseAnnotation',
252
+ 'XmlContextAnnotation',
253
+ 'XmlElementAnnotation',
254
+ 'YearAnnotation'
255
+ ],
256
+
257
+ 'nlp.dcoref.CorefCoreAnnotations' => [
258
+
259
+ 'CorefAnnotation',
260
+ 'CorefChainAnnotation',
261
+ 'CorefClusterAnnotation',
262
+ 'CorefClusterIdAnnotation',
263
+ 'CorefDestAnnotation',
264
+ 'CorefGraphAnnotation'
265
+ ],
266
+
267
+ 'nlp.ling.CoreLabel' => [
268
+ 'GenericAnnotation'
269
+ ],
270
+
271
+ 'nlp.trees.EnglishGrammaticalRelations' => [
272
+ 'AbbreviationModifierGRAnnotation',
273
+ 'AdjectivalComplementGRAnnotation',
274
+ 'AdjectivalModifierGRAnnotation',
275
+ 'AdvClauseModifierGRAnnotation',
276
+ 'AdverbialModifierGRAnnotation',
277
+ 'AgentGRAnnotation',
278
+ 'AppositionalModifierGRAnnotation',
279
+ 'ArgumentGRAnnotation',
280
+ 'AttributiveGRAnnotation',
281
+ 'AuxModifierGRAnnotation',
282
+ 'AuxPassiveGRAnnotation',
283
+ 'ClausalComplementGRAnnotation',
284
+ 'ClausalPassiveSubjectGRAnnotation',
285
+ 'ClausalSubjectGRAnnotation',
286
+ 'ComplementGRAnnotation',
287
+ 'ComplementizerGRAnnotation',
288
+ 'ConjunctGRAnnotation',
289
+ 'ControllingSubjectGRAnnotation',
290
+ 'CoordinationGRAnnotation',
291
+ 'CopulaGRAnnotation',
292
+ 'DeterminerGRAnnotation',
293
+ 'DirectObjectGRAnnotation',
294
+ 'ExpletiveGRAnnotation',
295
+ 'IndirectObjectGRAnnotation',
296
+ 'InfinitivalModifierGRAnnotation',
297
+ 'MarkerGRAnnotation',
298
+ 'ModifierGRAnnotation',
299
+ 'MultiWordExpressionGRAnnotation',
300
+ 'NegationModifierGRAnnotation',
301
+ 'NominalPassiveSubjectGRAnnotation',
302
+ 'NominalSubjectGRAnnotation',
303
+ 'NounCompoundModifierGRAnnotation',
304
+ 'NpAdverbialModifierGRAnnotation',
305
+ 'NumberModifierGRAnnotation',
306
+ 'NumericModifierGRAnnotation',
307
+ 'ObjectGRAnnotation',
308
+ 'ParataxisGRAnnotation',
309
+ 'ParticipialModifierGRAnnotation',
310
+ 'PhrasalVerbParticleGRAnnotation',
311
+ 'PossessionModifierGRAnnotation',
312
+ 'PossessiveModifierGRAnnotation',
313
+ 'PreconjunctGRAnnotation',
314
+ 'PredeterminerGRAnnotation',
315
+ 'PredicateGRAnnotation',
316
+ 'PrepositionalComplementGRAnnotation',
317
+ 'PrepositionalModifierGRAnnotation',
318
+ 'PrepositionalObjectGRAnnotation',
319
+ 'PunctuationGRAnnotation',
320
+ 'PurposeClauseModifierGRAnnotation',
321
+ 'QuantifierModifierGRAnnotation',
322
+ 'ReferentGRAnnotation',
323
+ 'RelativeClauseModifierGRAnnotation',
324
+ 'RelativeGRAnnotation',
325
+ 'SemanticDependentGRAnnotation',
326
+ 'SubjectGRAnnotation',
327
+ 'TemporalModifierGRAnnotation',
328
+ 'XClausalComplementGRAnnotation'
329
+ ],
330
+
331
+ 'nlp.trees.GrammaticalRelation' => [
332
+ 'DependentGRAnnotation',
333
+ 'GovernorGRAnnotation',
334
+ 'GrammaticalRelationAnnotation',
335
+ 'KillGRAnnotation',
336
+ 'Language',
337
+ 'RootGRAnnotation'
338
+ ],
339
+
340
+ 'nlp.ie.machinereading.structure.MachineReadingAnnotations' => [
341
+ 'DependencyAnnotation',
342
+ 'DocumentDirectoryAnnotation',
343
+ 'DocumentIdAnnotation',
344
+ 'EntityMentionsAnnotation',
345
+ 'EventMentionsAnnotation',
346
+ 'GenderAnnotation',
347
+ 'RelationMentionsAnnotation',
348
+ 'TriggerAnnotation'
349
+ ],
350
+
351
+ 'nlp.parser.lexparser.ParserAnnotations' => [
352
+ 'ConstraintAnnotation'
353
+ ],
354
+
355
+ 'nlp.semgraph.SemanticGraphCoreAnnotations' => [
356
+ 'BasicDependenciesAnnotation',
357
+ 'CollapsedCCProcessedDependenciesAnnotation',
358
+ 'CollapsedDependenciesAnnotation'
359
+ ],
360
+
361
+ 'nlp.time.TimeAnnotations' => [
362
+ 'TimexAnnotation',
363
+ 'TimexAnnotations'
364
+ ],
365
+
366
+ 'nlp.time.TimeExpression' => [
367
+ 'Annotation',
368
+ 'ChildrenAnnotation',
369
+ 'TimeIndexAnnotation'
370
+ ],
371
+
372
+ 'nlp.trees.TreeCoreAnnotations' => [
373
+ 'TreeHeadTagAnnotation',
374
+ 'TreeHeadWordAnnotation',
375
+ 'TreeAnnotation'
376
+ ]
377
+ }
378
+
379
+ # Create a list of annotation names => paths.
380
+ annotations_by_name = {}
381
+ Annotations.each do |base_class, annotation_classes|
382
+ annotation_classes.each do |annotation_class|
383
+ annotations_by_name[annotation_class] ||= []
384
+ annotations_by_name[annotation_class] << base_class
385
+ end
386
+ end
387
+
388
+ # Hash of name => path.
389
+ AnnotationsByName = annotations_by_name
390
+
391
+ end
392
+ end
metadata ADDED
@@ -0,0 +1,96 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: stanford-core-nlp-abstractor
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.5.3
5
+ platform: ruby
6
+ authors:
7
+ - Louis Mullie
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-02-23 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bind-it
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: 0.2.7
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: 0.2.7
27
+ - !ruby/object:Gem::Dependency
28
+ name: rspec
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rake
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ description: " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
56
+ language processing\ntools that provides tokenization, part-of-speech tagging and
57
+ parsing for several languages, as well as named entity\nrecognition and coreference
58
+ resolution for English. "
59
+ email:
60
+ - louis.mullie@gmail.com
61
+ executables: []
62
+ extensions: []
63
+ extra_rdoc_files: []
64
+ files:
65
+ - LICENSE
66
+ - README.md
67
+ - bin/AnnotationBridge.java
68
+ - bin/bridge.jar
69
+ - lib/stanford-core-nlp.rb
70
+ - lib/stanford-core-nlp/bridge.rb
71
+ - lib/stanford-core-nlp/config.rb
72
+ homepage: https://github.com/louismullie/stanford-core-nlp
73
+ licenses: []
74
+ metadata: {}
75
+ post_install_message:
76
+ rdoc_options: []
77
+ require_paths:
78
+ - lib
79
+ required_ruby_version: !ruby/object:Gem::Requirement
80
+ requirements:
81
+ - - ">="
82
+ - !ruby/object:Gem::Version
83
+ version: '0'
84
+ required_rubygems_version: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - ">="
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ requirements: []
90
+ rubyforge_project:
91
+ rubygems_version: 2.4.3
92
+ signing_key:
93
+ specification_version: 4
94
+ summary: Ruby bindings to the Stanford Core NLP tools.
95
+ test_files: []
96
+ has_rdoc: