stanford-core-nlp 0.2.1 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +139 -0
- data/lib/stanford-core-nlp.rb +56 -172
- data/lib/stanford-core-nlp/{java_wrapper.rb → bridge.rb} +9 -19
- metadata +13 -9
- data/README.markdown +0 -100
- data/lib/stanford-core-nlp/jar_loader.rb +0 -55
data/README.md
ADDED
@@ -0,0 +1,139 @@
|
|
1
|
+
[![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
|
2
|
+
|
3
|
+
**About**
|
4
|
+
|
5
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, part-of-speech tagging, lemmatization, and parsing of several languages, as well as named entity recognition and coreference resolution in English. This gem is compatible with Ruby 1.9.2 and above.
|
6
|
+
|
7
|
+
**Installing**
|
8
|
+
|
9
|
+
First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
|
10
|
+
|
11
|
+
* A [minimal package for English](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
|
12
|
+
* A [full package for English](http://louismullie.com/treat/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
|
13
|
+
* A [full package for all languages](http://louismullie.com/treat/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
|
14
|
+
|
15
|
+
Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
|
16
|
+
|
17
|
+
**Configuration**
|
18
|
+
|
19
|
+
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
# Set an alternative path to look for the JAR files
|
23
|
+
# Default is gem's bin folder.
|
24
|
+
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
25
|
+
|
26
|
+
# Set an alternative path to look for the model files
|
27
|
+
# Default is gem's bin folder.
|
28
|
+
StanfordCoreNLP.model_path = '/path_to_models/'
|
29
|
+
|
30
|
+
# Pass some alternative arguments to the Java VM.
|
31
|
+
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
32
|
+
# to take a coffee break).
|
33
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
34
|
+
|
35
|
+
# Redirect VM output to log.txt
|
36
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
37
|
+
|
38
|
+
# Use the model files for a different language than English.
|
39
|
+
StanfordCoreNLP.use(:french)
|
40
|
+
|
41
|
+
# Change a specific model file.
|
42
|
+
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
43
|
+
```
|
44
|
+
|
45
|
+
**Using the gem**
|
46
|
+
|
47
|
+
```ruby
|
48
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
49
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
50
|
+
'looked pleased, but Merkel was dismayed.'
|
51
|
+
|
52
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
53
|
+
text = StanfordCoreNLP::Text.new(text)
|
54
|
+
pipeline.annotate(text)
|
55
|
+
|
56
|
+
text.get(:sentences).each do |sentence|
|
57
|
+
# Syntatical dependencies
|
58
|
+
puts sentence.get(:basic_dependencies).to_s
|
59
|
+
sentence.get(:tokens).each do |token|
|
60
|
+
# Default annotations for all tokens
|
61
|
+
puts token.get(:value).to_s
|
62
|
+
puts token.get(:original_text).to_s
|
63
|
+
puts token.get(:character_offset_begin).to_s
|
64
|
+
puts token.get(:character_offset_end).to_s
|
65
|
+
# POS returned by the tagger
|
66
|
+
puts token.get(:part_of_speech).to_s
|
67
|
+
# Lemma (base form of the token)
|
68
|
+
puts token.get(:lemma).to_s
|
69
|
+
# Named entity tag
|
70
|
+
puts token.get(:named_entity_tag).to_s
|
71
|
+
# Coreference
|
72
|
+
puts token.get(:coref_cluster_id).to_s
|
73
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
74
|
+
end
|
75
|
+
end
|
76
|
+
```
|
77
|
+
|
78
|
+
> Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
79
|
+
|
80
|
+
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.
|
81
|
+
|
82
|
+
**Loading specific classes**
|
83
|
+
|
84
|
+
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
85
|
+
|
86
|
+
```ruby
|
87
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
88
|
+
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
89
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
90
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
91
|
+
|
92
|
+
# Here, we specify another base class.
|
93
|
+
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
94
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
95
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
96
|
+
```
|
97
|
+
|
98
|
+
**List of annotator classes**
|
99
|
+
|
100
|
+
Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using `StanfordCoreNLP.load_class` (see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
|
101
|
+
|
102
|
+
* PTBTokenizerAnnotator - tokenizes the text following Penn Treebank conventions.
|
103
|
+
* WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
|
104
|
+
* POSTaggerAnnotator - annotates the text with part-of-speech tags.
|
105
|
+
* MorphaAnnotator - morphological normalizer (generates lemmas).
|
106
|
+
* NERAnnotator - annotates the text with named-entity labels.
|
107
|
+
* NERCombinerAnnotator - combines several NER models.
|
108
|
+
* TrueCaseAnnotator - detects the true case of words in free text.
|
109
|
+
* ParserAnnotator - generates constituent and dependency trees.
|
110
|
+
* NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
|
111
|
+
* TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
|
112
|
+
* QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
|
113
|
+
* SRLAnnotator - annotates predicates and their semantic roles.
|
114
|
+
* DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model.
|
115
|
+
* NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
|
116
|
+
|
117
|
+
**List of model files**
|
118
|
+
|
119
|
+
Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using `StanfordCoreNLP.set_model` (see above).
|
120
|
+
|
121
|
+
* 'pos.model' - 'english-left3words-distsim.tagger'
|
122
|
+
* 'ner.model.3class' - 'all.3class.distsim.crf.ser.gz'
|
123
|
+
* 'ner.model.7class' - 'muc.7class.distsim.crf.ser.gz'
|
124
|
+
* 'ner.model.MISCclass' -- 'conll.4class.distsim.crf.ser.gz'
|
125
|
+
* 'parser.model' - 'englishPCFG.ser.gz'
|
126
|
+
* 'dcoref.demonym' - 'demonyms.txt'
|
127
|
+
* 'dcoref.animate' - 'animate.unigrams.txt'
|
128
|
+
* 'dcoref.female' - 'female.unigrams.txt'
|
129
|
+
* 'dcoref.inanimate' - 'inanimate.unigrams.txt'
|
130
|
+
* 'dcoref.male' - 'male.unigrams.txt'
|
131
|
+
* 'dcoref.neutral' - 'neutral.unigrams.txt'
|
132
|
+
* 'dcoref.plural' - 'plural.unigrams.txt'
|
133
|
+
* 'dcoref.singular' - 'singular.unigrams.txt'
|
134
|
+
* 'dcoref.states' - 'state-abbreviations.txt'
|
135
|
+
* 'dcoref.extra.gender' - 'namegender.combine.txt'
|
136
|
+
|
137
|
+
**Contributing**
|
138
|
+
|
139
|
+
Feel free to fork the project and send me a pull request!
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,58 +1,62 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.
|
3
|
+
VERSION = '0.3.0'
|
4
|
+
|
5
|
+
require 'bind-it'
|
6
|
+
extend BindIt::Binding
|
7
|
+
|
8
|
+
# ############################ #
|
9
|
+
# BindIt Configuration Options #
|
10
|
+
# ############################ #
|
11
|
+
|
12
|
+
# The path in which to look for the Stanford JAR files,
|
13
|
+
# with a trailing slash.
|
14
|
+
self.jar_path = File.dirname(__FILE__) + '/../bin/'
|
15
|
+
|
16
|
+
# Load the JVM with a minimum heap size of 512MB,
|
17
|
+
# and a maximum heap size of 1024MB.
|
18
|
+
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
19
|
+
|
20
|
+
# Turn logging off by default.
|
21
|
+
self.log_file = nil
|
22
|
+
|
23
|
+
# Default JAR files to load.
|
24
|
+
self.default_jars = [
|
25
|
+
'joda-time.jar',
|
26
|
+
'xom.jar',
|
27
|
+
'stanford-corenlp.jar',
|
28
|
+
'bridge.jar'
|
29
|
+
]
|
30
|
+
|
31
|
+
# Default classes to load.
|
32
|
+
self.default_classes = [
|
33
|
+
['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
|
34
|
+
['Annotation', 'edu.stanford.nlp.pipeline', 'Text'],
|
35
|
+
['Word', 'edu.stanford.nlp.ling'],
|
36
|
+
['MaxentTagger', 'edu.stanford.nlp.tagger.maxent'],
|
37
|
+
['CRFClassifier', 'edu.stanford.nlp.ie.crf'],
|
38
|
+
['Properties', 'java.util'],
|
39
|
+
['ArrayList', 'java.util'],
|
40
|
+
['AnnotationBridge', '']
|
41
|
+
]
|
42
|
+
|
43
|
+
# Default namespace is the Stanford pipeline namespace.
|
44
|
+
self.default_namespace = 'edu.stanford.nlp.pipeline'
|
4
45
|
|
5
|
-
require 'stanford-core-nlp/jar_loader'
|
6
|
-
require 'stanford-core-nlp/java_wrapper'
|
7
46
|
require 'stanford-core-nlp/config'
|
8
|
-
|
47
|
+
require 'stanford-core-nlp/bridge'
|
48
|
+
|
9
49
|
class << self
|
10
|
-
# The
|
11
|
-
# with a trailing slash.
|
12
|
-
#
|
13
|
-
# The structure of the JAR folder must be as follows:
|
14
|
-
#
|
15
|
-
# Files:
|
16
|
-
#
|
17
|
-
# /stanford-core-nlp.jar
|
18
|
-
# /joda-time.jar
|
19
|
-
# /xom.jar
|
20
|
-
# /bridge.jar*
|
21
|
-
#
|
22
|
-
# Folders:
|
23
|
-
#
|
24
|
-
# /classifiers # Models for the NER system.
|
25
|
-
# /dcoref # Models for the coreference resolver.
|
26
|
-
# /taggers # Models for the POS tagger.
|
27
|
-
# /grammar # Models for the parser.
|
28
|
-
#
|
29
|
-
# *The file bridge.jar is a thin JAVA wrapper over the
|
30
|
-
# Stanford Core NLP get() function, which allows to
|
31
|
-
# retrieve annotations using static classes as names.
|
32
|
-
# This works around one of the lacunae of Rjb.
|
33
|
-
attr_accessor :jar_path
|
34
|
-
# The path to the main folder containing the folders
|
35
|
-
# with the individual models inside. By default, this
|
36
|
-
# is the same as the JAR path.
|
37
|
-
attr_accessor :model_path
|
38
|
-
# The flags for starting the JVM machine. The parser
|
39
|
-
# and named entity recognizer are very memory consuming.
|
40
|
-
attr_accessor :jvm_args
|
41
|
-
# A file to redirect JVM output to.
|
42
|
-
attr_accessor :log_file
|
43
|
-
# The model files for a given language.
|
50
|
+
# The model file names for a given language.
|
44
51
|
attr_accessor :model_files
|
52
|
+
# The folder in which to look for models.
|
53
|
+
attr_accessor :model_path
|
45
54
|
end
|
46
|
-
|
47
|
-
# The
|
48
|
-
|
49
|
-
#
|
55
|
+
|
56
|
+
# The path to the main folder containing the folders
|
57
|
+
# with the individual models inside. By default, this
|
58
|
+
# is the same as the JAR path.
|
50
59
|
self.model_path = self.jar_path
|
51
|
-
# Load the JVM with a minimum heap size of 512MB and a
|
52
|
-
# maximum heap size of 1024MB.
|
53
|
-
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
54
|
-
# Turn logging off by default.
|
55
|
-
self.log_file = nil
|
56
60
|
|
57
61
|
# Use models for a given language. Language can be
|
58
62
|
# supplied as full-length, or ISO-639 2 or 3 letter
|
@@ -83,49 +87,20 @@ module StanfordCoreNLP
|
|
83
87
|
# Use english by default.
|
84
88
|
self.use(:english)
|
85
89
|
|
86
|
-
# Set a model file.
|
87
|
-
#
|
88
|
-
# 'pos.model' => 'english-left3words-distsim.tagger',
|
89
|
-
# 'ner.model.3class' => 'all.3class.distsim.crf.ser.gz',
|
90
|
-
# 'ner.model.7class' => 'muc.7class.distsim.crf.ser.gz',
|
91
|
-
# 'ner.model.MISCclass' => 'conll.4class.distsim.crf.ser.gz',
|
92
|
-
# 'parser.model' => 'englishPCFG.ser.gz',
|
93
|
-
# 'dcoref.demonym' => 'demonyms.txt',
|
94
|
-
# 'dcoref.animate' => 'animate.unigrams.txt',
|
95
|
-
# 'dcoref.female' => 'female.unigrams.txt',
|
96
|
-
# 'dcoref.inanimate' => 'inanimate.unigrams.txt',
|
97
|
-
# 'dcoref.male' => 'male.unigrams.txt',
|
98
|
-
# 'dcoref.neutral' => 'neutral.unigrams.txt',
|
99
|
-
# 'dcoref.plural' => 'plural.unigrams.txt',
|
100
|
-
# 'dcoref.singular' => 'singular.unigrams.txt',
|
101
|
-
# 'dcoref.states' => 'state-abbreviations.txt',
|
102
|
-
# 'dcoref.extra.gender' => 'namegender.combine.txt'
|
103
|
-
#
|
90
|
+
# Set a model file.
|
104
91
|
def self.set_model(name, file)
|
105
92
|
n = name.split('.')[0].intern
|
106
93
|
self.model_files[name] =
|
107
94
|
Config::ModelFolders[n] + file
|
108
95
|
end
|
109
96
|
|
110
|
-
# Whether the classes are initialized or not.
|
111
|
-
@@initialized = false
|
112
|
-
|
113
|
-
# Load the JARs, create the classes.
|
114
|
-
def self.init
|
115
|
-
unless @@initialized
|
116
|
-
self.load_jars
|
117
|
-
self.load_default_classes
|
118
|
-
end
|
119
|
-
@@initialized = true
|
120
|
-
end
|
121
|
-
|
122
97
|
# Load a StanfordCoreNLP pipeline with the
|
123
98
|
# specified JVM flags and StanfordCoreNLP
|
124
99
|
# properties.
|
125
100
|
def self.load(*annotators)
|
126
|
-
|
127
|
-
|
128
|
-
|
101
|
+
|
102
|
+
# Make the bindings.
|
103
|
+
self.bind
|
129
104
|
# Prepend the JAR path to the model files.
|
130
105
|
properties = {}
|
131
106
|
self.model_files.each do |k,v|
|
@@ -135,15 +110,12 @@ module StanfordCoreNLP
|
|
135
110
|
break if found
|
136
111
|
end
|
137
112
|
next unless found
|
138
|
-
|
139
113
|
f = self.model_path + v
|
140
|
-
|
141
114
|
unless File.readable?(f)
|
142
115
|
raise "Model file #{f} could not be found. " +
|
143
116
|
"You may need to download this file manually "+
|
144
117
|
" and/or set paths properly."
|
145
118
|
end
|
146
|
-
|
147
119
|
properties[k] = f
|
148
120
|
end
|
149
121
|
|
@@ -152,81 +124,7 @@ module StanfordCoreNLP
|
|
152
124
|
CoreNLP.new(get_properties(properties))
|
153
125
|
end
|
154
126
|
|
155
|
-
#
|
156
|
-
# the program always loads the same models when
|
157
|
-
# you make new pipelines and request the annotator
|
158
|
-
# again, ignoring the changes in models.
|
159
|
-
#
|
160
|
-
# This function kills the JVM and reloads everything
|
161
|
-
# if you need to create a new pipeline with different
|
162
|
-
# models for the same annotators.
|
163
|
-
#def self.reload
|
164
|
-
# raise 'Not implemented.'
|
165
|
-
#end
|
166
|
-
|
167
|
-
# Load the jars.
|
168
|
-
def self.load_jars
|
169
|
-
JarLoader.log(self.log_file)
|
170
|
-
JarLoader.jvm_args = self.jvm_args
|
171
|
-
JarLoader.jar_path = self.jar_path
|
172
|
-
JarLoader.load('joda-time.jar')
|
173
|
-
JarLoader.load('xom.jar')
|
174
|
-
JarLoader.load('stanford-corenlp.jar')
|
175
|
-
JarLoader.load('bridge.jar')
|
176
|
-
end
|
177
|
-
|
178
|
-
# Create the Ruby classes corresponding to the StanfordNLP
|
179
|
-
# core classes.
|
180
|
-
def self.load_default_classes
|
181
|
-
|
182
|
-
const_set(:CoreNLP,
|
183
|
-
Rjb::import('edu.stanford.nlp.pipeline.StanfordCoreNLP')
|
184
|
-
)
|
185
|
-
|
186
|
-
self.load_klass 'Annotation'
|
187
|
-
self.load_klass 'Word', 'edu.stanford.nlp.ling'
|
188
|
-
|
189
|
-
self.load_klass 'MaxentTagger', 'edu.stanford.nlp.tagger.maxent'
|
190
|
-
|
191
|
-
self.load_klass 'CRFClassifier', 'edu.stanford.nlp.ie.crf'
|
192
|
-
|
193
|
-
self.load_klass 'Properties', 'java.util'
|
194
|
-
self.load_klass 'ArrayList', 'java.util'
|
195
|
-
|
196
|
-
self.load_klass 'AnnotationBridge', ''
|
197
|
-
|
198
|
-
const_set(:Text, Annotation)
|
199
|
-
|
200
|
-
end
|
201
|
-
|
202
|
-
# Load a class (e.g. PTBTokenizerAnnotator) in a specific
|
203
|
-
# class path (default is 'edu.stanford.nlp.pipeline').
|
204
|
-
# The class is then accessible under the StanfordCoreNLP
|
205
|
-
# namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
|
206
|
-
#
|
207
|
-
# List of annotators:
|
208
|
-
#
|
209
|
-
# - PTBTokenizingAnnotator - tokenizes the text following Penn Treebank conventions.
|
210
|
-
# - WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
|
211
|
-
# - POSTaggerAnnotator - annotates the text with part-of-speech tags.
|
212
|
-
# - MorphaAnnotator - morphological normalizer (generates lemmas).
|
213
|
-
# - NERAnnotator - annotates the text with named-entity labels.
|
214
|
-
# - NERCombinerAnnotator - combines several NER models (use this instead of NERAnnotator!).
|
215
|
-
# - TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text).
|
216
|
-
# - ParserAnnotator - generates constituent and dependency trees.
|
217
|
-
# - NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
|
218
|
-
# - TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
|
219
|
-
# - QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
|
220
|
-
# - SRLAnnotator - annotates predicates and their semantic roles.
|
221
|
-
# - CorefAnnotator - implements pronominal anaphora resolution using a statistical model (deprecated!).
|
222
|
-
# - DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model (newer model, use this!).
|
223
|
-
# - NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
|
224
|
-
def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
|
225
|
-
self.init unless @@initialized
|
226
|
-
self.load_klass(klass, base)
|
227
|
-
end
|
228
|
-
|
229
|
-
# HCreate a java.util.Properties object from a hash.
|
127
|
+
# Create a java.util.Properties object from a hash.
|
230
128
|
def self.get_properties(properties)
|
231
129
|
props = Properties.new
|
232
130
|
properties.each do |property, value|
|
@@ -245,18 +143,4 @@ module StanfordCoreNLP
|
|
245
143
|
list
|
246
144
|
end
|
247
145
|
|
248
|
-
# Under_case -> CamelCase.
|
249
|
-
def self.camel_case(text)
|
250
|
-
text.to_s.gsub(/^[a-z]|_[a-z]/) do |a|
|
251
|
-
a.upcase
|
252
|
-
end.gsub('_', '')
|
253
|
-
end
|
254
|
-
|
255
|
-
private
|
256
|
-
def self.load_klass(klass, base = 'edu.stanford.nlp.pipeline')
|
257
|
-
base += '.' unless base == ''
|
258
|
-
const_set(klass.intern,
|
259
|
-
Rjb::import("#{base}#{klass}"))
|
260
|
-
end
|
261
|
-
|
262
146
|
end
|
@@ -1,23 +1,9 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
# Modify the Rjb JavaProxy class to add our
|
3
|
+
# Modify the Rjb JavaProxy class to add our
|
4
|
+
# own methods to every Java object.
|
4
5
|
Rjb::Rjb_JavaProxy.class_eval do
|
5
6
|
|
6
|
-
# Dynamically defined on all proxied Java objects.
|
7
|
-
# Shorthand for to_string defined by Java classes.
|
8
|
-
def to_s; to_string; end
|
9
|
-
|
10
|
-
# Dynamically defined on all proxied Java iterators.
|
11
|
-
# Provide Ruby-style iterators to wrap Java iterators.
|
12
|
-
def each
|
13
|
-
if !java_methods.include?('iterator()')
|
14
|
-
raise 'This object cannot be iterated.'
|
15
|
-
else
|
16
|
-
i = self.iterator
|
17
|
-
while i.has_next; yield i.next; end
|
18
|
-
end
|
19
|
-
end
|
20
|
-
|
21
7
|
# Dynamically defined on all proxied annotation classes.
|
22
8
|
# Get an annotation using the annotation bridge.
|
23
9
|
def get(annotation, anno_base = nil)
|
@@ -26,15 +12,19 @@ module StanfordCoreNLP
|
|
26
12
|
else
|
27
13
|
anno_class = "#{StanfordCoreNLP.camel_case(annotation)}Annotation"
|
28
14
|
if anno_base
|
29
|
-
|
15
|
+
unless StanfordNLP::Config::Annotations[anno_base]
|
16
|
+
raise "The path #{anno_base} doesn't exist."
|
17
|
+
end
|
30
18
|
anno_bases = [anno_base]
|
31
19
|
else
|
32
20
|
anno_bases = StanfordCoreNLP::Config::AnnotationsByName[anno_class]
|
33
21
|
raise "The annotation #{anno_class} doesn't exist." unless anno_bases
|
34
22
|
end
|
35
23
|
if anno_bases.size > 1
|
36
|
-
msg = "There are many different annotations
|
37
|
-
|
24
|
+
msg = "There are many different annotations " +
|
25
|
+
"bearing the name #{anno_class}. \nPlease specify " +
|
26
|
+
"one of the following base classes as second " +
|
27
|
+
"parameter to disambiguate: "
|
38
28
|
msg << anno_bases.join(',')
|
39
29
|
raise msg
|
40
30
|
else
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-04-05 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
|
-
name:
|
16
|
-
requirement:
|
15
|
+
name: bind-it
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,12 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements:
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
25
30
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
31
|
language processing \ntools that provides tokenization, part-of-speech tagging and
|
27
32
|
parsing for several languages, as well as named entity \nrecognition and coreference
|
@@ -32,12 +37,11 @@ executables: []
|
|
32
37
|
extensions: []
|
33
38
|
extra_rdoc_files: []
|
34
39
|
files:
|
40
|
+
- lib/stanford-core-nlp/bridge.rb
|
35
41
|
- lib/stanford-core-nlp/config.rb
|
36
|
-
- lib/stanford-core-nlp/jar_loader.rb
|
37
|
-
- lib/stanford-core-nlp/java_wrapper.rb
|
38
42
|
- lib/stanford-core-nlp.rb
|
39
43
|
- bin/bridge.jar
|
40
|
-
- README.
|
44
|
+
- README.md
|
41
45
|
- LICENSE
|
42
46
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
43
47
|
licenses: []
|
@@ -59,7 +63,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
59
63
|
version: '0'
|
60
64
|
requirements: []
|
61
65
|
rubyforge_project:
|
62
|
-
rubygems_version: 1.8.
|
66
|
+
rubygems_version: 1.8.21
|
63
67
|
signing_key:
|
64
68
|
specification_version: 3
|
65
69
|
summary: Ruby bindings to the Stanford Core NLP tools.
|
data/README.markdown
DELETED
@@ -1,100 +0,0 @@
|
|
1
|
-
[![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
|
2
|
-
|
3
|
-
**About**
|
4
|
-
|
5
|
-
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for several languages, as well as named entity recognition and coreference resolution for English. This gem is compatible with Ruby 1.9.2 and above.
|
6
|
-
|
7
|
-
**Installing**
|
8
|
-
|
9
|
-
Firs, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
|
10
|
-
|
11
|
-
* A [minimal package for English](http://louismullie.com/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
|
12
|
-
* A [full package for English](http://louismullie.com/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
|
13
|
-
* A [full package for all languages](http://louismullie.com/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
|
14
|
-
|
15
|
-
Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. /usr/local/lib/ruby/gems/1.X.x/gems/stanford-core-nlp-0.x/bin/).
|
16
|
-
|
17
|
-
**Configuration**
|
18
|
-
|
19
|
-
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
|
20
|
-
|
21
|
-
```ruby
|
22
|
-
# Set an alternative path to look for the JAR files
|
23
|
-
# Default is gem's bin folder.
|
24
|
-
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
25
|
-
|
26
|
-
# Set an alternative path to look for the model files
|
27
|
-
# Default is gem's bin folder.
|
28
|
-
StanfordCoreNLP.model_path = '/path_to_models/'
|
29
|
-
|
30
|
-
# Pass some alternative arguments to the Java VM.
|
31
|
-
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
32
|
-
# to take a coffee break).
|
33
|
-
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
34
|
-
|
35
|
-
# Redirect VM output to log.txt
|
36
|
-
StanfordCoreNLP.log_file = 'log.txt'
|
37
|
-
|
38
|
-
# Use the model files for a different language than English.
|
39
|
-
StanfordCoreNLP.use(:french)
|
40
|
-
|
41
|
-
# Change a specific model file.
|
42
|
-
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
43
|
-
```
|
44
|
-
|
45
|
-
**Using the gem**
|
46
|
-
|
47
|
-
```ruby
|
48
|
-
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
49
|
-
'Berlin to discuss a new austerity package. Sarkozy ' +
|
50
|
-
'looked pleased, but Merkel was dismayed.'
|
51
|
-
|
52
|
-
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
53
|
-
text = StanfordCoreNLP::Text.new(text)
|
54
|
-
pipeline.annotate(text)
|
55
|
-
|
56
|
-
text.get(:sentences).each do |sentence|
|
57
|
-
# Syntatical dependencies
|
58
|
-
puts sentence.get(:basic_dependencies).to_s
|
59
|
-
sentence.get(:tokens).each do |token|
|
60
|
-
# Default annotations for all tokens
|
61
|
-
puts token.get(:value).to_s
|
62
|
-
puts token.get(:original_text).to_s
|
63
|
-
puts token.get(:character_offset_begin).to_s
|
64
|
-
puts token.get(:character_offset_end).to_s
|
65
|
-
# POS returned by the tagger
|
66
|
-
puts token.get(:part_of_speech).to_s
|
67
|
-
# Lemma (base form of the token)
|
68
|
-
puts token.get(:lemma).to_s
|
69
|
-
# Named entity tag
|
70
|
-
puts token.get(:named_entity_tag).to_s
|
71
|
-
# Coreference
|
72
|
-
puts token.get(:coref_cluster_id).to_s
|
73
|
-
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
74
|
-
end
|
75
|
-
end
|
76
|
-
```
|
77
|
-
|
78
|
-
> Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
79
|
-
|
80
|
-
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
|
81
|
-
|
82
|
-
**Loading specific classes**
|
83
|
-
|
84
|
-
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
85
|
-
|
86
|
-
```ruby
|
87
|
-
# Default base class is edu.stanford.nlp.pipeline.
|
88
|
-
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
89
|
-
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
90
|
-
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
91
|
-
|
92
|
-
# Here, we specify another base class.
|
93
|
-
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
94
|
-
puts StanfordCoreNLP::MaxentTagger.inspect
|
95
|
-
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
96
|
-
```
|
97
|
-
|
98
|
-
**Contributing**
|
99
|
-
|
100
|
-
Feel free to fork the project and send me a pull request!
|
@@ -1,55 +0,0 @@
|
|
1
|
-
module StanfordCoreNLP
|
2
|
-
class JarLoader
|
3
|
-
|
4
|
-
require 'rjb'
|
5
|
-
|
6
|
-
# Configuration options.
|
7
|
-
class << self
|
8
|
-
# An array of flags to pass to the JVM machine.
|
9
|
-
attr_accessor :jvm_args
|
10
|
-
attr_accessor :jar_path
|
11
|
-
attr_accessor :log_file
|
12
|
-
end
|
13
|
-
|
14
|
-
# An array of string flags to supply to the JVM, e.g. ['-Xms512M', '-Xmx1024M']
|
15
|
-
self.jvm_args = []
|
16
|
-
# The path in which to look for Jars.
|
17
|
-
self.jar_path = ''
|
18
|
-
# By default, disable logging.
|
19
|
-
self.log_file = nil
|
20
|
-
|
21
|
-
# Load Rjb and create Java VM.
|
22
|
-
def self.rjb_initialize
|
23
|
-
return if ::Rjb::loaded?
|
24
|
-
::Rjb::load(nil, self.jvm_args)
|
25
|
-
set_java_logging if self.log_file
|
26
|
-
end
|
27
|
-
|
28
|
-
# Enable logging.
|
29
|
-
def self.log(file = 'log.txt')
|
30
|
-
self.log_file = file
|
31
|
-
end
|
32
|
-
|
33
|
-
# Redirect the output of the JVM to supplied log file.
|
34
|
-
def self.set_java_logging
|
35
|
-
const_set(:System, Rjb::import('java.lang.System'))
|
36
|
-
const_set(:PrintStream, Rjb::import('java.io.PrintStream'))
|
37
|
-
const_set(:File2, Rjb::import('java.io.File'))
|
38
|
-
ps = PrintStream.new(File2.new(self.log_file))
|
39
|
-
ps.write(::Time.now.strftime("[%m/%d/%Y at %I:%M%p]\n\n"))
|
40
|
-
System.setOut(ps)
|
41
|
-
System.setErr(ps)
|
42
|
-
end
|
43
|
-
|
44
|
-
# Load a jar.
|
45
|
-
def self.load(jar)
|
46
|
-
self.rjb_initialize
|
47
|
-
jar = self.jar_path + jar
|
48
|
-
if !::File.readable?(jar)
|
49
|
-
raise "Could not find JAR file (looking in #{jar})."
|
50
|
-
end
|
51
|
-
::Rjb::add_jar(jar)
|
52
|
-
end
|
53
|
-
|
54
|
-
end
|
55
|
-
end
|