stanford-core-nlp 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +139 -0
- data/lib/stanford-core-nlp.rb +56 -172
- data/lib/stanford-core-nlp/{java_wrapper.rb → bridge.rb} +9 -19
- metadata +13 -9
- data/README.markdown +0 -100
- data/lib/stanford-core-nlp/jar_loader.rb +0 -55
data/README.md
ADDED
@@ -0,0 +1,139 @@
|
|
1
|
+
[](http://travis-ci.org/louismullie/stanford-core-nlp)
|
2
|
+
|
3
|
+
**About**
|
4
|
+
|
5
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, part-of-speech tagging, lemmatization, and parsing of several languages, as well as named entity recognition and coreference resolution in English. This gem is compatible with Ruby 1.9.2 and above.
|
6
|
+
|
7
|
+
**Installing**
|
8
|
+
|
9
|
+
First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
|
10
|
+
|
11
|
+
* A [minimal package for English](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
|
12
|
+
* A [full package for English](http://louismullie.com/treat/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
|
13
|
+
* A [full package for all languages](http://louismullie.com/treat/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
|
14
|
+
|
15
|
+
Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
|
16
|
+
|
17
|
+
**Configuration**
|
18
|
+
|
19
|
+
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
# Set an alternative path to look for the JAR files
|
23
|
+
# Default is gem's bin folder.
|
24
|
+
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
25
|
+
|
26
|
+
# Set an alternative path to look for the model files
|
27
|
+
# Default is gem's bin folder.
|
28
|
+
StanfordCoreNLP.model_path = '/path_to_models/'
|
29
|
+
|
30
|
+
# Pass some alternative arguments to the Java VM.
|
31
|
+
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
32
|
+
# to take a coffee break).
|
33
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
34
|
+
|
35
|
+
# Redirect VM output to log.txt
|
36
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
37
|
+
|
38
|
+
# Use the model files for a different language than English.
|
39
|
+
StanfordCoreNLP.use(:french)
|
40
|
+
|
41
|
+
# Change a specific model file.
|
42
|
+
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
43
|
+
```
|
44
|
+
|
45
|
+
**Using the gem**
|
46
|
+
|
47
|
+
```ruby
|
48
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
49
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
50
|
+
'looked pleased, but Merkel was dismayed.'
|
51
|
+
|
52
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
53
|
+
text = StanfordCoreNLP::Text.new(text)
|
54
|
+
pipeline.annotate(text)
|
55
|
+
|
56
|
+
text.get(:sentences).each do |sentence|
|
57
|
+
# Syntatical dependencies
|
58
|
+
puts sentence.get(:basic_dependencies).to_s
|
59
|
+
sentence.get(:tokens).each do |token|
|
60
|
+
# Default annotations for all tokens
|
61
|
+
puts token.get(:value).to_s
|
62
|
+
puts token.get(:original_text).to_s
|
63
|
+
puts token.get(:character_offset_begin).to_s
|
64
|
+
puts token.get(:character_offset_end).to_s
|
65
|
+
# POS returned by the tagger
|
66
|
+
puts token.get(:part_of_speech).to_s
|
67
|
+
# Lemma (base form of the token)
|
68
|
+
puts token.get(:lemma).to_s
|
69
|
+
# Named entity tag
|
70
|
+
puts token.get(:named_entity_tag).to_s
|
71
|
+
# Coreference
|
72
|
+
puts token.get(:coref_cluster_id).to_s
|
73
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
74
|
+
end
|
75
|
+
end
|
76
|
+
```
|
77
|
+
|
78
|
+
> Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
79
|
+
|
80
|
+
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.
|
81
|
+
|
82
|
+
**Loading specific classes**
|
83
|
+
|
84
|
+
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
85
|
+
|
86
|
+
```ruby
|
87
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
88
|
+
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
89
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
90
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
91
|
+
|
92
|
+
# Here, we specify another base class.
|
93
|
+
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
94
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
95
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
96
|
+
```
|
97
|
+
|
98
|
+
**List of annotator classes**
|
99
|
+
|
100
|
+
Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using `StanfordCoreNLP.load_class` (see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
|
101
|
+
|
102
|
+
* PTBTokenizerAnnotator - tokenizes the text following Penn Treebank conventions.
|
103
|
+
* WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
|
104
|
+
* POSTaggerAnnotator - annotates the text with part-of-speech tags.
|
105
|
+
* MorphaAnnotator - morphological normalizer (generates lemmas).
|
106
|
+
* NERAnnotator - annotates the text with named-entity labels.
|
107
|
+
* NERCombinerAnnotator - combines several NER models.
|
108
|
+
* TrueCaseAnnotator - detects the true case of words in free text.
|
109
|
+
* ParserAnnotator - generates constituent and dependency trees.
|
110
|
+
* NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
|
111
|
+
* TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
|
112
|
+
* QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
|
113
|
+
* SRLAnnotator - annotates predicates and their semantic roles.
|
114
|
+
* DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model.
|
115
|
+
* NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
|
116
|
+
|
117
|
+
**List of model files**
|
118
|
+
|
119
|
+
Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using `StanfordCoreNLP.set_model` (see above).
|
120
|
+
|
121
|
+
* 'pos.model' - 'english-left3words-distsim.tagger'
|
122
|
+
* 'ner.model.3class' - 'all.3class.distsim.crf.ser.gz'
|
123
|
+
* 'ner.model.7class' - 'muc.7class.distsim.crf.ser.gz'
|
124
|
+
* 'ner.model.MISCclass' -- 'conll.4class.distsim.crf.ser.gz'
|
125
|
+
* 'parser.model' - 'englishPCFG.ser.gz'
|
126
|
+
* 'dcoref.demonym' - 'demonyms.txt'
|
127
|
+
* 'dcoref.animate' - 'animate.unigrams.txt'
|
128
|
+
* 'dcoref.female' - 'female.unigrams.txt'
|
129
|
+
* 'dcoref.inanimate' - 'inanimate.unigrams.txt'
|
130
|
+
* 'dcoref.male' - 'male.unigrams.txt'
|
131
|
+
* 'dcoref.neutral' - 'neutral.unigrams.txt'
|
132
|
+
* 'dcoref.plural' - 'plural.unigrams.txt'
|
133
|
+
* 'dcoref.singular' - 'singular.unigrams.txt'
|
134
|
+
* 'dcoref.states' - 'state-abbreviations.txt'
|
135
|
+
* 'dcoref.extra.gender' - 'namegender.combine.txt'
|
136
|
+
|
137
|
+
**Contributing**
|
138
|
+
|
139
|
+
Feel free to fork the project and send me a pull request!
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,58 +1,62 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.
|
3
|
+
VERSION = '0.3.0'
|
4
|
+
|
5
|
+
require 'bind-it'
|
6
|
+
extend BindIt::Binding
|
7
|
+
|
8
|
+
# ############################ #
|
9
|
+
# BindIt Configuration Options #
|
10
|
+
# ############################ #
|
11
|
+
|
12
|
+
# The path in which to look for the Stanford JAR files,
|
13
|
+
# with a trailing slash.
|
14
|
+
self.jar_path = File.dirname(__FILE__) + '/../bin/'
|
15
|
+
|
16
|
+
# Load the JVM with a minimum heap size of 512MB,
|
17
|
+
# and a maximum heap size of 1024MB.
|
18
|
+
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
19
|
+
|
20
|
+
# Turn logging off by default.
|
21
|
+
self.log_file = nil
|
22
|
+
|
23
|
+
# Default JAR files to load.
|
24
|
+
self.default_jars = [
|
25
|
+
'joda-time.jar',
|
26
|
+
'xom.jar',
|
27
|
+
'stanford-corenlp.jar',
|
28
|
+
'bridge.jar'
|
29
|
+
]
|
30
|
+
|
31
|
+
# Default classes to load.
|
32
|
+
self.default_classes = [
|
33
|
+
['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
|
34
|
+
['Annotation', 'edu.stanford.nlp.pipeline', 'Text'],
|
35
|
+
['Word', 'edu.stanford.nlp.ling'],
|
36
|
+
['MaxentTagger', 'edu.stanford.nlp.tagger.maxent'],
|
37
|
+
['CRFClassifier', 'edu.stanford.nlp.ie.crf'],
|
38
|
+
['Properties', 'java.util'],
|
39
|
+
['ArrayList', 'java.util'],
|
40
|
+
['AnnotationBridge', '']
|
41
|
+
]
|
42
|
+
|
43
|
+
# Default namespace is the Stanford pipeline namespace.
|
44
|
+
self.default_namespace = 'edu.stanford.nlp.pipeline'
|
4
45
|
|
5
|
-
require 'stanford-core-nlp/jar_loader'
|
6
|
-
require 'stanford-core-nlp/java_wrapper'
|
7
46
|
require 'stanford-core-nlp/config'
|
8
|
-
|
47
|
+
require 'stanford-core-nlp/bridge'
|
48
|
+
|
9
49
|
class << self
|
10
|
-
# The
|
11
|
-
# with a trailing slash.
|
12
|
-
#
|
13
|
-
# The structure of the JAR folder must be as follows:
|
14
|
-
#
|
15
|
-
# Files:
|
16
|
-
#
|
17
|
-
# /stanford-core-nlp.jar
|
18
|
-
# /joda-time.jar
|
19
|
-
# /xom.jar
|
20
|
-
# /bridge.jar*
|
21
|
-
#
|
22
|
-
# Folders:
|
23
|
-
#
|
24
|
-
# /classifiers # Models for the NER system.
|
25
|
-
# /dcoref # Models for the coreference resolver.
|
26
|
-
# /taggers # Models for the POS tagger.
|
27
|
-
# /grammar # Models for the parser.
|
28
|
-
#
|
29
|
-
# *The file bridge.jar is a thin JAVA wrapper over the
|
30
|
-
# Stanford Core NLP get() function, which allows to
|
31
|
-
# retrieve annotations using static classes as names.
|
32
|
-
# This works around one of the lacunae of Rjb.
|
33
|
-
attr_accessor :jar_path
|
34
|
-
# The path to the main folder containing the folders
|
35
|
-
# with the individual models inside. By default, this
|
36
|
-
# is the same as the JAR path.
|
37
|
-
attr_accessor :model_path
|
38
|
-
# The flags for starting the JVM machine. The parser
|
39
|
-
# and named entity recognizer are very memory consuming.
|
40
|
-
attr_accessor :jvm_args
|
41
|
-
# A file to redirect JVM output to.
|
42
|
-
attr_accessor :log_file
|
43
|
-
# The model files for a given language.
|
50
|
+
# The model file names for a given language.
|
44
51
|
attr_accessor :model_files
|
52
|
+
# The folder in which to look for models.
|
53
|
+
attr_accessor :model_path
|
45
54
|
end
|
46
|
-
|
47
|
-
# The
|
48
|
-
|
49
|
-
#
|
55
|
+
|
56
|
+
# The path to the main folder containing the folders
|
57
|
+
# with the individual models inside. By default, this
|
58
|
+
# is the same as the JAR path.
|
50
59
|
self.model_path = self.jar_path
|
51
|
-
# Load the JVM with a minimum heap size of 512MB and a
|
52
|
-
# maximum heap size of 1024MB.
|
53
|
-
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
54
|
-
# Turn logging off by default.
|
55
|
-
self.log_file = nil
|
56
60
|
|
57
61
|
# Use models for a given language. Language can be
|
58
62
|
# supplied as full-length, or ISO-639 2 or 3 letter
|
@@ -83,49 +87,20 @@ module StanfordCoreNLP
|
|
83
87
|
# Use english by default.
|
84
88
|
self.use(:english)
|
85
89
|
|
86
|
-
# Set a model file.
|
87
|
-
#
|
88
|
-
# 'pos.model' => 'english-left3words-distsim.tagger',
|
89
|
-
# 'ner.model.3class' => 'all.3class.distsim.crf.ser.gz',
|
90
|
-
# 'ner.model.7class' => 'muc.7class.distsim.crf.ser.gz',
|
91
|
-
# 'ner.model.MISCclass' => 'conll.4class.distsim.crf.ser.gz',
|
92
|
-
# 'parser.model' => 'englishPCFG.ser.gz',
|
93
|
-
# 'dcoref.demonym' => 'demonyms.txt',
|
94
|
-
# 'dcoref.animate' => 'animate.unigrams.txt',
|
95
|
-
# 'dcoref.female' => 'female.unigrams.txt',
|
96
|
-
# 'dcoref.inanimate' => 'inanimate.unigrams.txt',
|
97
|
-
# 'dcoref.male' => 'male.unigrams.txt',
|
98
|
-
# 'dcoref.neutral' => 'neutral.unigrams.txt',
|
99
|
-
# 'dcoref.plural' => 'plural.unigrams.txt',
|
100
|
-
# 'dcoref.singular' => 'singular.unigrams.txt',
|
101
|
-
# 'dcoref.states' => 'state-abbreviations.txt',
|
102
|
-
# 'dcoref.extra.gender' => 'namegender.combine.txt'
|
103
|
-
#
|
90
|
+
# Set a model file.
|
104
91
|
def self.set_model(name, file)
|
105
92
|
n = name.split('.')[0].intern
|
106
93
|
self.model_files[name] =
|
107
94
|
Config::ModelFolders[n] + file
|
108
95
|
end
|
109
96
|
|
110
|
-
# Whether the classes are initialized or not.
|
111
|
-
@@initialized = false
|
112
|
-
|
113
|
-
# Load the JARs, create the classes.
|
114
|
-
def self.init
|
115
|
-
unless @@initialized
|
116
|
-
self.load_jars
|
117
|
-
self.load_default_classes
|
118
|
-
end
|
119
|
-
@@initialized = true
|
120
|
-
end
|
121
|
-
|
122
97
|
# Load a StanfordCoreNLP pipeline with the
|
123
98
|
# specified JVM flags and StanfordCoreNLP
|
124
99
|
# properties.
|
125
100
|
def self.load(*annotators)
|
126
|
-
|
127
|
-
|
128
|
-
|
101
|
+
|
102
|
+
# Make the bindings.
|
103
|
+
self.bind
|
129
104
|
# Prepend the JAR path to the model files.
|
130
105
|
properties = {}
|
131
106
|
self.model_files.each do |k,v|
|
@@ -135,15 +110,12 @@ module StanfordCoreNLP
|
|
135
110
|
break if found
|
136
111
|
end
|
137
112
|
next unless found
|
138
|
-
|
139
113
|
f = self.model_path + v
|
140
|
-
|
141
114
|
unless File.readable?(f)
|
142
115
|
raise "Model file #{f} could not be found. " +
|
143
116
|
"You may need to download this file manually "+
|
144
117
|
" and/or set paths properly."
|
145
118
|
end
|
146
|
-
|
147
119
|
properties[k] = f
|
148
120
|
end
|
149
121
|
|
@@ -152,81 +124,7 @@ module StanfordCoreNLP
|
|
152
124
|
CoreNLP.new(get_properties(properties))
|
153
125
|
end
|
154
126
|
|
155
|
-
#
|
156
|
-
# the program always loads the same models when
|
157
|
-
# you make new pipelines and request the annotator
|
158
|
-
# again, ignoring the changes in models.
|
159
|
-
#
|
160
|
-
# This function kills the JVM and reloads everything
|
161
|
-
# if you need to create a new pipeline with different
|
162
|
-
# models for the same annotators.
|
163
|
-
#def self.reload
|
164
|
-
# raise 'Not implemented.'
|
165
|
-
#end
|
166
|
-
|
167
|
-
# Load the jars.
|
168
|
-
def self.load_jars
|
169
|
-
JarLoader.log(self.log_file)
|
170
|
-
JarLoader.jvm_args = self.jvm_args
|
171
|
-
JarLoader.jar_path = self.jar_path
|
172
|
-
JarLoader.load('joda-time.jar')
|
173
|
-
JarLoader.load('xom.jar')
|
174
|
-
JarLoader.load('stanford-corenlp.jar')
|
175
|
-
JarLoader.load('bridge.jar')
|
176
|
-
end
|
177
|
-
|
178
|
-
# Create the Ruby classes corresponding to the StanfordNLP
|
179
|
-
# core classes.
|
180
|
-
def self.load_default_classes
|
181
|
-
|
182
|
-
const_set(:CoreNLP,
|
183
|
-
Rjb::import('edu.stanford.nlp.pipeline.StanfordCoreNLP')
|
184
|
-
)
|
185
|
-
|
186
|
-
self.load_klass 'Annotation'
|
187
|
-
self.load_klass 'Word', 'edu.stanford.nlp.ling'
|
188
|
-
|
189
|
-
self.load_klass 'MaxentTagger', 'edu.stanford.nlp.tagger.maxent'
|
190
|
-
|
191
|
-
self.load_klass 'CRFClassifier', 'edu.stanford.nlp.ie.crf'
|
192
|
-
|
193
|
-
self.load_klass 'Properties', 'java.util'
|
194
|
-
self.load_klass 'ArrayList', 'java.util'
|
195
|
-
|
196
|
-
self.load_klass 'AnnotationBridge', ''
|
197
|
-
|
198
|
-
const_set(:Text, Annotation)
|
199
|
-
|
200
|
-
end
|
201
|
-
|
202
|
-
# Load a class (e.g. PTBTokenizerAnnotator) in a specific
|
203
|
-
# class path (default is 'edu.stanford.nlp.pipeline').
|
204
|
-
# The class is then accessible under the StanfordCoreNLP
|
205
|
-
# namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
|
206
|
-
#
|
207
|
-
# List of annotators:
|
208
|
-
#
|
209
|
-
# - PTBTokenizingAnnotator - tokenizes the text following Penn Treebank conventions.
|
210
|
-
# - WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
|
211
|
-
# - POSTaggerAnnotator - annotates the text with part-of-speech tags.
|
212
|
-
# - MorphaAnnotator - morphological normalizer (generates lemmas).
|
213
|
-
# - NERAnnotator - annotates the text with named-entity labels.
|
214
|
-
# - NERCombinerAnnotator - combines several NER models (use this instead of NERAnnotator!).
|
215
|
-
# - TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text).
|
216
|
-
# - ParserAnnotator - generates constituent and dependency trees.
|
217
|
-
# - NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
|
218
|
-
# - TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
|
219
|
-
# - QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
|
220
|
-
# - SRLAnnotator - annotates predicates and their semantic roles.
|
221
|
-
# - CorefAnnotator - implements pronominal anaphora resolution using a statistical model (deprecated!).
|
222
|
-
# - DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model (newer model, use this!).
|
223
|
-
# - NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
|
224
|
-
def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
|
225
|
-
self.init unless @@initialized
|
226
|
-
self.load_klass(klass, base)
|
227
|
-
end
|
228
|
-
|
229
|
-
# HCreate a java.util.Properties object from a hash.
|
127
|
+
# Create a java.util.Properties object from a hash.
|
230
128
|
def self.get_properties(properties)
|
231
129
|
props = Properties.new
|
232
130
|
properties.each do |property, value|
|
@@ -245,18 +143,4 @@ module StanfordCoreNLP
|
|
245
143
|
list
|
246
144
|
end
|
247
145
|
|
248
|
-
# Under_case -> CamelCase.
|
249
|
-
def self.camel_case(text)
|
250
|
-
text.to_s.gsub(/^[a-z]|_[a-z]/) do |a|
|
251
|
-
a.upcase
|
252
|
-
end.gsub('_', '')
|
253
|
-
end
|
254
|
-
|
255
|
-
private
|
256
|
-
def self.load_klass(klass, base = 'edu.stanford.nlp.pipeline')
|
257
|
-
base += '.' unless base == ''
|
258
|
-
const_set(klass.intern,
|
259
|
-
Rjb::import("#{base}#{klass}"))
|
260
|
-
end
|
261
|
-
|
262
146
|
end
|
@@ -1,23 +1,9 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
# Modify the Rjb JavaProxy class to add our
|
3
|
+
# Modify the Rjb JavaProxy class to add our
|
4
|
+
# own methods to every Java object.
|
4
5
|
Rjb::Rjb_JavaProxy.class_eval do
|
5
6
|
|
6
|
-
# Dynamically defined on all proxied Java objects.
|
7
|
-
# Shorthand for to_string defined by Java classes.
|
8
|
-
def to_s; to_string; end
|
9
|
-
|
10
|
-
# Dynamically defined on all proxied Java iterators.
|
11
|
-
# Provide Ruby-style iterators to wrap Java iterators.
|
12
|
-
def each
|
13
|
-
if !java_methods.include?('iterator()')
|
14
|
-
raise 'This object cannot be iterated.'
|
15
|
-
else
|
16
|
-
i = self.iterator
|
17
|
-
while i.has_next; yield i.next; end
|
18
|
-
end
|
19
|
-
end
|
20
|
-
|
21
7
|
# Dynamically defined on all proxied annotation classes.
|
22
8
|
# Get an annotation using the annotation bridge.
|
23
9
|
def get(annotation, anno_base = nil)
|
@@ -26,15 +12,19 @@ module StanfordCoreNLP
|
|
26
12
|
else
|
27
13
|
anno_class = "#{StanfordCoreNLP.camel_case(annotation)}Annotation"
|
28
14
|
if anno_base
|
29
|
-
|
15
|
+
unless StanfordNLP::Config::Annotations[anno_base]
|
16
|
+
raise "The path #{anno_base} doesn't exist."
|
17
|
+
end
|
30
18
|
anno_bases = [anno_base]
|
31
19
|
else
|
32
20
|
anno_bases = StanfordCoreNLP::Config::AnnotationsByName[anno_class]
|
33
21
|
raise "The annotation #{anno_class} doesn't exist." unless anno_bases
|
34
22
|
end
|
35
23
|
if anno_bases.size > 1
|
36
|
-
msg = "There are many different annotations
|
37
|
-
|
24
|
+
msg = "There are many different annotations " +
|
25
|
+
"bearing the name #{anno_class}. \nPlease specify " +
|
26
|
+
"one of the following base classes as second " +
|
27
|
+
"parameter to disambiguate: "
|
38
28
|
msg << anno_bases.join(',')
|
39
29
|
raise msg
|
40
30
|
else
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-04-05 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
|
-
name:
|
16
|
-
requirement:
|
15
|
+
name: bind-it
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,12 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements:
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
25
30
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
31
|
language processing \ntools that provides tokenization, part-of-speech tagging and
|
27
32
|
parsing for several languages, as well as named entity \nrecognition and coreference
|
@@ -32,12 +37,11 @@ executables: []
|
|
32
37
|
extensions: []
|
33
38
|
extra_rdoc_files: []
|
34
39
|
files:
|
40
|
+
- lib/stanford-core-nlp/bridge.rb
|
35
41
|
- lib/stanford-core-nlp/config.rb
|
36
|
-
- lib/stanford-core-nlp/jar_loader.rb
|
37
|
-
- lib/stanford-core-nlp/java_wrapper.rb
|
38
42
|
- lib/stanford-core-nlp.rb
|
39
43
|
- bin/bridge.jar
|
40
|
-
- README.
|
44
|
+
- README.md
|
41
45
|
- LICENSE
|
42
46
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
43
47
|
licenses: []
|
@@ -59,7 +63,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
59
63
|
version: '0'
|
60
64
|
requirements: []
|
61
65
|
rubyforge_project:
|
62
|
-
rubygems_version: 1.8.
|
66
|
+
rubygems_version: 1.8.21
|
63
67
|
signing_key:
|
64
68
|
specification_version: 3
|
65
69
|
summary: Ruby bindings to the Stanford Core NLP tools.
|
data/README.markdown
DELETED
@@ -1,100 +0,0 @@
|
|
1
|
-
[](http://travis-ci.org/louismullie/stanford-core-nlp)
|
2
|
-
|
3
|
-
**About**
|
4
|
-
|
5
|
-
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for several languages, as well as named entity recognition and coreference resolution for English. This gem is compatible with Ruby 1.9.2 and above.
|
6
|
-
|
7
|
-
**Installing**
|
8
|
-
|
9
|
-
Firs, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
|
10
|
-
|
11
|
-
* A [minimal package for English](http://louismullie.com/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
|
12
|
-
* A [full package for English](http://louismullie.com/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
|
13
|
-
* A [full package for all languages](http://louismullie.com/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
|
14
|
-
|
15
|
-
Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. /usr/local/lib/ruby/gems/1.X.x/gems/stanford-core-nlp-0.x/bin/).
|
16
|
-
|
17
|
-
**Configuration**
|
18
|
-
|
19
|
-
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
|
20
|
-
|
21
|
-
```ruby
|
22
|
-
# Set an alternative path to look for the JAR files
|
23
|
-
# Default is gem's bin folder.
|
24
|
-
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
25
|
-
|
26
|
-
# Set an alternative path to look for the model files
|
27
|
-
# Default is gem's bin folder.
|
28
|
-
StanfordCoreNLP.model_path = '/path_to_models/'
|
29
|
-
|
30
|
-
# Pass some alternative arguments to the Java VM.
|
31
|
-
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
32
|
-
# to take a coffee break).
|
33
|
-
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
34
|
-
|
35
|
-
# Redirect VM output to log.txt
|
36
|
-
StanfordCoreNLP.log_file = 'log.txt'
|
37
|
-
|
38
|
-
# Use the model files for a different language than English.
|
39
|
-
StanfordCoreNLP.use(:french)
|
40
|
-
|
41
|
-
# Change a specific model file.
|
42
|
-
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
43
|
-
```
|
44
|
-
|
45
|
-
**Using the gem**
|
46
|
-
|
47
|
-
```ruby
|
48
|
-
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
49
|
-
'Berlin to discuss a new austerity package. Sarkozy ' +
|
50
|
-
'looked pleased, but Merkel was dismayed.'
|
51
|
-
|
52
|
-
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
53
|
-
text = StanfordCoreNLP::Text.new(text)
|
54
|
-
pipeline.annotate(text)
|
55
|
-
|
56
|
-
text.get(:sentences).each do |sentence|
|
57
|
-
# Syntatical dependencies
|
58
|
-
puts sentence.get(:basic_dependencies).to_s
|
59
|
-
sentence.get(:tokens).each do |token|
|
60
|
-
# Default annotations for all tokens
|
61
|
-
puts token.get(:value).to_s
|
62
|
-
puts token.get(:original_text).to_s
|
63
|
-
puts token.get(:character_offset_begin).to_s
|
64
|
-
puts token.get(:character_offset_end).to_s
|
65
|
-
# POS returned by the tagger
|
66
|
-
puts token.get(:part_of_speech).to_s
|
67
|
-
# Lemma (base form of the token)
|
68
|
-
puts token.get(:lemma).to_s
|
69
|
-
# Named entity tag
|
70
|
-
puts token.get(:named_entity_tag).to_s
|
71
|
-
# Coreference
|
72
|
-
puts token.get(:coref_cluster_id).to_s
|
73
|
-
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
74
|
-
end
|
75
|
-
end
|
76
|
-
```
|
77
|
-
|
78
|
-
> Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
79
|
-
|
80
|
-
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
|
81
|
-
|
82
|
-
**Loading specific classes**
|
83
|
-
|
84
|
-
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
85
|
-
|
86
|
-
```ruby
|
87
|
-
# Default base class is edu.stanford.nlp.pipeline.
|
88
|
-
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
89
|
-
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
90
|
-
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
91
|
-
|
92
|
-
# Here, we specify another base class.
|
93
|
-
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
94
|
-
puts StanfordCoreNLP::MaxentTagger.inspect
|
95
|
-
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
96
|
-
```
|
97
|
-
|
98
|
-
**Contributing**
|
99
|
-
|
100
|
-
Feel free to fork the project and send me a pull request!
|
@@ -1,55 +0,0 @@
|
|
1
|
-
module StanfordCoreNLP
|
2
|
-
class JarLoader
|
3
|
-
|
4
|
-
require 'rjb'
|
5
|
-
|
6
|
-
# Configuration options.
|
7
|
-
class << self
|
8
|
-
# An array of flags to pass to the JVM machine.
|
9
|
-
attr_accessor :jvm_args
|
10
|
-
attr_accessor :jar_path
|
11
|
-
attr_accessor :log_file
|
12
|
-
end
|
13
|
-
|
14
|
-
# An array of string flags to supply to the JVM, e.g. ['-Xms512M', '-Xmx1024M']
|
15
|
-
self.jvm_args = []
|
16
|
-
# The path in which to look for Jars.
|
17
|
-
self.jar_path = ''
|
18
|
-
# By default, disable logging.
|
19
|
-
self.log_file = nil
|
20
|
-
|
21
|
-
# Load Rjb and create Java VM.
|
22
|
-
def self.rjb_initialize
|
23
|
-
return if ::Rjb::loaded?
|
24
|
-
::Rjb::load(nil, self.jvm_args)
|
25
|
-
set_java_logging if self.log_file
|
26
|
-
end
|
27
|
-
|
28
|
-
# Enable logging.
|
29
|
-
def self.log(file = 'log.txt')
|
30
|
-
self.log_file = file
|
31
|
-
end
|
32
|
-
|
33
|
-
# Redirect the output of the JVM to supplied log file.
|
34
|
-
def self.set_java_logging
|
35
|
-
const_set(:System, Rjb::import('java.lang.System'))
|
36
|
-
const_set(:PrintStream, Rjb::import('java.io.PrintStream'))
|
37
|
-
const_set(:File2, Rjb::import('java.io.File'))
|
38
|
-
ps = PrintStream.new(File2.new(self.log_file))
|
39
|
-
ps.write(::Time.now.strftime("[%m/%d/%Y at %I:%M%p]\n\n"))
|
40
|
-
System.setOut(ps)
|
41
|
-
System.setErr(ps)
|
42
|
-
end
|
43
|
-
|
44
|
-
# Load a jar.
|
45
|
-
def self.load(jar)
|
46
|
-
self.rjb_initialize
|
47
|
-
jar = self.jar_path + jar
|
48
|
-
if !::File.readable?(jar)
|
49
|
-
raise "Could not find JAR file (looking in #{jar})."
|
50
|
-
end
|
51
|
-
::Rjb::add_jar(jar)
|
52
|
-
end
|
53
|
-
|
54
|
-
end
|
55
|
-
end
|