stanford-core-nlp 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,139 @@
1
+ [![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
2
+
3
+ **About**
4
+
5
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, part-of-speech tagging, lemmatization, and parsing of several languages, as well as named entity recognition and coreference resolution in English. This gem is compatible with Ruby 1.9.2 and above.
6
+
7
+ **Installing**
8
+
9
+ First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
10
+
11
+ * A [minimal package for English](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
12
+ * A [full package for English](http://louismullie.com/treat/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
13
+ * A [full package for all languages](http://louismullie.com/treat/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
14
+
15
+ Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
16
+
17
+ **Configuration**
18
+
19
+ After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
20
+
21
+ ```ruby
22
+ # Set an alternative path to look for the JAR files
23
+ # Default is gem's bin folder.
24
+ StanfordCoreNLP.jar_path = '/path_to_jars/'
25
+
26
+ # Set an alternative path to look for the model files
27
+ # Default is gem's bin folder.
28
+ StanfordCoreNLP.model_path = '/path_to_models/'
29
+
30
+ # Pass some alternative arguments to the Java VM.
31
+ # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
32
+ # to take a coffee break).
33
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
34
+
35
+ # Redirect VM output to log.txt
36
+ StanfordCoreNLP.log_file = 'log.txt'
37
+
38
+ # Use the model files for a different language than English.
39
+ StanfordCoreNLP.use(:french)
40
+
41
+ # Change a specific model file.
42
+ StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
43
+ ```
44
+
45
+ **Using the gem**
46
+
47
+ ```ruby
48
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
49
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
50
+ 'looked pleased, but Merkel was dismayed.'
51
+
52
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
53
+ text = StanfordCoreNLP::Text.new(text)
54
+ pipeline.annotate(text)
55
+
56
+ text.get(:sentences).each do |sentence|
57
+ # Syntatical dependencies
58
+ puts sentence.get(:basic_dependencies).to_s
59
+ sentence.get(:tokens).each do |token|
60
+ # Default annotations for all tokens
61
+ puts token.get(:value).to_s
62
+ puts token.get(:original_text).to_s
63
+ puts token.get(:character_offset_begin).to_s
64
+ puts token.get(:character_offset_end).to_s
65
+ # POS returned by the tagger
66
+ puts token.get(:part_of_speech).to_s
67
+ # Lemma (base form of the token)
68
+ puts token.get(:lemma).to_s
69
+ # Named entity tag
70
+ puts token.get(:named_entity_tag).to_s
71
+ # Coreference
72
+ puts token.get(:coref_cluster_id).to_s
73
+ # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
74
+ end
75
+ end
76
+ ```
77
+
78
+ > Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
79
+
80
+ A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.
81
+
82
+ **Loading specific classes**
83
+
84
+ You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
85
+
86
+ ```ruby
87
+ # Default base class is edu.stanford.nlp.pipeline.
88
+ StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
89
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
90
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
91
+
92
+ # Here, we specify another base class.
93
+ StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
94
+ puts StanfordCoreNLP::MaxentTagger.inspect
95
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
96
+ ```
97
+
98
+ **List of annotator classes**
99
+
100
+ Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using `StanfordCoreNLP.load_class` (see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
101
+
102
+ * PTBTokenizerAnnotator - tokenizes the text following Penn Treebank conventions.
103
+ * WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
104
+ * POSTaggerAnnotator - annotates the text with part-of-speech tags.
105
+ * MorphaAnnotator - morphological normalizer (generates lemmas).
106
+ * NERAnnotator - annotates the text with named-entity labels.
107
+ * NERCombinerAnnotator - combines several NER models.
108
+ * TrueCaseAnnotator - detects the true case of words in free text.
109
+ * ParserAnnotator - generates constituent and dependency trees.
110
+ * NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
111
+ * TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
112
+ * QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
113
+ * SRLAnnotator - annotates predicates and their semantic roles.
114
+ * DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model.
115
+ * NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
116
+
117
+ **List of model files**
118
+
119
+ Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using `StanfordCoreNLP.set_model` (see above).
120
+
121
+ * 'pos.model' - 'english-left3words-distsim.tagger'
122
+ * 'ner.model.3class' - 'all.3class.distsim.crf.ser.gz'
123
+ * 'ner.model.7class' - 'muc.7class.distsim.crf.ser.gz'
124
+ * 'ner.model.MISCclass' -- 'conll.4class.distsim.crf.ser.gz'
125
+ * 'parser.model' - 'englishPCFG.ser.gz'
126
+ * 'dcoref.demonym' - 'demonyms.txt'
127
+ * 'dcoref.animate' - 'animate.unigrams.txt'
128
+ * 'dcoref.female' - 'female.unigrams.txt'
129
+ * 'dcoref.inanimate' - 'inanimate.unigrams.txt'
130
+ * 'dcoref.male' - 'male.unigrams.txt'
131
+ * 'dcoref.neutral' - 'neutral.unigrams.txt'
132
+ * 'dcoref.plural' - 'plural.unigrams.txt'
133
+ * 'dcoref.singular' - 'singular.unigrams.txt'
134
+ * 'dcoref.states' - 'state-abbreviations.txt'
135
+ * 'dcoref.extra.gender' - 'namegender.combine.txt'
136
+
137
+ **Contributing**
138
+
139
+ Feel free to fork the project and send me a pull request!
@@ -1,58 +1,62 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- VERSION = '0.2.1'
3
+ VERSION = '0.3.0'
4
+
5
+ require 'bind-it'
6
+ extend BindIt::Binding
7
+
8
+ # ############################ #
9
+ # BindIt Configuration Options #
10
+ # ############################ #
11
+
12
+ # The path in which to look for the Stanford JAR files,
13
+ # with a trailing slash.
14
+ self.jar_path = File.dirname(__FILE__) + '/../bin/'
15
+
16
+ # Load the JVM with a minimum heap size of 512MB,
17
+ # and a maximum heap size of 1024MB.
18
+ self.jvm_args = ['-Xms512M', '-Xmx1024M']
19
+
20
+ # Turn logging off by default.
21
+ self.log_file = nil
22
+
23
+ # Default JAR files to load.
24
+ self.default_jars = [
25
+ 'joda-time.jar',
26
+ 'xom.jar',
27
+ 'stanford-corenlp.jar',
28
+ 'bridge.jar'
29
+ ]
30
+
31
+ # Default classes to load.
32
+ self.default_classes = [
33
+ ['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
34
+ ['Annotation', 'edu.stanford.nlp.pipeline', 'Text'],
35
+ ['Word', 'edu.stanford.nlp.ling'],
36
+ ['MaxentTagger', 'edu.stanford.nlp.tagger.maxent'],
37
+ ['CRFClassifier', 'edu.stanford.nlp.ie.crf'],
38
+ ['Properties', 'java.util'],
39
+ ['ArrayList', 'java.util'],
40
+ ['AnnotationBridge', '']
41
+ ]
42
+
43
+ # Default namespace is the Stanford pipeline namespace.
44
+ self.default_namespace = 'edu.stanford.nlp.pipeline'
4
45
 
5
- require 'stanford-core-nlp/jar_loader'
6
- require 'stanford-core-nlp/java_wrapper'
7
46
  require 'stanford-core-nlp/config'
8
-
47
+ require 'stanford-core-nlp/bridge'
48
+
9
49
  class << self
10
- # The path in which to look for the Stanford JAR files,
11
- # with a trailing slash.
12
- #
13
- # The structure of the JAR folder must be as follows:
14
- #
15
- # Files:
16
- #
17
- # /stanford-core-nlp.jar
18
- # /joda-time.jar
19
- # /xom.jar
20
- # /bridge.jar*
21
- #
22
- # Folders:
23
- #
24
- # /classifiers # Models for the NER system.
25
- # /dcoref # Models for the coreference resolver.
26
- # /taggers # Models for the POS tagger.
27
- # /grammar # Models for the parser.
28
- #
29
- # *The file bridge.jar is a thin JAVA wrapper over the
30
- # Stanford Core NLP get() function, which allows to
31
- # retrieve annotations using static classes as names.
32
- # This works around one of the lacunae of Rjb.
33
- attr_accessor :jar_path
34
- # The path to the main folder containing the folders
35
- # with the individual models inside. By default, this
36
- # is the same as the JAR path.
37
- attr_accessor :model_path
38
- # The flags for starting the JVM machine. The parser
39
- # and named entity recognizer are very memory consuming.
40
- attr_accessor :jvm_args
41
- # A file to redirect JVM output to.
42
- attr_accessor :log_file
43
- # The model files for a given language.
50
+ # The model file names for a given language.
44
51
  attr_accessor :model_files
52
+ # The folder in which to look for models.
53
+ attr_accessor :model_path
45
54
  end
46
-
47
- # The default JAR path is the gem's bin folder.
48
- self.jar_path = File.dirname(__FILE__) + '/../bin/'
49
- # The default model path is the same as the JAR path.
55
+
56
+ # The path to the main folder containing the folders
57
+ # with the individual models inside. By default, this
58
+ # is the same as the JAR path.
50
59
  self.model_path = self.jar_path
51
- # Load the JVM with a minimum heap size of 512MB and a
52
- # maximum heap size of 1024MB.
53
- self.jvm_args = ['-Xms512M', '-Xmx1024M']
54
- # Turn logging off by default.
55
- self.log_file = nil
56
60
 
57
61
  # Use models for a given language. Language can be
58
62
  # supplied as full-length, or ISO-639 2 or 3 letter
@@ -83,49 +87,20 @@ module StanfordCoreNLP
83
87
  # Use english by default.
84
88
  self.use(:english)
85
89
 
86
- # Set a model file. Here are the default models for English:
87
- #
88
- # 'pos.model' => 'english-left3words-distsim.tagger',
89
- # 'ner.model.3class' => 'all.3class.distsim.crf.ser.gz',
90
- # 'ner.model.7class' => 'muc.7class.distsim.crf.ser.gz',
91
- # 'ner.model.MISCclass' => 'conll.4class.distsim.crf.ser.gz',
92
- # 'parser.model' => 'englishPCFG.ser.gz',
93
- # 'dcoref.demonym' => 'demonyms.txt',
94
- # 'dcoref.animate' => 'animate.unigrams.txt',
95
- # 'dcoref.female' => 'female.unigrams.txt',
96
- # 'dcoref.inanimate' => 'inanimate.unigrams.txt',
97
- # 'dcoref.male' => 'male.unigrams.txt',
98
- # 'dcoref.neutral' => 'neutral.unigrams.txt',
99
- # 'dcoref.plural' => 'plural.unigrams.txt',
100
- # 'dcoref.singular' => 'singular.unigrams.txt',
101
- # 'dcoref.states' => 'state-abbreviations.txt',
102
- # 'dcoref.extra.gender' => 'namegender.combine.txt'
103
- #
90
+ # Set a model file.
104
91
  def self.set_model(name, file)
105
92
  n = name.split('.')[0].intern
106
93
  self.model_files[name] =
107
94
  Config::ModelFolders[n] + file
108
95
  end
109
96
 
110
- # Whether the classes are initialized or not.
111
- @@initialized = false
112
-
113
- # Load the JARs, create the classes.
114
- def self.init
115
- unless @@initialized
116
- self.load_jars
117
- self.load_default_classes
118
- end
119
- @@initialized = true
120
- end
121
-
122
97
  # Load a StanfordCoreNLP pipeline with the
123
98
  # specified JVM flags and StanfordCoreNLP
124
99
  # properties.
125
100
  def self.load(*annotators)
126
-
127
- self.init unless @@initialized
128
-
101
+
102
+ # Make the bindings.
103
+ self.bind
129
104
  # Prepend the JAR path to the model files.
130
105
  properties = {}
131
106
  self.model_files.each do |k,v|
@@ -135,15 +110,12 @@ module StanfordCoreNLP
135
110
  break if found
136
111
  end
137
112
  next unless found
138
-
139
113
  f = self.model_path + v
140
-
141
114
  unless File.readable?(f)
142
115
  raise "Model file #{f} could not be found. " +
143
116
  "You may need to download this file manually "+
144
117
  " and/or set paths properly."
145
118
  end
146
-
147
119
  properties[k] = f
148
120
  end
149
121
 
@@ -152,81 +124,7 @@ module StanfordCoreNLP
152
124
  CoreNLP.new(get_properties(properties))
153
125
  end
154
126
 
155
- # Once it loads a specific annotator model once,
156
- # the program always loads the same models when
157
- # you make new pipelines and request the annotator
158
- # again, ignoring the changes in models.
159
- #
160
- # This function kills the JVM and reloads everything
161
- # if you need to create a new pipeline with different
162
- # models for the same annotators.
163
- #def self.reload
164
- # raise 'Not implemented.'
165
- #end
166
-
167
- # Load the jars.
168
- def self.load_jars
169
- JarLoader.log(self.log_file)
170
- JarLoader.jvm_args = self.jvm_args
171
- JarLoader.jar_path = self.jar_path
172
- JarLoader.load('joda-time.jar')
173
- JarLoader.load('xom.jar')
174
- JarLoader.load('stanford-corenlp.jar')
175
- JarLoader.load('bridge.jar')
176
- end
177
-
178
- # Create the Ruby classes corresponding to the StanfordNLP
179
- # core classes.
180
- def self.load_default_classes
181
-
182
- const_set(:CoreNLP,
183
- Rjb::import('edu.stanford.nlp.pipeline.StanfordCoreNLP')
184
- )
185
-
186
- self.load_klass 'Annotation'
187
- self.load_klass 'Word', 'edu.stanford.nlp.ling'
188
-
189
- self.load_klass 'MaxentTagger', 'edu.stanford.nlp.tagger.maxent'
190
-
191
- self.load_klass 'CRFClassifier', 'edu.stanford.nlp.ie.crf'
192
-
193
- self.load_klass 'Properties', 'java.util'
194
- self.load_klass 'ArrayList', 'java.util'
195
-
196
- self.load_klass 'AnnotationBridge', ''
197
-
198
- const_set(:Text, Annotation)
199
-
200
- end
201
-
202
- # Load a class (e.g. PTBTokenizerAnnotator) in a specific
203
- # class path (default is 'edu.stanford.nlp.pipeline').
204
- # The class is then accessible under the StanfordCoreNLP
205
- # namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
206
- #
207
- # List of annotators:
208
- #
209
- # - PTBTokenizingAnnotator - tokenizes the text following Penn Treebank conventions.
210
- # - WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
211
- # - POSTaggerAnnotator - annotates the text with part-of-speech tags.
212
- # - MorphaAnnotator - morphological normalizer (generates lemmas).
213
- # - NERAnnotator - annotates the text with named-entity labels.
214
- # - NERCombinerAnnotator - combines several NER models (use this instead of NERAnnotator!).
215
- # - TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text).
216
- # - ParserAnnotator - generates constituent and dependency trees.
217
- # - NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
218
- # - TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
219
- # - QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
220
- # - SRLAnnotator - annotates predicates and their semantic roles.
221
- # - CorefAnnotator - implements pronominal anaphora resolution using a statistical model (deprecated!).
222
- # - DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model (newer model, use this!).
223
- # - NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
224
- def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
225
- self.init unless @@initialized
226
- self.load_klass(klass, base)
227
- end
228
-
229
- # HCreate a java.util.Properties object from a hash.
127
+ # Create a java.util.Properties object from a hash.
230
128
  def self.get_properties(properties)
231
129
  props = Properties.new
232
130
  properties.each do |property, value|
@@ -245,18 +143,4 @@ module StanfordCoreNLP
245
143
  list
246
144
  end
247
145
 
248
- # Under_case -> CamelCase.
249
- def self.camel_case(text)
250
- text.to_s.gsub(/^[a-z]|_[a-z]/) do |a|
251
- a.upcase
252
- end.gsub('_', '')
253
- end
254
-
255
- private
256
- def self.load_klass(klass, base = 'edu.stanford.nlp.pipeline')
257
- base += '.' unless base == ''
258
- const_set(klass.intern,
259
- Rjb::import("#{base}#{klass}"))
260
- end
261
-
262
146
  end
@@ -1,23 +1,9 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- # Modify the Rjb JavaProxy class to add our own methods to every Java object.
3
+ # Modify the Rjb JavaProxy class to add our
4
+ # own methods to every Java object.
4
5
  Rjb::Rjb_JavaProxy.class_eval do
5
6
 
6
- # Dynamically defined on all proxied Java objects.
7
- # Shorthand for to_string defined by Java classes.
8
- def to_s; to_string; end
9
-
10
- # Dynamically defined on all proxied Java iterators.
11
- # Provide Ruby-style iterators to wrap Java iterators.
12
- def each
13
- if !java_methods.include?('iterator()')
14
- raise 'This object cannot be iterated.'
15
- else
16
- i = self.iterator
17
- while i.has_next; yield i.next; end
18
- end
19
- end
20
-
21
7
  # Dynamically defined on all proxied annotation classes.
22
8
  # Get an annotation using the annotation bridge.
23
9
  def get(annotation, anno_base = nil)
@@ -26,15 +12,19 @@ module StanfordCoreNLP
26
12
  else
27
13
  anno_class = "#{StanfordCoreNLP.camel_case(annotation)}Annotation"
28
14
  if anno_base
29
- raise "The path #{anno_base} doesn't exist." unless StanfordNLP::Config::Annotations[anno_base]
15
+ unless StanfordNLP::Config::Annotations[anno_base]
16
+ raise "The path #{anno_base} doesn't exist."
17
+ end
30
18
  anno_bases = [anno_base]
31
19
  else
32
20
  anno_bases = StanfordCoreNLP::Config::AnnotationsByName[anno_class]
33
21
  raise "The annotation #{anno_class} doesn't exist." unless anno_bases
34
22
  end
35
23
  if anno_bases.size > 1
36
- msg = "There are many different annotations bearing the name #{anno_class}. "
37
- msg << "Please specify one of the following base classes as second parameter to disambiguate: "
24
+ msg = "There are many different annotations " +
25
+ "bearing the name #{anno_class}. \nPlease specify " +
26
+ "one of the following base classes as second " +
27
+ "parameter to disambiguate: "
38
28
  msg << anno_bases.join(',')
39
29
  raise msg
40
30
  else
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stanford-core-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.1
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-03-06 00:00:00.000000000 Z
12
+ date: 2012-04-05 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
- name: rjb
16
- requirement: &70138662664620 !ruby/object:Gem::Requirement
15
+ name: bind-it
16
+ requirement: !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,12 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70138662664620
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
25
30
  description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
26
31
  language processing \ntools that provides tokenization, part-of-speech tagging and
27
32
  parsing for several languages, as well as named entity \nrecognition and coreference
@@ -32,12 +37,11 @@ executables: []
32
37
  extensions: []
33
38
  extra_rdoc_files: []
34
39
  files:
40
+ - lib/stanford-core-nlp/bridge.rb
35
41
  - lib/stanford-core-nlp/config.rb
36
- - lib/stanford-core-nlp/jar_loader.rb
37
- - lib/stanford-core-nlp/java_wrapper.rb
38
42
  - lib/stanford-core-nlp.rb
39
43
  - bin/bridge.jar
40
- - README.markdown
44
+ - README.md
41
45
  - LICENSE
42
46
  homepage: https://github.com/louismullie/stanford-core-nlp
43
47
  licenses: []
@@ -59,7 +63,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
59
63
  version: '0'
60
64
  requirements: []
61
65
  rubyforge_project:
62
- rubygems_version: 1.8.15
66
+ rubygems_version: 1.8.21
63
67
  signing_key:
64
68
  specification_version: 3
65
69
  summary: Ruby bindings to the Stanford Core NLP tools.
data/README.markdown DELETED
@@ -1,100 +0,0 @@
1
- [![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
2
-
3
- **About**
4
-
5
- This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for several languages, as well as named entity recognition and coreference resolution for English. This gem is compatible with Ruby 1.9.2 and above.
6
-
7
- **Installing**
8
-
9
- Firs, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
10
-
11
- * A [minimal package for English](http://louismullie.com/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
12
- * A [full package for English](http://louismullie.com/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
13
- * A [full package for all languages](http://louismullie.com/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
14
-
15
- Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. /usr/local/lib/ruby/gems/1.X.x/gems/stanford-core-nlp-0.x/bin/).
16
-
17
- **Configuration**
18
-
19
- After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
20
-
21
- ```ruby
22
- # Set an alternative path to look for the JAR files
23
- # Default is gem's bin folder.
24
- StanfordCoreNLP.jar_path = '/path_to_jars/'
25
-
26
- # Set an alternative path to look for the model files
27
- # Default is gem's bin folder.
28
- StanfordCoreNLP.model_path = '/path_to_models/'
29
-
30
- # Pass some alternative arguments to the Java VM.
31
- # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
32
- # to take a coffee break).
33
- StanfordCoreNLP.jvm_args = ['-option1', '-option2']
34
-
35
- # Redirect VM output to log.txt
36
- StanfordCoreNLP.log_file = 'log.txt'
37
-
38
- # Use the model files for a different language than English.
39
- StanfordCoreNLP.use(:french)
40
-
41
- # Change a specific model file.
42
- StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
43
- ```
44
-
45
- **Using the gem**
46
-
47
- ```ruby
48
- text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
49
- 'Berlin to discuss a new austerity package. Sarkozy ' +
50
- 'looked pleased, but Merkel was dismayed.'
51
-
52
- pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
53
- text = StanfordCoreNLP::Text.new(text)
54
- pipeline.annotate(text)
55
-
56
- text.get(:sentences).each do |sentence|
57
- # Syntatical dependencies
58
- puts sentence.get(:basic_dependencies).to_s
59
- sentence.get(:tokens).each do |token|
60
- # Default annotations for all tokens
61
- puts token.get(:value).to_s
62
- puts token.get(:original_text).to_s
63
- puts token.get(:character_offset_begin).to_s
64
- puts token.get(:character_offset_end).to_s
65
- # POS returned by the tagger
66
- puts token.get(:part_of_speech).to_s
67
- # Lemma (base form of the token)
68
- puts token.get(:lemma).to_s
69
- # Named entity tag
70
- puts token.get(:named_entity_tag).to_s
71
- # Coreference
72
- puts token.get(:coref_cluster_id).to_s
73
- # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
74
- end
75
- end
76
- ```
77
-
78
- > Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
79
-
80
- A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
81
-
82
- **Loading specific classes**
83
-
84
- You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
85
-
86
- ```ruby
87
- # Default base class is edu.stanford.nlp.pipeline.
88
- StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
89
- puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
90
- # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
91
-
92
- # Here, we specify another base class.
93
- StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
94
- puts StanfordCoreNLP::MaxentTagger.inspect
95
- # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
96
- ```
97
-
98
- **Contributing**
99
-
100
- Feel free to fork the project and send me a pull request!
@@ -1,55 +0,0 @@
1
- module StanfordCoreNLP
2
- class JarLoader
3
-
4
- require 'rjb'
5
-
6
- # Configuration options.
7
- class << self
8
- # An array of flags to pass to the JVM machine.
9
- attr_accessor :jvm_args
10
- attr_accessor :jar_path
11
- attr_accessor :log_file
12
- end
13
-
14
- # An array of string flags to supply to the JVM, e.g. ['-Xms512M', '-Xmx1024M']
15
- self.jvm_args = []
16
- # The path in which to look for Jars.
17
- self.jar_path = ''
18
- # By default, disable logging.
19
- self.log_file = nil
20
-
21
- # Load Rjb and create Java VM.
22
- def self.rjb_initialize
23
- return if ::Rjb::loaded?
24
- ::Rjb::load(nil, self.jvm_args)
25
- set_java_logging if self.log_file
26
- end
27
-
28
- # Enable logging.
29
- def self.log(file = 'log.txt')
30
- self.log_file = file
31
- end
32
-
33
- # Redirect the output of the JVM to supplied log file.
34
- def self.set_java_logging
35
- const_set(:System, Rjb::import('java.lang.System'))
36
- const_set(:PrintStream, Rjb::import('java.io.PrintStream'))
37
- const_set(:File2, Rjb::import('java.io.File'))
38
- ps = PrintStream.new(File2.new(self.log_file))
39
- ps.write(::Time.now.strftime("[%m/%d/%Y at %I:%M%p]\n\n"))
40
- System.setOut(ps)
41
- System.setErr(ps)
42
- end
43
-
44
- # Load a jar.
45
- def self.load(jar)
46
- self.rjb_initialize
47
- jar = self.jar_path + jar
48
- if !::File.readable?(jar)
49
- raise "Could not find JAR file (looking in #{jar})."
50
- end
51
- ::Rjb::add_jar(jar)
52
- end
53
-
54
- end
55
- end