stanford-core-nlp 0.2.1 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,139 @@
1
+ [![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
2
+
3
+ **About**
4
+
5
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, part-of-speech tagging, lemmatization, and parsing of several languages, as well as named entity recognition and coreference resolution in English. This gem is compatible with Ruby 1.9.2 and above.
6
+
7
+ **Installing**
8
+
9
+ First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
10
+
11
+ * A [minimal package for English](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
12
+ * A [full package for English](http://louismullie.com/treat/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
13
+ * A [full package for all languages](http://louismullie.com/treat/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
14
+
15
+ Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/).
16
+
17
+ **Configuration**
18
+
19
+ After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
20
+
21
+ ```ruby
22
+ # Set an alternative path to look for the JAR files
23
+ # Default is gem's bin folder.
24
+ StanfordCoreNLP.jar_path = '/path_to_jars/'
25
+
26
+ # Set an alternative path to look for the model files
27
+ # Default is gem's bin folder.
28
+ StanfordCoreNLP.model_path = '/path_to_models/'
29
+
30
+ # Pass some alternative arguments to the Java VM.
31
+ # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
32
+ # to take a coffee break).
33
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
34
+
35
+ # Redirect VM output to log.txt
36
+ StanfordCoreNLP.log_file = 'log.txt'
37
+
38
+ # Use the model files for a different language than English.
39
+ StanfordCoreNLP.use(:french)
40
+
41
+ # Change a specific model file.
42
+ StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
43
+ ```
44
+
45
+ **Using the gem**
46
+
47
+ ```ruby
48
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
49
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
50
+ 'looked pleased, but Merkel was dismayed.'
51
+
52
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
53
+ text = StanfordCoreNLP::Text.new(text)
54
+ pipeline.annotate(text)
55
+
56
+ text.get(:sentences).each do |sentence|
57
+ # Syntatical dependencies
58
+ puts sentence.get(:basic_dependencies).to_s
59
+ sentence.get(:tokens).each do |token|
60
+ # Default annotations for all tokens
61
+ puts token.get(:value).to_s
62
+ puts token.get(:original_text).to_s
63
+ puts token.get(:character_offset_begin).to_s
64
+ puts token.get(:character_offset_end).to_s
65
+ # POS returned by the tagger
66
+ puts token.get(:part_of_speech).to_s
67
+ # Lemma (base form of the token)
68
+ puts token.get(:lemma).to_s
69
+ # Named entity tag
70
+ puts token.get(:named_entity_tag).to_s
71
+ # Coreference
72
+ puts token.get(:coref_cluster_id).to_s
73
+ # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
74
+ end
75
+ end
76
+ ```
77
+
78
+ > Important: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
79
+
80
+ A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.
81
+
82
+ **Loading specific classes**
83
+
84
+ You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
85
+
86
+ ```ruby
87
+ # Default base class is edu.stanford.nlp.pipeline.
88
+ StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
89
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
90
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
91
+
92
+ # Here, we specify another base class.
93
+ StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
94
+ puts StanfordCoreNLP::MaxentTagger.inspect
95
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
96
+ ```
97
+
98
+ **List of annotator classes**
99
+
100
+ Here is a full list of annotator classes provided by the Stanford Core NLP package. You can load these classes individually using `StanfordCoreNLP.load_class` (see above). Once this is done, you can use them like you would from a Java program. Refer to the Java documentation for a list of functions provided by each of these classes.
101
+
102
+ * PTBTokenizerAnnotator - tokenizes the text following Penn Treebank conventions.
103
+ * WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
104
+ * POSTaggerAnnotator - annotates the text with part-of-speech tags.
105
+ * MorphaAnnotator - morphological normalizer (generates lemmas).
106
+ * NERAnnotator - annotates the text with named-entity labels.
107
+ * NERCombinerAnnotator - combines several NER models.
108
+ * TrueCaseAnnotator - detects the true case of words in free text.
109
+ * ParserAnnotator - generates constituent and dependency trees.
110
+ * NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
111
+ * TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
112
+ * QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
113
+ * SRLAnnotator - annotates predicates and their semantic roles.
114
+ * DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model.
115
+ * NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
116
+
117
+ **List of model files**
118
+
119
+ Here is a full list of the default models for the Stanford Core NLP pipeline. You can change these models individually using `StanfordCoreNLP.set_model` (see above).
120
+
121
+ * 'pos.model' - 'english-left3words-distsim.tagger'
122
+ * 'ner.model.3class' - 'all.3class.distsim.crf.ser.gz'
123
+ * 'ner.model.7class' - 'muc.7class.distsim.crf.ser.gz'
124
+ * 'ner.model.MISCclass' -- 'conll.4class.distsim.crf.ser.gz'
125
+ * 'parser.model' - 'englishPCFG.ser.gz'
126
+ * 'dcoref.demonym' - 'demonyms.txt'
127
+ * 'dcoref.animate' - 'animate.unigrams.txt'
128
+ * 'dcoref.female' - 'female.unigrams.txt'
129
+ * 'dcoref.inanimate' - 'inanimate.unigrams.txt'
130
+ * 'dcoref.male' - 'male.unigrams.txt'
131
+ * 'dcoref.neutral' - 'neutral.unigrams.txt'
132
+ * 'dcoref.plural' - 'plural.unigrams.txt'
133
+ * 'dcoref.singular' - 'singular.unigrams.txt'
134
+ * 'dcoref.states' - 'state-abbreviations.txt'
135
+ * 'dcoref.extra.gender' - 'namegender.combine.txt'
136
+
137
+ **Contributing**
138
+
139
+ Feel free to fork the project and send me a pull request!
@@ -1,58 +1,62 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- VERSION = '0.2.1'
3
+ VERSION = '0.3.0'
4
+
5
+ require 'bind-it'
6
+ extend BindIt::Binding
7
+
8
+ # ############################ #
9
+ # BindIt Configuration Options #
10
+ # ############################ #
11
+
12
+ # The path in which to look for the Stanford JAR files,
13
+ # with a trailing slash.
14
+ self.jar_path = File.dirname(__FILE__) + '/../bin/'
15
+
16
+ # Load the JVM with a minimum heap size of 512MB,
17
+ # and a maximum heap size of 1024MB.
18
+ self.jvm_args = ['-Xms512M', '-Xmx1024M']
19
+
20
+ # Turn logging off by default.
21
+ self.log_file = nil
22
+
23
+ # Default JAR files to load.
24
+ self.default_jars = [
25
+ 'joda-time.jar',
26
+ 'xom.jar',
27
+ 'stanford-corenlp.jar',
28
+ 'bridge.jar'
29
+ ]
30
+
31
+ # Default classes to load.
32
+ self.default_classes = [
33
+ ['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
34
+ ['Annotation', 'edu.stanford.nlp.pipeline', 'Text'],
35
+ ['Word', 'edu.stanford.nlp.ling'],
36
+ ['MaxentTagger', 'edu.stanford.nlp.tagger.maxent'],
37
+ ['CRFClassifier', 'edu.stanford.nlp.ie.crf'],
38
+ ['Properties', 'java.util'],
39
+ ['ArrayList', 'java.util'],
40
+ ['AnnotationBridge', '']
41
+ ]
42
+
43
+ # Default namespace is the Stanford pipeline namespace.
44
+ self.default_namespace = 'edu.stanford.nlp.pipeline'
4
45
 
5
- require 'stanford-core-nlp/jar_loader'
6
- require 'stanford-core-nlp/java_wrapper'
7
46
  require 'stanford-core-nlp/config'
8
-
47
+ require 'stanford-core-nlp/bridge'
48
+
9
49
  class << self
10
- # The path in which to look for the Stanford JAR files,
11
- # with a trailing slash.
12
- #
13
- # The structure of the JAR folder must be as follows:
14
- #
15
- # Files:
16
- #
17
- # /stanford-core-nlp.jar
18
- # /joda-time.jar
19
- # /xom.jar
20
- # /bridge.jar*
21
- #
22
- # Folders:
23
- #
24
- # /classifiers # Models for the NER system.
25
- # /dcoref # Models for the coreference resolver.
26
- # /taggers # Models for the POS tagger.
27
- # /grammar # Models for the parser.
28
- #
29
- # *The file bridge.jar is a thin JAVA wrapper over the
30
- # Stanford Core NLP get() function, which allows to
31
- # retrieve annotations using static classes as names.
32
- # This works around one of the lacunae of Rjb.
33
- attr_accessor :jar_path
34
- # The path to the main folder containing the folders
35
- # with the individual models inside. By default, this
36
- # is the same as the JAR path.
37
- attr_accessor :model_path
38
- # The flags for starting the JVM machine. The parser
39
- # and named entity recognizer are very memory consuming.
40
- attr_accessor :jvm_args
41
- # A file to redirect JVM output to.
42
- attr_accessor :log_file
43
- # The model files for a given language.
50
+ # The model file names for a given language.
44
51
  attr_accessor :model_files
52
+ # The folder in which to look for models.
53
+ attr_accessor :model_path
45
54
  end
46
-
47
- # The default JAR path is the gem's bin folder.
48
- self.jar_path = File.dirname(__FILE__) + '/../bin/'
49
- # The default model path is the same as the JAR path.
55
+
56
+ # The path to the main folder containing the folders
57
+ # with the individual models inside. By default, this
58
+ # is the same as the JAR path.
50
59
  self.model_path = self.jar_path
51
- # Load the JVM with a minimum heap size of 512MB and a
52
- # maximum heap size of 1024MB.
53
- self.jvm_args = ['-Xms512M', '-Xmx1024M']
54
- # Turn logging off by default.
55
- self.log_file = nil
56
60
 
57
61
  # Use models for a given language. Language can be
58
62
  # supplied as full-length, or ISO-639 2 or 3 letter
@@ -83,49 +87,20 @@ module StanfordCoreNLP
83
87
  # Use english by default.
84
88
  self.use(:english)
85
89
 
86
- # Set a model file. Here are the default models for English:
87
- #
88
- # 'pos.model' => 'english-left3words-distsim.tagger',
89
- # 'ner.model.3class' => 'all.3class.distsim.crf.ser.gz',
90
- # 'ner.model.7class' => 'muc.7class.distsim.crf.ser.gz',
91
- # 'ner.model.MISCclass' => 'conll.4class.distsim.crf.ser.gz',
92
- # 'parser.model' => 'englishPCFG.ser.gz',
93
- # 'dcoref.demonym' => 'demonyms.txt',
94
- # 'dcoref.animate' => 'animate.unigrams.txt',
95
- # 'dcoref.female' => 'female.unigrams.txt',
96
- # 'dcoref.inanimate' => 'inanimate.unigrams.txt',
97
- # 'dcoref.male' => 'male.unigrams.txt',
98
- # 'dcoref.neutral' => 'neutral.unigrams.txt',
99
- # 'dcoref.plural' => 'plural.unigrams.txt',
100
- # 'dcoref.singular' => 'singular.unigrams.txt',
101
- # 'dcoref.states' => 'state-abbreviations.txt',
102
- # 'dcoref.extra.gender' => 'namegender.combine.txt'
103
- #
90
+ # Set a model file.
104
91
  def self.set_model(name, file)
105
92
  n = name.split('.')[0].intern
106
93
  self.model_files[name] =
107
94
  Config::ModelFolders[n] + file
108
95
  end
109
96
 
110
- # Whether the classes are initialized or not.
111
- @@initialized = false
112
-
113
- # Load the JARs, create the classes.
114
- def self.init
115
- unless @@initialized
116
- self.load_jars
117
- self.load_default_classes
118
- end
119
- @@initialized = true
120
- end
121
-
122
97
  # Load a StanfordCoreNLP pipeline with the
123
98
  # specified JVM flags and StanfordCoreNLP
124
99
  # properties.
125
100
  def self.load(*annotators)
126
-
127
- self.init unless @@initialized
128
-
101
+
102
+ # Make the bindings.
103
+ self.bind
129
104
  # Prepend the JAR path to the model files.
130
105
  properties = {}
131
106
  self.model_files.each do |k,v|
@@ -135,15 +110,12 @@ module StanfordCoreNLP
135
110
  break if found
136
111
  end
137
112
  next unless found
138
-
139
113
  f = self.model_path + v
140
-
141
114
  unless File.readable?(f)
142
115
  raise "Model file #{f} could not be found. " +
143
116
  "You may need to download this file manually "+
144
117
  " and/or set paths properly."
145
118
  end
146
-
147
119
  properties[k] = f
148
120
  end
149
121
 
@@ -152,81 +124,7 @@ module StanfordCoreNLP
152
124
  CoreNLP.new(get_properties(properties))
153
125
  end
154
126
 
155
- # Once it loads a specific annotator model once,
156
- # the program always loads the same models when
157
- # you make new pipelines and request the annotator
158
- # again, ignoring the changes in models.
159
- #
160
- # This function kills the JVM and reloads everything
161
- # if you need to create a new pipeline with different
162
- # models for the same annotators.
163
- #def self.reload
164
- # raise 'Not implemented.'
165
- #end
166
-
167
- # Load the jars.
168
- def self.load_jars
169
- JarLoader.log(self.log_file)
170
- JarLoader.jvm_args = self.jvm_args
171
- JarLoader.jar_path = self.jar_path
172
- JarLoader.load('joda-time.jar')
173
- JarLoader.load('xom.jar')
174
- JarLoader.load('stanford-corenlp.jar')
175
- JarLoader.load('bridge.jar')
176
- end
177
-
178
- # Create the Ruby classes corresponding to the StanfordNLP
179
- # core classes.
180
- def self.load_default_classes
181
-
182
- const_set(:CoreNLP,
183
- Rjb::import('edu.stanford.nlp.pipeline.StanfordCoreNLP')
184
- )
185
-
186
- self.load_klass 'Annotation'
187
- self.load_klass 'Word', 'edu.stanford.nlp.ling'
188
-
189
- self.load_klass 'MaxentTagger', 'edu.stanford.nlp.tagger.maxent'
190
-
191
- self.load_klass 'CRFClassifier', 'edu.stanford.nlp.ie.crf'
192
-
193
- self.load_klass 'Properties', 'java.util'
194
- self.load_klass 'ArrayList', 'java.util'
195
-
196
- self.load_klass 'AnnotationBridge', ''
197
-
198
- const_set(:Text, Annotation)
199
-
200
- end
201
-
202
- # Load a class (e.g. PTBTokenizerAnnotator) in a specific
203
- # class path (default is 'edu.stanford.nlp.pipeline').
204
- # The class is then accessible under the StanfordCoreNLP
205
- # namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
206
- #
207
- # List of annotators:
208
- #
209
- # - PTBTokenizingAnnotator - tokenizes the text following Penn Treebank conventions.
210
- # - WordToSentenceAnnotator - splits a sequence of words into a sequence of sentences.
211
- # - POSTaggerAnnotator - annotates the text with part-of-speech tags.
212
- # - MorphaAnnotator - morphological normalizer (generates lemmas).
213
- # - NERAnnotator - annotates the text with named-entity labels.
214
- # - NERCombinerAnnotator - combines several NER models (use this instead of NERAnnotator!).
215
- # - TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text).
216
- # - ParserAnnotator - generates constituent and dependency trees.
217
- # - NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates.
218
- # - TimeWordAnnotator - recognizes common temporal expressions, such as "teatime".
219
- # - QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities.
220
- # - SRLAnnotator - annotates predicates and their semantic roles.
221
- # - CorefAnnotator - implements pronominal anaphora resolution using a statistical model (deprecated!).
222
- # - DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model (newer model, use this!).
223
- # - NFLAnnotator - implements entity and relation mention extraction for the NFL domain.
224
- def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
225
- self.init unless @@initialized
226
- self.load_klass(klass, base)
227
- end
228
-
229
- # HCreate a java.util.Properties object from a hash.
127
+ # Create a java.util.Properties object from a hash.
230
128
  def self.get_properties(properties)
231
129
  props = Properties.new
232
130
  properties.each do |property, value|
@@ -245,18 +143,4 @@ module StanfordCoreNLP
245
143
  list
246
144
  end
247
145
 
248
- # Under_case -> CamelCase.
249
- def self.camel_case(text)
250
- text.to_s.gsub(/^[a-z]|_[a-z]/) do |a|
251
- a.upcase
252
- end.gsub('_', '')
253
- end
254
-
255
- private
256
- def self.load_klass(klass, base = 'edu.stanford.nlp.pipeline')
257
- base += '.' unless base == ''
258
- const_set(klass.intern,
259
- Rjb::import("#{base}#{klass}"))
260
- end
261
-
262
146
  end
@@ -1,23 +1,9 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- # Modify the Rjb JavaProxy class to add our own methods to every Java object.
3
+ # Modify the Rjb JavaProxy class to add our
4
+ # own methods to every Java object.
4
5
  Rjb::Rjb_JavaProxy.class_eval do
5
6
 
6
- # Dynamically defined on all proxied Java objects.
7
- # Shorthand for to_string defined by Java classes.
8
- def to_s; to_string; end
9
-
10
- # Dynamically defined on all proxied Java iterators.
11
- # Provide Ruby-style iterators to wrap Java iterators.
12
- def each
13
- if !java_methods.include?('iterator()')
14
- raise 'This object cannot be iterated.'
15
- else
16
- i = self.iterator
17
- while i.has_next; yield i.next; end
18
- end
19
- end
20
-
21
7
  # Dynamically defined on all proxied annotation classes.
22
8
  # Get an annotation using the annotation bridge.
23
9
  def get(annotation, anno_base = nil)
@@ -26,15 +12,19 @@ module StanfordCoreNLP
26
12
  else
27
13
  anno_class = "#{StanfordCoreNLP.camel_case(annotation)}Annotation"
28
14
  if anno_base
29
- raise "The path #{anno_base} doesn't exist." unless StanfordNLP::Config::Annotations[anno_base]
15
+ unless StanfordNLP::Config::Annotations[anno_base]
16
+ raise "The path #{anno_base} doesn't exist."
17
+ end
30
18
  anno_bases = [anno_base]
31
19
  else
32
20
  anno_bases = StanfordCoreNLP::Config::AnnotationsByName[anno_class]
33
21
  raise "The annotation #{anno_class} doesn't exist." unless anno_bases
34
22
  end
35
23
  if anno_bases.size > 1
36
- msg = "There are many different annotations bearing the name #{anno_class}. "
37
- msg << "Please specify one of the following base classes as second parameter to disambiguate: "
24
+ msg = "There are many different annotations " +
25
+ "bearing the name #{anno_class}. \nPlease specify " +
26
+ "one of the following base classes as second " +
27
+ "parameter to disambiguate: "
38
28
  msg << anno_bases.join(',')
39
29
  raise msg
40
30
  else
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stanford-core-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.1
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-03-06 00:00:00.000000000 Z
12
+ date: 2012-04-05 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
- name: rjb
16
- requirement: &70138662664620 !ruby/object:Gem::Requirement
15
+ name: bind-it
16
+ requirement: !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,12 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70138662664620
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
25
30
  description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
26
31
  language processing \ntools that provides tokenization, part-of-speech tagging and
27
32
  parsing for several languages, as well as named entity \nrecognition and coreference
@@ -32,12 +37,11 @@ executables: []
32
37
  extensions: []
33
38
  extra_rdoc_files: []
34
39
  files:
40
+ - lib/stanford-core-nlp/bridge.rb
35
41
  - lib/stanford-core-nlp/config.rb
36
- - lib/stanford-core-nlp/jar_loader.rb
37
- - lib/stanford-core-nlp/java_wrapper.rb
38
42
  - lib/stanford-core-nlp.rb
39
43
  - bin/bridge.jar
40
- - README.markdown
44
+ - README.md
41
45
  - LICENSE
42
46
  homepage: https://github.com/louismullie/stanford-core-nlp
43
47
  licenses: []
@@ -59,7 +63,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
59
63
  version: '0'
60
64
  requirements: []
61
65
  rubyforge_project:
62
- rubygems_version: 1.8.15
66
+ rubygems_version: 1.8.21
63
67
  signing_key:
64
68
  specification_version: 3
65
69
  summary: Ruby bindings to the Stanford Core NLP tools.
data/README.markdown DELETED
@@ -1,100 +0,0 @@
1
- [![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)
2
-
3
- **About**
4
-
5
- This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for several languages, as well as named entity recognition and coreference resolution for English. This gem is compatible with Ruby 1.9.2 and above.
6
-
7
- **Installing**
8
-
9
- Firs, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Three different packages are available:
10
-
11
- * A [minimal package for English](http://louismullie.com/stanford-core-nlp-minimal.zip) with one tagger model and one parser model for English.
12
- * A [full package for English](http://louismullie.com/stanford-core-nlp-english.zip), with all tagger and parser models for English, plus the coreference resolution and named entity recognition models.
13
- * A [full package for all languages](http://louismullie.com/stanford-core-nlp-all.zip), including tagger and parser models for English, French, German, Arabic and Chinese.
14
-
15
- Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. /usr/local/lib/ruby/gems/1.X.x/gems/stanford-core-nlp-0.x/bin/).
16
-
17
- **Configuration**
18
-
19
- After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some optional configuration options. Here are some examples:
20
-
21
- ```ruby
22
- # Set an alternative path to look for the JAR files
23
- # Default is gem's bin folder.
24
- StanfordCoreNLP.jar_path = '/path_to_jars/'
25
-
26
- # Set an alternative path to look for the model files
27
- # Default is gem's bin folder.
28
- StanfordCoreNLP.model_path = '/path_to_models/'
29
-
30
- # Pass some alternative arguments to the Java VM.
31
- # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
32
- # to take a coffee break).
33
- StanfordCoreNLP.jvm_args = ['-option1', '-option2']
34
-
35
- # Redirect VM output to log.txt
36
- StanfordCoreNLP.log_file = 'log.txt'
37
-
38
- # Use the model files for a different language than English.
39
- StanfordCoreNLP.use(:french)
40
-
41
- # Change a specific model file.
42
- StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
43
- ```
44
-
45
- **Using the gem**
46
-
47
- ```ruby
48
- text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
49
- 'Berlin to discuss a new austerity package. Sarkozy ' +
50
- 'looked pleased, but Merkel was dismayed.'
51
-
52
- pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
53
- text = StanfordCoreNLP::Text.new(text)
54
- pipeline.annotate(text)
55
-
56
- text.get(:sentences).each do |sentence|
57
- # Syntatical dependencies
58
- puts sentence.get(:basic_dependencies).to_s
59
- sentence.get(:tokens).each do |token|
60
- # Default annotations for all tokens
61
- puts token.get(:value).to_s
62
- puts token.get(:original_text).to_s
63
- puts token.get(:character_offset_begin).to_s
64
- puts token.get(:character_offset_end).to_s
65
- # POS returned by the tagger
66
- puts token.get(:part_of_speech).to_s
67
- # Lemma (base form of the token)
68
- puts token.get(:lemma).to_s
69
- # Named entity tag
70
- puts token.get(:named_entity_tag).to_s
71
- # Coreference
72
- puts token.get(:coref_cluster_id).to_s
73
- # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
74
- end
75
- end
76
- ```
77
-
78
- > Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
79
-
80
- A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'config.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding to a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
81
-
82
- **Loading specific classes**
83
-
84
- You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
85
-
86
- ```ruby
87
- # Default base class is edu.stanford.nlp.pipeline.
88
- StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
89
- puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
90
- # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
91
-
92
- # Here, we specify another base class.
93
- StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
94
- puts StanfordCoreNLP::MaxentTagger.inspect
95
- # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
96
- ```
97
-
98
- **Contributing**
99
-
100
- Feel free to fork the project and send me a pull request!
@@ -1,55 +0,0 @@
1
- module StanfordCoreNLP
2
- class JarLoader
3
-
4
- require 'rjb'
5
-
6
- # Configuration options.
7
- class << self
8
- # An array of flags to pass to the JVM machine.
9
- attr_accessor :jvm_args
10
- attr_accessor :jar_path
11
- attr_accessor :log_file
12
- end
13
-
14
- # An array of string flags to supply to the JVM, e.g. ['-Xms512M', '-Xmx1024M']
15
- self.jvm_args = []
16
- # The path in which to look for Jars.
17
- self.jar_path = ''
18
- # By default, disable logging.
19
- self.log_file = nil
20
-
21
- # Load Rjb and create Java VM.
22
- def self.rjb_initialize
23
- return if ::Rjb::loaded?
24
- ::Rjb::load(nil, self.jvm_args)
25
- set_java_logging if self.log_file
26
- end
27
-
28
- # Enable logging.
29
- def self.log(file = 'log.txt')
30
- self.log_file = file
31
- end
32
-
33
- # Redirect the output of the JVM to supplied log file.
34
- def self.set_java_logging
35
- const_set(:System, Rjb::import('java.lang.System'))
36
- const_set(:PrintStream, Rjb::import('java.io.PrintStream'))
37
- const_set(:File2, Rjb::import('java.io.File'))
38
- ps = PrintStream.new(File2.new(self.log_file))
39
- ps.write(::Time.now.strftime("[%m/%d/%Y at %I:%M%p]\n\n"))
40
- System.setOut(ps)
41
- System.setErr(ps)
42
- end
43
-
44
- # Load a jar.
45
- def self.load(jar)
46
- self.rjb_initialize
47
- jar = self.jar_path + jar
48
- if !::File.readable?(jar)
49
- raise "Could not find JAR file (looking in #{jar})."
50
- end
51
- ::Rjb::add_jar(jar)
52
- end
53
-
54
- end
55
- end