stanford-core-nlp 0.1.5 → 0.1.7

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,6 +1,6 @@
1
1
  **About**
2
2
 
3
- This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that features tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
3
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
4
4
 
5
5
  **Installing**
6
6
 
@@ -12,51 +12,60 @@ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](ht
12
12
 
13
13
  After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
14
14
 
15
- # Set an alternative path to look for the JAR files
16
- # Default is gem's bin folder.
17
- StanfordCoreNLP.jar_path = '/path/'
15
+ ```ruby
16
+ # Set an alternative path to look for the JAR files
17
+ # Default is gem's bin folder.
18
+ StanfordCoreNLP.jar_path = '/path_to_jars/'
18
19
 
19
- # Pass some alternative arguments to the Java VM.
20
- # Default is ['-Xms512M', '-Xmx1024M'].
21
- StanfordCoreNLP.jvm_args = ['-option1', '-option2']
20
+ # Set an alternative path to look for the model files
21
+ # Default is gem's bin folder.
22
+ StanfordCoreNLP.jar_path = '/path_to_models/'
22
23
 
23
- # Redirect VM output to log.txt
24
- StanfordCoreNLP.log_file = 'log.txt'
24
+ # Pass some alternative arguments to the Java VM.
25
+ # Default is ['-Xms512M', '-Xmx1024M'] (be prepared
26
+ # to take a coffee break).
27
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
25
28
 
26
- # Use the model files for a different language than English.
27
- StanfordCoreNLP.use(:french)
29
+ # Redirect VM output to log.txt
30
+ StanfordCoreNLP.log_file = 'log.txt'
31
+
32
+ # Use the model files for a different language than English.
33
+ StanfordCoreNLP.use(:french)
34
+
35
+ # Change a specific model file.
36
+ StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
37
+ ```
28
38
 
29
- # Change a specific model file.
30
- StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
31
-
32
39
  **Using the gem**
33
40
 
34
- text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
35
- 'Berlin to discuss a new austerity package. Sarkozy ' +
36
- 'looked pleased, but Merkel was dismayed.'
37
-
38
- pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
39
- text = StanfordCoreNLP::Text.new(text)
40
- pipeline.annotate(text)
41
-
42
- text.get(:sentences).each do |sentence|
43
- sentence.get(:tokens).each do |token|
44
- # Default annotations for all tokens
45
- puts token.get(:value).to_s
46
- puts token.get(:original_text).to_s
47
- puts token.get(:character_offset_begin).to_s
48
- puts token.get(:character_offset_end).to_s
49
- # POS returned by the tagger
50
- puts token.get(:part_of_speech).to_s
51
- # Lemma (base form of the token)
52
- puts token.get(:lemma).to_s
53
- # Named entity tag
54
- puts token.get(:named_entity_tag).to_s
55
- # Coreference
56
- puts token.get(:coref_cluster_id).to_s
57
- # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
58
- end
59
- end
41
+ ```ruby
42
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
43
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
44
+ 'looked pleased, but Merkel was dismayed.'
45
+
46
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
47
+ text = StanfordCoreNLP::Text.new(text)
48
+ pipeline.annotate(text)
49
+
50
+ text.get(:sentences).each do |sentence|
51
+ sentence.get(:tokens).each do |token|
52
+ # Default annotations for all tokens
53
+ puts token.get(:value).to_s
54
+ puts token.get(:original_text).to_s
55
+ puts token.get(:character_offset_begin).to_s
56
+ puts token.get(:character_offset_end).to_s
57
+ # POS returned by the tagger
58
+ puts token.get(:part_of_speech).to_s
59
+ # Lemma (base form of the token)
60
+ puts token.get(:lemma).to_s
61
+ # Named entity tag
62
+ puts token.get(:named_entity_tag).to_s
63
+ # Coreference
64
+ puts token.get(:coref_cluster_id).to_s
65
+ # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
66
+ end
67
+ end
68
+ ```
60
69
 
61
70
  > Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
62
71
 
@@ -66,19 +75,17 @@ A good reference for names of annotations are the Stanford Javadocs for [CoreAnn
66
75
 
67
76
  You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
68
77
 
69
- # Default base class is edu.stanford.nlp.pipeline.
70
- StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
71
- puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
72
- # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
73
-
74
- # Here, we specify another base class.
75
- StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
76
- puts StanfordCoreNLP::MaxentTagger.inspect
77
- # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
78
-
79
- **Current known issues**
80
-
81
- The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be grateful if somebody could add/e-mail me these files.
78
+ ```ruby
79
+ # Default base class is edu.stanford.nlp.pipeline.
80
+ StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
81
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
82
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
83
+
84
+ # Here, we specify another base class.
85
+ StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
86
+ puts StanfordCoreNLP::MaxentTagger.inspect
87
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
88
+ ```
82
89
 
83
90
  **Contributing**
84
91
 
@@ -1,6 +1,7 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- VERSION = '0.1.5'
3
+ VERSION = '0.1.7'
4
+
4
5
  require 'stanford-core-nlp/jar_loader'
5
6
  require 'stanford-core-nlp/java_wrapper'
6
7
  require 'stanford-core-nlp/config'
@@ -30,6 +31,10 @@ module StanfordCoreNLP
30
31
  # retrieve annotations using static classes as names.
31
32
  # This works around one of the lacunae of Rjb.
32
33
  attr_accessor :jar_path
34
+ # The path to the main folder containing the folders
35
+ # with the individual models inside. By default, this
36
+ # is the same as the JAR path.
37
+ attr_accessor :model_path
33
38
  # The flags for starting the JVM machine. The parser
34
39
  # and named entity recognizer are very memory consuming.
35
40
  attr_accessor :jvm_args
@@ -41,13 +46,14 @@ module StanfordCoreNLP
41
46
 
42
47
  # The default JAR path is the gem's bin folder.
43
48
  self.jar_path = File.dirname(__FILE__) + '/../bin/'
49
+ # The default model path is the same as the JAR path.
50
+ self.model_path = self.jar_path
44
51
  # Load the JVM with a minimum heap size of 512MB and a
45
52
  # maximum heap size of 1024MB.
46
53
  self.jvm_args = ['-Xms512M', '-Xmx1024M']
47
54
  # Turn logging off by default.
48
55
  self.log_file = nil
49
-
50
-
56
+
51
57
  # Use models for a given language. Language can be
52
58
  # supplied as full-length, or ISO-639 2 or 3 letter
53
59
  # code (e.g. :english, :eng or :en will work).
@@ -117,14 +123,16 @@ module StanfordCoreNLP
117
123
  # specified JVM flags and StanfordCoreNLP
118
124
  # properties.
119
125
  def self.load(*annotators)
126
+ JarLoader.log(self.log_file)
120
127
  self.init unless @@initialized
121
128
  # Prepend the JAR path to the model files.
122
129
  properties = {}
123
130
  self.model_files.each do |k,v|
124
- f = self.jar_path + v
131
+ f = self.model_path + v
125
132
  unless File.readable?(f)
126
133
  raise "Model file #{f} could not be found. " +
127
- "You may need to download this file manually and/or set paths properly."
134
+ "You may need to download this file manually "+
135
+ " and/or set paths properly."
128
136
  else
129
137
  properties[k] = f
130
138
  end
@@ -133,12 +141,23 @@ module StanfordCoreNLP
133
141
  annotators.map { |x| x.to_s }.join(', ')
134
142
  CoreNLP.new(get_properties(properties))
135
143
  end
136
-
144
+
145
+ # Once it loads a specific annotator model once,
146
+ # the program always loads the same models when
147
+ # you make new pipelines and request the annotator
148
+ # again, ignoring the changes in models.
149
+ #
150
+ # This function kills the JVM and reloads everything
151
+ # if you need to create a new pipeline with different
152
+ # models for the same annotators.
153
+ #def self.reload
154
+ # raise 'Not implemented.'
155
+ #end
156
+
137
157
  # Load the jars.
138
158
  def self.load_jars
139
159
  JarLoader.jvm_args = self.jvm_args
140
160
  JarLoader.jar_path = self.jar_path
141
- JarLoader.log(self.log_file) if self.log_file
142
161
  JarLoader.load('joda-time.jar')
143
162
  JarLoader.load('xom.jar')
144
163
  JarLoader.load('stanford-corenlp.jar')
@@ -20,9 +20,18 @@ module StanfordCoreNLP
20
20
  :ner => 'classifiers/',
21
21
  :dcoref => 'dcoref/'
22
22
  }
23
-
23
+
24
+ # Tag sets used by Stanford for each language.
25
+ TagSets = {
26
+ :english => :penn,
27
+ :german => :negra,
28
+ :chinese => :penn_chinese,
29
+ :french => :simple
30
+ }
31
+
24
32
  # Default models for all languages.
25
33
  Models = {
34
+
26
35
  :pos => {
27
36
  :english => 'english-left3words-distsim.tagger',
28
37
  :german => 'german-fast.tagger',
@@ -31,6 +40,7 @@ module StanfordCoreNLP
31
40
  :chinese => 'chinese.tagger',
32
41
  :xinhua => nil
33
42
  },
43
+
34
44
  :parser => {
35
45
  :english => 'englishPCFG.ser.gz',
36
46
  :german => 'germanPCFG.ser.gz',
@@ -39,6 +49,7 @@ module StanfordCoreNLP
39
49
  :chinese => 'chinesePCFG.ser.gz',
40
50
  :xinhua => 'xinhuaPCFG.ser.gz'
41
51
  },
52
+
42
53
  :ner => {
43
54
  :english => {
44
55
  '3class' => 'all.3class.distsim.crf.ser.gz',
@@ -51,6 +62,7 @@ module StanfordCoreNLP
51
62
  :chinese => {},
52
63
  :xinhua => {}
53
64
  },
65
+
54
66
  :dcoref => {
55
67
  :english => {
56
68
  'demonym' => 'demonyms.txt',
@@ -72,6 +84,7 @@ module StanfordCoreNLP
72
84
  :chinese => {},
73
85
  :xinhua => {}
74
86
  }
87
+
75
88
  # Models to add.
76
89
 
77
90
  #"truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz
@@ -46,7 +46,7 @@ module StanfordCoreNLP
46
46
  self.rjb_initialize
47
47
  jar = self.jar_path + jar
48
48
  if !::File.readable?(jar)
49
- raise "Could not find JAR file (looking in #{jar})."
49
+ raise "Could not find JAR file (looking in #{jar})."
50
50
  end
51
51
  ::Rjb::add_jar(jar)
52
52
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stanford-core-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.5
4
+ version: 0.1.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-04 00:00:00.000000000 Z
12
+ date: 2012-02-22 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rjb
16
- requirement: &70191057037760 !ruby/object:Gem::Requirement
16
+ requirement: &70107443631860 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,11 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70191057037760
24
+ version_requirements: *70107443631860
25
25
  description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
26
- language processing \ntools for English, including tokenization, part-of-speech
27
- tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
26
+ language processing \ntools that provides tokenization, part-of-speech tagging and
27
+ parsing for several languages, as well as named entity \nrecognition and coreference
28
+ resolution for English. "
28
29
  email:
29
30
  - louis.mullie@gmail.com
30
31
  executables: []
@@ -36,7 +37,6 @@ files:
36
37
  - lib/stanford-core-nlp/java_wrapper.rb
37
38
  - lib/stanford-core-nlp.rb
38
39
  - bin/bridge.jar
39
- - bin/INFO
40
40
  - README.markdown
41
41
  - LICENSE
42
42
  homepage: https://github.com/louismullie/stanford-core-nlp
@@ -64,4 +64,3 @@ signing_key:
64
64
  specification_version: 3
65
65
  summary: Ruby bindings to the Stanford Core NLP tools.
66
66
  test_files: []
67
- has_rdoc:
data/bin/INFO DELETED
@@ -1 +0,0 @@
1
- This is where you should put the JAR files and the folders with the model files.