stanford-core-nlp 0.1.5 → 0.1.7
Sign up to get free protection for your applications and to get access to all the features.
- data/README.markdown +60 -53
- data/lib/stanford-core-nlp.rb +26 -7
- data/lib/stanford-core-nlp/config.rb +14 -1
- data/lib/stanford-core-nlp/jar_loader.rb +1 -1
- metadata +7 -8
- data/bin/INFO +0 -1
data/README.markdown
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
**About**
|
2
2
|
|
3
|
-
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that
|
3
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
|
4
4
|
|
5
5
|
**Installing**
|
6
6
|
|
@@ -12,51 +12,60 @@ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](ht
|
|
12
12
|
|
13
13
|
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
|
15
|
+
```ruby
|
16
|
+
# Set an alternative path to look for the JAR files
|
17
|
+
# Default is gem's bin folder.
|
18
|
+
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
20
|
+
# Set an alternative path to look for the model files
|
21
|
+
# Default is gem's bin folder.
|
22
|
+
StanfordCoreNLP.jar_path = '/path_to_models/'
|
22
23
|
|
23
|
-
|
24
|
-
|
24
|
+
# Pass some alternative arguments to the Java VM.
|
25
|
+
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
26
|
+
# to take a coffee break).
|
27
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
25
28
|
|
26
|
-
|
27
|
-
|
29
|
+
# Redirect VM output to log.txt
|
30
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
31
|
+
|
32
|
+
# Use the model files for a different language than English.
|
33
|
+
StanfordCoreNLP.use(:french)
|
34
|
+
|
35
|
+
# Change a specific model file.
|
36
|
+
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
37
|
+
```
|
28
38
|
|
29
|
-
# Change a specific model file.
|
30
|
-
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
31
|
-
|
32
39
|
**Using the gem**
|
33
40
|
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
41
|
+
```ruby
|
42
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
43
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
44
|
+
'looked pleased, but Merkel was dismayed.'
|
45
|
+
|
46
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
47
|
+
text = StanfordCoreNLP::Text.new(text)
|
48
|
+
pipeline.annotate(text)
|
49
|
+
|
50
|
+
text.get(:sentences).each do |sentence|
|
51
|
+
sentence.get(:tokens).each do |token|
|
52
|
+
# Default annotations for all tokens
|
53
|
+
puts token.get(:value).to_s
|
54
|
+
puts token.get(:original_text).to_s
|
55
|
+
puts token.get(:character_offset_begin).to_s
|
56
|
+
puts token.get(:character_offset_end).to_s
|
57
|
+
# POS returned by the tagger
|
58
|
+
puts token.get(:part_of_speech).to_s
|
59
|
+
# Lemma (base form of the token)
|
60
|
+
puts token.get(:lemma).to_s
|
61
|
+
# Named entity tag
|
62
|
+
puts token.get(:named_entity_tag).to_s
|
63
|
+
# Coreference
|
64
|
+
puts token.get(:coref_cluster_id).to_s
|
65
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
66
|
+
end
|
67
|
+
end
|
68
|
+
```
|
60
69
|
|
61
70
|
> Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
62
71
|
|
@@ -66,19 +75,17 @@ A good reference for names of annotations are the Stanford Javadocs for [CoreAnn
|
|
66
75
|
|
67
76
|
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
68
77
|
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be grateful if somebody could add/e-mail me these files.
|
78
|
+
```ruby
|
79
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
80
|
+
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
81
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
82
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
83
|
+
|
84
|
+
# Here, we specify another base class.
|
85
|
+
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
86
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
87
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
88
|
+
```
|
82
89
|
|
83
90
|
**Contributing**
|
84
91
|
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.1.
|
3
|
+
VERSION = '0.1.7'
|
4
|
+
|
4
5
|
require 'stanford-core-nlp/jar_loader'
|
5
6
|
require 'stanford-core-nlp/java_wrapper'
|
6
7
|
require 'stanford-core-nlp/config'
|
@@ -30,6 +31,10 @@ module StanfordCoreNLP
|
|
30
31
|
# retrieve annotations using static classes as names.
|
31
32
|
# This works around one of the lacunae of Rjb.
|
32
33
|
attr_accessor :jar_path
|
34
|
+
# The path to the main folder containing the folders
|
35
|
+
# with the individual models inside. By default, this
|
36
|
+
# is the same as the JAR path.
|
37
|
+
attr_accessor :model_path
|
33
38
|
# The flags for starting the JVM machine. The parser
|
34
39
|
# and named entity recognizer are very memory consuming.
|
35
40
|
attr_accessor :jvm_args
|
@@ -41,13 +46,14 @@ module StanfordCoreNLP
|
|
41
46
|
|
42
47
|
# The default JAR path is the gem's bin folder.
|
43
48
|
self.jar_path = File.dirname(__FILE__) + '/../bin/'
|
49
|
+
# The default model path is the same as the JAR path.
|
50
|
+
self.model_path = self.jar_path
|
44
51
|
# Load the JVM with a minimum heap size of 512MB and a
|
45
52
|
# maximum heap size of 1024MB.
|
46
53
|
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
47
54
|
# Turn logging off by default.
|
48
55
|
self.log_file = nil
|
49
|
-
|
50
|
-
|
56
|
+
|
51
57
|
# Use models for a given language. Language can be
|
52
58
|
# supplied as full-length, or ISO-639 2 or 3 letter
|
53
59
|
# code (e.g. :english, :eng or :en will work).
|
@@ -117,14 +123,16 @@ module StanfordCoreNLP
|
|
117
123
|
# specified JVM flags and StanfordCoreNLP
|
118
124
|
# properties.
|
119
125
|
def self.load(*annotators)
|
126
|
+
JarLoader.log(self.log_file)
|
120
127
|
self.init unless @@initialized
|
121
128
|
# Prepend the JAR path to the model files.
|
122
129
|
properties = {}
|
123
130
|
self.model_files.each do |k,v|
|
124
|
-
f = self.
|
131
|
+
f = self.model_path + v
|
125
132
|
unless File.readable?(f)
|
126
133
|
raise "Model file #{f} could not be found. " +
|
127
|
-
"You may need to download this file manually
|
134
|
+
"You may need to download this file manually "+
|
135
|
+
" and/or set paths properly."
|
128
136
|
else
|
129
137
|
properties[k] = f
|
130
138
|
end
|
@@ -133,12 +141,23 @@ module StanfordCoreNLP
|
|
133
141
|
annotators.map { |x| x.to_s }.join(', ')
|
134
142
|
CoreNLP.new(get_properties(properties))
|
135
143
|
end
|
136
|
-
|
144
|
+
|
145
|
+
# Once it loads a specific annotator model once,
|
146
|
+
# the program always loads the same models when
|
147
|
+
# you make new pipelines and request the annotator
|
148
|
+
# again, ignoring the changes in models.
|
149
|
+
#
|
150
|
+
# This function kills the JVM and reloads everything
|
151
|
+
# if you need to create a new pipeline with different
|
152
|
+
# models for the same annotators.
|
153
|
+
#def self.reload
|
154
|
+
# raise 'Not implemented.'
|
155
|
+
#end
|
156
|
+
|
137
157
|
# Load the jars.
|
138
158
|
def self.load_jars
|
139
159
|
JarLoader.jvm_args = self.jvm_args
|
140
160
|
JarLoader.jar_path = self.jar_path
|
141
|
-
JarLoader.log(self.log_file) if self.log_file
|
142
161
|
JarLoader.load('joda-time.jar')
|
143
162
|
JarLoader.load('xom.jar')
|
144
163
|
JarLoader.load('stanford-corenlp.jar')
|
@@ -20,9 +20,18 @@ module StanfordCoreNLP
|
|
20
20
|
:ner => 'classifiers/',
|
21
21
|
:dcoref => 'dcoref/'
|
22
22
|
}
|
23
|
-
|
23
|
+
|
24
|
+
# Tag sets used by Stanford for each language.
|
25
|
+
TagSets = {
|
26
|
+
:english => :penn,
|
27
|
+
:german => :negra,
|
28
|
+
:chinese => :penn_chinese,
|
29
|
+
:french => :simple
|
30
|
+
}
|
31
|
+
|
24
32
|
# Default models for all languages.
|
25
33
|
Models = {
|
34
|
+
|
26
35
|
:pos => {
|
27
36
|
:english => 'english-left3words-distsim.tagger',
|
28
37
|
:german => 'german-fast.tagger',
|
@@ -31,6 +40,7 @@ module StanfordCoreNLP
|
|
31
40
|
:chinese => 'chinese.tagger',
|
32
41
|
:xinhua => nil
|
33
42
|
},
|
43
|
+
|
34
44
|
:parser => {
|
35
45
|
:english => 'englishPCFG.ser.gz',
|
36
46
|
:german => 'germanPCFG.ser.gz',
|
@@ -39,6 +49,7 @@ module StanfordCoreNLP
|
|
39
49
|
:chinese => 'chinesePCFG.ser.gz',
|
40
50
|
:xinhua => 'xinhuaPCFG.ser.gz'
|
41
51
|
},
|
52
|
+
|
42
53
|
:ner => {
|
43
54
|
:english => {
|
44
55
|
'3class' => 'all.3class.distsim.crf.ser.gz',
|
@@ -51,6 +62,7 @@ module StanfordCoreNLP
|
|
51
62
|
:chinese => {},
|
52
63
|
:xinhua => {}
|
53
64
|
},
|
65
|
+
|
54
66
|
:dcoref => {
|
55
67
|
:english => {
|
56
68
|
'demonym' => 'demonyms.txt',
|
@@ -72,6 +84,7 @@ module StanfordCoreNLP
|
|
72
84
|
:chinese => {},
|
73
85
|
:xinhua => {}
|
74
86
|
}
|
87
|
+
|
75
88
|
# Models to add.
|
76
89
|
|
77
90
|
#"truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.7
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-02-
|
12
|
+
date: 2012-02-22 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rjb
|
16
|
-
requirement: &
|
16
|
+
requirement: &70107443631860 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,11 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70107443631860
|
25
25
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
|
-
language processing \ntools
|
27
|
-
|
26
|
+
language processing \ntools that provides tokenization, part-of-speech tagging and
|
27
|
+
parsing for several languages, as well as named entity \nrecognition and coreference
|
28
|
+
resolution for English. "
|
28
29
|
email:
|
29
30
|
- louis.mullie@gmail.com
|
30
31
|
executables: []
|
@@ -36,7 +37,6 @@ files:
|
|
36
37
|
- lib/stanford-core-nlp/java_wrapper.rb
|
37
38
|
- lib/stanford-core-nlp.rb
|
38
39
|
- bin/bridge.jar
|
39
|
-
- bin/INFO
|
40
40
|
- README.markdown
|
41
41
|
- LICENSE
|
42
42
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
@@ -64,4 +64,3 @@ signing_key:
|
|
64
64
|
specification_version: 3
|
65
65
|
summary: Ruby bindings to the Stanford Core NLP tools.
|
66
66
|
test_files: []
|
67
|
-
has_rdoc:
|
data/bin/INFO
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
This is where you should put the JAR files and the folders with the model files.
|