stanford-core-nlp 0.1.5 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.markdown +60 -53
- data/lib/stanford-core-nlp.rb +26 -7
- data/lib/stanford-core-nlp/config.rb +14 -1
- data/lib/stanford-core-nlp/jar_loader.rb +1 -1
- metadata +7 -8
- data/bin/INFO +0 -1
data/README.markdown
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
**About**
|
2
2
|
|
3
|
-
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that
|
3
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
|
4
4
|
|
5
5
|
**Installing**
|
6
6
|
|
@@ -12,51 +12,60 @@ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](ht
|
|
12
12
|
|
13
13
|
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
|
15
|
+
```ruby
|
16
|
+
# Set an alternative path to look for the JAR files
|
17
|
+
# Default is gem's bin folder.
|
18
|
+
StanfordCoreNLP.jar_path = '/path_to_jars/'
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
20
|
+
# Set an alternative path to look for the model files
|
21
|
+
# Default is gem's bin folder.
|
22
|
+
StanfordCoreNLP.jar_path = '/path_to_models/'
|
22
23
|
|
23
|
-
|
24
|
-
|
24
|
+
# Pass some alternative arguments to the Java VM.
|
25
|
+
# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
|
26
|
+
# to take a coffee break).
|
27
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
25
28
|
|
26
|
-
|
27
|
-
|
29
|
+
# Redirect VM output to log.txt
|
30
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
31
|
+
|
32
|
+
# Use the model files for a different language than English.
|
33
|
+
StanfordCoreNLP.use(:french)
|
34
|
+
|
35
|
+
# Change a specific model file.
|
36
|
+
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
37
|
+
```
|
28
38
|
|
29
|
-
# Change a specific model file.
|
30
|
-
StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
|
31
|
-
|
32
39
|
**Using the gem**
|
33
40
|
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
41
|
+
```ruby
|
42
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
43
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
44
|
+
'looked pleased, but Merkel was dismayed.'
|
45
|
+
|
46
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
47
|
+
text = StanfordCoreNLP::Text.new(text)
|
48
|
+
pipeline.annotate(text)
|
49
|
+
|
50
|
+
text.get(:sentences).each do |sentence|
|
51
|
+
sentence.get(:tokens).each do |token|
|
52
|
+
# Default annotations for all tokens
|
53
|
+
puts token.get(:value).to_s
|
54
|
+
puts token.get(:original_text).to_s
|
55
|
+
puts token.get(:character_offset_begin).to_s
|
56
|
+
puts token.get(:character_offset_end).to_s
|
57
|
+
# POS returned by the tagger
|
58
|
+
puts token.get(:part_of_speech).to_s
|
59
|
+
# Lemma (base form of the token)
|
60
|
+
puts token.get(:lemma).to_s
|
61
|
+
# Named entity tag
|
62
|
+
puts token.get(:named_entity_tag).to_s
|
63
|
+
# Coreference
|
64
|
+
puts token.get(:coref_cluster_id).to_s
|
65
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
66
|
+
end
|
67
|
+
end
|
68
|
+
```
|
60
69
|
|
61
70
|
> Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
|
62
71
|
|
@@ -66,19 +75,17 @@ A good reference for names of annotations are the Stanford Javadocs for [CoreAnn
|
|
66
75
|
|
67
76
|
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
68
77
|
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be grateful if somebody could add/e-mail me these files.
|
78
|
+
```ruby
|
79
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
80
|
+
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
|
81
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
82
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
83
|
+
|
84
|
+
# Here, we specify another base class.
|
85
|
+
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
|
86
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
87
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
88
|
+
```
|
82
89
|
|
83
90
|
**Contributing**
|
84
91
|
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.1.
|
3
|
+
VERSION = '0.1.7'
|
4
|
+
|
4
5
|
require 'stanford-core-nlp/jar_loader'
|
5
6
|
require 'stanford-core-nlp/java_wrapper'
|
6
7
|
require 'stanford-core-nlp/config'
|
@@ -30,6 +31,10 @@ module StanfordCoreNLP
|
|
30
31
|
# retrieve annotations using static classes as names.
|
31
32
|
# This works around one of the lacunae of Rjb.
|
32
33
|
attr_accessor :jar_path
|
34
|
+
# The path to the main folder containing the folders
|
35
|
+
# with the individual models inside. By default, this
|
36
|
+
# is the same as the JAR path.
|
37
|
+
attr_accessor :model_path
|
33
38
|
# The flags for starting the JVM machine. The parser
|
34
39
|
# and named entity recognizer are very memory consuming.
|
35
40
|
attr_accessor :jvm_args
|
@@ -41,13 +46,14 @@ module StanfordCoreNLP
|
|
41
46
|
|
42
47
|
# The default JAR path is the gem's bin folder.
|
43
48
|
self.jar_path = File.dirname(__FILE__) + '/../bin/'
|
49
|
+
# The default model path is the same as the JAR path.
|
50
|
+
self.model_path = self.jar_path
|
44
51
|
# Load the JVM with a minimum heap size of 512MB and a
|
45
52
|
# maximum heap size of 1024MB.
|
46
53
|
self.jvm_args = ['-Xms512M', '-Xmx1024M']
|
47
54
|
# Turn logging off by default.
|
48
55
|
self.log_file = nil
|
49
|
-
|
50
|
-
|
56
|
+
|
51
57
|
# Use models for a given language. Language can be
|
52
58
|
# supplied as full-length, or ISO-639 2 or 3 letter
|
53
59
|
# code (e.g. :english, :eng or :en will work).
|
@@ -117,14 +123,16 @@ module StanfordCoreNLP
|
|
117
123
|
# specified JVM flags and StanfordCoreNLP
|
118
124
|
# properties.
|
119
125
|
def self.load(*annotators)
|
126
|
+
JarLoader.log(self.log_file)
|
120
127
|
self.init unless @@initialized
|
121
128
|
# Prepend the JAR path to the model files.
|
122
129
|
properties = {}
|
123
130
|
self.model_files.each do |k,v|
|
124
|
-
f = self.
|
131
|
+
f = self.model_path + v
|
125
132
|
unless File.readable?(f)
|
126
133
|
raise "Model file #{f} could not be found. " +
|
127
|
-
"You may need to download this file manually
|
134
|
+
"You may need to download this file manually "+
|
135
|
+
" and/or set paths properly."
|
128
136
|
else
|
129
137
|
properties[k] = f
|
130
138
|
end
|
@@ -133,12 +141,23 @@ module StanfordCoreNLP
|
|
133
141
|
annotators.map { |x| x.to_s }.join(', ')
|
134
142
|
CoreNLP.new(get_properties(properties))
|
135
143
|
end
|
136
|
-
|
144
|
+
|
145
|
+
# Once it loads a specific annotator model once,
|
146
|
+
# the program always loads the same models when
|
147
|
+
# you make new pipelines and request the annotator
|
148
|
+
# again, ignoring the changes in models.
|
149
|
+
#
|
150
|
+
# This function kills the JVM and reloads everything
|
151
|
+
# if you need to create a new pipeline with different
|
152
|
+
# models for the same annotators.
|
153
|
+
#def self.reload
|
154
|
+
# raise 'Not implemented.'
|
155
|
+
#end
|
156
|
+
|
137
157
|
# Load the jars.
|
138
158
|
def self.load_jars
|
139
159
|
JarLoader.jvm_args = self.jvm_args
|
140
160
|
JarLoader.jar_path = self.jar_path
|
141
|
-
JarLoader.log(self.log_file) if self.log_file
|
142
161
|
JarLoader.load('joda-time.jar')
|
143
162
|
JarLoader.load('xom.jar')
|
144
163
|
JarLoader.load('stanford-corenlp.jar')
|
@@ -20,9 +20,18 @@ module StanfordCoreNLP
|
|
20
20
|
:ner => 'classifiers/',
|
21
21
|
:dcoref => 'dcoref/'
|
22
22
|
}
|
23
|
-
|
23
|
+
|
24
|
+
# Tag sets used by Stanford for each language.
|
25
|
+
TagSets = {
|
26
|
+
:english => :penn,
|
27
|
+
:german => :negra,
|
28
|
+
:chinese => :penn_chinese,
|
29
|
+
:french => :simple
|
30
|
+
}
|
31
|
+
|
24
32
|
# Default models for all languages.
|
25
33
|
Models = {
|
34
|
+
|
26
35
|
:pos => {
|
27
36
|
:english => 'english-left3words-distsim.tagger',
|
28
37
|
:german => 'german-fast.tagger',
|
@@ -31,6 +40,7 @@ module StanfordCoreNLP
|
|
31
40
|
:chinese => 'chinese.tagger',
|
32
41
|
:xinhua => nil
|
33
42
|
},
|
43
|
+
|
34
44
|
:parser => {
|
35
45
|
:english => 'englishPCFG.ser.gz',
|
36
46
|
:german => 'germanPCFG.ser.gz',
|
@@ -39,6 +49,7 @@ module StanfordCoreNLP
|
|
39
49
|
:chinese => 'chinesePCFG.ser.gz',
|
40
50
|
:xinhua => 'xinhuaPCFG.ser.gz'
|
41
51
|
},
|
52
|
+
|
42
53
|
:ner => {
|
43
54
|
:english => {
|
44
55
|
'3class' => 'all.3class.distsim.crf.ser.gz',
|
@@ -51,6 +62,7 @@ module StanfordCoreNLP
|
|
51
62
|
:chinese => {},
|
52
63
|
:xinhua => {}
|
53
64
|
},
|
65
|
+
|
54
66
|
:dcoref => {
|
55
67
|
:english => {
|
56
68
|
'demonym' => 'demonyms.txt',
|
@@ -72,6 +84,7 @@ module StanfordCoreNLP
|
|
72
84
|
:chinese => {},
|
73
85
|
:xinhua => {}
|
74
86
|
}
|
87
|
+
|
75
88
|
# Models to add.
|
76
89
|
|
77
90
|
#"truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.7
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-02-
|
12
|
+
date: 2012-02-22 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rjb
|
16
|
-
requirement: &
|
16
|
+
requirement: &70107443631860 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,11 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70107443631860
|
25
25
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
|
-
language processing \ntools
|
27
|
-
|
26
|
+
language processing \ntools that provides tokenization, part-of-speech tagging and
|
27
|
+
parsing for several languages, as well as named entity \nrecognition and coreference
|
28
|
+
resolution for English. "
|
28
29
|
email:
|
29
30
|
- louis.mullie@gmail.com
|
30
31
|
executables: []
|
@@ -36,7 +37,6 @@ files:
|
|
36
37
|
- lib/stanford-core-nlp/java_wrapper.rb
|
37
38
|
- lib/stanford-core-nlp.rb
|
38
39
|
- bin/bridge.jar
|
39
|
-
- bin/INFO
|
40
40
|
- README.markdown
|
41
41
|
- LICENSE
|
42
42
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
@@ -64,4 +64,3 @@ signing_key:
|
|
64
64
|
specification_version: 3
|
65
65
|
summary: Ruby bindings to the Stanford Core NLP tools.
|
66
66
|
test_files: []
|
67
|
-
has_rdoc:
|
data/bin/INFO
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
This is where you should put the JAR files and the folders with the model files.
|