stanford-core-nlp 0.1.3 → 0.1.4
Sign up to get free protection for your applications and to get access to all the features.
- data/README.markdown +86 -0
- data/bin/bridge.jar +0 -0
- data/lib/stanford-core-nlp.rb +18 -9
- metadata +6 -5
- data/README +0 -3
data/README.markdown
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
**About**
|
2
|
+
|
3
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for English, including tokenization, part-of-speech tagging, lemmatization, named entity recognition, parsing, and coreference resolution.
|
4
|
+
|
5
|
+
**Installing**
|
6
|
+
|
7
|
+
1. Install the gem: `gem install stanford-core-nlp`.
|
8
|
+
|
9
|
+
2. Download the Stanford Core NLP JAR and model files [here](http://louismullie.com/stanford-core-nlp-english.zip). Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (typically this is /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/). This package only includes model files for English; see below for information on adding model files for other languages.
|
10
|
+
|
11
|
+
**Configuration**
|
12
|
+
|
13
|
+
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
|
14
|
+
|
15
|
+
# Set an alternative path to look for the JAR files
|
16
|
+
# Default is gem's bin folder.
|
17
|
+
StanfordCoreNLP.jar_path = '/path/'
|
18
|
+
|
19
|
+
# Pass some alternative arguments to the Java VM.
|
20
|
+
# Default is ['-Xms512M', '-Xmx1024M'].
|
21
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
22
|
+
|
23
|
+
# Redirect VM output to log.txt
|
24
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
25
|
+
|
26
|
+
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
27
|
+
|
28
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
29
|
+
StanfordCoreNLP.load('PTBTokenizerAnnotator')
|
30
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
31
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
32
|
+
|
33
|
+
# Here, we specify another base class.
|
34
|
+
StanfordCoreNLP.load('MaxentTagger', 'edu.stanford.nlp.tagger')
|
35
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
36
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
37
|
+
|
38
|
+
**Using the gem**
|
39
|
+
|
40
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
41
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
42
|
+
'looked pleased, but Merkel was dismayed.'
|
43
|
+
|
44
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
45
|
+
text = StanfordCoreNLP::Text.new(text)
|
46
|
+
pipeline.annotate(text)
|
47
|
+
|
48
|
+
text.get(:sentences).each do |sentence|
|
49
|
+
sentence.get(:tokens).each do |token|
|
50
|
+
# Default annotations for all tokens
|
51
|
+
puts token.get(:value).to_s
|
52
|
+
puts token.get(:original_text).to_s
|
53
|
+
puts token.get(:character_offset_begin).to_s
|
54
|
+
puts token.get(:character_offset_end).to_s
|
55
|
+
# POS returned by the tagger
|
56
|
+
puts token.get(:part_of_speech).to_s
|
57
|
+
# Lemma (base form of the token)
|
58
|
+
puts token.get(:lemma).to_s
|
59
|
+
# Named entity tag
|
60
|
+
puts token.get(:named_entity_tag).to_s
|
61
|
+
# Coreference
|
62
|
+
puts token.get(:coref_cluster_id).to_s
|
63
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'stanford_annotations.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding ot a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
|
68
|
+
|
69
|
+
**Adding models for other languages for the parser and tagger**
|
70
|
+
|
71
|
+
- For the Stanford Parser, download the [parser files](http://nlp.stanford.edu/software/lex-parser.shtml), and copy from the grammar/ directory the grammars you need into the gem's bin/grammar directory (e.g. /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/grammar). Grammars are available for Arabic, Chinese, French, German and Xinhua.
|
72
|
+
- For the Stanford Tagger, download the [tagger files](http://nlp.stanford.edu/software/tagger.shtml), and copy from the models/ directory the models you need into the gem's bin/models directory. Models are available for Arabic, Chinese, French and German.
|
73
|
+
|
74
|
+
Then, configure the gem to use your newly added files, e.g.:
|
75
|
+
|
76
|
+
StanfordCoreNLP.set_model('parser.model', '/path/to/gem/bin/grammar/chinesePCFG.ser.gz')
|
77
|
+
StanfordCoreNLP.set_model('tagger.model', '/path/to/gem/bin/grammar/chinese.tagger')
|
78
|
+
pipeline = StanfordCoreNLP.load(:ssplit, :tokenize, :pos, :parse)
|
79
|
+
|
80
|
+
**Current known issues**
|
81
|
+
|
82
|
+
The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be very grateful if somebody could add/e-mail me these files.
|
83
|
+
|
84
|
+
**Contributing**
|
85
|
+
|
86
|
+
Feel free to fork the project and send me a pull request!
|
data/bin/bridge.jar
ADDED
Binary file
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.1.
|
3
|
+
VERSION = '0.1.4'
|
4
4
|
require 'stanford-core-nlp/jar_loader.rb'
|
5
5
|
require 'stanford-core-nlp/java_wrapper'
|
6
6
|
require 'stanford-core-nlp/stanford_annotations'
|
@@ -47,15 +47,23 @@ module StanfordCoreNLP
|
|
47
47
|
'dcoref.extra.gender' => 'dcoref/namegender.combine.txt'
|
48
48
|
}
|
49
49
|
|
50
|
+
# Whether the classes are initialized or not.
|
51
|
+
@@initialized = false
|
52
|
+
# Whether the jars are loaded or not.
|
53
|
+
@@loaded = false
|
54
|
+
|
50
55
|
# Set a model file.
|
51
56
|
def self.set_model(name, file)
|
57
|
+
unless File.readable?(self.jar_path + file)
|
58
|
+
raise "JAR file #{self.jar_path + file} could not be found." +
|
59
|
+
"You may need to download this file manually and/or set paths properly."
|
60
|
+
end
|
52
61
|
self.model_files[name] = file
|
53
62
|
end
|
54
|
-
|
55
|
-
@@initialized = false
|
63
|
+
|
56
64
|
# Load the JARs, create the classes.
|
57
65
|
def self.init
|
58
|
-
self.load_jars
|
66
|
+
self.load_jars unless @@loaded
|
59
67
|
self.create_classes
|
60
68
|
@@initialized = true
|
61
69
|
end
|
@@ -73,14 +81,15 @@ module StanfordCoreNLP
|
|
73
81
|
end
|
74
82
|
|
75
83
|
# Load the jars.
|
76
|
-
def self.load_jars
|
77
|
-
JarLoader.jvm_args = jvm_args
|
78
|
-
JarLoader.jar_path = jar_path
|
79
|
-
JarLoader.log(log_file) if log_file
|
84
|
+
def self.load_jars
|
85
|
+
JarLoader.jvm_args = self.jvm_args
|
86
|
+
JarLoader.jar_path = self.jar_path
|
87
|
+
JarLoader.log(self.log_file) if self.log_file
|
80
88
|
JarLoader.load('joda-time.jar')
|
81
89
|
JarLoader.load('xom.jar')
|
82
90
|
JarLoader.load('stanford-corenlp.jar')
|
83
91
|
JarLoader.load('bridge.jar')
|
92
|
+
@@loaded = true
|
84
93
|
end
|
85
94
|
|
86
95
|
# Create the Ruby classes corresponding to the StanfordNLP
|
@@ -98,7 +107,7 @@ module StanfordCoreNLP
|
|
98
107
|
# The class is then accessible under the StanfordCoreNLP
|
99
108
|
# namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
|
100
109
|
def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
|
101
|
-
self.
|
110
|
+
self.load_jars unless @@loaded
|
102
111
|
const_set(klass.intern, Rjb::import("#{base}.#{klass}"))
|
103
112
|
end
|
104
113
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.4
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-01-
|
12
|
+
date: 2012-01-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rjb
|
16
|
-
requirement: &
|
16
|
+
requirement: &70226234873780 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70226234873780
|
25
25
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
26
|
language processing \ntools for English, including tokenization, part-of-speech
|
27
27
|
tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
|
@@ -35,8 +35,9 @@ files:
|
|
35
35
|
- lib/stanford-core-nlp/java_wrapper.rb
|
36
36
|
- lib/stanford-core-nlp/stanford_annotations.rb
|
37
37
|
- lib/stanford-core-nlp.rb
|
38
|
+
- bin/bridge.jar
|
38
39
|
- bin/INFO
|
39
|
-
- README
|
40
|
+
- README.markdown
|
40
41
|
- LICENSE
|
41
42
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
42
43
|
licenses: []
|
data/README
DELETED