stanford-core-nlp 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.markdown +86 -0
- data/bin/bridge.jar +0 -0
- data/lib/stanford-core-nlp.rb +18 -9
- metadata +6 -5
- data/README +0 -3
data/README.markdown
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
**About**
|
2
|
+
|
3
|
+
This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for English, including tokenization, part-of-speech tagging, lemmatization, named entity recognition, parsing, and coreference resolution.
|
4
|
+
|
5
|
+
**Installing**
|
6
|
+
|
7
|
+
1. Install the gem: `gem install stanford-core-nlp`.
|
8
|
+
|
9
|
+
2. Download the Stanford Core NLP JAR and model files [here](http://louismullie.com/stanford-core-nlp-english.zip). Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (typically this is /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/). This package only includes model files for English; see below for information on adding model files for other languages.
|
10
|
+
|
11
|
+
**Configuration**
|
12
|
+
|
13
|
+
After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
|
14
|
+
|
15
|
+
# Set an alternative path to look for the JAR files
|
16
|
+
# Default is gem's bin folder.
|
17
|
+
StanfordCoreNLP.jar_path = '/path/'
|
18
|
+
|
19
|
+
# Pass some alternative arguments to the Java VM.
|
20
|
+
# Default is ['-Xms512M', '-Xmx1024M'].
|
21
|
+
StanfordCoreNLP.jvm_args = ['-option1', '-option2']
|
22
|
+
|
23
|
+
# Redirect VM output to log.txt
|
24
|
+
StanfordCoreNLP.log_file = 'log.txt'
|
25
|
+
|
26
|
+
You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
|
27
|
+
|
28
|
+
# Default base class is edu.stanford.nlp.pipeline.
|
29
|
+
StanfordCoreNLP.load('PTBTokenizerAnnotator')
|
30
|
+
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
|
31
|
+
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
|
32
|
+
|
33
|
+
# Here, we specify another base class.
|
34
|
+
StanfordCoreNLP.load('MaxentTagger', 'edu.stanford.nlp.tagger')
|
35
|
+
puts StanfordCoreNLP::MaxentTagger.inspect
|
36
|
+
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
|
37
|
+
|
38
|
+
**Using the gem**
|
39
|
+
|
40
|
+
text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
|
41
|
+
'Berlin to discuss a new austerity package. Sarkozy ' +
|
42
|
+
'looked pleased, but Merkel was dismayed.'
|
43
|
+
|
44
|
+
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
|
45
|
+
text = StanfordCoreNLP::Text.new(text)
|
46
|
+
pipeline.annotate(text)
|
47
|
+
|
48
|
+
text.get(:sentences).each do |sentence|
|
49
|
+
sentence.get(:tokens).each do |token|
|
50
|
+
# Default annotations for all tokens
|
51
|
+
puts token.get(:value).to_s
|
52
|
+
puts token.get(:original_text).to_s
|
53
|
+
puts token.get(:character_offset_begin).to_s
|
54
|
+
puts token.get(:character_offset_end).to_s
|
55
|
+
# POS returned by the tagger
|
56
|
+
puts token.get(:part_of_speech).to_s
|
57
|
+
# Lemma (base form of the token)
|
58
|
+
puts token.get(:lemma).to_s
|
59
|
+
# Named entity tag
|
60
|
+
puts token.get(:named_entity_tag).to_s
|
61
|
+
# Coreference
|
62
|
+
puts token.get(:coref_cluster_id).to_s
|
63
|
+
# Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'stanford_annotations.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding ot a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
|
68
|
+
|
69
|
+
**Adding models for other languages for the parser and tagger**
|
70
|
+
|
71
|
+
- For the Stanford Parser, download the [parser files](http://nlp.stanford.edu/software/lex-parser.shtml), and copy from the grammar/ directory the grammars you need into the gem's bin/grammar directory (e.g. /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/grammar). Grammars are available for Arabic, Chinese, French, German and Xinhua.
|
72
|
+
- For the Stanford Tagger, download the [tagger files](http://nlp.stanford.edu/software/tagger.shtml), and copy from the models/ directory the models you need into the gem's bin/models directory. Models are available for Arabic, Chinese, French and German.
|
73
|
+
|
74
|
+
Then, configure the gem to use your newly added files, e.g.:
|
75
|
+
|
76
|
+
StanfordCoreNLP.set_model('parser.model', '/path/to/gem/bin/grammar/chinesePCFG.ser.gz')
|
77
|
+
StanfordCoreNLP.set_model('tagger.model', '/path/to/gem/bin/grammar/chinese.tagger')
|
78
|
+
pipeline = StanfordCoreNLP.load(:ssplit, :tokenize, :pos, :parse)
|
79
|
+
|
80
|
+
**Current known issues**
|
81
|
+
|
82
|
+
The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be very grateful if somebody could add/e-mail me these files.
|
83
|
+
|
84
|
+
**Contributing**
|
85
|
+
|
86
|
+
Feel free to fork the project and send me a pull request!
|
data/bin/bridge.jar
ADDED
Binary file
|
data/lib/stanford-core-nlp.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
module StanfordCoreNLP
|
2
2
|
|
3
|
-
VERSION = '0.1.
|
3
|
+
VERSION = '0.1.4'
|
4
4
|
require 'stanford-core-nlp/jar_loader.rb'
|
5
5
|
require 'stanford-core-nlp/java_wrapper'
|
6
6
|
require 'stanford-core-nlp/stanford_annotations'
|
@@ -47,15 +47,23 @@ module StanfordCoreNLP
|
|
47
47
|
'dcoref.extra.gender' => 'dcoref/namegender.combine.txt'
|
48
48
|
}
|
49
49
|
|
50
|
+
# Whether the classes are initialized or not.
|
51
|
+
@@initialized = false
|
52
|
+
# Whether the jars are loaded or not.
|
53
|
+
@@loaded = false
|
54
|
+
|
50
55
|
# Set a model file.
|
51
56
|
def self.set_model(name, file)
|
57
|
+
unless File.readable?(self.jar_path + file)
|
58
|
+
raise "JAR file #{self.jar_path + file} could not be found." +
|
59
|
+
"You may need to download this file manually and/or set paths properly."
|
60
|
+
end
|
52
61
|
self.model_files[name] = file
|
53
62
|
end
|
54
|
-
|
55
|
-
@@initialized = false
|
63
|
+
|
56
64
|
# Load the JARs, create the classes.
|
57
65
|
def self.init
|
58
|
-
self.load_jars
|
66
|
+
self.load_jars unless @@loaded
|
59
67
|
self.create_classes
|
60
68
|
@@initialized = true
|
61
69
|
end
|
@@ -73,14 +81,15 @@ module StanfordCoreNLP
|
|
73
81
|
end
|
74
82
|
|
75
83
|
# Load the jars.
|
76
|
-
def self.load_jars
|
77
|
-
JarLoader.jvm_args = jvm_args
|
78
|
-
JarLoader.jar_path = jar_path
|
79
|
-
JarLoader.log(log_file) if log_file
|
84
|
+
def self.load_jars
|
85
|
+
JarLoader.jvm_args = self.jvm_args
|
86
|
+
JarLoader.jar_path = self.jar_path
|
87
|
+
JarLoader.log(self.log_file) if self.log_file
|
80
88
|
JarLoader.load('joda-time.jar')
|
81
89
|
JarLoader.load('xom.jar')
|
82
90
|
JarLoader.load('stanford-corenlp.jar')
|
83
91
|
JarLoader.load('bridge.jar')
|
92
|
+
@@loaded = true
|
84
93
|
end
|
85
94
|
|
86
95
|
# Create the Ruby classes corresponding to the StanfordNLP
|
@@ -98,7 +107,7 @@ module StanfordCoreNLP
|
|
98
107
|
# The class is then accessible under the StanfordCoreNLP
|
99
108
|
# namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
|
100
109
|
def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
|
101
|
-
self.
|
110
|
+
self.load_jars unless @@loaded
|
102
111
|
const_set(klass.intern, Rjb::import("#{base}.#{klass}"))
|
103
112
|
end
|
104
113
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stanford-core-nlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.4
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-01-
|
12
|
+
date: 2012-01-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rjb
|
16
|
-
requirement: &
|
16
|
+
requirement: &70226234873780 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70226234873780
|
25
25
|
description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
|
26
26
|
language processing \ntools for English, including tokenization, part-of-speech
|
27
27
|
tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
|
@@ -35,8 +35,9 @@ files:
|
|
35
35
|
- lib/stanford-core-nlp/java_wrapper.rb
|
36
36
|
- lib/stanford-core-nlp/stanford_annotations.rb
|
37
37
|
- lib/stanford-core-nlp.rb
|
38
|
+
- bin/bridge.jar
|
38
39
|
- bin/INFO
|
39
|
-
- README
|
40
|
+
- README.markdown
|
40
41
|
- LICENSE
|
41
42
|
homepage: https://github.com/louismullie/stanford-core-nlp
|
42
43
|
licenses: []
|
data/README
DELETED