stanford-core-nlp 0.1.3 → 0.1.4

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,86 @@
1
+ **About**
2
+
3
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for English, including tokenization, part-of-speech tagging, lemmatization, named entity recognition, parsing, and coreference resolution.
4
+
5
+ **Installing**
6
+
7
+ 1. Install the gem: `gem install stanford-core-nlp`.
8
+
9
+ 2. Download the Stanford Core NLP JAR and model files [here](http://louismullie.com/stanford-core-nlp-english.zip). Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (typically this is /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/). This package only includes model files for English; see below for information on adding model files for other languages.
10
+
11
+ **Configuration**
12
+
13
+ After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
14
+
15
+ # Set an alternative path to look for the JAR files
16
+ # Default is gem's bin folder.
17
+ StanfordCoreNLP.jar_path = '/path/'
18
+
19
+ # Pass some alternative arguments to the Java VM.
20
+ # Default is ['-Xms512M', '-Xmx1024M'].
21
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
22
+
23
+ # Redirect VM output to log.txt
24
+ StanfordCoreNLP.log_file = 'log.txt'
25
+
26
+ You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
27
+
28
+ # Default base class is edu.stanford.nlp.pipeline.
29
+ StanfordCoreNLP.load('PTBTokenizerAnnotator')
30
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
31
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
32
+
33
+ # Here, we specify another base class.
34
+ StanfordCoreNLP.load('MaxentTagger', 'edu.stanford.nlp.tagger')
35
+ puts StanfordCoreNLP::MaxentTagger.inspect
36
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
37
+
38
+ **Using the gem**
39
+
40
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
41
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
42
+ 'looked pleased, but Merkel was dismayed.'
43
+
44
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
45
+ text = StanfordCoreNLP::Text.new(text)
46
+ pipeline.annotate(text)
47
+
48
+ text.get(:sentences).each do |sentence|
49
+ sentence.get(:tokens).each do |token|
50
+ # Default annotations for all tokens
51
+ puts token.get(:value).to_s
52
+ puts token.get(:original_text).to_s
53
+ puts token.get(:character_offset_begin).to_s
54
+ puts token.get(:character_offset_end).to_s
55
+ # POS returned by the tagger
56
+ puts token.get(:part_of_speech).to_s
57
+ # Lemma (base form of the token)
58
+ puts token.get(:lemma).to_s
59
+ # Named entity tag
60
+ puts token.get(:named_entity_tag).to_s
61
+ # Coreference
62
+ puts token.get(:coref_cluster_id).to_s
63
+ # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
64
+ end
65
+ end
66
+
67
+ A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'stanford_annotations.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding ot a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
68
+
69
+ **Adding models for other languages for the parser and tagger**
70
+
71
+ - For the Stanford Parser, download the [parser files](http://nlp.stanford.edu/software/lex-parser.shtml), and copy from the grammar/ directory the grammars you need into the gem's bin/grammar directory (e.g. /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/grammar). Grammars are available for Arabic, Chinese, French, German and Xinhua.
72
+ - For the Stanford Tagger, download the [tagger files](http://nlp.stanford.edu/software/tagger.shtml), and copy from the models/ directory the models you need into the gem's bin/models directory. Models are available for Arabic, Chinese, French and German.
73
+
74
+ Then, configure the gem to use your newly added files, e.g.:
75
+
76
+ StanfordCoreNLP.set_model('parser.model', '/path/to/gem/bin/grammar/chinesePCFG.ser.gz')
77
+ StanfordCoreNLP.set_model('tagger.model', '/path/to/gem/bin/grammar/chinese.tagger')
78
+ pipeline = StanfordCoreNLP.load(:ssplit, :tokenize, :pos, :parse)
79
+
80
+ **Current known issues**
81
+
82
+ The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be very grateful if somebody could add/e-mail me these files.
83
+
84
+ **Contributing**
85
+
86
+ Feel free to fork the project and send me a pull request!
Binary file
@@ -1,6 +1,6 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- VERSION = '0.1.3'
3
+ VERSION = '0.1.4'
4
4
  require 'stanford-core-nlp/jar_loader.rb'
5
5
  require 'stanford-core-nlp/java_wrapper'
6
6
  require 'stanford-core-nlp/stanford_annotations'
@@ -47,15 +47,23 @@ module StanfordCoreNLP
47
47
  'dcoref.extra.gender' => 'dcoref/namegender.combine.txt'
48
48
  }
49
49
 
50
+ # Whether the classes are initialized or not.
51
+ @@initialized = false
52
+ # Whether the jars are loaded or not.
53
+ @@loaded = false
54
+
50
55
  # Set a model file.
51
56
  def self.set_model(name, file)
57
+ unless File.readable?(self.jar_path + file)
58
+ raise "JAR file #{self.jar_path + file} could not be found." +
59
+ "You may need to download this file manually and/or set paths properly."
60
+ end
52
61
  self.model_files[name] = file
53
62
  end
54
-
55
- @@initialized = false
63
+
56
64
  # Load the JARs, create the classes.
57
65
  def self.init
58
- self.load_jars(self.jvm_args, self.jar_path, self.log_file)
66
+ self.load_jars unless @@loaded
59
67
  self.create_classes
60
68
  @@initialized = true
61
69
  end
@@ -73,14 +81,15 @@ module StanfordCoreNLP
73
81
  end
74
82
 
75
83
  # Load the jars.
76
- def self.load_jars(jvm_args, jar_path, log_file)
77
- JarLoader.jvm_args = jvm_args
78
- JarLoader.jar_path = jar_path
79
- JarLoader.log(log_file) if log_file
84
+ def self.load_jars
85
+ JarLoader.jvm_args = self.jvm_args
86
+ JarLoader.jar_path = self.jar_path
87
+ JarLoader.log(self.log_file) if self.log_file
80
88
  JarLoader.load('joda-time.jar')
81
89
  JarLoader.load('xom.jar')
82
90
  JarLoader.load('stanford-corenlp.jar')
83
91
  JarLoader.load('bridge.jar')
92
+ @@loaded = true
84
93
  end
85
94
 
86
95
  # Create the Ruby classes corresponding to the StanfordNLP
@@ -98,7 +107,7 @@ module StanfordCoreNLP
98
107
  # The class is then accessible under the StanfordCoreNLP
99
108
  # namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
100
109
  def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
101
- self.init unless @@initialized
110
+ self.load_jars unless @@loaded
102
111
  const_set(klass.intern, Rjb::import("#{base}.#{klass}"))
103
112
  end
104
113
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stanford-core-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.3
4
+ version: 0.1.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-29 00:00:00.000000000 Z
12
+ date: 2012-01-31 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rjb
16
- requirement: &70158635488600 !ruby/object:Gem::Requirement
16
+ requirement: &70226234873780 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70158635488600
24
+ version_requirements: *70226234873780
25
25
  description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
26
26
  language processing \ntools for English, including tokenization, part-of-speech
27
27
  tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
@@ -35,8 +35,9 @@ files:
35
35
  - lib/stanford-core-nlp/java_wrapper.rb
36
36
  - lib/stanford-core-nlp/stanford_annotations.rb
37
37
  - lib/stanford-core-nlp.rb
38
+ - bin/bridge.jar
38
39
  - bin/INFO
39
- - README
40
+ - README.markdown
40
41
  - LICENSE
41
42
  homepage: https://github.com/louismullie/stanford-core-nlp
42
43
  licenses: []
data/README DELETED
@@ -1,3 +0,0 @@
1
- Ruby bindings for the Stanford CoreNLP package
2
-
3
- See the wiki for more information at https://github.com/louismullie/stanford-core-nlp/wiki/.