stanford-core-nlp 0.1.3 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,86 @@
1
+ **About**
2
+
3
+ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for English, including tokenization, part-of-speech tagging, lemmatization, named entity recognition, parsing, and coreference resolution.
4
+
5
+ **Installing**
6
+
7
+ 1. Install the gem: `gem install stanford-core-nlp`.
8
+
9
+ 2. Download the Stanford Core NLP JAR and model files [here](http://louismullie.com/stanford-core-nlp-english.zip). Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (typically this is /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/). This package only includes model files for English; see below for information on adding model files for other languages.
10
+
11
+ **Configuration**
12
+
13
+ After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
14
+
15
+ # Set an alternative path to look for the JAR files
16
+ # Default is gem's bin folder.
17
+ StanfordCoreNLP.jar_path = '/path/'
18
+
19
+ # Pass some alternative arguments to the Java VM.
20
+ # Default is ['-Xms512M', '-Xmx1024M'].
21
+ StanfordCoreNLP.jvm_args = ['-option1', '-option2']
22
+
23
+ # Redirect VM output to log.txt
24
+ StanfordCoreNLP.log_file = 'log.txt'
25
+
26
+ You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
27
+
28
+ # Default base class is edu.stanford.nlp.pipeline.
29
+ StanfordCoreNLP.load('PTBTokenizerAnnotator')
30
+ puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
31
+ # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
32
+
33
+ # Here, we specify another base class.
34
+ StanfordCoreNLP.load('MaxentTagger', 'edu.stanford.nlp.tagger')
35
+ puts StanfordCoreNLP::MaxentTagger.inspect
36
+ # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
37
+
38
+ **Using the gem**
39
+
40
+ text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
41
+ 'Berlin to discuss a new austerity package. Sarkozy ' +
42
+ 'looked pleased, but Merkel was dismayed.'
43
+
44
+ pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
45
+ text = StanfordCoreNLP::Text.new(text)
46
+ pipeline.annotate(text)
47
+
48
+ text.get(:sentences).each do |sentence|
49
+ sentence.get(:tokens).each do |token|
50
+ # Default annotations for all tokens
51
+ puts token.get(:value).to_s
52
+ puts token.get(:original_text).to_s
53
+ puts token.get(:character_offset_begin).to_s
54
+ puts token.get(:character_offset_end).to_s
55
+ # POS returned by the tagger
56
+ puts token.get(:part_of_speech).to_s
57
+ # Lemma (base form of the token)
58
+ puts token.get(:lemma).to_s
59
+ # Named entity tag
60
+ puts token.get(:named_entity_tag).to_s
61
+ # Coreference
62
+ puts token.get(:coref_cluster_id).to_s
63
+ # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
64
+ end
65
+ end
66
+
67
+ A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the 'stanford_annotations.rb' file inside the gem. The Ruby symbol (e.g. :named_entity_tag) corresponding ot a Java annotation class follows the simple un-camel-casing convention, with 'Annotation' at the end removed. For example, the annotation NamedEntityTagAnnotation translates to :named_entity_tag, PartOfSpeechAnnotation to :part_of_speech, etc.
68
+
69
+ **Adding models for other languages for the parser and tagger**
70
+
71
+ - For the Stanford Parser, download the [parser files](http://nlp.stanford.edu/software/lex-parser.shtml), and copy from the grammar/ directory the grammars you need into the gem's bin/grammar directory (e.g. /usr/local/lib/ruby/gems/1.9.1/gems/stanford-core-nlp-0.x/bin/grammar). Grammars are available for Arabic, Chinese, French, German and Xinhua.
72
+ - For the Stanford Tagger, download the [tagger files](http://nlp.stanford.edu/software/tagger.shtml), and copy from the models/ directory the models you need into the gem's bin/models directory. Models are available for Arabic, Chinese, French and German.
73
+
74
+ Then, configure the gem to use your newly added files, e.g.:
75
+
76
+ StanfordCoreNLP.set_model('parser.model', '/path/to/gem/bin/grammar/chinesePCFG.ser.gz')
77
+ StanfordCoreNLP.set_model('tagger.model', '/path/to/gem/bin/grammar/chinese.tagger')
78
+ pipeline = StanfordCoreNLP.load(:ssplit, :tokenize, :pos, :parse)
79
+
80
+ **Current known issues**
81
+
82
+ The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be very grateful if somebody could add/e-mail me these files.
83
+
84
+ **Contributing**
85
+
86
+ Feel free to fork the project and send me a pull request!
Binary file
@@ -1,6 +1,6 @@
1
1
  module StanfordCoreNLP
2
2
 
3
- VERSION = '0.1.3'
3
+ VERSION = '0.1.4'
4
4
  require 'stanford-core-nlp/jar_loader.rb'
5
5
  require 'stanford-core-nlp/java_wrapper'
6
6
  require 'stanford-core-nlp/stanford_annotations'
@@ -47,15 +47,23 @@ module StanfordCoreNLP
47
47
  'dcoref.extra.gender' => 'dcoref/namegender.combine.txt'
48
48
  }
49
49
 
50
+ # Whether the classes are initialized or not.
51
+ @@initialized = false
52
+ # Whether the jars are loaded or not.
53
+ @@loaded = false
54
+
50
55
  # Set a model file.
51
56
  def self.set_model(name, file)
57
+ unless File.readable?(self.jar_path + file)
58
+ raise "JAR file #{self.jar_path + file} could not be found." +
59
+ "You may need to download this file manually and/or set paths properly."
60
+ end
52
61
  self.model_files[name] = file
53
62
  end
54
-
55
- @@initialized = false
63
+
56
64
  # Load the JARs, create the classes.
57
65
  def self.init
58
- self.load_jars(self.jvm_args, self.jar_path, self.log_file)
66
+ self.load_jars unless @@loaded
59
67
  self.create_classes
60
68
  @@initialized = true
61
69
  end
@@ -73,14 +81,15 @@ module StanfordCoreNLP
73
81
  end
74
82
 
75
83
  # Load the jars.
76
- def self.load_jars(jvm_args, jar_path, log_file)
77
- JarLoader.jvm_args = jvm_args
78
- JarLoader.jar_path = jar_path
79
- JarLoader.log(log_file) if log_file
84
+ def self.load_jars
85
+ JarLoader.jvm_args = self.jvm_args
86
+ JarLoader.jar_path = self.jar_path
87
+ JarLoader.log(self.log_file) if self.log_file
80
88
  JarLoader.load('joda-time.jar')
81
89
  JarLoader.load('xom.jar')
82
90
  JarLoader.load('stanford-corenlp.jar')
83
91
  JarLoader.load('bridge.jar')
92
+ @@loaded = true
84
93
  end
85
94
 
86
95
  # Create the Ruby classes corresponding to the StanfordNLP
@@ -98,7 +107,7 @@ module StanfordCoreNLP
98
107
  # The class is then accessible under the StanfordCoreNLP
99
108
  # namespace, e.g. StanfordCoreNLP::PTBTokenizerAnnotator.
100
109
  def self.load_class(klass, base = 'edu.stanford.nlp.pipeline')
101
- self.init unless @@initialized
110
+ self.load_jars unless @@loaded
102
111
  const_set(klass.intern, Rjb::import("#{base}.#{klass}"))
103
112
  end
104
113
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stanford-core-nlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.3
4
+ version: 0.1.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-29 00:00:00.000000000 Z
12
+ date: 2012-01-31 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rjb
16
- requirement: &70158635488600 !ruby/object:Gem::Requirement
16
+ requirement: &70226234873780 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70158635488600
24
+ version_requirements: *70226234873780
25
25
  description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
26
26
  language processing \ntools for English, including tokenization, part-of-speech
27
27
  tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
@@ -35,8 +35,9 @@ files:
35
35
  - lib/stanford-core-nlp/java_wrapper.rb
36
36
  - lib/stanford-core-nlp/stanford_annotations.rb
37
37
  - lib/stanford-core-nlp.rb
38
+ - bin/bridge.jar
38
39
  - bin/INFO
39
- - README
40
+ - README.markdown
40
41
  - LICENSE
41
42
  homepage: https://github.com/louismullie/stanford-core-nlp
42
43
  licenses: []
data/README DELETED
@@ -1,3 +0,0 @@
1
- Ruby bindings for the Stanford CoreNLP package
2
-
3
- See the wiki for more information at https://github.com/louismullie/stanford-core-nlp/wiki/.