corenlp 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/README.md +95 -0
- data/lib/corenlp/downloader.rb +56 -0
- data/lib/corenlp/enclitic.rb +22 -0
- data/lib/corenlp/number.rb +4 -0
- data/lib/corenlp/punctuation.rb +20 -0
- data/lib/corenlp/sentence.rb +28 -0
- data/lib/corenlp/token.rb +73 -0
- data/lib/corenlp/token_dependency.rb +11 -0
- data/lib/corenlp/version.rb +3 -0
- data/lib/corenlp/word.rb +4 -0
- data/test/downloader_test.rb +9 -0
- data/test/enclitic_test.rb +8 -0
- data/test/number_test.rb +8 -0
- data/test/punctuation_test.rb +38 -0
- data/test/sentence_test.rb +33 -0
- data/test/test_helper.rb +5 -0
- data/test/token_test.rb +41 -0
- data/test/treebank_test.rb +12 -0
- data/test/word_test.rb +8 -0
- metadata +150 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 072b1b153bb4591c16e6242713e9b4431ba003da
|
4
|
+
data.tar.gz: 0e71dd5289c128e0245f082ace874d29b51cd92f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 7969ddc18c42ca6c832c06bf677df56212f4ec54bc35bebbd4d9e4925425804015ef2e333cbc0463a88af76204ed46f814c355c10c703586761c9a7db501442d
|
7
|
+
data.tar.gz: 646eb3e03f42182e5a957fe6d52db3a7767cd52c9cf48bb4bd4bc36d0e07c03f1f4890c652addad58ccaf845f9a430c5d877143902b3824efc60c42e44757f31
|
data/README.md
ADDED
@@ -0,0 +1,95 @@
|
|
1
|
+
# Corenlp
|
2
|
+
|
3
|
+
Corenlp is a Ruby gem that uses the [Stanford CoreNLP Java tools](http://nlp.stanford.edu/software/corenlp.shtml) to parse text. The gem takes the output from Stanford CoreNLP and builds objects in Ruby, for use in Ruby applications.
|
4
|
+
|
5
|
+
Stanford CoreNLP requires Java version 1.6 or higher. Installations vary so that will need to be installed by the developer on their own before continuing.
|
6
|
+
|
7
|
+
Development has been done with 64-bit Java on a OS X machine. 3G of RAM is allocated for the Java process, which can be set any time the parser is called.
|
8
|
+
|
9
|
+
The following Java version has been used in development.
|
10
|
+
|
11
|
+
$ java -version
|
12
|
+
java version "1.7.0_45"
|
13
|
+
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
|
14
|
+
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
|
15
|
+
|
16
|
+
## Installing Stanford CoreNLP
|
17
|
+
|
18
|
+
Run `rake corenlp:download_deps` to download the Stanford CoreNLP dependencies. The files will be extracted to the `lib/ext` directory, which is ignored from git. The dependencies directory can be customized. The download URL can also be customized.
|
19
|
+
|
20
|
+
The rake task will use the values from the following environment variables if they are set.
|
21
|
+
|
22
|
+
* `CORENLP_DOWNLOAD_URL` - This is set to "http://nlp.stanford.edu/software/stanford-corenlp-full-2014-06-16.zip" which was the latest version when this was written.
|
23
|
+
* `CORENLP_DEPS_DIR` - This is set to "./lib/ext/", which is a directory that exists in our project where we want to place the Stanford CoreNLP files.
|
24
|
+
|
25
|
+
To customize these values, supply environment variable arguments when calling the rake task like this:
|
26
|
+
|
27
|
+
rake corenlp:download_deps CORENLP_DEPS_DIR='./my_directory'
|
28
|
+
|
29
|
+
## Testing the output in a IRB console
|
30
|
+
|
31
|
+
Corenlp gem builds up a treebank of structured parts that define tokens, sentences, and dependencies between the tokens. This treebank structure is represented as a nested Ruby hash. Token objects that are part of a sentence are nested within the sentence. Token dependencies are nested within the sentence, and so on.
|
32
|
+
|
33
|
+
The following code will build up a treebank structure for the raw text "Put the book down.". On my machine this takes around 10 seconds to run.
|
34
|
+
|
35
|
+
bundle exec irb
|
36
|
+
Bundler.require
|
37
|
+
Corenlp::Treebank.new(raw_text: "Put the book down.").parse
|
38
|
+
|
39
|
+
## Options
|
40
|
+
|
41
|
+
The Treebank object can be initialize with various options.
|
42
|
+
|
43
|
+
* `java_max_memory` - set to 3GB by default. This can be customized via the Treebank initializer to be `-Xmx2g`, which would use a max of 2GB of memory, for example.
|
44
|
+
* `threads_to_use` - number of threads Stanford CoreNLP uses to parse text. This is set to 4 by default. This option is passed to the Java executable.
|
45
|
+
* `output_directory` - by default this is `./tmp/language_processing`, which already exists. This is where Stanford CoreNLP XML files are placed. These XML files represented the structured parser output.
|
46
|
+
|
47
|
+
## Tests
|
48
|
+
|
49
|
+
Minitest is used as a test suite for the Ruby objects. New code should include test coverage. Manually testing is also useful. Internally we have some more test methods to verify parser output on the same content over time, but they are not included at this time.
|
50
|
+
|
51
|
+
rake
|
52
|
+
|
53
|
+
To run a single test:
|
54
|
+
|
55
|
+
ruby path/to/file.rb --name test_method_name
|
56
|
+
|
57
|
+
## Terminology
|
58
|
+
|
59
|
+
Stanford CoreNLP uses a lot of terminology from the natural language processing field, and defines its own terminology. Refer to the [Stanford CoreNLP documentation](http://nlp.stanford.edu/software/corenlp.shtml) to learn more.
|
60
|
+
|
61
|
+
## Contributors
|
62
|
+
|
63
|
+
This gem was developed at Lengio by Andy Atkinson as an extraction of some of our natural language processing tools.
|
64
|
+
|
65
|
+
* Andy Atkinson, gem author and maintainer
|
66
|
+
* Kamran Khan
|
67
|
+
* Rodolfo Carvalho
|
68
|
+
|
69
|
+
## Rubygems badge
|
70
|
+
|
71
|
+
[![Gem Version](https://badge.fury.io/rb/corenlp.svg)](http://badge.fury.io/rb/corenlp)
|
72
|
+
|
73
|
+
## License
|
74
|
+
|
75
|
+
The MIT License (MIT)
|
76
|
+
|
77
|
+
Copyright (c) 2014 Lengio Corporation
|
78
|
+
|
79
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
80
|
+
of this software and associated documentation files (the "Software"), to deal
|
81
|
+
in the Software without restriction, including without limitation the rights
|
82
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
83
|
+
copies of the Software, and to permit persons to whom the Software is
|
84
|
+
furnished to do so, subject to the following conditions:
|
85
|
+
|
86
|
+
The above copyright notice and this permission notice shall be included in
|
87
|
+
all copies or substantial portions of the Software.
|
88
|
+
|
89
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
90
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
91
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
92
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
93
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
94
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
95
|
+
THE SOFTWARE.
|
@@ -0,0 +1,56 @@
|
|
1
|
+
require 'net/http'
|
2
|
+
require 'zip/zip'
|
3
|
+
require 'fileutils'
|
4
|
+
require 'uri'
|
5
|
+
|
6
|
+
module Corenlp
|
7
|
+
class Downloader
|
8
|
+
attr_accessor :url, :destination, :local_file
|
9
|
+
def initialize(url, destination)
|
10
|
+
self.url = url
|
11
|
+
self.destination = destination
|
12
|
+
self.local_file = nil
|
13
|
+
end
|
14
|
+
|
15
|
+
def extract
|
16
|
+
puts "extracting file..."
|
17
|
+
Zip::ZipFile.open(local_file) do |zip_file|
|
18
|
+
zip_file.each do |file|
|
19
|
+
file_path = File.join(destination, file.name)
|
20
|
+
zip_file.extract(file, file_path) unless File.exist?(file_path)
|
21
|
+
end
|
22
|
+
|
23
|
+
puts "moving files into directory..."
|
24
|
+
dirname = local_file[0...-4]
|
25
|
+
dir = File.join(destination, dirname)
|
26
|
+
if File.exists?(dir)
|
27
|
+
Dir.glob(File.join(dir, "*")).each do |file|
|
28
|
+
FileUtils.mv(file, File.join(destination, File.basename(file)))
|
29
|
+
end
|
30
|
+
FileUtils.rm_rf(dir)
|
31
|
+
end
|
32
|
+
|
33
|
+
puts "deleting original zip file..."
|
34
|
+
FileUtils.rm(local_file)
|
35
|
+
puts "done."
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
def download
|
40
|
+
return unless url
|
41
|
+
puts "downloading zip file from url #{url} to #{destination}..."
|
42
|
+
self.local_file = File.basename(url)
|
43
|
+
uri = URI.parse(url)
|
44
|
+
if local_file && uri
|
45
|
+
Net::HTTP.start(uri.host) do |http|
|
46
|
+
resp = http.get(uri.request_uri)
|
47
|
+
open(local_file, "wb") do |file|
|
48
|
+
file.write(resp.body)
|
49
|
+
end
|
50
|
+
end
|
51
|
+
puts "done. Downloaded file #{local_file}."
|
52
|
+
extract
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
@@ -0,0 +1,22 @@
|
|
1
|
+
module Corenlp
|
2
|
+
class Enclitic < Token
|
3
|
+
def enclitic_map
|
4
|
+
# Note: This isn't really one-to-one (e.g. "'d" could be "had" or "would"):
|
5
|
+
{
|
6
|
+
"'ll" => "will",
|
7
|
+
"'m" => "am",
|
8
|
+
"'re" => "are",
|
9
|
+
"'s" => "is",
|
10
|
+
"'d" => "would",
|
11
|
+
"'t" => "not",
|
12
|
+
"'ve" => "have",
|
13
|
+
"'nt" => "not",
|
14
|
+
"n't" => "not"
|
15
|
+
}
|
16
|
+
end
|
17
|
+
|
18
|
+
def expanded
|
19
|
+
enclitic_map[text]
|
20
|
+
end
|
21
|
+
end
|
22
|
+
end
|
@@ -0,0 +1,20 @@
|
|
1
|
+
module Corenlp
|
2
|
+
class Punctuation < Token
|
3
|
+
# From http://www.unicode.org/charts/PDF/U20A0.pdf
|
4
|
+
CURRENCY_SYMBOLS = %W(\u0024 \u00A2 \u00A3 \u00A4 \u00A5 \u20AC)
|
5
|
+
DASH_SYMBOLS = %W(\u2010 \u2011 \u2012 \u2013 \u2014)
|
6
|
+
OPENING_SYMBOLS = %W(\u201C \u2018 \u00A1 \u00BF \( [ {)
|
7
|
+
|
8
|
+
def currency?
|
9
|
+
CURRENCY_SYMBOLS.include?(text)
|
10
|
+
end
|
11
|
+
|
12
|
+
def dash?
|
13
|
+
DASH_SYMBOLS.include?(text)
|
14
|
+
end
|
15
|
+
|
16
|
+
def opening?
|
17
|
+
OPENING_SYMBOLS.include?(text)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
@@ -0,0 +1,28 @@
|
|
1
|
+
module Corenlp
|
2
|
+
class Sentence
|
3
|
+
attr_accessor :index, :tokens, :token_dependencies, :parse_tree_raw
|
4
|
+
|
5
|
+
def initialize(attrs = {})
|
6
|
+
@index = attrs[:index]
|
7
|
+
@tokens = []
|
8
|
+
@token_dependencies = []
|
9
|
+
@parse_tree_raw = ''
|
10
|
+
end
|
11
|
+
|
12
|
+
def governor_dependencies(token)
|
13
|
+
token_dependencies.select{|td| td.governor == token}
|
14
|
+
end
|
15
|
+
|
16
|
+
def next_token(token)
|
17
|
+
tokens.sort_by(&:index).detect{|t| t.index > token.index}
|
18
|
+
end
|
19
|
+
|
20
|
+
def previous_token(token)
|
21
|
+
tokens.sort_by(&:index).reverse.detect{|t| t.index < token.index}
|
22
|
+
end
|
23
|
+
|
24
|
+
def get_dependency_token_by_index(index)
|
25
|
+
tokens.detect{|t| t.index == index}
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
@@ -0,0 +1,73 @@
|
|
1
|
+
module Corenlp
|
2
|
+
class Token
|
3
|
+
attr_accessor :index, :text, :penn_treebank_tag, :stanford_lemma, :type, :ner
|
4
|
+
|
5
|
+
def initialize(attrs = {})
|
6
|
+
@index = attrs[:index]
|
7
|
+
@text = attrs[:text]
|
8
|
+
@penn_treebank_tag = attrs[:penn_treebank_tag]
|
9
|
+
@stanford_lemma = attrs[:stanford_lemma]
|
10
|
+
@type = attrs[:type]
|
11
|
+
@ner = attrs[:ner]
|
12
|
+
end
|
13
|
+
|
14
|
+
IGNORED_ENTITIES = ["PERSON"]
|
15
|
+
|
16
|
+
def content?
|
17
|
+
is_a?(Word) || is_a?(Enclitic)
|
18
|
+
end
|
19
|
+
|
20
|
+
def top_level_penn_treebank_category
|
21
|
+
penn_treebank_tag[0]
|
22
|
+
end
|
23
|
+
|
24
|
+
def ==(other)
|
25
|
+
index == other.index && \
|
26
|
+
penn_treebank_tag == other.penn_treebank_tag && type == other.type
|
27
|
+
end
|
28
|
+
|
29
|
+
def website_text?
|
30
|
+
text =~ /http:\/\//
|
31
|
+
end
|
32
|
+
|
33
|
+
def self.clean_stanford_text(text)
|
34
|
+
Token::STANFORD_TEXT_REPLACEMENTS.each_pair do |original, replacement|
|
35
|
+
text.gsub!(replacement, original)
|
36
|
+
end
|
37
|
+
text
|
38
|
+
end
|
39
|
+
|
40
|
+
Enclitics = %w{'ll 'm 're 's 't 've 'nt n't 'd ’ll ’m ’re ’s ’t ’ve ’nt n’t ’d}
|
41
|
+
WordRegexp = /^[[:alpha:]\-'\/]+$/
|
42
|
+
NumberRegexp = /^#?(\d+)(,\d+)*(\.\d+)?$/
|
43
|
+
PunctRegexp = /^[[:punct:]'"\$]+$/
|
44
|
+
WebsiteRegexp = /https?:\/\/[\S]+/
|
45
|
+
|
46
|
+
# The character replacements that Stanford performs which we reverse:
|
47
|
+
STANFORD_TEXT_REPLACEMENTS = {
|
48
|
+
'”' => "''", '“' => '``', '(' => '-LRB-',
|
49
|
+
')' => '-RRB-', '[' => '-LSB-', ']' => '-RSB-',
|
50
|
+
'{' => '-LCB-', '}' => '-RCB-',
|
51
|
+
'‘' => '`', '’' => '\'', '—' => '--', '/' => '\\/'
|
52
|
+
}
|
53
|
+
|
54
|
+
def ignored_entity?
|
55
|
+
IGNORED_ENTITIES.include?(self.ner)
|
56
|
+
end
|
57
|
+
|
58
|
+
def self.token_subclass_from_text(text)
|
59
|
+
case
|
60
|
+
when Enclitics.include?(text)
|
61
|
+
Enclitic
|
62
|
+
when (text =~ WordRegexp && text != '-') || (text =~ WebsiteRegexp)
|
63
|
+
Word
|
64
|
+
when text =~ PunctRegexp
|
65
|
+
Punctuation
|
66
|
+
when text =~ NumberRegexp
|
67
|
+
Number
|
68
|
+
else
|
69
|
+
Token
|
70
|
+
end
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
data/lib/corenlp/word.rb
ADDED
@@ -0,0 +1,9 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
class DownloaderTest < Minitest::Test
|
4
|
+
def test_initialized_ok
|
5
|
+
zip_file_url = "http://nlp.stanford.edu/software/stanford-corenlp-full-2014-06-16.zip"
|
6
|
+
destination = File.join('./lib/ext/')
|
7
|
+
assert Downloader.new(zip_file_url, destination)
|
8
|
+
end
|
9
|
+
end
|
data/test/number_test.rb
ADDED
@@ -0,0 +1,38 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
class PunctuationTest < Minitest::Test
|
4
|
+
def test_punctuation_is_initialized_ok
|
5
|
+
punctuation = Punctuation.new(text: "$")
|
6
|
+
assert punctuation.is_a?(Punctuation)
|
7
|
+
end
|
8
|
+
|
9
|
+
def test_punctuation_is_currency
|
10
|
+
Punctuation::CURRENCY_SYMBOLS.each do |s|
|
11
|
+
token = Punctuation.new text: s
|
12
|
+
assert token.currency?, "Token #{token.text} should be a currency"
|
13
|
+
end
|
14
|
+
|
15
|
+
token = Punctuation.new text: "a"
|
16
|
+
assert !token.currency?
|
17
|
+
end
|
18
|
+
|
19
|
+
def test_punctuation_is_dash
|
20
|
+
Punctuation::DASH_SYMBOLS.each do |s|
|
21
|
+
token = Punctuation.new text: s
|
22
|
+
assert token.dash?, "Token #{token.text} should be a dash"
|
23
|
+
end
|
24
|
+
|
25
|
+
token = Punctuation.new text: "{"
|
26
|
+
assert !token.dash?
|
27
|
+
end
|
28
|
+
|
29
|
+
def test_punctuation_is_opening
|
30
|
+
Punctuation::OPENING_SYMBOLS.each do |s|
|
31
|
+
token = Punctuation.new text: s
|
32
|
+
assert token.opening?, "Token #{token.text} should be an opening"
|
33
|
+
end
|
34
|
+
|
35
|
+
token = Punctuation.new text: "$"
|
36
|
+
assert !token.opening?
|
37
|
+
end
|
38
|
+
end
|
@@ -0,0 +1,33 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
class SentenceTest < Minitest::Test
|
4
|
+
def test_initialized_ok
|
5
|
+
sentence = Sentence.new(index: 0, text: "some text in a sentence.")
|
6
|
+
t1 = Token.new(index: 0, text: "some", type: "Word", sentence: sentence)
|
7
|
+
t2 = Token.new(index: 1, text: "text", type: "Word", sentence: sentence)
|
8
|
+
t3 = Token.new(index: 2, text: "in", type: "Word", sentence: sentence)
|
9
|
+
t4 = Token.new(index: 3, text: "a", type: "Word", sentence: sentence)
|
10
|
+
td1 = TokenDependency.new(dependent: t1, governor: t2, relation: "det")
|
11
|
+
sentence.tokens << t1 << t2 << t3 << t4
|
12
|
+
sentence.token_dependencies << td1
|
13
|
+
assert_equal 4, sentence.tokens.size
|
14
|
+
assert_equal 1, sentence.token_dependencies.size
|
15
|
+
assert_equal [td1], sentence.governor_dependencies(t2)
|
16
|
+
assert_equal [td1], sentence.token_dependencies
|
17
|
+
assert_equal 0, sentence.index
|
18
|
+
end
|
19
|
+
|
20
|
+
def test_calculate_previous_and_next_token_from_token_in_a_sentence
|
21
|
+
sentence = Sentence.new(index: 0, text: "some text in a sentence.")
|
22
|
+
t0 = Word.new(index: 0, text: "some", sentence: sentence)
|
23
|
+
t1 = Word.new(index: 1, text: "text", sentence: sentence)
|
24
|
+
t2 = Word.new(index: 2, text: "in", sentence: sentence)
|
25
|
+
t3 = Word.new(index: 3, text: "a", sentence: sentence)
|
26
|
+
sentence.tokens << t0 << t1 << t2 << t3
|
27
|
+
assert_equal nil, sentence.previous_token(t0)
|
28
|
+
assert_equal nil, sentence.next_token(t3)
|
29
|
+
assert_equal t1, sentence.next_token(t0)
|
30
|
+
assert_equal t2, sentence.previous_token(t3)
|
31
|
+
assert_equal t1, sentence.previous_token(t2)
|
32
|
+
end
|
33
|
+
end
|
data/test/test_helper.rb
ADDED
data/test/token_test.rb
ADDED
@@ -0,0 +1,41 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
class TokenTest < Minitest::Test
|
4
|
+
def test_initialized_ok
|
5
|
+
assert token = Word.new(index: 0, text: "some", penn_treebank_tag: "NNP",
|
6
|
+
stanford_lemma: "some")
|
7
|
+
assert_equal "N", token.top_level_penn_treebank_category
|
8
|
+
assert token.content?
|
9
|
+
end
|
10
|
+
|
11
|
+
def test_token_equality
|
12
|
+
t0 = Word.new(index: 0, text: "some", penn_treebank_tag: "NNP")
|
13
|
+
t1 = Word.new(index: 0, text: "more", penn_treebank_tag: "NNP")
|
14
|
+
assert t0 == t1
|
15
|
+
end
|
16
|
+
|
17
|
+
def test_token_person_ner_value_is_ignored
|
18
|
+
assert Token.new(text: "Walter", ner: "PERSON").ignored_entity?
|
19
|
+
end
|
20
|
+
|
21
|
+
def test_number_recognition
|
22
|
+
text_samples = ["33,333", "20", "30.00", "30,000,000.00"]
|
23
|
+
|
24
|
+
text_samples.each do |text|
|
25
|
+
assert_equal Number, Token.token_subclass_from_text(text), text
|
26
|
+
end
|
27
|
+
|
28
|
+
text_samples = ["33A333", "20", "30F00", "30X000;000"]
|
29
|
+
text_samples.each do |text|
|
30
|
+
assert !Token.token_subclass_from_text(text).is_a?(Number)
|
31
|
+
end
|
32
|
+
end
|
33
|
+
|
34
|
+
def test_enclitic_recognition
|
35
|
+
text_samples = %w{n't 'nt 'll n’t ’nt ’ll}
|
36
|
+
|
37
|
+
text_samples.each do |text|
|
38
|
+
assert_equal Enclitic, Token.token_subclass_from_text(text)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
@@ -0,0 +1,12 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
|
3
|
+
class TreebankTest < Minitest::Test
|
4
|
+
@@treebank = Treebank.new(raw_text: 'I put the book down on the coffee table.').parse
|
5
|
+
|
6
|
+
def test_treebank_has_all_the_parsed_parts
|
7
|
+
# earlier verions of stanford had some different dependencies
|
8
|
+
# ["I put nsubj", "the book det", "book put dobj", "down put prt", "on put prep", "the table det", "coffee table nn", "table on pobj"]
|
9
|
+
expected = ["I put nsubj", "the book det", "book put dobj", "down put prt", "the table det", "coffee table nn", "table put prep_on"]
|
10
|
+
assert_equal expected, @@treebank.sentences.map(&:token_dependencies).flatten.map{|x| "#{x.dependent.text} #{x.governor.text} #{x.relation}"}
|
11
|
+
end
|
12
|
+
end
|
data/test/word_test.rb
ADDED
metadata
ADDED
@@ -0,0 +1,150 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: corenlp
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.3
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Lengio Corporation
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2014-06-23 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: nokogiri
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - '='
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: 1.6.1
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - '='
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: 1.6.1
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rubyzip
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - '>='
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - '>='
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: bundler
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ~>
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '1.5'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ~>
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '1.5'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rake
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - '>='
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: pry
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - '>='
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - '>='
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: minitest
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - '>='
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - '>='
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '0'
|
97
|
+
description: Corenlp is a Ruby gem that uses the Stanford CoreNLP Java tools to parse
|
98
|
+
text.
|
99
|
+
email:
|
100
|
+
- engineering@leng.io
|
101
|
+
executables: []
|
102
|
+
extensions: []
|
103
|
+
extra_rdoc_files: []
|
104
|
+
files:
|
105
|
+
- lib/corenlp/downloader.rb
|
106
|
+
- lib/corenlp/enclitic.rb
|
107
|
+
- lib/corenlp/number.rb
|
108
|
+
- lib/corenlp/punctuation.rb
|
109
|
+
- lib/corenlp/sentence.rb
|
110
|
+
- lib/corenlp/token.rb
|
111
|
+
- lib/corenlp/token_dependency.rb
|
112
|
+
- lib/corenlp/version.rb
|
113
|
+
- lib/corenlp/word.rb
|
114
|
+
- test/downloader_test.rb
|
115
|
+
- test/enclitic_test.rb
|
116
|
+
- test/number_test.rb
|
117
|
+
- test/punctuation_test.rb
|
118
|
+
- test/sentence_test.rb
|
119
|
+
- test/test_helper.rb
|
120
|
+
- test/token_test.rb
|
121
|
+
- test/treebank_test.rb
|
122
|
+
- test/word_test.rb
|
123
|
+
- README.md
|
124
|
+
homepage: https://github.com/lengio/corenlp
|
125
|
+
licenses:
|
126
|
+
- MIT
|
127
|
+
metadata: {}
|
128
|
+
post_install_message:
|
129
|
+
rdoc_options: []
|
130
|
+
require_paths:
|
131
|
+
- lib
|
132
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
133
|
+
requirements:
|
134
|
+
- - '>='
|
135
|
+
- !ruby/object:Gem::Version
|
136
|
+
version: '0'
|
137
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
138
|
+
requirements:
|
139
|
+
- - '>='
|
140
|
+
- !ruby/object:Gem::Version
|
141
|
+
version: '0'
|
142
|
+
requirements: []
|
143
|
+
rubyforge_project:
|
144
|
+
rubygems_version: 2.0.14
|
145
|
+
signing_key:
|
146
|
+
specification_version: 4
|
147
|
+
summary: Corenlp is a Ruby gem that uses the Stanford CoreNLP Java tools to parse
|
148
|
+
text.
|
149
|
+
test_files: []
|
150
|
+
has_rdoc:
|