treetagger-ruby 0.0.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG.rdoc +12 -2
- data/README.rdoc +137 -18
- data/bin/rtt +45 -18
- data/lib/tree_tagger/tagger.rb +217 -11
- data/lib/tree_tagger/version.rb +1 -1
- data/test/test_tagger.rb +154 -0
- data/test/tree-tagger/corrupted_lexicon_file.txt +0 -0
- data/test/tree-tagger/corrupted_model_file.par +0 -0
- data/test/tree-tagger/lexicon_file.txt +0 -0
- data/test/tree-tagger/model_file.par +0 -0
- data/test/tree-tagger/tree-tagger +8 -0
- metadata +17 -6
data/CHANGELOG.rdoc
CHANGED
@@ -1,17 +1,27 @@
|
|
1
1
|
== COMPLETED
|
2
|
+
=== 0.1.0
|
3
|
+
The inteface is now clear and stable.
|
4
|
+
|
5
|
+
Tagging of big texts is possible since the TreeTagger is invoked as an
|
6
|
+
external process through a pipe.
|
7
|
+
|
8
|
+
Test suite improved.
|
2
9
|
=== 0.0.1
|
3
10
|
Implemented simple tagging. The TreeTagger is invoked through the evn variable.
|
4
11
|
=== 0.0.1.prealpha
|
5
12
|
Created the structure for this project, added documentation and a public repo.
|
6
13
|
|
7
14
|
== PLANNED
|
8
|
-
=== 0.1.0
|
9
|
-
|
10
15
|
=== 0.2.0
|
16
|
+
Better tests. Support for all input types.
|
11
17
|
=== 0.3.0
|
18
|
+
Lemmatizer.
|
12
19
|
=== 0.4.0
|
20
|
+
File based FIFOs.
|
13
21
|
=== 0.5.0
|
22
|
+
File based queues.
|
14
23
|
=== 0.6.0
|
24
|
+
Full featured cmd interface.
|
15
25
|
=== 0.7.0
|
16
26
|
=== 0.8.0
|
17
27
|
=== 0.9.0
|
data/README.rdoc
CHANGED
@@ -1,37 +1,69 @@
|
|
1
1
|
= TreeTagger for Ruby
|
2
2
|
|
3
3
|
* {RubyGems}[http://rubygems.org/gems/treetagger-ruby]
|
4
|
-
*
|
4
|
+
* {Homepage}[http://bu.chsta.be/]
|
5
5
|
* {RTT Project Page}[http://bu.chsta.be/projects/treetagger-ruby/]
|
6
6
|
* {Source Code}[https://github.com/arbox/treetagger-ruby]
|
7
7
|
* {Bug Tracker}[https://github.com/arbox/treetagger-ruby/issues]
|
8
8
|
|
9
9
|
== DESCRIPTION
|
10
|
-
|
11
|
-
|
12
|
-
|
10
|
+
A Ruby based wrapper for the TreeTagger by Helmut Schmid.
|
11
|
+
|
12
|
+
Check it out if you are interested in Natural Language Processing (NLP)
|
13
|
+
and/or Human Language Technology (HLT).
|
14
|
+
|
15
|
+
This library provides comprehensive bindings for the
|
16
|
+
{TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/],
|
17
|
+
a statistical language independed POS tagging and chunking software.
|
18
|
+
|
19
|
+
TreeTagger is language agnostic, it will never guess what language you're going
|
20
|
+
to use. It
|
21
|
+
|
22
|
+
TODO:
|
23
|
+
* References to Schmid's publications;
|
24
|
+
* How to use TreeTagger in the wild;
|
25
|
+
* Input and output format, tokenization;
|
26
|
+
* ...
|
27
|
+
* The actual german parameter file has been estimated on one byte encoded data.
|
28
|
+
|
13
29
|
=== Implemented Features
|
14
30
|
Simple tagging.
|
15
31
|
|
32
|
+
Please have a look at the CHANGELOG file for details on implemented and planned
|
33
|
+
features.
|
16
34
|
|
17
35
|
== INSTALLATION
|
18
36
|
Before you install the <tt>treetagger-ruby</tt> package please ensure
|
19
|
-
you have downloaded and
|
37
|
+
you have downloaded and installed the
|
38
|
+
{TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
|
39
|
+
itself.
|
20
40
|
|
21
41
|
The {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
|
22
|
-
is a copyrighted software by Helmut Schmid and
|
23
|
-
|
42
|
+
is a copyrighted software by Helmut Schmid and
|
43
|
+
{IMS}[http://www.ims.uni-stuttgart.de/], please read the license
|
44
|
+
agreament before you download the TreeTagger package and language models.
|
24
45
|
|
25
46
|
After the installation of the <tt>TreeTagger</tt> set the environment variable
|
26
|
-
<tt>
|
27
|
-
Usually this
|
28
|
-
<tt>
|
29
|
-
|
30
|
-
|
47
|
+
<tt>TREETAGGER_BINARY</tt> to the location where the binary <tt>tree-tagger</tt>
|
48
|
+
resides. Usually this binary is located under the <tt>bin</tt> directory in the
|
49
|
+
main installation directory of the <tt>TreeTagger</tt>.
|
50
|
+
|
51
|
+
Also you have to set the variable <tt>TREETAGGER_MODEL</tt> to the location of
|
52
|
+
the appropriate language model you have acquired in the training step.
|
53
|
+
|
54
|
+
For instance you may add the following lines to your <tt>.profile</tt> file:
|
55
|
+
export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
|
56
|
+
export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'
|
57
|
+
|
58
|
+
It is convinient to work with a default language model, but you can change
|
59
|
+
it every time during the instantiation of a new tagger instance.
|
60
|
+
|
61
|
+
If you want to feed a lexicon file into your tagger you can do it globally
|
62
|
+
through the environment variable <tt>TREETAGGER_LEXICON</tt>.
|
31
63
|
|
32
64
|
<tt>treetagger-ruby</tt> is provided as a .gem package. Simply install it via
|
33
65
|
{RubyGems}[http://rubygems.org/gems/treetagger-ruby].
|
34
|
-
To install <tt>treetagger-ruby</tt>
|
66
|
+
To install <tt>treetagger-ruby</tt> issue the following command:
|
35
67
|
$ gem install treetagger-ruby
|
36
68
|
|
37
69
|
If you want to do a system wide installation, do this as root
|
@@ -41,14 +73,88 @@ Alternatively use your Gemfile for dependency management.
|
|
41
73
|
|
42
74
|
|
43
75
|
== SYNOPSIS
|
44
|
-
|
76
|
+
=== Basic Usage
|
45
77
|
Basic usage is very simple:
|
46
78
|
$ require 'treetagger-ruby'
|
79
|
+
$ # Instantiate a tagger instance with default values.
|
47
80
|
$ tagger = TreeTagger::Tagger.new
|
48
|
-
$
|
81
|
+
$ # Process an array of tokens.
|
82
|
+
$ tagger.process(%w{Ich gehe in die Schule})
|
83
|
+
$ # Flush the pipeline.
|
84
|
+
$ tagger.flush
|
85
|
+
$ # Get the processed data.
|
86
|
+
$ tagger.get_output
|
87
|
+
|
88
|
+
=== Input Format
|
89
|
+
Basically you have to provide a tokenized sequence with possibly some additional
|
90
|
+
information on lexical classes of tokens and on their probabilities. Every token
|
91
|
+
has to be on a separate line. Due to technical limitations SGML tags
|
92
|
+
(i.e. sequences with heading < and trailing >) cannot be valid tokes since
|
93
|
+
they are used internally for delimiting meaningful content from flush tokens.
|
94
|
+
It implies the use of the <tt>-sgml</tt> option which cannot be changes by user.
|
95
|
+
It is a limitation of <em>this</em> library. If you do need to process tags,
|
96
|
+
fall back and use the TreeTagger as a standalone programm possibly employing
|
97
|
+
temp files to store your input and output. This behaviour will be also
|
98
|
+
implemented in futher versions of <tt>treetagger-ruby</tt>.
|
99
|
+
|
100
|
+
Every token may occure alone on the line or be followed by additional
|
101
|
+
information:
|
102
|
+
* token;
|
103
|
+
* token (\\tab tag)+;
|
104
|
+
* token (\\tab tag \\space lemma)+;
|
105
|
+
* token (\\tab tag \\space probability)+;
|
106
|
+
* token (\\tab tag \\space probability \\space lemma)+.
|
107
|
+
|
108
|
+
You input may look like the following sentence:
|
109
|
+
Die ART 0.99
|
110
|
+
neuen ADJA neu
|
111
|
+
Hunde NN NP
|
112
|
+
stehen VVFIN 0.99 stehen
|
113
|
+
an
|
114
|
+
den
|
115
|
+
Mauern NN Mauer
|
116
|
+
.
|
117
|
+
|
118
|
+
|
119
|
+
This wrapper accepts the input as <em>String</em> or <em>Array</em>.
|
120
|
+
|
121
|
+
If you want to use strings, you are responsible for the proper delimiters inside
|
122
|
+
the string: <tt>"Die\\tART 0.99\\nneuen\\tADJA neu\\nHunde\\tNN NP\\nstehen\\t
|
123
|
+
VVFIN 0.99 stehen\\nan\\nden\\nMauern\\tNN Mauer\\n.\\n"</tt>
|
124
|
+
Now <tt>treetagger-ruby</tt> does not check your markup for correctness and will
|
125
|
+
possibly report a <tt>TreeTagger::ExternalError</tt> if the TreeTagger process
|
126
|
+
die due to input errors.
|
127
|
+
|
128
|
+
Using arrays is more convinient since they can be built programmatically.
|
129
|
+
|
130
|
+
Arrays should have the following structure:
|
131
|
+
* ['token', 'token', 'token'];
|
132
|
+
* ['token', ['token', ['POS', 'lemma'], ['POS', 'lemma']], 'token'];
|
133
|
+
* ['token', ['token', ['POS', prob], ['POS', 'prob']], 'token'];
|
134
|
+
* ['token', ['token', ['POS', prob, 'lemma'], ['POS', 'prob', 'lemma']]].
|
135
|
+
|
136
|
+
It is internally converted in the sequence <tt>token\\ntoken\\tPOS lemma\\t
|
137
|
+
POS lemma\\ntoken\\n</tt>, i.e. in the enriched version alternatives are
|
138
|
+
tab separated and entries a blank separated.
|
139
|
+
|
140
|
+
Note that probabilities may be strings or integers.
|
141
|
+
|
142
|
+
The lexicon lookup is +not+ implemented for now, that's the latter three forms
|
143
|
+
of input arrays are not supported yet.
|
144
|
+
|
145
|
+
=== Output Format
|
146
|
+
For now you'll get an array with strings elements. However the precise string
|
147
|
+
structure depends on the cmd arguments you've provided during the tagger
|
148
|
+
instantiation.
|
149
|
+
|
150
|
+
For instanse for the input <tt>["Veruntreute", "die", "AWO", "Spendengeld", "?"]
|
151
|
+
</tt> you'll get the following output with default cmd argumetns:
|
152
|
+
|
153
|
+
<tt>["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>",
|
154
|
+
"Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]</tt>
|
49
155
|
|
50
156
|
See documentation in the TreeTagger::Tagger class for details
|
51
|
-
on particular
|
157
|
+
on particular methods.
|
52
158
|
|
53
159
|
== EXCEPTION HIERARCHY
|
54
160
|
While using TreeTagger you can face following errors:
|
@@ -56,10 +162,23 @@ While using TreeTagger you can face following errors:
|
|
56
162
|
* <tt>TreeTagger::RuntimeError</tt>;
|
57
163
|
* <tt>TreeTagger::ExternalError</tt>.
|
58
164
|
|
165
|
+
This three kinds of errors all subclass <tt>TreeTagger::Error</tt>, which
|
166
|
+
in turn is a subclass of <tt>StandardError</tt>. For an end user this means that
|
167
|
+
it is possible to intercept all errors from <em>treetagger-ruby</em> with
|
168
|
+
a simple <tt>rescue</tt> clause.
|
169
|
+
|
59
170
|
== SUPPORT
|
60
|
-
If you have question, bug reports or any suggestions, please drop me an email :)
|
61
|
-
Any help is deeply appreciated!
|
171
|
+
If you have question, bug reports or any suggestions, please drop me an email :)
|
62
172
|
|
173
|
+
== HOW TO CONTRIBUTE
|
174
|
+
Please contact me and suggest your ideas, report bugs, talk to me, if you want
|
175
|
+
to implement some features in the future releases of this library.
|
176
|
+
|
177
|
+
Please don't feel offended if I cannot accept all your pull requests, I have
|
178
|
+
to review them and find the appropriate time and place in the code base to
|
179
|
+
incorporate your valuable changes.
|
180
|
+
|
181
|
+
Any help is deeply appreciated!
|
63
182
|
== CHANGELOG
|
64
183
|
For details on future plan and working progress see CHANGELOG.
|
65
184
|
|
data/bin/rtt
CHANGED
@@ -8,24 +8,51 @@ options = TreeTagger::ARGVParser.parse(ARGV)
|
|
8
8
|
|
9
9
|
tagger = TreeTagger::Tagger.new(options)
|
10
10
|
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
11
|
+
# Adding some colors to the output.
|
12
|
+
# Using ANSI escape codes.
|
13
|
+
red = "\e[31m"
|
14
|
+
green = "\e[32m"
|
15
|
+
blue = "\e[34m"
|
16
|
+
reset = "\e[0m"
|
17
|
+
|
18
|
+
reader = Thread.new do
|
19
|
+
beginning = true
|
20
|
+
loop do
|
21
|
+
result_array = tagger.get_output
|
22
|
+
if result_array.nil?
|
23
|
+
if beginning
|
24
|
+
sleep(0.1)
|
25
|
+
next
|
26
|
+
else
|
27
|
+
break
|
28
|
+
end
|
29
|
+
end
|
30
|
+
sleep(0.2) # Is useful!
|
31
|
+
|
32
|
+
beginning = false
|
33
|
+
result_array.each do |tuple|
|
34
|
+
tuple = tuple.split("\t")
|
35
|
+
|
36
|
+
if $stdout.tty?
|
37
|
+
tuple[0].insert(0, red).insert(-1, reset) if tuple[0]
|
38
|
+
tuple[1].insert(0, green).insert(-1, reset) if tuple[1]
|
39
|
+
tuple[2].insert(0, blue).insert(-1, reset) if tuple[2]
|
40
|
+
end
|
41
|
+
|
42
|
+
# [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]`
|
43
|
+
$stdout.puts tuple.join("\t")
|
27
44
|
end
|
28
|
-
|
29
|
-
$stdout.puts tuple.join("\t")
|
30
45
|
end
|
31
46
|
end
|
47
|
+
|
48
|
+
# Read all lines from STDOUT or from files.
|
49
|
+
while line = ARGF.gets
|
50
|
+
# Invoke tokenizer somehow here.
|
51
|
+
tagger.process(line)
|
52
|
+
end
|
53
|
+
|
54
|
+
tagger.flush
|
55
|
+
|
56
|
+
reader.join
|
57
|
+
|
58
|
+
STDOUT.flush
|
data/lib/tree_tagger/tagger.rb
CHANGED
@@ -1,22 +1,228 @@
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
2
|
+
require 'thread'
|
3
|
+
require 'tree_tagger/error'
|
2
4
|
|
5
|
+
=begin
|
6
|
+
TODO:
|
7
|
+
- Observe the status of the reader thread.
|
8
|
+
- Control the status of the pipe and recreate it.
|
9
|
+
- Handle IO errors.
|
10
|
+
- Handle errors while allocating the TT object.
|
11
|
+
- Update the flush sentence, make it shorter.
|
12
|
+
- Store the queue on a persistant medium, not in the memory.
|
13
|
+
- Properly set the $ORS for all platforms.
|
14
|
+
=end
|
15
|
+
# :main: README.rdoc
|
16
|
+
# :title: TreeTagger - Ruby based Wrapper for the TreeTagger by Helmut Schmid
|
17
|
+
# Module comment
|
3
18
|
module TreeTagger
|
19
|
+
# Class comment
|
4
20
|
class Tagger
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
21
|
+
|
22
|
+
BEGIN_MARKER = '<BEGIN_OF_THE_TT_INPUT>'
|
23
|
+
END_MARKER = '<END_OF_THE_TT_INPUT>'
|
24
|
+
# TT seems to hold only the last three tokens in the buffer.
|
25
|
+
# The flushing sentence can be shortened down to this size.
|
26
|
+
FLUSH_SENTENCE = "Das\nist\nein\nTestsatz\n,\num\ndas\nStossen\nder\nDaten\nsicherzustellen\n."
|
27
|
+
|
28
|
+
# Initializer commet
|
29
|
+
def initialize(opts = {
|
30
|
+
:binary => nil,
|
31
|
+
:model => nil,
|
32
|
+
:lexicon => nil,
|
33
|
+
:options => '-token -lemma -sgml -quiet',
|
34
|
+
:replace_blanks => true,
|
35
|
+
:blank_tag => '<BLANK>',
|
36
|
+
:lookup => false
|
11
37
|
}
|
12
38
|
)
|
13
|
-
|
14
|
-
@
|
39
|
+
|
40
|
+
@opts = validate_options(opts)
|
41
|
+
@blank_tag = @opts[:blank_tag]
|
42
|
+
@cmdline = "#{@opts[:binary]} #{@opts[:options]} #{@opts[:model]}"
|
43
|
+
|
44
|
+
@queue = Queue.new
|
45
|
+
@pipe = new_pipe
|
46
|
+
@pipe.sync = true
|
47
|
+
@reader = new_reader
|
48
|
+
@inside_output = false
|
49
|
+
@inside_input = false
|
50
|
+
@enqueued_tokens = 0
|
51
|
+
@mutex = Mutex.new
|
52
|
+
@queue_mutex = Mutex.new
|
53
|
+
# sleep(1) # Don't know if it's useful, no problems before.
|
54
|
+
end
|
55
|
+
|
56
|
+
# Send the string to the TreeTagger.
|
57
|
+
def process(input)
|
58
|
+
|
59
|
+
str = convert(input)
|
60
|
+
# Sanitize strings.
|
61
|
+
str = sanitize(str)
|
62
|
+
# Mark the beginning of the text.
|
63
|
+
if not @inside_input
|
64
|
+
str = "#{BEGIN_MARKER}\n#{str}\n"
|
65
|
+
@inside_input = true
|
66
|
+
else
|
67
|
+
str = str + "\n"
|
68
|
+
end
|
69
|
+
@mutex.synchronize { @enqueued_tokens += 1 }
|
70
|
+
@pipe.print(str)
|
15
71
|
end
|
16
|
-
|
17
|
-
|
18
|
-
|
72
|
+
|
73
|
+
# Get processed tokens back.
|
74
|
+
# This method is not blocking. If some tokens have been sent,
|
75
|
+
# but not received from the pipe yet, it returns an empty array.
|
76
|
+
# If all sent tokens are in the queue it returns all of them.
|
77
|
+
# If no more tokens are awaited it returns <nil>.
|
78
|
+
def get_output
|
79
|
+
output = []
|
80
|
+
tokens = 0
|
81
|
+
@queue_mutex.synchronize do
|
82
|
+
tokens = @queue.size
|
83
|
+
tokens.times { output << @queue.shift }
|
84
|
+
end
|
85
|
+
@mutex.synchronize do
|
86
|
+
@enqueued_tokens -= tokens
|
87
|
+
end
|
88
|
+
|
89
|
+
# Nil if nothing to process in the pipe.
|
90
|
+
# Possible only after flushing the pipe.
|
91
|
+
if @enqueued_tokens > 0
|
92
|
+
output
|
93
|
+
else
|
94
|
+
output.any? ? output : nil
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
# Get the rest of the text back.
|
99
|
+
# TT holds some meaningful parts in the buffer.
|
100
|
+
def flush
|
101
|
+
@inside_input = false
|
102
|
+
str = "#{END_MARKER}\n#{FLUSH_SENTENCE}\n"
|
103
|
+
@pipe.print(str)
|
104
|
+
# Here invoke the reader thread to ensure
|
105
|
+
# all output has been read.
|
106
|
+
#@reader.run
|
107
|
+
end
|
108
|
+
|
109
|
+
private
|
110
|
+
# Return the options hash after validation.
|
111
|
+
# {
|
112
|
+
# :binary => nil,
|
113
|
+
# :model => nil,
|
114
|
+
# :lexicon => nil,
|
115
|
+
# :options => '-token -lemma -sgml -quiet',
|
116
|
+
# :replace_blanks => true,
|
117
|
+
# :blank_tag => '<BLANK>',
|
118
|
+
# :lookup => false
|
119
|
+
# }
|
120
|
+
def validate_options(opts)
|
121
|
+
# Check if <:lookup> is boolean.
|
122
|
+
|
123
|
+
# Check if <:replace_blanks> is boolean.
|
124
|
+
|
125
|
+
# Check if <:options> is a string.
|
126
|
+
|
127
|
+
# Check if <:options> contains only allowed values.
|
128
|
+
|
129
|
+
# Ensure that <:options> contains <-sgml>.
|
130
|
+
|
131
|
+
# Check if <:blank_tag> is a string.
|
132
|
+
|
133
|
+
# Ensure that <:blank_tag> is a valid SGML sequence.
|
134
|
+
|
135
|
+
# Set the model and binary paths if not provided.
|
136
|
+
[:binary, :model].each do |key|
|
137
|
+
if opts[key].nil?
|
138
|
+
opts[key] = ENV.fetch("TREETAGGER_#{key.to_s.upcase}") do |missing|
|
139
|
+
fail UserError, "Provide a value for <:#{key}>" +
|
140
|
+
" or set the environment variable <#{missing}>!"
|
141
|
+
end
|
142
|
+
end
|
143
|
+
end
|
144
|
+
|
145
|
+
# Set the lexicon path if not provided but requested.
|
146
|
+
if opts[:lookup] && opts[:lexicon].nil?
|
147
|
+
opts[:lookup] = ENV.fetch('TREETAGGER_LEXICON') do |missing|
|
148
|
+
fail UserError, 'Provide a value for <:lexicon>' +
|
149
|
+
' or set the environment variable <TREETAGGER_LEXICON>!'
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
# Check for existence and reedability of external files:
|
154
|
+
# * binary;
|
155
|
+
# * model;
|
156
|
+
# * lexicon (if applicable).
|
157
|
+
|
158
|
+
opts
|
159
|
+
end
|
160
|
+
|
161
|
+
# Starts the reader thread.
|
162
|
+
def new_reader
|
163
|
+
Thread.new do
|
164
|
+
while line = @pipe.gets
|
165
|
+
# The output strings must not contain "\n".
|
166
|
+
line.chomp!
|
167
|
+
case line
|
168
|
+
when BEGIN_MARKER
|
169
|
+
@inside_output = true
|
170
|
+
$stderr.puts 'Found the begin marker.' if $DEBUG
|
171
|
+
when END_MARKER
|
172
|
+
@inside_output = false
|
173
|
+
$stderr.puts 'Found the end marker.' if $DEBUG
|
174
|
+
else
|
175
|
+
if @inside_output
|
176
|
+
@queue_mutex.synchronize { @queue << line }
|
177
|
+
$stderr.puts "<#{line}> added to the queue." if $DEBUG
|
178
|
+
end
|
179
|
+
end
|
180
|
+
end
|
181
|
+
end # thread
|
182
|
+
end # start_reader
|
183
|
+
|
184
|
+
# This method may be utilized to keep the TT process alive.
|
185
|
+
# Check here if TT returns the exit code 1 in case on invalide options.
|
186
|
+
def new_pipe
|
187
|
+
IO.popen(@cmdline, 'r+')
|
188
|
+
end
|
189
|
+
|
190
|
+
# Convert token arrays to delimited strings.
|
191
|
+
def convert(input)
|
192
|
+
unless input.is_a?(Array) || input.is_a?(String)
|
193
|
+
fail UserError, "Not a valid input format: <#{input.class}>!"
|
194
|
+
end
|
195
|
+
|
196
|
+
if input.empty?
|
197
|
+
fail UserError, "Empty input is not allowed!"
|
198
|
+
end
|
199
|
+
|
200
|
+
if input.is_a?(Array)
|
201
|
+
input.each do |el|
|
202
|
+
unless el.is_a?(String)
|
203
|
+
fail UserError, "Input elements should be strings!"
|
204
|
+
end
|
205
|
+
el = sanitize(el)
|
206
|
+
end
|
207
|
+
input = input.join("\n")
|
208
|
+
end
|
209
|
+
|
210
|
+
input
|
211
|
+
end
|
212
|
+
|
213
|
+
def sanitize(str)
|
214
|
+
line = str.strip
|
215
|
+
if line.empty?
|
216
|
+
line = @blank_tag
|
217
|
+
end
|
218
|
+
|
219
|
+
line
|
19
220
|
end
|
20
221
|
end # class
|
21
222
|
end # module
|
22
223
|
|
224
|
+
__END__
|
225
|
+
- tokenization
|
226
|
+
- lexicon lookup
|
227
|
+
- tagging
|
228
|
+
- error correction
|
data/lib/tree_tagger/version.rb
CHANGED
data/test/test_tagger.rb
ADDED
@@ -0,0 +1,154 @@
|
|
1
|
+
require 'test/unit'
|
2
|
+
require 'tree_tagger/tagger'
|
3
|
+
require 'tree_tagger/error'
|
4
|
+
require 'stringio'
|
5
|
+
|
6
|
+
class TestTagger < Test::Unit::TestCase
|
7
|
+
|
8
|
+
PUBLIC_METHODS = [:process,
|
9
|
+
:get_output,
|
10
|
+
:flush
|
11
|
+
]
|
12
|
+
def setup
|
13
|
+
# ENV['TREETAGGER_BINARY'] = '/opt/TreeTagger/bin/tree-tagger'
|
14
|
+
# ENV['TREETAGGER_MODEL'] = '/opt/TreeTagger/lib/german.par'
|
15
|
+
# ENV['TREETAGGER_LEXICON'] = '/opt/TreeTagger/lib/german-lexicon.txt'
|
16
|
+
|
17
|
+
ENV['TREETAGGER_BINARY'] = 'test/tree-tagger/tree-tagger'
|
18
|
+
ENV['TREETAGGER_MODEL'] = 'test/tree-tagger/model_file.par'
|
19
|
+
ENV['TREETAGGER_LEXICON'] = 'test/tree-tagger/lexicon_file.txt'
|
20
|
+
|
21
|
+
params = {} # dummy for now
|
22
|
+
@tagger = TreeTagger::Tagger.new
|
23
|
+
end
|
24
|
+
|
25
|
+
def teardown
|
26
|
+
end
|
27
|
+
|
28
|
+
# It should have the following constants set.
|
29
|
+
def test_constants
|
30
|
+
end
|
31
|
+
|
32
|
+
# It should respond to valid methods
|
33
|
+
def test_public_methods
|
34
|
+
PUBLIC_METHODS.each do |m|
|
35
|
+
assert_respond_to(@tagger, m)
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
def test_tagger
|
40
|
+
end
|
41
|
+
|
42
|
+
# It should accept only arrays and strings.
|
43
|
+
def test_input_for_its_class
|
44
|
+
assert_nothing_raised do
|
45
|
+
@tagger.process 'Ich\ngehe\nin\ndie\nSchule\n.\n'
|
46
|
+
@tagger.process %w{Ich gehe in die Schule .}
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
# It should reject non-string and non-array elements.
|
51
|
+
def test_rejecting_invalid_input
|
52
|
+
[{}, :input, 1, 1.0, Time.new].each do |input|
|
53
|
+
assert_raise(TreeTagger::UserError) do
|
54
|
+
@tagger.process(input)
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
# It should reject empty input.
|
60
|
+
def test_for_empty_input
|
61
|
+
['', []].each do |input|
|
62
|
+
assert_raise(TreeTagger::UserError) do
|
63
|
+
@tagger.process(input)
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
# It should reject arrays with wrong elements.
|
69
|
+
def test_for_elements_of_arrays
|
70
|
+
|
71
|
+
end
|
72
|
+
|
73
|
+
# It should accept valid input.
|
74
|
+
def test_accepting_vaild_input
|
75
|
+
input = ''
|
76
|
+
end
|
77
|
+
|
78
|
+
# It should accept only valid input.
|
79
|
+
def test_input_validity
|
80
|
+
['', [], {}, :input, [:one, :two]].each do |input|
|
81
|
+
assert_raise(TreeTagger::UserError) do
|
82
|
+
@tagger.process(input)
|
83
|
+
end
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
# It should instantiate a tagger instance only with valid options.
|
88
|
+
def test_for_binary_presence
|
89
|
+
ENV.delete('TREETAGGER_BINARY')
|
90
|
+
assert_raise(TreeTagger::UserError) do
|
91
|
+
TreeTagger::Tagger.new
|
92
|
+
end
|
93
|
+
end
|
94
|
+
|
95
|
+
# It should instantiate a tagger instance only with valid options.
|
96
|
+
def test_for_model_presence
|
97
|
+
ENV.delete('TREETAGGER_MODEL')
|
98
|
+
assert_raise(TreeTagger::UserError) do
|
99
|
+
TreeTagger::Tagger.new
|
100
|
+
end
|
101
|
+
|
102
|
+
end
|
103
|
+
|
104
|
+
# It should instantiate a tagger instance only with valid options.
|
105
|
+
def test_for_lexicon_presence
|
106
|
+
ENV.delete('TREETAGGER_LEXICON')
|
107
|
+
assert_raise(TreeTagger::UserError) do
|
108
|
+
TreeTagger::Tagger.new({:lookup => true, :options => '-quiet -sgml'})
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
# It should reject a non-boolean value for <:lookup>.
|
113
|
+
def test_rejecting_lookup_values
|
114
|
+
assert_raise(TreeTagger::UserError) do
|
115
|
+
TreeTagger::Tagger.new({:lookup => 'true', :options => '-quiet'})
|
116
|
+
end
|
117
|
+
end
|
118
|
+
|
119
|
+
# It should reject a non-boolean value for <:replace_blanks>.
|
120
|
+
def test_rejecting_blank_values
|
121
|
+
assert_raise(TreeTagger::UserError) do
|
122
|
+
TreeTagger::Tagger.new({:replace_blanks => 'true'})
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
# It should reject a non-string value for <:options>.
|
127
|
+
def test_rejecting_option_values
|
128
|
+
assert_raise(TreeTagger::UserError) do
|
129
|
+
TreeTagger::Tagger.new({:options => :quiet})
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
# It should reject invalid options for TreeTagger inside <:options>.
|
134
|
+
def test_rejecting_invalid_arguments
|
135
|
+
flunk 'Not implemented yet!'
|
136
|
+
end
|
137
|
+
|
138
|
+
# It should ensure the presense of the <-sgml> argument.
|
139
|
+
def test_presence_of_sgml_argument
|
140
|
+
flunk 'Not implemented yet!'
|
141
|
+
end
|
142
|
+
|
143
|
+
# It should reject a non-string value for <:blank_tag>.
|
144
|
+
def test_rejecting_blanktag_values
|
145
|
+
assert_raise(TreeTagger::UserError) do
|
146
|
+
TreeTagger::Tagger.new({:blank_tag => :blank})
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
150
|
+
# It should ensure that <:blang_tag> is a valid smgl sequence.
|
151
|
+
def test_sgml_form
|
152
|
+
flunk 'Not implemented yet!'
|
153
|
+
end
|
154
|
+
end
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: treetagger-ruby
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 27
|
5
5
|
prerelease:
|
6
6
|
segments:
|
7
7
|
- 0
|
8
|
-
- 0
|
9
8
|
- 1
|
10
|
-
|
9
|
+
- 0
|
10
|
+
version: 0.1.0
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Andrei Beliankou
|
@@ -15,7 +15,7 @@ autorequire:
|
|
15
15
|
bindir: bin
|
16
16
|
cert_chain: []
|
17
17
|
|
18
|
-
date:
|
18
|
+
date: 2012-02-14 00:00:00 Z
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
21
21
|
name: rdoc
|
@@ -95,6 +95,12 @@ files:
|
|
95
95
|
- LICENCE.rdoc
|
96
96
|
- CHANGELOG.rdoc
|
97
97
|
- .yardopts
|
98
|
+
- test/test_tagger.rb
|
99
|
+
- test/tree-tagger/corrupted_lexicon_file.txt
|
100
|
+
- test/tree-tagger/lexicon_file.txt
|
101
|
+
- test/tree-tagger/corrupted_model_file.par
|
102
|
+
- test/tree-tagger/model_file.par
|
103
|
+
- test/tree-tagger/tree-tagger
|
98
104
|
- bin/rtt
|
99
105
|
homepage: http://www.uni-trier.de/index.php?id=34451
|
100
106
|
licenses: []
|
@@ -132,6 +138,11 @@ rubygems_version: 1.8.10
|
|
132
138
|
signing_key:
|
133
139
|
specification_version: 3
|
134
140
|
summary: A wrapper for the TreeTagger by Helmut Schmid.
|
135
|
-
test_files:
|
136
|
-
|
141
|
+
test_files:
|
142
|
+
- test/test_tagger.rb
|
143
|
+
- test/tree-tagger/corrupted_lexicon_file.txt
|
144
|
+
- test/tree-tagger/lexicon_file.txt
|
145
|
+
- test/tree-tagger/corrupted_model_file.par
|
146
|
+
- test/tree-tagger/model_file.par
|
147
|
+
- test/tree-tagger/tree-tagger
|
137
148
|
has_rdoc:
|