treetagger-ruby 0.0.1 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.rdoc +12 -2
- data/README.rdoc +137 -18
- data/bin/rtt +45 -18
- data/lib/tree_tagger/tagger.rb +217 -11
- data/lib/tree_tagger/version.rb +1 -1
- data/test/test_tagger.rb +154 -0
- data/test/tree-tagger/corrupted_lexicon_file.txt +0 -0
- data/test/tree-tagger/corrupted_model_file.par +0 -0
- data/test/tree-tagger/lexicon_file.txt +0 -0
- data/test/tree-tagger/model_file.par +0 -0
- data/test/tree-tagger/tree-tagger +8 -0
- metadata +17 -6
data/CHANGELOG.rdoc
CHANGED
@@ -1,17 +1,27 @@
|
|
1
1
|
== COMPLETED
|
2
|
+
=== 0.1.0
|
3
|
+
The inteface is now clear and stable.
|
4
|
+
|
5
|
+
Tagging of big texts is possible since the TreeTagger is invoked as an
|
6
|
+
external process through a pipe.
|
7
|
+
|
8
|
+
Test suite improved.
|
2
9
|
=== 0.0.1
|
3
10
|
Implemented simple tagging. The TreeTagger is invoked through the evn variable.
|
4
11
|
=== 0.0.1.prealpha
|
5
12
|
Created the structure for this project, added documentation and a public repo.
|
6
13
|
|
7
14
|
== PLANNED
|
8
|
-
=== 0.1.0
|
9
|
-
|
10
15
|
=== 0.2.0
|
16
|
+
Better tests. Support for all input types.
|
11
17
|
=== 0.3.0
|
18
|
+
Lemmatizer.
|
12
19
|
=== 0.4.0
|
20
|
+
File based FIFOs.
|
13
21
|
=== 0.5.0
|
22
|
+
File based queues.
|
14
23
|
=== 0.6.0
|
24
|
+
Full featured cmd interface.
|
15
25
|
=== 0.7.0
|
16
26
|
=== 0.8.0
|
17
27
|
=== 0.9.0
|
data/README.rdoc
CHANGED
@@ -1,37 +1,69 @@
|
|
1
1
|
= TreeTagger for Ruby
|
2
2
|
|
3
3
|
* {RubyGems}[http://rubygems.org/gems/treetagger-ruby]
|
4
|
-
*
|
4
|
+
* {Homepage}[http://bu.chsta.be/]
|
5
5
|
* {RTT Project Page}[http://bu.chsta.be/projects/treetagger-ruby/]
|
6
6
|
* {Source Code}[https://github.com/arbox/treetagger-ruby]
|
7
7
|
* {Bug Tracker}[https://github.com/arbox/treetagger-ruby/issues]
|
8
8
|
|
9
9
|
== DESCRIPTION
|
10
|
-
|
11
|
-
|
12
|
-
|
10
|
+
A Ruby based wrapper for the TreeTagger by Helmut Schmid.
|
11
|
+
|
12
|
+
Check it out if you are interested in Natural Language Processing (NLP)
|
13
|
+
and/or Human Language Technology (HLT).
|
14
|
+
|
15
|
+
This library provides comprehensive bindings for the
|
16
|
+
{TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/],
|
17
|
+
a statistical language independed POS tagging and chunking software.
|
18
|
+
|
19
|
+
TreeTagger is language agnostic, it will never guess what language you're going
|
20
|
+
to use. It
|
21
|
+
|
22
|
+
TODO:
|
23
|
+
* References to Schmid's publications;
|
24
|
+
* How to use TreeTagger in the wild;
|
25
|
+
* Input and output format, tokenization;
|
26
|
+
* ...
|
27
|
+
* The actual german parameter file has been estimated on one byte encoded data.
|
28
|
+
|
13
29
|
=== Implemented Features
|
14
30
|
Simple tagging.
|
15
31
|
|
32
|
+
Please have a look at the CHANGELOG file for details on implemented and planned
|
33
|
+
features.
|
16
34
|
|
17
35
|
== INSTALLATION
|
18
36
|
Before you install the <tt>treetagger-ruby</tt> package please ensure
|
19
|
-
you have downloaded and
|
37
|
+
you have downloaded and installed the
|
38
|
+
{TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
|
39
|
+
itself.
|
20
40
|
|
21
41
|
The {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
|
22
|
-
is a copyrighted software by Helmut Schmid and
|
23
|
-
|
42
|
+
is a copyrighted software by Helmut Schmid and
|
43
|
+
{IMS}[http://www.ims.uni-stuttgart.de/], please read the license
|
44
|
+
agreament before you download the TreeTagger package and language models.
|
24
45
|
|
25
46
|
After the installation of the <tt>TreeTagger</tt> set the environment variable
|
26
|
-
<tt>
|
27
|
-
Usually this
|
28
|
-
<tt>
|
29
|
-
|
30
|
-
|
47
|
+
<tt>TREETAGGER_BINARY</tt> to the location where the binary <tt>tree-tagger</tt>
|
48
|
+
resides. Usually this binary is located under the <tt>bin</tt> directory in the
|
49
|
+
main installation directory of the <tt>TreeTagger</tt>.
|
50
|
+
|
51
|
+
Also you have to set the variable <tt>TREETAGGER_MODEL</tt> to the location of
|
52
|
+
the appropriate language model you have acquired in the training step.
|
53
|
+
|
54
|
+
For instance you may add the following lines to your <tt>.profile</tt> file:
|
55
|
+
export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
|
56
|
+
export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'
|
57
|
+
|
58
|
+
It is convinient to work with a default language model, but you can change
|
59
|
+
it every time during the instantiation of a new tagger instance.
|
60
|
+
|
61
|
+
If you want to feed a lexicon file into your tagger you can do it globally
|
62
|
+
through the environment variable <tt>TREETAGGER_LEXICON</tt>.
|
31
63
|
|
32
64
|
<tt>treetagger-ruby</tt> is provided as a .gem package. Simply install it via
|
33
65
|
{RubyGems}[http://rubygems.org/gems/treetagger-ruby].
|
34
|
-
To install <tt>treetagger-ruby</tt>
|
66
|
+
To install <tt>treetagger-ruby</tt> issue the following command:
|
35
67
|
$ gem install treetagger-ruby
|
36
68
|
|
37
69
|
If you want to do a system wide installation, do this as root
|
@@ -41,14 +73,88 @@ Alternatively use your Gemfile for dependency management.
|
|
41
73
|
|
42
74
|
|
43
75
|
== SYNOPSIS
|
44
|
-
|
76
|
+
=== Basic Usage
|
45
77
|
Basic usage is very simple:
|
46
78
|
$ require 'treetagger-ruby'
|
79
|
+
$ # Instantiate a tagger instance with default values.
|
47
80
|
$ tagger = TreeTagger::Tagger.new
|
48
|
-
$
|
81
|
+
$ # Process an array of tokens.
|
82
|
+
$ tagger.process(%w{Ich gehe in die Schule})
|
83
|
+
$ # Flush the pipeline.
|
84
|
+
$ tagger.flush
|
85
|
+
$ # Get the processed data.
|
86
|
+
$ tagger.get_output
|
87
|
+
|
88
|
+
=== Input Format
|
89
|
+
Basically you have to provide a tokenized sequence with possibly some additional
|
90
|
+
information on lexical classes of tokens and on their probabilities. Every token
|
91
|
+
has to be on a separate line. Due to technical limitations SGML tags
|
92
|
+
(i.e. sequences with heading < and trailing >) cannot be valid tokes since
|
93
|
+
they are used internally for delimiting meaningful content from flush tokens.
|
94
|
+
It implies the use of the <tt>-sgml</tt> option which cannot be changes by user.
|
95
|
+
It is a limitation of <em>this</em> library. If you do need to process tags,
|
96
|
+
fall back and use the TreeTagger as a standalone programm possibly employing
|
97
|
+
temp files to store your input and output. This behaviour will be also
|
98
|
+
implemented in futher versions of <tt>treetagger-ruby</tt>.
|
99
|
+
|
100
|
+
Every token may occure alone on the line or be followed by additional
|
101
|
+
information:
|
102
|
+
* token;
|
103
|
+
* token (\\tab tag)+;
|
104
|
+
* token (\\tab tag \\space lemma)+;
|
105
|
+
* token (\\tab tag \\space probability)+;
|
106
|
+
* token (\\tab tag \\space probability \\space lemma)+.
|
107
|
+
|
108
|
+
You input may look like the following sentence:
|
109
|
+
Die ART 0.99
|
110
|
+
neuen ADJA neu
|
111
|
+
Hunde NN NP
|
112
|
+
stehen VVFIN 0.99 stehen
|
113
|
+
an
|
114
|
+
den
|
115
|
+
Mauern NN Mauer
|
116
|
+
.
|
117
|
+
|
118
|
+
|
119
|
+
This wrapper accepts the input as <em>String</em> or <em>Array</em>.
|
120
|
+
|
121
|
+
If you want to use strings, you are responsible for the proper delimiters inside
|
122
|
+
the string: <tt>"Die\\tART 0.99\\nneuen\\tADJA neu\\nHunde\\tNN NP\\nstehen\\t
|
123
|
+
VVFIN 0.99 stehen\\nan\\nden\\nMauern\\tNN Mauer\\n.\\n"</tt>
|
124
|
+
Now <tt>treetagger-ruby</tt> does not check your markup for correctness and will
|
125
|
+
possibly report a <tt>TreeTagger::ExternalError</tt> if the TreeTagger process
|
126
|
+
die due to input errors.
|
127
|
+
|
128
|
+
Using arrays is more convinient since they can be built programmatically.
|
129
|
+
|
130
|
+
Arrays should have the following structure:
|
131
|
+
* ['token', 'token', 'token'];
|
132
|
+
* ['token', ['token', ['POS', 'lemma'], ['POS', 'lemma']], 'token'];
|
133
|
+
* ['token', ['token', ['POS', prob], ['POS', 'prob']], 'token'];
|
134
|
+
* ['token', ['token', ['POS', prob, 'lemma'], ['POS', 'prob', 'lemma']]].
|
135
|
+
|
136
|
+
It is internally converted in the sequence <tt>token\\ntoken\\tPOS lemma\\t
|
137
|
+
POS lemma\\ntoken\\n</tt>, i.e. in the enriched version alternatives are
|
138
|
+
tab separated and entries a blank separated.
|
139
|
+
|
140
|
+
Note that probabilities may be strings or integers.
|
141
|
+
|
142
|
+
The lexicon lookup is +not+ implemented for now, that's the latter three forms
|
143
|
+
of input arrays are not supported yet.
|
144
|
+
|
145
|
+
=== Output Format
|
146
|
+
For now you'll get an array with strings elements. However the precise string
|
147
|
+
structure depends on the cmd arguments you've provided during the tagger
|
148
|
+
instantiation.
|
149
|
+
|
150
|
+
For instanse for the input <tt>["Veruntreute", "die", "AWO", "Spendengeld", "?"]
|
151
|
+
</tt> you'll get the following output with default cmd argumetns:
|
152
|
+
|
153
|
+
<tt>["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>",
|
154
|
+
"Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]</tt>
|
49
155
|
|
50
156
|
See documentation in the TreeTagger::Tagger class for details
|
51
|
-
on particular
|
157
|
+
on particular methods.
|
52
158
|
|
53
159
|
== EXCEPTION HIERARCHY
|
54
160
|
While using TreeTagger you can face following errors:
|
@@ -56,10 +162,23 @@ While using TreeTagger you can face following errors:
|
|
56
162
|
* <tt>TreeTagger::RuntimeError</tt>;
|
57
163
|
* <tt>TreeTagger::ExternalError</tt>.
|
58
164
|
|
165
|
+
This three kinds of errors all subclass <tt>TreeTagger::Error</tt>, which
|
166
|
+
in turn is a subclass of <tt>StandardError</tt>. For an end user this means that
|
167
|
+
it is possible to intercept all errors from <em>treetagger-ruby</em> with
|
168
|
+
a simple <tt>rescue</tt> clause.
|
169
|
+
|
59
170
|
== SUPPORT
|
60
|
-
If you have question, bug reports or any suggestions, please drop me an email :)
|
61
|
-
Any help is deeply appreciated!
|
171
|
+
If you have question, bug reports or any suggestions, please drop me an email :)
|
62
172
|
|
173
|
+
== HOW TO CONTRIBUTE
|
174
|
+
Please contact me and suggest your ideas, report bugs, talk to me, if you want
|
175
|
+
to implement some features in the future releases of this library.
|
176
|
+
|
177
|
+
Please don't feel offended if I cannot accept all your pull requests, I have
|
178
|
+
to review them and find the appropriate time and place in the code base to
|
179
|
+
incorporate your valuable changes.
|
180
|
+
|
181
|
+
Any help is deeply appreciated!
|
63
182
|
== CHANGELOG
|
64
183
|
For details on future plan and working progress see CHANGELOG.
|
65
184
|
|
data/bin/rtt
CHANGED
@@ -8,24 +8,51 @@ options = TreeTagger::ARGVParser.parse(ARGV)
|
|
8
8
|
|
9
9
|
tagger = TreeTagger::Tagger.new(options)
|
10
10
|
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
11
|
+
# Adding some colors to the output.
|
12
|
+
# Using ANSI escape codes.
|
13
|
+
red = "\e[31m"
|
14
|
+
green = "\e[32m"
|
15
|
+
blue = "\e[34m"
|
16
|
+
reset = "\e[0m"
|
17
|
+
|
18
|
+
reader = Thread.new do
|
19
|
+
beginning = true
|
20
|
+
loop do
|
21
|
+
result_array = tagger.get_output
|
22
|
+
if result_array.nil?
|
23
|
+
if beginning
|
24
|
+
sleep(0.1)
|
25
|
+
next
|
26
|
+
else
|
27
|
+
break
|
28
|
+
end
|
29
|
+
end
|
30
|
+
sleep(0.2) # Is useful!
|
31
|
+
|
32
|
+
beginning = false
|
33
|
+
result_array.each do |tuple|
|
34
|
+
tuple = tuple.split("\t")
|
35
|
+
|
36
|
+
if $stdout.tty?
|
37
|
+
tuple[0].insert(0, red).insert(-1, reset) if tuple[0]
|
38
|
+
tuple[1].insert(0, green).insert(-1, reset) if tuple[1]
|
39
|
+
tuple[2].insert(0, blue).insert(-1, reset) if tuple[2]
|
40
|
+
end
|
41
|
+
|
42
|
+
# [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]`
|
43
|
+
$stdout.puts tuple.join("\t")
|
27
44
|
end
|
28
|
-
|
29
|
-
$stdout.puts tuple.join("\t")
|
30
45
|
end
|
31
46
|
end
|
47
|
+
|
48
|
+
# Read all lines from STDOUT or from files.
|
49
|
+
while line = ARGF.gets
|
50
|
+
# Invoke tokenizer somehow here.
|
51
|
+
tagger.process(line)
|
52
|
+
end
|
53
|
+
|
54
|
+
tagger.flush
|
55
|
+
|
56
|
+
reader.join
|
57
|
+
|
58
|
+
STDOUT.flush
|
data/lib/tree_tagger/tagger.rb
CHANGED
@@ -1,22 +1,228 @@
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
2
|
+
require 'thread'
|
3
|
+
require 'tree_tagger/error'
|
2
4
|
|
5
|
+
=begin
|
6
|
+
TODO:
|
7
|
+
- Observe the status of the reader thread.
|
8
|
+
- Control the status of the pipe and recreate it.
|
9
|
+
- Handle IO errors.
|
10
|
+
- Handle errors while allocating the TT object.
|
11
|
+
- Update the flush sentence, make it shorter.
|
12
|
+
- Store the queue on a persistant medium, not in the memory.
|
13
|
+
- Properly set the $ORS for all platforms.
|
14
|
+
=end
|
15
|
+
# :main: README.rdoc
|
16
|
+
# :title: TreeTagger - Ruby based Wrapper for the TreeTagger by Helmut Schmid
|
17
|
+
# Module comment
|
3
18
|
module TreeTagger
|
19
|
+
# Class comment
|
4
20
|
class Tagger
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
21
|
+
|
22
|
+
BEGIN_MARKER = '<BEGIN_OF_THE_TT_INPUT>'
|
23
|
+
END_MARKER = '<END_OF_THE_TT_INPUT>'
|
24
|
+
# TT seems to hold only the last three tokens in the buffer.
|
25
|
+
# The flushing sentence can be shortened down to this size.
|
26
|
+
FLUSH_SENTENCE = "Das\nist\nein\nTestsatz\n,\num\ndas\nStossen\nder\nDaten\nsicherzustellen\n."
|
27
|
+
|
28
|
+
# Initializer commet
|
29
|
+
def initialize(opts = {
|
30
|
+
:binary => nil,
|
31
|
+
:model => nil,
|
32
|
+
:lexicon => nil,
|
33
|
+
:options => '-token -lemma -sgml -quiet',
|
34
|
+
:replace_blanks => true,
|
35
|
+
:blank_tag => '<BLANK>',
|
36
|
+
:lookup => false
|
11
37
|
}
|
12
38
|
)
|
13
|
-
|
14
|
-
@
|
39
|
+
|
40
|
+
@opts = validate_options(opts)
|
41
|
+
@blank_tag = @opts[:blank_tag]
|
42
|
+
@cmdline = "#{@opts[:binary]} #{@opts[:options]} #{@opts[:model]}"
|
43
|
+
|
44
|
+
@queue = Queue.new
|
45
|
+
@pipe = new_pipe
|
46
|
+
@pipe.sync = true
|
47
|
+
@reader = new_reader
|
48
|
+
@inside_output = false
|
49
|
+
@inside_input = false
|
50
|
+
@enqueued_tokens = 0
|
51
|
+
@mutex = Mutex.new
|
52
|
+
@queue_mutex = Mutex.new
|
53
|
+
# sleep(1) # Don't know if it's useful, no problems before.
|
54
|
+
end
|
55
|
+
|
56
|
+
# Send the string to the TreeTagger.
|
57
|
+
def process(input)
|
58
|
+
|
59
|
+
str = convert(input)
|
60
|
+
# Sanitize strings.
|
61
|
+
str = sanitize(str)
|
62
|
+
# Mark the beginning of the text.
|
63
|
+
if not @inside_input
|
64
|
+
str = "#{BEGIN_MARKER}\n#{str}\n"
|
65
|
+
@inside_input = true
|
66
|
+
else
|
67
|
+
str = str + "\n"
|
68
|
+
end
|
69
|
+
@mutex.synchronize { @enqueued_tokens += 1 }
|
70
|
+
@pipe.print(str)
|
15
71
|
end
|
16
|
-
|
17
|
-
|
18
|
-
|
72
|
+
|
73
|
+
# Get processed tokens back.
|
74
|
+
# This method is not blocking. If some tokens have been sent,
|
75
|
+
# but not received from the pipe yet, it returns an empty array.
|
76
|
+
# If all sent tokens are in the queue it returns all of them.
|
77
|
+
# If no more tokens are awaited it returns <nil>.
|
78
|
+
def get_output
|
79
|
+
output = []
|
80
|
+
tokens = 0
|
81
|
+
@queue_mutex.synchronize do
|
82
|
+
tokens = @queue.size
|
83
|
+
tokens.times { output << @queue.shift }
|
84
|
+
end
|
85
|
+
@mutex.synchronize do
|
86
|
+
@enqueued_tokens -= tokens
|
87
|
+
end
|
88
|
+
|
89
|
+
# Nil if nothing to process in the pipe.
|
90
|
+
# Possible only after flushing the pipe.
|
91
|
+
if @enqueued_tokens > 0
|
92
|
+
output
|
93
|
+
else
|
94
|
+
output.any? ? output : nil
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
# Get the rest of the text back.
|
99
|
+
# TT holds some meaningful parts in the buffer.
|
100
|
+
def flush
|
101
|
+
@inside_input = false
|
102
|
+
str = "#{END_MARKER}\n#{FLUSH_SENTENCE}\n"
|
103
|
+
@pipe.print(str)
|
104
|
+
# Here invoke the reader thread to ensure
|
105
|
+
# all output has been read.
|
106
|
+
#@reader.run
|
107
|
+
end
|
108
|
+
|
109
|
+
private
|
110
|
+
# Return the options hash after validation.
|
111
|
+
# {
|
112
|
+
# :binary => nil,
|
113
|
+
# :model => nil,
|
114
|
+
# :lexicon => nil,
|
115
|
+
# :options => '-token -lemma -sgml -quiet',
|
116
|
+
# :replace_blanks => true,
|
117
|
+
# :blank_tag => '<BLANK>',
|
118
|
+
# :lookup => false
|
119
|
+
# }
|
120
|
+
def validate_options(opts)
|
121
|
+
# Check if <:lookup> is boolean.
|
122
|
+
|
123
|
+
# Check if <:replace_blanks> is boolean.
|
124
|
+
|
125
|
+
# Check if <:options> is a string.
|
126
|
+
|
127
|
+
# Check if <:options> contains only allowed values.
|
128
|
+
|
129
|
+
# Ensure that <:options> contains <-sgml>.
|
130
|
+
|
131
|
+
# Check if <:blank_tag> is a string.
|
132
|
+
|
133
|
+
# Ensure that <:blank_tag> is a valid SGML sequence.
|
134
|
+
|
135
|
+
# Set the model and binary paths if not provided.
|
136
|
+
[:binary, :model].each do |key|
|
137
|
+
if opts[key].nil?
|
138
|
+
opts[key] = ENV.fetch("TREETAGGER_#{key.to_s.upcase}") do |missing|
|
139
|
+
fail UserError, "Provide a value for <:#{key}>" +
|
140
|
+
" or set the environment variable <#{missing}>!"
|
141
|
+
end
|
142
|
+
end
|
143
|
+
end
|
144
|
+
|
145
|
+
# Set the lexicon path if not provided but requested.
|
146
|
+
if opts[:lookup] && opts[:lexicon].nil?
|
147
|
+
opts[:lookup] = ENV.fetch('TREETAGGER_LEXICON') do |missing|
|
148
|
+
fail UserError, 'Provide a value for <:lexicon>' +
|
149
|
+
' or set the environment variable <TREETAGGER_LEXICON>!'
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
# Check for existence and reedability of external files:
|
154
|
+
# * binary;
|
155
|
+
# * model;
|
156
|
+
# * lexicon (if applicable).
|
157
|
+
|
158
|
+
opts
|
159
|
+
end
|
160
|
+
|
161
|
+
# Starts the reader thread.
|
162
|
+
def new_reader
|
163
|
+
Thread.new do
|
164
|
+
while line = @pipe.gets
|
165
|
+
# The output strings must not contain "\n".
|
166
|
+
line.chomp!
|
167
|
+
case line
|
168
|
+
when BEGIN_MARKER
|
169
|
+
@inside_output = true
|
170
|
+
$stderr.puts 'Found the begin marker.' if $DEBUG
|
171
|
+
when END_MARKER
|
172
|
+
@inside_output = false
|
173
|
+
$stderr.puts 'Found the end marker.' if $DEBUG
|
174
|
+
else
|
175
|
+
if @inside_output
|
176
|
+
@queue_mutex.synchronize { @queue << line }
|
177
|
+
$stderr.puts "<#{line}> added to the queue." if $DEBUG
|
178
|
+
end
|
179
|
+
end
|
180
|
+
end
|
181
|
+
end # thread
|
182
|
+
end # start_reader
|
183
|
+
|
184
|
+
# This method may be utilized to keep the TT process alive.
|
185
|
+
# Check here if TT returns the exit code 1 in case on invalide options.
|
186
|
+
def new_pipe
|
187
|
+
IO.popen(@cmdline, 'r+')
|
188
|
+
end
|
189
|
+
|
190
|
+
# Convert token arrays to delimited strings.
|
191
|
+
def convert(input)
|
192
|
+
unless input.is_a?(Array) || input.is_a?(String)
|
193
|
+
fail UserError, "Not a valid input format: <#{input.class}>!"
|
194
|
+
end
|
195
|
+
|
196
|
+
if input.empty?
|
197
|
+
fail UserError, "Empty input is not allowed!"
|
198
|
+
end
|
199
|
+
|
200
|
+
if input.is_a?(Array)
|
201
|
+
input.each do |el|
|
202
|
+
unless el.is_a?(String)
|
203
|
+
fail UserError, "Input elements should be strings!"
|
204
|
+
end
|
205
|
+
el = sanitize(el)
|
206
|
+
end
|
207
|
+
input = input.join("\n")
|
208
|
+
end
|
209
|
+
|
210
|
+
input
|
211
|
+
end
|
212
|
+
|
213
|
+
def sanitize(str)
|
214
|
+
line = str.strip
|
215
|
+
if line.empty?
|
216
|
+
line = @blank_tag
|
217
|
+
end
|
218
|
+
|
219
|
+
line
|
19
220
|
end
|
20
221
|
end # class
|
21
222
|
end # module
|
22
223
|
|
224
|
+
__END__
|
225
|
+
- tokenization
|
226
|
+
- lexicon lookup
|
227
|
+
- tagging
|
228
|
+
- error correction
|
data/lib/tree_tagger/version.rb
CHANGED
data/test/test_tagger.rb
ADDED
@@ -0,0 +1,154 @@
|
|
1
|
+
require 'test/unit'
|
2
|
+
require 'tree_tagger/tagger'
|
3
|
+
require 'tree_tagger/error'
|
4
|
+
require 'stringio'
|
5
|
+
|
6
|
+
class TestTagger < Test::Unit::TestCase
|
7
|
+
|
8
|
+
PUBLIC_METHODS = [:process,
|
9
|
+
:get_output,
|
10
|
+
:flush
|
11
|
+
]
|
12
|
+
def setup
|
13
|
+
# ENV['TREETAGGER_BINARY'] = '/opt/TreeTagger/bin/tree-tagger'
|
14
|
+
# ENV['TREETAGGER_MODEL'] = '/opt/TreeTagger/lib/german.par'
|
15
|
+
# ENV['TREETAGGER_LEXICON'] = '/opt/TreeTagger/lib/german-lexicon.txt'
|
16
|
+
|
17
|
+
ENV['TREETAGGER_BINARY'] = 'test/tree-tagger/tree-tagger'
|
18
|
+
ENV['TREETAGGER_MODEL'] = 'test/tree-tagger/model_file.par'
|
19
|
+
ENV['TREETAGGER_LEXICON'] = 'test/tree-tagger/lexicon_file.txt'
|
20
|
+
|
21
|
+
params = {} # dummy for now
|
22
|
+
@tagger = TreeTagger::Tagger.new
|
23
|
+
end
|
24
|
+
|
25
|
+
def teardown
|
26
|
+
end
|
27
|
+
|
28
|
+
# It should have the following constants set.
|
29
|
+
def test_constants
|
30
|
+
end
|
31
|
+
|
32
|
+
# It should respond to valid methods
|
33
|
+
def test_public_methods
|
34
|
+
PUBLIC_METHODS.each do |m|
|
35
|
+
assert_respond_to(@tagger, m)
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
def test_tagger
|
40
|
+
end
|
41
|
+
|
42
|
+
# It should accept only arrays and strings.
|
43
|
+
def test_input_for_its_class
|
44
|
+
assert_nothing_raised do
|
45
|
+
@tagger.process 'Ich\ngehe\nin\ndie\nSchule\n.\n'
|
46
|
+
@tagger.process %w{Ich gehe in die Schule .}
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
# It should reject non-string and non-array elements.
|
51
|
+
def test_rejecting_invalid_input
|
52
|
+
[{}, :input, 1, 1.0, Time.new].each do |input|
|
53
|
+
assert_raise(TreeTagger::UserError) do
|
54
|
+
@tagger.process(input)
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
# It should reject empty input.
|
60
|
+
def test_for_empty_input
|
61
|
+
['', []].each do |input|
|
62
|
+
assert_raise(TreeTagger::UserError) do
|
63
|
+
@tagger.process(input)
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
# It should reject arrays with wrong elements.
|
69
|
+
def test_for_elements_of_arrays
|
70
|
+
|
71
|
+
end
|
72
|
+
|
73
|
+
# It should accept valid input.
|
74
|
+
def test_accepting_vaild_input
|
75
|
+
input = ''
|
76
|
+
end
|
77
|
+
|
78
|
+
# It should accept only valid input.
|
79
|
+
def test_input_validity
|
80
|
+
['', [], {}, :input, [:one, :two]].each do |input|
|
81
|
+
assert_raise(TreeTagger::UserError) do
|
82
|
+
@tagger.process(input)
|
83
|
+
end
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
# It should instantiate a tagger instance only with valid options.
|
88
|
+
def test_for_binary_presence
|
89
|
+
ENV.delete('TREETAGGER_BINARY')
|
90
|
+
assert_raise(TreeTagger::UserError) do
|
91
|
+
TreeTagger::Tagger.new
|
92
|
+
end
|
93
|
+
end
|
94
|
+
|
95
|
+
# It should instantiate a tagger instance only with valid options.
|
96
|
+
def test_for_model_presence
|
97
|
+
ENV.delete('TREETAGGER_MODEL')
|
98
|
+
assert_raise(TreeTagger::UserError) do
|
99
|
+
TreeTagger::Tagger.new
|
100
|
+
end
|
101
|
+
|
102
|
+
end
|
103
|
+
|
104
|
+
# It should instantiate a tagger instance only with valid options.
|
105
|
+
def test_for_lexicon_presence
|
106
|
+
ENV.delete('TREETAGGER_LEXICON')
|
107
|
+
assert_raise(TreeTagger::UserError) do
|
108
|
+
TreeTagger::Tagger.new({:lookup => true, :options => '-quiet -sgml'})
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
# It should reject a non-boolean value for <:lookup>.
|
113
|
+
def test_rejecting_lookup_values
|
114
|
+
assert_raise(TreeTagger::UserError) do
|
115
|
+
TreeTagger::Tagger.new({:lookup => 'true', :options => '-quiet'})
|
116
|
+
end
|
117
|
+
end
|
118
|
+
|
119
|
+
# It should reject a non-boolean value for <:replace_blanks>.
|
120
|
+
def test_rejecting_blank_values
|
121
|
+
assert_raise(TreeTagger::UserError) do
|
122
|
+
TreeTagger::Tagger.new({:replace_blanks => 'true'})
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
# It should reject a non-string value for <:options>.
|
127
|
+
def test_rejecting_option_values
|
128
|
+
assert_raise(TreeTagger::UserError) do
|
129
|
+
TreeTagger::Tagger.new({:options => :quiet})
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
# It should reject invalid options for TreeTagger inside <:options>.
|
134
|
+
def test_rejecting_invalid_arguments
|
135
|
+
flunk 'Not implemented yet!'
|
136
|
+
end
|
137
|
+
|
138
|
+
# It should ensure the presense of the <-sgml> argument.
|
139
|
+
def test_presence_of_sgml_argument
|
140
|
+
flunk 'Not implemented yet!'
|
141
|
+
end
|
142
|
+
|
143
|
+
# It should reject a non-string value for <:blank_tag>.
|
144
|
+
def test_rejecting_blanktag_values
|
145
|
+
assert_raise(TreeTagger::UserError) do
|
146
|
+
TreeTagger::Tagger.new({:blank_tag => :blank})
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
150
|
+
# It should ensure that <:blang_tag> is a valid smgl sequence.
|
151
|
+
def test_sgml_form
|
152
|
+
flunk 'Not implemented yet!'
|
153
|
+
end
|
154
|
+
end
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: treetagger-ruby
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 27
|
5
5
|
prerelease:
|
6
6
|
segments:
|
7
7
|
- 0
|
8
|
-
- 0
|
9
8
|
- 1
|
10
|
-
|
9
|
+
- 0
|
10
|
+
version: 0.1.0
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Andrei Beliankou
|
@@ -15,7 +15,7 @@ autorequire:
|
|
15
15
|
bindir: bin
|
16
16
|
cert_chain: []
|
17
17
|
|
18
|
-
date:
|
18
|
+
date: 2012-02-14 00:00:00 Z
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
21
21
|
name: rdoc
|
@@ -95,6 +95,12 @@ files:
|
|
95
95
|
- LICENCE.rdoc
|
96
96
|
- CHANGELOG.rdoc
|
97
97
|
- .yardopts
|
98
|
+
- test/test_tagger.rb
|
99
|
+
- test/tree-tagger/corrupted_lexicon_file.txt
|
100
|
+
- test/tree-tagger/lexicon_file.txt
|
101
|
+
- test/tree-tagger/corrupted_model_file.par
|
102
|
+
- test/tree-tagger/model_file.par
|
103
|
+
- test/tree-tagger/tree-tagger
|
98
104
|
- bin/rtt
|
99
105
|
homepage: http://www.uni-trier.de/index.php?id=34451
|
100
106
|
licenses: []
|
@@ -132,6 +138,11 @@ rubygems_version: 1.8.10
|
|
132
138
|
signing_key:
|
133
139
|
specification_version: 3
|
134
140
|
summary: A wrapper for the TreeTagger by Helmut Schmid.
|
135
|
-
test_files:
|
136
|
-
|
141
|
+
test_files:
|
142
|
+
- test/test_tagger.rb
|
143
|
+
- test/tree-tagger/corrupted_lexicon_file.txt
|
144
|
+
- test/tree-tagger/lexicon_file.txt
|
145
|
+
- test/tree-tagger/corrupted_model_file.par
|
146
|
+
- test/tree-tagger/model_file.par
|
147
|
+
- test/tree-tagger/tree-tagger
|
137
148
|
has_rdoc:
|