treetagger-ruby 0.0.1 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,17 +1,27 @@
1
1
  == COMPLETED
2
+ === 0.1.0
3
+ The inteface is now clear and stable.
4
+
5
+ Tagging of big texts is possible since the TreeTagger is invoked as an
6
+ external process through a pipe.
7
+
8
+ Test suite improved.
2
9
  === 0.0.1
3
10
  Implemented simple tagging. The TreeTagger is invoked through the evn variable.
4
11
  === 0.0.1.prealpha
5
12
  Created the structure for this project, added documentation and a public repo.
6
13
 
7
14
  == PLANNED
8
- === 0.1.0
9
-
10
15
  === 0.2.0
16
+ Better tests. Support for all input types.
11
17
  === 0.3.0
18
+ Lemmatizer.
12
19
  === 0.4.0
20
+ File based FIFOs.
13
21
  === 0.5.0
22
+ File based queues.
14
23
  === 0.6.0
24
+ Full featured cmd interface.
15
25
  === 0.7.0
16
26
  === 0.8.0
17
27
  === 0.9.0
@@ -1,37 +1,69 @@
1
1
  = TreeTagger for Ruby
2
2
 
3
3
  * {RubyGems}[http://rubygems.org/gems/treetagger-ruby]
4
- * Developers {Homepage}[http://bu.chsta.be/]
4
+ * {Homepage}[http://bu.chsta.be/]
5
5
  * {RTT Project Page}[http://bu.chsta.be/projects/treetagger-ruby/]
6
6
  * {Source Code}[https://github.com/arbox/treetagger-ruby]
7
7
  * {Bug Tracker}[https://github.com/arbox/treetagger-ruby/issues]
8
8
 
9
9
  == DESCRIPTION
10
- The Ruby based wrapper for the TreeTagger by Helmut Schmid.
11
- Check it out if you are interested
12
- in Natural Language Processing (NLP) and Human Language Technology (HLT).
10
+ A Ruby based wrapper for the TreeTagger by Helmut Schmid.
11
+
12
+ Check it out if you are interested in Natural Language Processing (NLP)
13
+ and/or Human Language Technology (HLT).
14
+
15
+ This library provides comprehensive bindings for the
16
+ {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/],
17
+ a statistical language independed POS tagging and chunking software.
18
+
19
+ TreeTagger is language agnostic, it will never guess what language you're going
20
+ to use. It
21
+
22
+ TODO:
23
+ * References to Schmid's publications;
24
+ * How to use TreeTagger in the wild;
25
+ * Input and output format, tokenization;
26
+ * ...
27
+ * The actual german parameter file has been estimated on one byte encoded data.
28
+
13
29
  === Implemented Features
14
30
  Simple tagging.
15
31
 
32
+ Please have a look at the CHANGELOG file for details on implemented and planned
33
+ features.
16
34
 
17
35
  == INSTALLATION
18
36
  Before you install the <tt>treetagger-ruby</tt> package please ensure
19
- you have downloaded and installe the <tt>TreeTagger</tt> itself.
37
+ you have downloaded and installed the
38
+ {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
39
+ itself.
20
40
 
21
41
  The {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
22
- is a copyrighted software by Helmut Schmid and IMC, please read the license
23
- agreament befor you download the package.
42
+ is a copyrighted software by Helmut Schmid and
43
+ {IMS}[http://www.ims.uni-stuttgart.de/], please read the license
44
+ agreament before you download the TreeTagger package and language models.
24
45
 
25
46
  After the installation of the <tt>TreeTagger</tt> set the environment variable
26
- <tt>TREETAGGERHOME</tt> to the location where you have the programm installed.
27
- Usually this directory contains subdirectories <tt>bin, cmd, lib</tt> and
28
- <tt>doc</tt>.
29
- For instance you may add the following line to your <tt>.profile</tt> file:
30
- export TREETAGGERHOME='/path/to/your/TreeTagger/installation'
47
+ <tt>TREETAGGER_BINARY</tt> to the location where the binary <tt>tree-tagger</tt>
48
+ resides. Usually this binary is located under the <tt>bin</tt> directory in the
49
+ main installation directory of the <tt>TreeTagger</tt>.
50
+
51
+ Also you have to set the variable <tt>TREETAGGER_MODEL</tt> to the location of
52
+ the appropriate language model you have acquired in the training step.
53
+
54
+ For instance you may add the following lines to your <tt>.profile</tt> file:
55
+ export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
56
+ export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'
57
+
58
+ It is convinient to work with a default language model, but you can change
59
+ it every time during the instantiation of a new tagger instance.
60
+
61
+ If you want to feed a lexicon file into your tagger you can do it globally
62
+ through the environment variable <tt>TREETAGGER_LEXICON</tt>.
31
63
 
32
64
  <tt>treetagger-ruby</tt> is provided as a .gem package. Simply install it via
33
65
  {RubyGems}[http://rubygems.org/gems/treetagger-ruby].
34
- To install <tt>treetagger-ruby</tt> ussue the following command:
66
+ To install <tt>treetagger-ruby</tt> issue the following command:
35
67
  $ gem install treetagger-ruby
36
68
 
37
69
  If you want to do a system wide installation, do this as root
@@ -41,14 +73,88 @@ Alternatively use your Gemfile for dependency management.
41
73
 
42
74
 
43
75
  == SYNOPSIS
44
-
76
+ === Basic Usage
45
77
  Basic usage is very simple:
46
78
  $ require 'treetagger-ruby'
79
+ $ # Instantiate a tagger instance with default values.
47
80
  $ tagger = TreeTagger::Tagger.new
48
- $ api.process('Ich gehe in die Schule')
81
+ $ # Process an array of tokens.
82
+ $ tagger.process(%w{Ich gehe in die Schule})
83
+ $ # Flush the pipeline.
84
+ $ tagger.flush
85
+ $ # Get the processed data.
86
+ $ tagger.get_output
87
+
88
+ === Input Format
89
+ Basically you have to provide a tokenized sequence with possibly some additional
90
+ information on lexical classes of tokens and on their probabilities. Every token
91
+ has to be on a separate line. Due to technical limitations SGML tags
92
+ (i.e. sequences with heading < and trailing >) cannot be valid tokes since
93
+ they are used internally for delimiting meaningful content from flush tokens.
94
+ It implies the use of the <tt>-sgml</tt> option which cannot be changes by user.
95
+ It is a limitation of <em>this</em> library. If you do need to process tags,
96
+ fall back and use the TreeTagger as a standalone programm possibly employing
97
+ temp files to store your input and output. This behaviour will be also
98
+ implemented in futher versions of <tt>treetagger-ruby</tt>.
99
+
100
+ Every token may occure alone on the line or be followed by additional
101
+ information:
102
+ * token;
103
+ * token (\\tab tag)+;
104
+ * token (\\tab tag \\space lemma)+;
105
+ * token (\\tab tag \\space probability)+;
106
+ * token (\\tab tag \\space probability \\space lemma)+.
107
+
108
+ You input may look like the following sentence:
109
+ Die ART 0.99
110
+ neuen ADJA neu
111
+ Hunde NN NP
112
+ stehen VVFIN 0.99 stehen
113
+ an
114
+ den
115
+ Mauern NN Mauer
116
+ .
117
+
118
+
119
+ This wrapper accepts the input as <em>String</em> or <em>Array</em>.
120
+
121
+ If you want to use strings, you are responsible for the proper delimiters inside
122
+ the string: <tt>"Die\\tART 0.99\\nneuen\\tADJA neu\\nHunde\\tNN NP\\nstehen\\t
123
+ VVFIN 0.99 stehen\\nan\\nden\\nMauern\\tNN Mauer\\n.\\n"</tt>
124
+ Now <tt>treetagger-ruby</tt> does not check your markup for correctness and will
125
+ possibly report a <tt>TreeTagger::ExternalError</tt> if the TreeTagger process
126
+ die due to input errors.
127
+
128
+ Using arrays is more convinient since they can be built programmatically.
129
+
130
+ Arrays should have the following structure:
131
+ * ['token', 'token', 'token'];
132
+ * ['token', ['token', ['POS', 'lemma'], ['POS', 'lemma']], 'token'];
133
+ * ['token', ['token', ['POS', prob], ['POS', 'prob']], 'token'];
134
+ * ['token', ['token', ['POS', prob, 'lemma'], ['POS', 'prob', 'lemma']]].
135
+
136
+ It is internally converted in the sequence <tt>token\\ntoken\\tPOS lemma\\t
137
+ POS lemma\\ntoken\\n</tt>, i.e. in the enriched version alternatives are
138
+ tab separated and entries a blank separated.
139
+
140
+ Note that probabilities may be strings or integers.
141
+
142
+ The lexicon lookup is +not+ implemented for now, that's the latter three forms
143
+ of input arrays are not supported yet.
144
+
145
+ === Output Format
146
+ For now you'll get an array with strings elements. However the precise string
147
+ structure depends on the cmd arguments you've provided during the tagger
148
+ instantiation.
149
+
150
+ For instanse for the input <tt>["Veruntreute", "die", "AWO", "Spendengeld", "?"]
151
+ </tt> you'll get the following output with default cmd argumetns:
152
+
153
+ <tt>["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>",
154
+ "Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]</tt>
49
155
 
50
156
  See documentation in the TreeTagger::Tagger class for details
51
- on particular search methods.
157
+ on particular methods.
52
158
 
53
159
  == EXCEPTION HIERARCHY
54
160
  While using TreeTagger you can face following errors:
@@ -56,10 +162,23 @@ While using TreeTagger you can face following errors:
56
162
  * <tt>TreeTagger::RuntimeError</tt>;
57
163
  * <tt>TreeTagger::ExternalError</tt>.
58
164
 
165
+ This three kinds of errors all subclass <tt>TreeTagger::Error</tt>, which
166
+ in turn is a subclass of <tt>StandardError</tt>. For an end user this means that
167
+ it is possible to intercept all errors from <em>treetagger-ruby</em> with
168
+ a simple <tt>rescue</tt> clause.
169
+
59
170
  == SUPPORT
60
- If you have question, bug reports or any suggestions, please drop me an email :)
61
- Any help is deeply appreciated!
171
+ If you have question, bug reports or any suggestions, please drop me an email :)
62
172
 
173
+ == HOW TO CONTRIBUTE
174
+ Please contact me and suggest your ideas, report bugs, talk to me, if you want
175
+ to implement some features in the future releases of this library.
176
+
177
+ Please don't feel offended if I cannot accept all your pull requests, I have
178
+ to review them and find the appropriate time and place in the code base to
179
+ incorporate your valuable changes.
180
+
181
+ Any help is deeply appreciated!
63
182
  == CHANGELOG
64
183
  For details on future plan and working progress see CHANGELOG.
65
184
 
data/bin/rtt CHANGED
@@ -8,24 +8,51 @@ options = TreeTagger::ARGVParser.parse(ARGV)
8
8
 
9
9
  tagger = TreeTagger::Tagger.new(options)
10
10
 
11
- while line = ARGF.gets
12
- # [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]
13
- result_array = tagger.process(line.chomp)
14
-
15
- # Adding some colors to the output.
16
- # Using ANSI escape codes.
17
- red = "\e[31m"
18
- green = "\e[32m"
19
- blue = "\e[34m"
20
- reset = "\e[0m"
21
-
22
- result_array.each do |tuple|
23
- if $stdout.tty?
24
- tuple[0].insert(0, red).insert(-1, reset)
25
- tuple[1].insert(0, green).insert(-1, reset)
26
- tuple[2].insert(0, blue).insert(-1, reset)
11
+ # Adding some colors to the output.
12
+ # Using ANSI escape codes.
13
+ red = "\e[31m"
14
+ green = "\e[32m"
15
+ blue = "\e[34m"
16
+ reset = "\e[0m"
17
+
18
+ reader = Thread.new do
19
+ beginning = true
20
+ loop do
21
+ result_array = tagger.get_output
22
+ if result_array.nil?
23
+ if beginning
24
+ sleep(0.1)
25
+ next
26
+ else
27
+ break
28
+ end
29
+ end
30
+ sleep(0.2) # Is useful!
31
+
32
+ beginning = false
33
+ result_array.each do |tuple|
34
+ tuple = tuple.split("\t")
35
+
36
+ if $stdout.tty?
37
+ tuple[0].insert(0, red).insert(-1, reset) if tuple[0]
38
+ tuple[1].insert(0, green).insert(-1, reset) if tuple[1]
39
+ tuple[2].insert(0, blue).insert(-1, reset) if tuple[2]
40
+ end
41
+
42
+ # [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]`
43
+ $stdout.puts tuple.join("\t")
27
44
  end
28
-
29
- $stdout.puts tuple.join("\t")
30
45
  end
31
46
  end
47
+
48
+ # Read all lines from STDOUT or from files.
49
+ while line = ARGF.gets
50
+ # Invoke tokenizer somehow here.
51
+ tagger.process(line)
52
+ end
53
+
54
+ tagger.flush
55
+
56
+ reader.join
57
+
58
+ STDOUT.flush
@@ -1,22 +1,228 @@
1
1
  # -*- encoding: utf-8 -*-
2
+ require 'thread'
3
+ require 'tree_tagger/error'
2
4
 
5
+ =begin
6
+ TODO:
7
+ - Observe the status of the reader thread.
8
+ - Control the status of the pipe and recreate it.
9
+ - Handle IO errors.
10
+ - Handle errors while allocating the TT object.
11
+ - Update the flush sentence, make it shorter.
12
+ - Store the queue on a persistant medium, not in the memory.
13
+ - Properly set the $ORS for all platforms.
14
+ =end
15
+ # :main: README.rdoc
16
+ # :title: TreeTagger - Ruby based Wrapper for the TreeTagger by Helmut Schmid
17
+ # Module comment
3
18
  module TreeTagger
19
+ # Class comment
4
20
  class Tagger
5
- def initialize(
6
- lang = :de,
7
- opts = {
8
- :sgml => true,
9
- :token => true,
10
- :lemma => true
21
+
22
+ BEGIN_MARKER = '<BEGIN_OF_THE_TT_INPUT>'
23
+ END_MARKER = '<END_OF_THE_TT_INPUT>'
24
+ # TT seems to hold only the last three tokens in the buffer.
25
+ # The flushing sentence can be shortened down to this size.
26
+ FLUSH_SENTENCE = "Das\nist\nein\nTestsatz\n,\num\ndas\nStossen\nder\nDaten\nsicherzustellen\n."
27
+
28
+ # Initializer commet
29
+ def initialize(opts = {
30
+ :binary => nil,
31
+ :model => nil,
32
+ :lexicon => nil,
33
+ :options => '-token -lemma -sgml -quiet',
34
+ :replace_blanks => true,
35
+ :blank_tag => '<BLANK>',
36
+ :lookup => false
11
37
  }
12
38
  )
13
- @lang = lang
14
- @opt = opts
39
+
40
+ @opts = validate_options(opts)
41
+ @blank_tag = @opts[:blank_tag]
42
+ @cmdline = "#{@opts[:binary]} #{@opts[:options]} #{@opts[:model]}"
43
+
44
+ @queue = Queue.new
45
+ @pipe = new_pipe
46
+ @pipe.sync = true
47
+ @reader = new_reader
48
+ @inside_output = false
49
+ @inside_input = false
50
+ @enqueued_tokens = 0
51
+ @mutex = Mutex.new
52
+ @queue_mutex = Mutex.new
53
+ # sleep(1) # Don't know if it's useful, no problems before.
54
+ end
55
+
56
+ # Send the string to the TreeTagger.
57
+ def process(input)
58
+
59
+ str = convert(input)
60
+ # Sanitize strings.
61
+ str = sanitize(str)
62
+ # Mark the beginning of the text.
63
+ if not @inside_input
64
+ str = "#{BEGIN_MARKER}\n#{str}\n"
65
+ @inside_input = true
66
+ else
67
+ str = str + "\n"
68
+ end
69
+ @mutex.synchronize { @enqueued_tokens += 1 }
70
+ @pipe.print(str)
15
71
  end
16
- def process(str)
17
- line = %x(echo '#{str}' | #{ENV['TREETAGGERHOME']}/cmd/tree-tagger-german)
18
- arr = line.split("\n").collect { |el| el.split("\t") }
72
+
73
+ # Get processed tokens back.
74
+ # This method is not blocking. If some tokens have been sent,
75
+ # but not received from the pipe yet, it returns an empty array.
76
+ # If all sent tokens are in the queue it returns all of them.
77
+ # If no more tokens are awaited it returns <nil>.
78
+ def get_output
79
+ output = []
80
+ tokens = 0
81
+ @queue_mutex.synchronize do
82
+ tokens = @queue.size
83
+ tokens.times { output << @queue.shift }
84
+ end
85
+ @mutex.synchronize do
86
+ @enqueued_tokens -= tokens
87
+ end
88
+
89
+ # Nil if nothing to process in the pipe.
90
+ # Possible only after flushing the pipe.
91
+ if @enqueued_tokens > 0
92
+ output
93
+ else
94
+ output.any? ? output : nil
95
+ end
96
+ end
97
+
98
+ # Get the rest of the text back.
99
+ # TT holds some meaningful parts in the buffer.
100
+ def flush
101
+ @inside_input = false
102
+ str = "#{END_MARKER}\n#{FLUSH_SENTENCE}\n"
103
+ @pipe.print(str)
104
+ # Here invoke the reader thread to ensure
105
+ # all output has been read.
106
+ #@reader.run
107
+ end
108
+
109
+ private
110
+ # Return the options hash after validation.
111
+ # {
112
+ # :binary => nil,
113
+ # :model => nil,
114
+ # :lexicon => nil,
115
+ # :options => '-token -lemma -sgml -quiet',
116
+ # :replace_blanks => true,
117
+ # :blank_tag => '<BLANK>',
118
+ # :lookup => false
119
+ # }
120
+ def validate_options(opts)
121
+ # Check if <:lookup> is boolean.
122
+
123
+ # Check if <:replace_blanks> is boolean.
124
+
125
+ # Check if <:options> is a string.
126
+
127
+ # Check if <:options> contains only allowed values.
128
+
129
+ # Ensure that <:options> contains <-sgml>.
130
+
131
+ # Check if <:blank_tag> is a string.
132
+
133
+ # Ensure that <:blank_tag> is a valid SGML sequence.
134
+
135
+ # Set the model and binary paths if not provided.
136
+ [:binary, :model].each do |key|
137
+ if opts[key].nil?
138
+ opts[key] = ENV.fetch("TREETAGGER_#{key.to_s.upcase}") do |missing|
139
+ fail UserError, "Provide a value for <:#{key}>" +
140
+ " or set the environment variable <#{missing}>!"
141
+ end
142
+ end
143
+ end
144
+
145
+ # Set the lexicon path if not provided but requested.
146
+ if opts[:lookup] && opts[:lexicon].nil?
147
+ opts[:lookup] = ENV.fetch('TREETAGGER_LEXICON') do |missing|
148
+ fail UserError, 'Provide a value for <:lexicon>' +
149
+ ' or set the environment variable <TREETAGGER_LEXICON>!'
150
+ end
151
+ end
152
+
153
+ # Check for existence and reedability of external files:
154
+ # * binary;
155
+ # * model;
156
+ # * lexicon (if applicable).
157
+
158
+ opts
159
+ end
160
+
161
+ # Starts the reader thread.
162
+ def new_reader
163
+ Thread.new do
164
+ while line = @pipe.gets
165
+ # The output strings must not contain "\n".
166
+ line.chomp!
167
+ case line
168
+ when BEGIN_MARKER
169
+ @inside_output = true
170
+ $stderr.puts 'Found the begin marker.' if $DEBUG
171
+ when END_MARKER
172
+ @inside_output = false
173
+ $stderr.puts 'Found the end marker.' if $DEBUG
174
+ else
175
+ if @inside_output
176
+ @queue_mutex.synchronize { @queue << line }
177
+ $stderr.puts "<#{line}> added to the queue." if $DEBUG
178
+ end
179
+ end
180
+ end
181
+ end # thread
182
+ end # start_reader
183
+
184
+ # This method may be utilized to keep the TT process alive.
185
+ # Check here if TT returns the exit code 1 in case on invalide options.
186
+ def new_pipe
187
+ IO.popen(@cmdline, 'r+')
188
+ end
189
+
190
+ # Convert token arrays to delimited strings.
191
+ def convert(input)
192
+ unless input.is_a?(Array) || input.is_a?(String)
193
+ fail UserError, "Not a valid input format: <#{input.class}>!"
194
+ end
195
+
196
+ if input.empty?
197
+ fail UserError, "Empty input is not allowed!"
198
+ end
199
+
200
+ if input.is_a?(Array)
201
+ input.each do |el|
202
+ unless el.is_a?(String)
203
+ fail UserError, "Input elements should be strings!"
204
+ end
205
+ el = sanitize(el)
206
+ end
207
+ input = input.join("\n")
208
+ end
209
+
210
+ input
211
+ end
212
+
213
+ def sanitize(str)
214
+ line = str.strip
215
+ if line.empty?
216
+ line = @blank_tag
217
+ end
218
+
219
+ line
19
220
  end
20
221
  end # class
21
222
  end # module
22
223
 
224
+ __END__
225
+ - tokenization
226
+ - lexicon lookup
227
+ - tagging
228
+ - error correction
@@ -1,3 +1,3 @@
1
1
  module TreeTagger
2
- VERSION = '0.0.1'
2
+ VERSION = '0.1.0'
3
3
  end
@@ -0,0 +1,154 @@
1
+ require 'test/unit'
2
+ require 'tree_tagger/tagger'
3
+ require 'tree_tagger/error'
4
+ require 'stringio'
5
+
6
+ class TestTagger < Test::Unit::TestCase
7
+
8
+ PUBLIC_METHODS = [:process,
9
+ :get_output,
10
+ :flush
11
+ ]
12
+ def setup
13
+ # ENV['TREETAGGER_BINARY'] = '/opt/TreeTagger/bin/tree-tagger'
14
+ # ENV['TREETAGGER_MODEL'] = '/opt/TreeTagger/lib/german.par'
15
+ # ENV['TREETAGGER_LEXICON'] = '/opt/TreeTagger/lib/german-lexicon.txt'
16
+
17
+ ENV['TREETAGGER_BINARY'] = 'test/tree-tagger/tree-tagger'
18
+ ENV['TREETAGGER_MODEL'] = 'test/tree-tagger/model_file.par'
19
+ ENV['TREETAGGER_LEXICON'] = 'test/tree-tagger/lexicon_file.txt'
20
+
21
+ params = {} # dummy for now
22
+ @tagger = TreeTagger::Tagger.new
23
+ end
24
+
25
+ def teardown
26
+ end
27
+
28
+ # It should have the following constants set.
29
+ def test_constants
30
+ end
31
+
32
+ # It should respond to valid methods
33
+ def test_public_methods
34
+ PUBLIC_METHODS.each do |m|
35
+ assert_respond_to(@tagger, m)
36
+ end
37
+ end
38
+
39
+ def test_tagger
40
+ end
41
+
42
+ # It should accept only arrays and strings.
43
+ def test_input_for_its_class
44
+ assert_nothing_raised do
45
+ @tagger.process 'Ich\ngehe\nin\ndie\nSchule\n.\n'
46
+ @tagger.process %w{Ich gehe in die Schule .}
47
+ end
48
+ end
49
+
50
+ # It should reject non-string and non-array elements.
51
+ def test_rejecting_invalid_input
52
+ [{}, :input, 1, 1.0, Time.new].each do |input|
53
+ assert_raise(TreeTagger::UserError) do
54
+ @tagger.process(input)
55
+ end
56
+ end
57
+ end
58
+
59
+ # It should reject empty input.
60
+ def test_for_empty_input
61
+ ['', []].each do |input|
62
+ assert_raise(TreeTagger::UserError) do
63
+ @tagger.process(input)
64
+ end
65
+ end
66
+ end
67
+
68
+ # It should reject arrays with wrong elements.
69
+ def test_for_elements_of_arrays
70
+
71
+ end
72
+
73
+ # It should accept valid input.
74
+ def test_accepting_vaild_input
75
+ input = ''
76
+ end
77
+
78
+ # It should accept only valid input.
79
+ def test_input_validity
80
+ ['', [], {}, :input, [:one, :two]].each do |input|
81
+ assert_raise(TreeTagger::UserError) do
82
+ @tagger.process(input)
83
+ end
84
+ end
85
+ end
86
+
87
+ # It should instantiate a tagger instance only with valid options.
88
+ def test_for_binary_presence
89
+ ENV.delete('TREETAGGER_BINARY')
90
+ assert_raise(TreeTagger::UserError) do
91
+ TreeTagger::Tagger.new
92
+ end
93
+ end
94
+
95
+ # It should instantiate a tagger instance only with valid options.
96
+ def test_for_model_presence
97
+ ENV.delete('TREETAGGER_MODEL')
98
+ assert_raise(TreeTagger::UserError) do
99
+ TreeTagger::Tagger.new
100
+ end
101
+
102
+ end
103
+
104
+ # It should instantiate a tagger instance only with valid options.
105
+ def test_for_lexicon_presence
106
+ ENV.delete('TREETAGGER_LEXICON')
107
+ assert_raise(TreeTagger::UserError) do
108
+ TreeTagger::Tagger.new({:lookup => true, :options => '-quiet -sgml'})
109
+ end
110
+ end
111
+
112
+ # It should reject a non-boolean value for <:lookup>.
113
+ def test_rejecting_lookup_values
114
+ assert_raise(TreeTagger::UserError) do
115
+ TreeTagger::Tagger.new({:lookup => 'true', :options => '-quiet'})
116
+ end
117
+ end
118
+
119
+ # It should reject a non-boolean value for <:replace_blanks>.
120
+ def test_rejecting_blank_values
121
+ assert_raise(TreeTagger::UserError) do
122
+ TreeTagger::Tagger.new({:replace_blanks => 'true'})
123
+ end
124
+ end
125
+
126
+ # It should reject a non-string value for <:options>.
127
+ def test_rejecting_option_values
128
+ assert_raise(TreeTagger::UserError) do
129
+ TreeTagger::Tagger.new({:options => :quiet})
130
+ end
131
+ end
132
+
133
+ # It should reject invalid options for TreeTagger inside <:options>.
134
+ def test_rejecting_invalid_arguments
135
+ flunk 'Not implemented yet!'
136
+ end
137
+
138
+ # It should ensure the presense of the <-sgml> argument.
139
+ def test_presence_of_sgml_argument
140
+ flunk 'Not implemented yet!'
141
+ end
142
+
143
+ # It should reject a non-string value for <:blank_tag>.
144
+ def test_rejecting_blanktag_values
145
+ assert_raise(TreeTagger::UserError) do
146
+ TreeTagger::Tagger.new({:blank_tag => :blank})
147
+ end
148
+ end
149
+
150
+ # It should ensure that <:blang_tag> is a valid smgl sequence.
151
+ def test_sgml_form
152
+ flunk 'Not implemented yet!'
153
+ end
154
+ end
File without changes
File without changes
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ ARGV.clear
4
+ while gets
5
+ puts $_
6
+ end
7
+
8
+ #STDOUT.flush
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: treetagger-ruby
3
3
  version: !ruby/object:Gem::Version
4
- hash: 29
4
+ hash: 27
5
5
  prerelease:
6
6
  segments:
7
7
  - 0
8
- - 0
9
8
  - 1
10
- version: 0.0.1
9
+ - 0
10
+ version: 0.1.0
11
11
  platform: ruby
12
12
  authors:
13
13
  - Andrei Beliankou
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2011-12-18 00:00:00 Z
18
+ date: 2012-02-14 00:00:00 Z
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
21
21
  name: rdoc
@@ -95,6 +95,12 @@ files:
95
95
  - LICENCE.rdoc
96
96
  - CHANGELOG.rdoc
97
97
  - .yardopts
98
+ - test/test_tagger.rb
99
+ - test/tree-tagger/corrupted_lexicon_file.txt
100
+ - test/tree-tagger/lexicon_file.txt
101
+ - test/tree-tagger/corrupted_model_file.par
102
+ - test/tree-tagger/model_file.par
103
+ - test/tree-tagger/tree-tagger
98
104
  - bin/rtt
99
105
  homepage: http://www.uni-trier.de/index.php?id=34451
100
106
  licenses: []
@@ -132,6 +138,11 @@ rubygems_version: 1.8.10
132
138
  signing_key:
133
139
  specification_version: 3
134
140
  summary: A wrapper for the TreeTagger by Helmut Schmid.
135
- test_files: []
136
-
141
+ test_files:
142
+ - test/test_tagger.rb
143
+ - test/tree-tagger/corrupted_lexicon_file.txt
144
+ - test/tree-tagger/lexicon_file.txt
145
+ - test/tree-tagger/corrupted_model_file.par
146
+ - test/tree-tagger/model_file.par
147
+ - test/tree-tagger/tree-tagger
137
148
  has_rdoc: