treetagger-ruby 0.0.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,17 +1,27 @@
1
1
  == COMPLETED
2
+ === 0.1.0
3
+ The inteface is now clear and stable.
4
+
5
+ Tagging of big texts is possible since the TreeTagger is invoked as an
6
+ external process through a pipe.
7
+
8
+ Test suite improved.
2
9
  === 0.0.1
3
10
  Implemented simple tagging. The TreeTagger is invoked through the evn variable.
4
11
  === 0.0.1.prealpha
5
12
  Created the structure for this project, added documentation and a public repo.
6
13
 
7
14
  == PLANNED
8
- === 0.1.0
9
-
10
15
  === 0.2.0
16
+ Better tests. Support for all input types.
11
17
  === 0.3.0
18
+ Lemmatizer.
12
19
  === 0.4.0
20
+ File based FIFOs.
13
21
  === 0.5.0
22
+ File based queues.
14
23
  === 0.6.0
24
+ Full featured cmd interface.
15
25
  === 0.7.0
16
26
  === 0.8.0
17
27
  === 0.9.0
@@ -1,37 +1,69 @@
1
1
  = TreeTagger for Ruby
2
2
 
3
3
  * {RubyGems}[http://rubygems.org/gems/treetagger-ruby]
4
- * Developers {Homepage}[http://bu.chsta.be/]
4
+ * {Homepage}[http://bu.chsta.be/]
5
5
  * {RTT Project Page}[http://bu.chsta.be/projects/treetagger-ruby/]
6
6
  * {Source Code}[https://github.com/arbox/treetagger-ruby]
7
7
  * {Bug Tracker}[https://github.com/arbox/treetagger-ruby/issues]
8
8
 
9
9
  == DESCRIPTION
10
- The Ruby based wrapper for the TreeTagger by Helmut Schmid.
11
- Check it out if you are interested
12
- in Natural Language Processing (NLP) and Human Language Technology (HLT).
10
+ A Ruby based wrapper for the TreeTagger by Helmut Schmid.
11
+
12
+ Check it out if you are interested in Natural Language Processing (NLP)
13
+ and/or Human Language Technology (HLT).
14
+
15
+ This library provides comprehensive bindings for the
16
+ {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/],
17
+ a statistical language independed POS tagging and chunking software.
18
+
19
+ TreeTagger is language agnostic, it will never guess what language you're going
20
+ to use. It
21
+
22
+ TODO:
23
+ * References to Schmid's publications;
24
+ * How to use TreeTagger in the wild;
25
+ * Input and output format, tokenization;
26
+ * ...
27
+ * The actual german parameter file has been estimated on one byte encoded data.
28
+
13
29
  === Implemented Features
14
30
  Simple tagging.
15
31
 
32
+ Please have a look at the CHANGELOG file for details on implemented and planned
33
+ features.
16
34
 
17
35
  == INSTALLATION
18
36
  Before you install the <tt>treetagger-ruby</tt> package please ensure
19
- you have downloaded and installe the <tt>TreeTagger</tt> itself.
37
+ you have downloaded and installed the
38
+ {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
39
+ itself.
20
40
 
21
41
  The {TreeTagger}[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/]
22
- is a copyrighted software by Helmut Schmid and IMC, please read the license
23
- agreament befor you download the package.
42
+ is a copyrighted software by Helmut Schmid and
43
+ {IMS}[http://www.ims.uni-stuttgart.de/], please read the license
44
+ agreament before you download the TreeTagger package and language models.
24
45
 
25
46
  After the installation of the <tt>TreeTagger</tt> set the environment variable
26
- <tt>TREETAGGERHOME</tt> to the location where you have the programm installed.
27
- Usually this directory contains subdirectories <tt>bin, cmd, lib</tt> and
28
- <tt>doc</tt>.
29
- For instance you may add the following line to your <tt>.profile</tt> file:
30
- export TREETAGGERHOME='/path/to/your/TreeTagger/installation'
47
+ <tt>TREETAGGER_BINARY</tt> to the location where the binary <tt>tree-tagger</tt>
48
+ resides. Usually this binary is located under the <tt>bin</tt> directory in the
49
+ main installation directory of the <tt>TreeTagger</tt>.
50
+
51
+ Also you have to set the variable <tt>TREETAGGER_MODEL</tt> to the location of
52
+ the appropriate language model you have acquired in the training step.
53
+
54
+ For instance you may add the following lines to your <tt>.profile</tt> file:
55
+ export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
56
+ export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'
57
+
58
+ It is convinient to work with a default language model, but you can change
59
+ it every time during the instantiation of a new tagger instance.
60
+
61
+ If you want to feed a lexicon file into your tagger you can do it globally
62
+ through the environment variable <tt>TREETAGGER_LEXICON</tt>.
31
63
 
32
64
  <tt>treetagger-ruby</tt> is provided as a .gem package. Simply install it via
33
65
  {RubyGems}[http://rubygems.org/gems/treetagger-ruby].
34
- To install <tt>treetagger-ruby</tt> ussue the following command:
66
+ To install <tt>treetagger-ruby</tt> issue the following command:
35
67
  $ gem install treetagger-ruby
36
68
 
37
69
  If you want to do a system wide installation, do this as root
@@ -41,14 +73,88 @@ Alternatively use your Gemfile for dependency management.
41
73
 
42
74
 
43
75
  == SYNOPSIS
44
-
76
+ === Basic Usage
45
77
  Basic usage is very simple:
46
78
  $ require 'treetagger-ruby'
79
+ $ # Instantiate a tagger instance with default values.
47
80
  $ tagger = TreeTagger::Tagger.new
48
- $ api.process('Ich gehe in die Schule')
81
+ $ # Process an array of tokens.
82
+ $ tagger.process(%w{Ich gehe in die Schule})
83
+ $ # Flush the pipeline.
84
+ $ tagger.flush
85
+ $ # Get the processed data.
86
+ $ tagger.get_output
87
+
88
+ === Input Format
89
+ Basically you have to provide a tokenized sequence with possibly some additional
90
+ information on lexical classes of tokens and on their probabilities. Every token
91
+ has to be on a separate line. Due to technical limitations SGML tags
92
+ (i.e. sequences with heading < and trailing >) cannot be valid tokes since
93
+ they are used internally for delimiting meaningful content from flush tokens.
94
+ It implies the use of the <tt>-sgml</tt> option which cannot be changes by user.
95
+ It is a limitation of <em>this</em> library. If you do need to process tags,
96
+ fall back and use the TreeTagger as a standalone programm possibly employing
97
+ temp files to store your input and output. This behaviour will be also
98
+ implemented in futher versions of <tt>treetagger-ruby</tt>.
99
+
100
+ Every token may occure alone on the line or be followed by additional
101
+ information:
102
+ * token;
103
+ * token (\\tab tag)+;
104
+ * token (\\tab tag \\space lemma)+;
105
+ * token (\\tab tag \\space probability)+;
106
+ * token (\\tab tag \\space probability \\space lemma)+.
107
+
108
+ You input may look like the following sentence:
109
+ Die ART 0.99
110
+ neuen ADJA neu
111
+ Hunde NN NP
112
+ stehen VVFIN 0.99 stehen
113
+ an
114
+ den
115
+ Mauern NN Mauer
116
+ .
117
+
118
+
119
+ This wrapper accepts the input as <em>String</em> or <em>Array</em>.
120
+
121
+ If you want to use strings, you are responsible for the proper delimiters inside
122
+ the string: <tt>"Die\\tART 0.99\\nneuen\\tADJA neu\\nHunde\\tNN NP\\nstehen\\t
123
+ VVFIN 0.99 stehen\\nan\\nden\\nMauern\\tNN Mauer\\n.\\n"</tt>
124
+ Now <tt>treetagger-ruby</tt> does not check your markup for correctness and will
125
+ possibly report a <tt>TreeTagger::ExternalError</tt> if the TreeTagger process
126
+ die due to input errors.
127
+
128
+ Using arrays is more convinient since they can be built programmatically.
129
+
130
+ Arrays should have the following structure:
131
+ * ['token', 'token', 'token'];
132
+ * ['token', ['token', ['POS', 'lemma'], ['POS', 'lemma']], 'token'];
133
+ * ['token', ['token', ['POS', prob], ['POS', 'prob']], 'token'];
134
+ * ['token', ['token', ['POS', prob, 'lemma'], ['POS', 'prob', 'lemma']]].
135
+
136
+ It is internally converted in the sequence <tt>token\\ntoken\\tPOS lemma\\t
137
+ POS lemma\\ntoken\\n</tt>, i.e. in the enriched version alternatives are
138
+ tab separated and entries a blank separated.
139
+
140
+ Note that probabilities may be strings or integers.
141
+
142
+ The lexicon lookup is +not+ implemented for now, that's the latter three forms
143
+ of input arrays are not supported yet.
144
+
145
+ === Output Format
146
+ For now you'll get an array with strings elements. However the precise string
147
+ structure depends on the cmd arguments you've provided during the tagger
148
+ instantiation.
149
+
150
+ For instanse for the input <tt>["Veruntreute", "die", "AWO", "Spendengeld", "?"]
151
+ </tt> you'll get the following output with default cmd argumetns:
152
+
153
+ <tt>["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>",
154
+ "Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]</tt>
49
155
 
50
156
  See documentation in the TreeTagger::Tagger class for details
51
- on particular search methods.
157
+ on particular methods.
52
158
 
53
159
  == EXCEPTION HIERARCHY
54
160
  While using TreeTagger you can face following errors:
@@ -56,10 +162,23 @@ While using TreeTagger you can face following errors:
56
162
  * <tt>TreeTagger::RuntimeError</tt>;
57
163
  * <tt>TreeTagger::ExternalError</tt>.
58
164
 
165
+ This three kinds of errors all subclass <tt>TreeTagger::Error</tt>, which
166
+ in turn is a subclass of <tt>StandardError</tt>. For an end user this means that
167
+ it is possible to intercept all errors from <em>treetagger-ruby</em> with
168
+ a simple <tt>rescue</tt> clause.
169
+
59
170
  == SUPPORT
60
- If you have question, bug reports or any suggestions, please drop me an email :)
61
- Any help is deeply appreciated!
171
+ If you have question, bug reports or any suggestions, please drop me an email :)
62
172
 
173
+ == HOW TO CONTRIBUTE
174
+ Please contact me and suggest your ideas, report bugs, talk to me, if you want
175
+ to implement some features in the future releases of this library.
176
+
177
+ Please don't feel offended if I cannot accept all your pull requests, I have
178
+ to review them and find the appropriate time and place in the code base to
179
+ incorporate your valuable changes.
180
+
181
+ Any help is deeply appreciated!
63
182
  == CHANGELOG
64
183
  For details on future plan and working progress see CHANGELOG.
65
184
 
data/bin/rtt CHANGED
@@ -8,24 +8,51 @@ options = TreeTagger::ARGVParser.parse(ARGV)
8
8
 
9
9
  tagger = TreeTagger::Tagger.new(options)
10
10
 
11
- while line = ARGF.gets
12
- # [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]
13
- result_array = tagger.process(line.chomp)
14
-
15
- # Adding some colors to the output.
16
- # Using ANSI escape codes.
17
- red = "\e[31m"
18
- green = "\e[32m"
19
- blue = "\e[34m"
20
- reset = "\e[0m"
21
-
22
- result_array.each do |tuple|
23
- if $stdout.tty?
24
- tuple[0].insert(0, red).insert(-1, reset)
25
- tuple[1].insert(0, green).insert(-1, reset)
26
- tuple[2].insert(0, blue).insert(-1, reset)
11
+ # Adding some colors to the output.
12
+ # Using ANSI escape codes.
13
+ red = "\e[31m"
14
+ green = "\e[32m"
15
+ blue = "\e[34m"
16
+ reset = "\e[0m"
17
+
18
+ reader = Thread.new do
19
+ beginning = true
20
+ loop do
21
+ result_array = tagger.get_output
22
+ if result_array.nil?
23
+ if beginning
24
+ sleep(0.1)
25
+ next
26
+ else
27
+ break
28
+ end
29
+ end
30
+ sleep(0.2) # Is useful!
31
+
32
+ beginning = false
33
+ result_array.each do |tuple|
34
+ tuple = tuple.split("\t")
35
+
36
+ if $stdout.tty?
37
+ tuple[0].insert(0, red).insert(-1, reset) if tuple[0]
38
+ tuple[1].insert(0, green).insert(-1, reset) if tuple[1]
39
+ tuple[2].insert(0, blue).insert(-1, reset) if tuple[2]
40
+ end
41
+
42
+ # [['token', 'tag', 'lemma'], ['token', 'tag', 'lemma']]`
43
+ $stdout.puts tuple.join("\t")
27
44
  end
28
-
29
- $stdout.puts tuple.join("\t")
30
45
  end
31
46
  end
47
+
48
+ # Read all lines from STDOUT or from files.
49
+ while line = ARGF.gets
50
+ # Invoke tokenizer somehow here.
51
+ tagger.process(line)
52
+ end
53
+
54
+ tagger.flush
55
+
56
+ reader.join
57
+
58
+ STDOUT.flush
@@ -1,22 +1,228 @@
1
1
  # -*- encoding: utf-8 -*-
2
+ require 'thread'
3
+ require 'tree_tagger/error'
2
4
 
5
+ =begin
6
+ TODO:
7
+ - Observe the status of the reader thread.
8
+ - Control the status of the pipe and recreate it.
9
+ - Handle IO errors.
10
+ - Handle errors while allocating the TT object.
11
+ - Update the flush sentence, make it shorter.
12
+ - Store the queue on a persistant medium, not in the memory.
13
+ - Properly set the $ORS for all platforms.
14
+ =end
15
+ # :main: README.rdoc
16
+ # :title: TreeTagger - Ruby based Wrapper for the TreeTagger by Helmut Schmid
17
+ # Module comment
3
18
  module TreeTagger
19
+ # Class comment
4
20
  class Tagger
5
- def initialize(
6
- lang = :de,
7
- opts = {
8
- :sgml => true,
9
- :token => true,
10
- :lemma => true
21
+
22
+ BEGIN_MARKER = '<BEGIN_OF_THE_TT_INPUT>'
23
+ END_MARKER = '<END_OF_THE_TT_INPUT>'
24
+ # TT seems to hold only the last three tokens in the buffer.
25
+ # The flushing sentence can be shortened down to this size.
26
+ FLUSH_SENTENCE = "Das\nist\nein\nTestsatz\n,\num\ndas\nStossen\nder\nDaten\nsicherzustellen\n."
27
+
28
+ # Initializer commet
29
+ def initialize(opts = {
30
+ :binary => nil,
31
+ :model => nil,
32
+ :lexicon => nil,
33
+ :options => '-token -lemma -sgml -quiet',
34
+ :replace_blanks => true,
35
+ :blank_tag => '<BLANK>',
36
+ :lookup => false
11
37
  }
12
38
  )
13
- @lang = lang
14
- @opt = opts
39
+
40
+ @opts = validate_options(opts)
41
+ @blank_tag = @opts[:blank_tag]
42
+ @cmdline = "#{@opts[:binary]} #{@opts[:options]} #{@opts[:model]}"
43
+
44
+ @queue = Queue.new
45
+ @pipe = new_pipe
46
+ @pipe.sync = true
47
+ @reader = new_reader
48
+ @inside_output = false
49
+ @inside_input = false
50
+ @enqueued_tokens = 0
51
+ @mutex = Mutex.new
52
+ @queue_mutex = Mutex.new
53
+ # sleep(1) # Don't know if it's useful, no problems before.
54
+ end
55
+
56
+ # Send the string to the TreeTagger.
57
+ def process(input)
58
+
59
+ str = convert(input)
60
+ # Sanitize strings.
61
+ str = sanitize(str)
62
+ # Mark the beginning of the text.
63
+ if not @inside_input
64
+ str = "#{BEGIN_MARKER}\n#{str}\n"
65
+ @inside_input = true
66
+ else
67
+ str = str + "\n"
68
+ end
69
+ @mutex.synchronize { @enqueued_tokens += 1 }
70
+ @pipe.print(str)
15
71
  end
16
- def process(str)
17
- line = %x(echo '#{str}' | #{ENV['TREETAGGERHOME']}/cmd/tree-tagger-german)
18
- arr = line.split("\n").collect { |el| el.split("\t") }
72
+
73
+ # Get processed tokens back.
74
+ # This method is not blocking. If some tokens have been sent,
75
+ # but not received from the pipe yet, it returns an empty array.
76
+ # If all sent tokens are in the queue it returns all of them.
77
+ # If no more tokens are awaited it returns <nil>.
78
+ def get_output
79
+ output = []
80
+ tokens = 0
81
+ @queue_mutex.synchronize do
82
+ tokens = @queue.size
83
+ tokens.times { output << @queue.shift }
84
+ end
85
+ @mutex.synchronize do
86
+ @enqueued_tokens -= tokens
87
+ end
88
+
89
+ # Nil if nothing to process in the pipe.
90
+ # Possible only after flushing the pipe.
91
+ if @enqueued_tokens > 0
92
+ output
93
+ else
94
+ output.any? ? output : nil
95
+ end
96
+ end
97
+
98
+ # Get the rest of the text back.
99
+ # TT holds some meaningful parts in the buffer.
100
+ def flush
101
+ @inside_input = false
102
+ str = "#{END_MARKER}\n#{FLUSH_SENTENCE}\n"
103
+ @pipe.print(str)
104
+ # Here invoke the reader thread to ensure
105
+ # all output has been read.
106
+ #@reader.run
107
+ end
108
+
109
+ private
110
+ # Return the options hash after validation.
111
+ # {
112
+ # :binary => nil,
113
+ # :model => nil,
114
+ # :lexicon => nil,
115
+ # :options => '-token -lemma -sgml -quiet',
116
+ # :replace_blanks => true,
117
+ # :blank_tag => '<BLANK>',
118
+ # :lookup => false
119
+ # }
120
+ def validate_options(opts)
121
+ # Check if <:lookup> is boolean.
122
+
123
+ # Check if <:replace_blanks> is boolean.
124
+
125
+ # Check if <:options> is a string.
126
+
127
+ # Check if <:options> contains only allowed values.
128
+
129
+ # Ensure that <:options> contains <-sgml>.
130
+
131
+ # Check if <:blank_tag> is a string.
132
+
133
+ # Ensure that <:blank_tag> is a valid SGML sequence.
134
+
135
+ # Set the model and binary paths if not provided.
136
+ [:binary, :model].each do |key|
137
+ if opts[key].nil?
138
+ opts[key] = ENV.fetch("TREETAGGER_#{key.to_s.upcase}") do |missing|
139
+ fail UserError, "Provide a value for <:#{key}>" +
140
+ " or set the environment variable <#{missing}>!"
141
+ end
142
+ end
143
+ end
144
+
145
+ # Set the lexicon path if not provided but requested.
146
+ if opts[:lookup] && opts[:lexicon].nil?
147
+ opts[:lookup] = ENV.fetch('TREETAGGER_LEXICON') do |missing|
148
+ fail UserError, 'Provide a value for <:lexicon>' +
149
+ ' or set the environment variable <TREETAGGER_LEXICON>!'
150
+ end
151
+ end
152
+
153
+ # Check for existence and reedability of external files:
154
+ # * binary;
155
+ # * model;
156
+ # * lexicon (if applicable).
157
+
158
+ opts
159
+ end
160
+
161
+ # Starts the reader thread.
162
+ def new_reader
163
+ Thread.new do
164
+ while line = @pipe.gets
165
+ # The output strings must not contain "\n".
166
+ line.chomp!
167
+ case line
168
+ when BEGIN_MARKER
169
+ @inside_output = true
170
+ $stderr.puts 'Found the begin marker.' if $DEBUG
171
+ when END_MARKER
172
+ @inside_output = false
173
+ $stderr.puts 'Found the end marker.' if $DEBUG
174
+ else
175
+ if @inside_output
176
+ @queue_mutex.synchronize { @queue << line }
177
+ $stderr.puts "<#{line}> added to the queue." if $DEBUG
178
+ end
179
+ end
180
+ end
181
+ end # thread
182
+ end # start_reader
183
+
184
+ # This method may be utilized to keep the TT process alive.
185
+ # Check here if TT returns the exit code 1 in case on invalide options.
186
+ def new_pipe
187
+ IO.popen(@cmdline, 'r+')
188
+ end
189
+
190
+ # Convert token arrays to delimited strings.
191
+ def convert(input)
192
+ unless input.is_a?(Array) || input.is_a?(String)
193
+ fail UserError, "Not a valid input format: <#{input.class}>!"
194
+ end
195
+
196
+ if input.empty?
197
+ fail UserError, "Empty input is not allowed!"
198
+ end
199
+
200
+ if input.is_a?(Array)
201
+ input.each do |el|
202
+ unless el.is_a?(String)
203
+ fail UserError, "Input elements should be strings!"
204
+ end
205
+ el = sanitize(el)
206
+ end
207
+ input = input.join("\n")
208
+ end
209
+
210
+ input
211
+ end
212
+
213
+ def sanitize(str)
214
+ line = str.strip
215
+ if line.empty?
216
+ line = @blank_tag
217
+ end
218
+
219
+ line
19
220
  end
20
221
  end # class
21
222
  end # module
22
223
 
224
+ __END__
225
+ - tokenization
226
+ - lexicon lookup
227
+ - tagging
228
+ - error correction
@@ -1,3 +1,3 @@
1
1
  module TreeTagger
2
- VERSION = '0.0.1'
2
+ VERSION = '0.1.0'
3
3
  end
@@ -0,0 +1,154 @@
1
+ require 'test/unit'
2
+ require 'tree_tagger/tagger'
3
+ require 'tree_tagger/error'
4
+ require 'stringio'
5
+
6
+ class TestTagger < Test::Unit::TestCase
7
+
8
+ PUBLIC_METHODS = [:process,
9
+ :get_output,
10
+ :flush
11
+ ]
12
+ def setup
13
+ # ENV['TREETAGGER_BINARY'] = '/opt/TreeTagger/bin/tree-tagger'
14
+ # ENV['TREETAGGER_MODEL'] = '/opt/TreeTagger/lib/german.par'
15
+ # ENV['TREETAGGER_LEXICON'] = '/opt/TreeTagger/lib/german-lexicon.txt'
16
+
17
+ ENV['TREETAGGER_BINARY'] = 'test/tree-tagger/tree-tagger'
18
+ ENV['TREETAGGER_MODEL'] = 'test/tree-tagger/model_file.par'
19
+ ENV['TREETAGGER_LEXICON'] = 'test/tree-tagger/lexicon_file.txt'
20
+
21
+ params = {} # dummy for now
22
+ @tagger = TreeTagger::Tagger.new
23
+ end
24
+
25
+ def teardown
26
+ end
27
+
28
+ # It should have the following constants set.
29
+ def test_constants
30
+ end
31
+
32
+ # It should respond to valid methods
33
+ def test_public_methods
34
+ PUBLIC_METHODS.each do |m|
35
+ assert_respond_to(@tagger, m)
36
+ end
37
+ end
38
+
39
+ def test_tagger
40
+ end
41
+
42
+ # It should accept only arrays and strings.
43
+ def test_input_for_its_class
44
+ assert_nothing_raised do
45
+ @tagger.process 'Ich\ngehe\nin\ndie\nSchule\n.\n'
46
+ @tagger.process %w{Ich gehe in die Schule .}
47
+ end
48
+ end
49
+
50
+ # It should reject non-string and non-array elements.
51
+ def test_rejecting_invalid_input
52
+ [{}, :input, 1, 1.0, Time.new].each do |input|
53
+ assert_raise(TreeTagger::UserError) do
54
+ @tagger.process(input)
55
+ end
56
+ end
57
+ end
58
+
59
+ # It should reject empty input.
60
+ def test_for_empty_input
61
+ ['', []].each do |input|
62
+ assert_raise(TreeTagger::UserError) do
63
+ @tagger.process(input)
64
+ end
65
+ end
66
+ end
67
+
68
+ # It should reject arrays with wrong elements.
69
+ def test_for_elements_of_arrays
70
+
71
+ end
72
+
73
+ # It should accept valid input.
74
+ def test_accepting_vaild_input
75
+ input = ''
76
+ end
77
+
78
+ # It should accept only valid input.
79
+ def test_input_validity
80
+ ['', [], {}, :input, [:one, :two]].each do |input|
81
+ assert_raise(TreeTagger::UserError) do
82
+ @tagger.process(input)
83
+ end
84
+ end
85
+ end
86
+
87
+ # It should instantiate a tagger instance only with valid options.
88
+ def test_for_binary_presence
89
+ ENV.delete('TREETAGGER_BINARY')
90
+ assert_raise(TreeTagger::UserError) do
91
+ TreeTagger::Tagger.new
92
+ end
93
+ end
94
+
95
+ # It should instantiate a tagger instance only with valid options.
96
+ def test_for_model_presence
97
+ ENV.delete('TREETAGGER_MODEL')
98
+ assert_raise(TreeTagger::UserError) do
99
+ TreeTagger::Tagger.new
100
+ end
101
+
102
+ end
103
+
104
+ # It should instantiate a tagger instance only with valid options.
105
+ def test_for_lexicon_presence
106
+ ENV.delete('TREETAGGER_LEXICON')
107
+ assert_raise(TreeTagger::UserError) do
108
+ TreeTagger::Tagger.new({:lookup => true, :options => '-quiet -sgml'})
109
+ end
110
+ end
111
+
112
+ # It should reject a non-boolean value for <:lookup>.
113
+ def test_rejecting_lookup_values
114
+ assert_raise(TreeTagger::UserError) do
115
+ TreeTagger::Tagger.new({:lookup => 'true', :options => '-quiet'})
116
+ end
117
+ end
118
+
119
+ # It should reject a non-boolean value for <:replace_blanks>.
120
+ def test_rejecting_blank_values
121
+ assert_raise(TreeTagger::UserError) do
122
+ TreeTagger::Tagger.new({:replace_blanks => 'true'})
123
+ end
124
+ end
125
+
126
+ # It should reject a non-string value for <:options>.
127
+ def test_rejecting_option_values
128
+ assert_raise(TreeTagger::UserError) do
129
+ TreeTagger::Tagger.new({:options => :quiet})
130
+ end
131
+ end
132
+
133
+ # It should reject invalid options for TreeTagger inside <:options>.
134
+ def test_rejecting_invalid_arguments
135
+ flunk 'Not implemented yet!'
136
+ end
137
+
138
+ # It should ensure the presense of the <-sgml> argument.
139
+ def test_presence_of_sgml_argument
140
+ flunk 'Not implemented yet!'
141
+ end
142
+
143
+ # It should reject a non-string value for <:blank_tag>.
144
+ def test_rejecting_blanktag_values
145
+ assert_raise(TreeTagger::UserError) do
146
+ TreeTagger::Tagger.new({:blank_tag => :blank})
147
+ end
148
+ end
149
+
150
+ # It should ensure that <:blang_tag> is a valid smgl sequence.
151
+ def test_sgml_form
152
+ flunk 'Not implemented yet!'
153
+ end
154
+ end
File without changes
File without changes
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ ARGV.clear
4
+ while gets
5
+ puts $_
6
+ end
7
+
8
+ #STDOUT.flush
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: treetagger-ruby
3
3
  version: !ruby/object:Gem::Version
4
- hash: 29
4
+ hash: 27
5
5
  prerelease:
6
6
  segments:
7
7
  - 0
8
- - 0
9
8
  - 1
10
- version: 0.0.1
9
+ - 0
10
+ version: 0.1.0
11
11
  platform: ruby
12
12
  authors:
13
13
  - Andrei Beliankou
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2011-12-18 00:00:00 Z
18
+ date: 2012-02-14 00:00:00 Z
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
21
21
  name: rdoc
@@ -95,6 +95,12 @@ files:
95
95
  - LICENCE.rdoc
96
96
  - CHANGELOG.rdoc
97
97
  - .yardopts
98
+ - test/test_tagger.rb
99
+ - test/tree-tagger/corrupted_lexicon_file.txt
100
+ - test/tree-tagger/lexicon_file.txt
101
+ - test/tree-tagger/corrupted_model_file.par
102
+ - test/tree-tagger/model_file.par
103
+ - test/tree-tagger/tree-tagger
98
104
  - bin/rtt
99
105
  homepage: http://www.uni-trier.de/index.php?id=34451
100
106
  licenses: []
@@ -132,6 +138,11 @@ rubygems_version: 1.8.10
132
138
  signing_key:
133
139
  specification_version: 3
134
140
  summary: A wrapper for the TreeTagger by Helmut Schmid.
135
- test_files: []
136
-
141
+ test_files:
142
+ - test/test_tagger.rb
143
+ - test/tree-tagger/corrupted_lexicon_file.txt
144
+ - test/tree-tagger/lexicon_file.txt
145
+ - test/tree-tagger/corrupted_model_file.par
146
+ - test/tree-tagger/model_file.par
147
+ - test/tree-tagger/tree-tagger
137
148
  has_rdoc: