opener-tokenizer-base 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. checksums.yaml +7 -0
  2. data/README.md +148 -0
  3. data/bin/tokenizer-base +5 -0
  4. data/bin/tokenizer-de +5 -0
  5. data/bin/tokenizer-en +5 -0
  6. data/bin/tokenizer-es +5 -0
  7. data/bin/tokenizer-fr +5 -0
  8. data/bin/tokenizer-it +5 -0
  9. data/bin/tokenizer-nl +5 -0
  10. data/core/lib/Data/OptList.pm +256 -0
  11. data/core/lib/Params/Util.pm +866 -0
  12. data/core/lib/Sub/Exporter.pm +1101 -0
  13. data/core/lib/Sub/Exporter/Cookbook.pod +309 -0
  14. data/core/lib/Sub/Exporter/Tutorial.pod +280 -0
  15. data/core/lib/Sub/Exporter/Util.pm +354 -0
  16. data/core/lib/Sub/Install.pm +329 -0
  17. data/core/lib/Time/Stamp.pm +808 -0
  18. data/core/load-prefixes.pl +43 -0
  19. data/core/nonbreaking_prefixes/abbreviation_list.kaf +0 -0
  20. data/core/nonbreaking_prefixes/abbreviation_list.txt +444 -0
  21. data/core/nonbreaking_prefixes/nonbreaking_prefix.ca +533 -0
  22. data/core/nonbreaking_prefixes/nonbreaking_prefix.de +781 -0
  23. data/core/nonbreaking_prefixes/nonbreaking_prefix.el +448 -0
  24. data/core/nonbreaking_prefixes/nonbreaking_prefix.en +564 -0
  25. data/core/nonbreaking_prefixes/nonbreaking_prefix.es +758 -0
  26. data/core/nonbreaking_prefixes/nonbreaking_prefix.fr +1027 -0
  27. data/core/nonbreaking_prefixes/nonbreaking_prefix.is +697 -0
  28. data/core/nonbreaking_prefixes/nonbreaking_prefix.it +641 -0
  29. data/core/nonbreaking_prefixes/nonbreaking_prefix.nl +739 -0
  30. data/core/nonbreaking_prefixes/nonbreaking_prefix.pl +729 -0
  31. data/core/nonbreaking_prefixes/nonbreaking_prefix.pt +656 -0
  32. data/core/nonbreaking_prefixes/nonbreaking_prefix.ro +484 -0
  33. data/core/nonbreaking_prefixes/nonbreaking_prefix.ru +705 -0
  34. data/core/nonbreaking_prefixes/nonbreaking_prefix.sk +920 -0
  35. data/core/nonbreaking_prefixes/nonbreaking_prefix.sl +524 -0
  36. data/core/nonbreaking_prefixes/nonbreaking_prefix.sv +492 -0
  37. data/core/split-sentences.pl +114 -0
  38. data/core/text-fixer.pl +169 -0
  39. data/core/tokenizer-cli.pl +363 -0
  40. data/core/tokenizer.pl +145 -0
  41. data/lib/opener/tokenizers/base.rb +84 -0
  42. data/lib/opener/tokenizers/base/version.rb +8 -0
  43. data/opener-tokenizer-base.gemspec +25 -0
  44. metadata +134 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 52254f5259f0ae95c92f14b961c4c56699da2ae0
4
+ data.tar.gz: 5c303cf8e7fd2d876e3f88dd66a2da9b6dc29b64
5
+ SHA512:
6
+ metadata.gz: e1da706e735d5e3a872aa1249e29b5e612b16a8773341d496448ebccc2c114946b93fe46967e5cc8fd230ea9871496dd646b23da0bde6929a4ec8b740ab61f66
7
+ data.tar.gz: 10052efe22ae37be19137afe01c8b1770389eb8bd5ab94614783b384d605ea67a5bd7ba4927e24f8ca6347a91d95303a5f1c1a2fd367eff18d9e24e0412f1b14
@@ -0,0 +1,148 @@
1
+ [![Build Status](https://drone.io/github.com/opener-project/tokenizer-base/status.png)](https://drone.io/github.com/opener-project/tokenizer-base/latest)
2
+ # Opener::Tokenizer::Base
3
+
4
+ Base tokenizer for various languages such as English, German and Italian. Keep
5
+ in mind that this tokenizer supports multiple languages and as such requires
6
+ you to specify said language in a commandline option. The language is specified
7
+ using the `-l` option. The following languages are supported:
8
+
9
+ * en
10
+ * es
11
+ * it
12
+ * nl
13
+ * de
14
+ * fr
15
+
16
+ More languages may be supported in the future.
17
+
18
+ ## Quick Use Overview
19
+
20
+ Install the Gem using Specific Install
21
+
22
+ gem specific_install opener-tokenizer-base \
23
+ -l https://github.com/opener-project/tokenizer-base.git
24
+
25
+ If you dont have specific\_install already, install it first:
26
+
27
+ gem intall specific_install
28
+
29
+ You should now be able to call the tokenizer as a regular shell command, by its
30
+ name. Once installed as a gem you can access the gem from anywhere. This aplication
31
+ reads a text from standard input in order to tokenize.
32
+
33
+ echo "This is an English text." | tokenizer-base -l en
34
+
35
+ For more information about the available CLI options run the following:
36
+
37
+ tokenizer-base --help
38
+
39
+ ## Requirements
40
+
41
+ * Perl 5.14.2 or newer.
42
+ * Ruby 1.9.3 or newer (1.9.2 should work too but 1.9.3. is recommended). Ruby
43
+ 2 is supported.
44
+
45
+ ## Installation
46
+
47
+ To set up the project run the following commands:
48
+
49
+ bundle install
50
+ bundle exec rake compile
51
+
52
+ This will install all the dependencies and generate the Java files. To run all
53
+ the tests (including the process of building the files first) you can run the
54
+ following:
55
+
56
+ bundle exec rake
57
+
58
+ or:
59
+
60
+ bundle exec rake test
61
+
62
+ Building a new Gem can be done as following:
63
+
64
+ bundle exec rake build
65
+
66
+ For more information invoke `rake -T` or take a look at the Rakefile.
67
+
68
+
69
+ ## Gem Installation
70
+
71
+ Add the following to your Gemfile (use Git for now):
72
+
73
+ gem 'opener-tokenizer-base',
74
+ :git=>"git@github.com:opener-project/tokenizer-base.git"
75
+
76
+
77
+ ## Usage
78
+
79
+ Once installed, the tokenizer can be called as a shell command. It reads the
80
+ standard input, and writes to standard output.
81
+
82
+ It is mandatory to set the language as a parameter (there is no default
83
+ language nor automatic detection inside). Providing no language parameter will
84
+ raise an error. To set a language, it has to be preceded by -l
85
+
86
+ echo "Tokenizer example." | tokenizer-base -l en
87
+
88
+ or you can use the convenience option
89
+
90
+ echo "Tokenizer example." | tokenizer-it
91
+
92
+ The output should be:
93
+
94
+ ```xml
95
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
96
+ <KAF version="v1.opener" xml:lang="en">
97
+ <kafHeader>
98
+ <linguisticProcessors layer="text">
99
+ <lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
100
+ <lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
101
+ </linguisticProcessors>
102
+ </kafHeader>
103
+ <text>
104
+ <wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
105
+ <wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
106
+ <wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
107
+ </text>
108
+ </KAF>
109
+ ```
110
+
111
+ If you need a static timestamp you can use the -t param.
112
+
113
+ echo "Tokenizer example." | tokenizer-base -l en -t
114
+
115
+ The output will be something along the lines of the following:
116
+
117
+ ```xml
118
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
119
+ <KAF version="v1.opener" xml:lang="en">
120
+ <kafHeader>
121
+ <linguisticProcessors layer="text">
122
+ <lp name="opener-sentence-splitter-en" timestamp="0000-00-00T00:00:00Z" version="0.0.1"/>
123
+ <lp name="opener-tokenizer-en" timestamp="0000-00-00T00:00:00Z" version="1.0.1"/>
124
+ </linguisticProcessors>
125
+ </kafHeader>
126
+ <text>
127
+ <wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
128
+ <wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
129
+ <wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
130
+ </text>
131
+ </KAF>
132
+ ```
133
+
134
+ ## Possible bugs
135
+
136
+ The merging of all tokenizers in one has been done quite quickly. It seems to
137
+ work so far, but there can be bugs, or lack of some functionality. As the
138
+ tokenizer is the first step of the chain, any error will affect to the analysis
139
+ of the rest of the layers.
140
+
141
+
142
+ ## Contributing
143
+
144
+ 1. Pull it
145
+ 2. Create your feature branch (`git checkout -b features/my-new-feature`)
146
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
147
+ 4. Push to the branch (`git push origin features/my-new-feature`)
148
+ 5. If you're confident, merge your changes into master.
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::Base.new
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::DE.new(:language=>"de")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::EN.new(:language=>"en")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::ES.new(:language=>"es")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::FR.new(:language=>"fr")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::IT.new(:language=>"it")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::NL.new(:language=>"nl")
5
+ puts kernel.run
@@ -0,0 +1,256 @@
1
+ use strict;
2
+ use warnings;
3
+ package Data::OptList;
4
+ BEGIN {
5
+ $Data::OptList::VERSION = '0.107';
6
+ }
7
+ # ABSTRACT: parse and validate simple name/value option pairs
8
+
9
+ use List::Util ();
10
+ use Params::Util ();
11
+ use Sub::Install 0.921 ();
12
+
13
+
14
+ my %test_for;
15
+ BEGIN {
16
+ %test_for = (
17
+ CODE => \&Params::Util::_CODELIKE, ## no critic
18
+ HASH => \&Params::Util::_HASHLIKE, ## no critic
19
+ ARRAY => \&Params::Util::_ARRAYLIKE, ## no critic
20
+ SCALAR => \&Params::Util::_SCALAR0, ## no critic
21
+ );
22
+ }
23
+
24
+ sub __is_a {
25
+ my ($got, $expected) = @_;
26
+
27
+ return List::Util::first { __is_a($got, $_) } @$expected if ref $expected;
28
+
29
+ return defined (
30
+ exists($test_for{$expected})
31
+ ? $test_for{$expected}->($got)
32
+ : Params::Util::_INSTANCE($got, $expected) ## no critic
33
+ );
34
+ }
35
+
36
+ sub mkopt {
37
+ my ($opt_list) = shift;
38
+
39
+ my ($moniker, $require_unique, $must_be); # the old positional args
40
+ my $name_test;
41
+
42
+ if (@_ == 1 and Params::Util::_HASHLIKE($_[0])) {
43
+ my $arg = $_[0];
44
+ ($moniker, $require_unique, $must_be, $name_test)
45
+ = @$arg{ qw(moniker require_unique must_be name_test) };
46
+ } else {
47
+ ($moniker, $require_unique, $must_be) = @_;
48
+ }
49
+
50
+ $moniker = 'unnamed' unless defined $moniker;
51
+
52
+ return [] unless $opt_list;
53
+
54
+ $name_test ||= sub { ! ref $_[0] };
55
+
56
+ $opt_list = [
57
+ map { $_ => (ref $opt_list->{$_} ? $opt_list->{$_} : ()) } keys %$opt_list
58
+ ] if ref $opt_list eq 'HASH';
59
+
60
+ my @return;
61
+ my %seen;
62
+
63
+ for (my $i = 0; $i < @$opt_list; $i++) { ## no critic
64
+ my $name = $opt_list->[$i];
65
+ my $value;
66
+
67
+ if ($require_unique) {
68
+ Carp::croak "multiple definitions provided for $name" if $seen{$name}++;
69
+ }
70
+
71
+ if ($i == $#$opt_list) { $value = undef; }
72
+ elsif (not defined $opt_list->[$i+1]) { $value = undef; $i++ }
73
+ elsif ($name_test->($opt_list->[$i+1])) { $value = undef; }
74
+ else { $value = $opt_list->[++$i] }
75
+
76
+ if ($must_be and defined $value) {
77
+ unless (__is_a($value, $must_be)) {
78
+ my $ref = ref $value;
79
+ Carp::croak "$ref-ref values are not valid in $moniker opt list";
80
+ }
81
+ }
82
+
83
+ push @return, [ $name => $value ];
84
+ }
85
+
86
+ return \@return;
87
+ }
88
+
89
+
90
+ sub mkopt_hash {
91
+ my ($opt_list, $moniker, $must_be) = @_;
92
+ return {} unless $opt_list;
93
+
94
+ $opt_list = mkopt($opt_list, $moniker, 1, $must_be);
95
+ my %hash = map { $_->[0] => $_->[1] } @$opt_list;
96
+ return \%hash;
97
+ }
98
+
99
+
100
+ BEGIN {
101
+ *import = Sub::Install::exporter {
102
+ exports => [qw(mkopt mkopt_hash)],
103
+ };
104
+ }
105
+
106
+ 1;
107
+
108
+ __END__
109
+ =pod
110
+
111
+ =head1 NAME
112
+
113
+ Data::OptList - parse and validate simple name/value option pairs
114
+
115
+ =head1 VERSION
116
+
117
+ version 0.107
118
+
119
+ =head1 SYNOPSIS
120
+
121
+ use Data::OptList;
122
+
123
+ my $options = Data::OptList::mkopt([
124
+ qw(key1 key2 key3 key4),
125
+ key5 => { ... },
126
+ key6 => [ ... ],
127
+ key7 => sub { ... },
128
+ key8 => { ... },
129
+ key8 => [ ... ],
130
+ ]);
131
+
132
+ ...is the same thing, more or less, as:
133
+
134
+ my $options = [
135
+ [ key1 => undef, ],
136
+ [ key2 => undef, ],
137
+ [ key3 => undef, ],
138
+ [ key4 => undef, ],
139
+ [ key5 => { ... }, ],
140
+ [ key6 => [ ... ], ],
141
+ [ key7 => sub { ... }, ],
142
+ [ key8 => { ... }, ],
143
+ [ key8 => [ ... ], ],
144
+ ]);
145
+
146
+ =head1 DESCRIPTION
147
+
148
+ Hashes are great for storing named data, but if you want more than one entry
149
+ for a name, you have to use a list of pairs. Even then, this is really boring
150
+ to write:
151
+
152
+ $values = [
153
+ foo => undef,
154
+ bar => undef,
155
+ baz => undef,
156
+ xyz => { ... },
157
+ ];
158
+
159
+ Just look at all those undefs! Don't worry, we can get rid of those:
160
+
161
+ $values = [
162
+ map { $_ => undef } qw(foo bar baz),
163
+ xyz => { ... },
164
+ ];
165
+
166
+ Aaaauuugh! We've saved a little typing, but now it requires thought to read,
167
+ and thinking is even worse than typing... and it's got a bug! It looked right,
168
+ didn't it? Well, the C<< xyz => { ... } >> gets consumed by the map, and we
169
+ don't get the data we wanted.
170
+
171
+ With Data::OptList, you can do this instead:
172
+
173
+ $values = Data::OptList::mkopt([
174
+ qw(foo bar baz),
175
+ xyz => { ... },
176
+ ]);
177
+
178
+ This works by assuming that any defined scalar is a name and any reference
179
+ following a name is its value.
180
+
181
+ =head1 FUNCTIONS
182
+
183
+ =head2 mkopt
184
+
185
+ my $opt_list = Data::OptList::mkopt($input, \%arg);
186
+
187
+ Valid arguments are:
188
+
189
+ moniker - a word used in errors to describe the opt list; encouraged
190
+ require_unique - if true, no name may appear more than once
191
+ must_be - types to which opt list values are limited (described below)
192
+ name_test - a coderef used to test whether a value can be a name
193
+ (described below, but you probably don't want this)
194
+
195
+ This produces an array of arrays; the inner arrays are name/value pairs.
196
+ Values will be either "undef" or a reference.
197
+
198
+ Positional parameters may be used for compability with the old C<mkopt>
199
+ interface:
200
+
201
+ my $opt_list = Data::OptList::mkopt($input, $moniker, $req_uni, $must_be);
202
+
203
+ Valid values for C<$input>:
204
+
205
+ undef -> []
206
+ hashref -> [ [ key1 => value1 ] ... ] # non-ref values become undef
207
+ arrayref -> every name followed by a non-name becomes a pair: [ name => ref ]
208
+ every name followed by undef becomes a pair: [ name => undef ]
209
+ otherwise, it becomes [ name => undef ] like so:
210
+ [ "a", "b", [ 1, 2 ] ] -> [ [ a => undef ], [ b => [ 1, 2 ] ] ]
211
+
212
+ By default, a I<name> is any defined non-reference. The C<name_test> parameter
213
+ can be a code ref that tests whether the argument passed it is a name or not.
214
+ This should be used rarely. Interactions between C<require_unique> and
215
+ C<name_test> are not yet particularly elegant, as C<require_unique> just tests
216
+ string equality. B<This may change.>
217
+
218
+ The C<must_be> parameter is either a scalar or array of scalars; it defines
219
+ what kind(s) of refs may be values. If an invalid value is found, an exception
220
+ is thrown. If no value is passed for this argument, any reference is valid.
221
+ If C<must_be> specifies that values must be CODE, HASH, ARRAY, or SCALAR, then
222
+ Params::Util is used to check whether the given value can provide that
223
+ interface. Otherwise, it checks that the given value is an object of the kind.
224
+
225
+ In other words:
226
+
227
+ [ qw(SCALAR HASH Object::Known) ]
228
+
229
+ Means:
230
+
231
+ _SCALAR0($value) or _HASH($value) or _INSTANCE($value, 'Object::Known')
232
+
233
+ =head2 mkopt_hash
234
+
235
+ my $opt_hash = Data::OptList::mkopt_hash($input, $moniker, $must_be);
236
+
237
+ Given valid C<L</mkopt>> input, this routine returns a reference to a hash. It
238
+ will throw an exception if any name has more than one value.
239
+
240
+ =head1 EXPORTS
241
+
242
+ Both C<mkopt> and C<mkopt_hash> may be exported on request.
243
+
244
+ =head1 AUTHOR
245
+
246
+ Ricardo Signes <rjbs@cpan.org>
247
+
248
+ =head1 COPYRIGHT AND LICENSE
249
+
250
+ This software is copyright (c) 2006 by Ricardo Signes.
251
+
252
+ This is free software; you can redistribute it and/or modify it under
253
+ the same terms as the Perl 5 programming language system itself.
254
+
255
+ =cut
256
+