opener-tokenizer-base 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (44) hide show
  1. checksums.yaml +7 -0
  2. data/README.md +148 -0
  3. data/bin/tokenizer-base +5 -0
  4. data/bin/tokenizer-de +5 -0
  5. data/bin/tokenizer-en +5 -0
  6. data/bin/tokenizer-es +5 -0
  7. data/bin/tokenizer-fr +5 -0
  8. data/bin/tokenizer-it +5 -0
  9. data/bin/tokenizer-nl +5 -0
  10. data/core/lib/Data/OptList.pm +256 -0
  11. data/core/lib/Params/Util.pm +866 -0
  12. data/core/lib/Sub/Exporter.pm +1101 -0
  13. data/core/lib/Sub/Exporter/Cookbook.pod +309 -0
  14. data/core/lib/Sub/Exporter/Tutorial.pod +280 -0
  15. data/core/lib/Sub/Exporter/Util.pm +354 -0
  16. data/core/lib/Sub/Install.pm +329 -0
  17. data/core/lib/Time/Stamp.pm +808 -0
  18. data/core/load-prefixes.pl +43 -0
  19. data/core/nonbreaking_prefixes/abbreviation_list.kaf +0 -0
  20. data/core/nonbreaking_prefixes/abbreviation_list.txt +444 -0
  21. data/core/nonbreaking_prefixes/nonbreaking_prefix.ca +533 -0
  22. data/core/nonbreaking_prefixes/nonbreaking_prefix.de +781 -0
  23. data/core/nonbreaking_prefixes/nonbreaking_prefix.el +448 -0
  24. data/core/nonbreaking_prefixes/nonbreaking_prefix.en +564 -0
  25. data/core/nonbreaking_prefixes/nonbreaking_prefix.es +758 -0
  26. data/core/nonbreaking_prefixes/nonbreaking_prefix.fr +1027 -0
  27. data/core/nonbreaking_prefixes/nonbreaking_prefix.is +697 -0
  28. data/core/nonbreaking_prefixes/nonbreaking_prefix.it +641 -0
  29. data/core/nonbreaking_prefixes/nonbreaking_prefix.nl +739 -0
  30. data/core/nonbreaking_prefixes/nonbreaking_prefix.pl +729 -0
  31. data/core/nonbreaking_prefixes/nonbreaking_prefix.pt +656 -0
  32. data/core/nonbreaking_prefixes/nonbreaking_prefix.ro +484 -0
  33. data/core/nonbreaking_prefixes/nonbreaking_prefix.ru +705 -0
  34. data/core/nonbreaking_prefixes/nonbreaking_prefix.sk +920 -0
  35. data/core/nonbreaking_prefixes/nonbreaking_prefix.sl +524 -0
  36. data/core/nonbreaking_prefixes/nonbreaking_prefix.sv +492 -0
  37. data/core/split-sentences.pl +114 -0
  38. data/core/text-fixer.pl +169 -0
  39. data/core/tokenizer-cli.pl +363 -0
  40. data/core/tokenizer.pl +145 -0
  41. data/lib/opener/tokenizers/base.rb +84 -0
  42. data/lib/opener/tokenizers/base/version.rb +8 -0
  43. data/opener-tokenizer-base.gemspec +25 -0
  44. metadata +134 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 52254f5259f0ae95c92f14b961c4c56699da2ae0
4
+ data.tar.gz: 5c303cf8e7fd2d876e3f88dd66a2da9b6dc29b64
5
+ SHA512:
6
+ metadata.gz: e1da706e735d5e3a872aa1249e29b5e612b16a8773341d496448ebccc2c114946b93fe46967e5cc8fd230ea9871496dd646b23da0bde6929a4ec8b740ab61f66
7
+ data.tar.gz: 10052efe22ae37be19137afe01c8b1770389eb8bd5ab94614783b384d605ea67a5bd7ba4927e24f8ca6347a91d95303a5f1c1a2fd367eff18d9e24e0412f1b14
@@ -0,0 +1,148 @@
1
+ [![Build Status](https://drone.io/github.com/opener-project/tokenizer-base/status.png)](https://drone.io/github.com/opener-project/tokenizer-base/latest)
2
+ # Opener::Tokenizer::Base
3
+
4
+ Base tokenizer for various languages such as English, German and Italian. Keep
5
+ in mind that this tokenizer supports multiple languages and as such requires
6
+ you to specify said language in a commandline option. The language is specified
7
+ using the `-l` option. The following languages are supported:
8
+
9
+ * en
10
+ * es
11
+ * it
12
+ * nl
13
+ * de
14
+ * fr
15
+
16
+ More languages may be supported in the future.
17
+
18
+ ## Quick Use Overview
19
+
20
+ Install the Gem using Specific Install
21
+
22
+ gem specific_install opener-tokenizer-base \
23
+ -l https://github.com/opener-project/tokenizer-base.git
24
+
25
+ If you dont have specific\_install already, install it first:
26
+
27
+ gem intall specific_install
28
+
29
+ You should now be able to call the tokenizer as a regular shell command, by its
30
+ name. Once installed as a gem you can access the gem from anywhere. This aplication
31
+ reads a text from standard input in order to tokenize.
32
+
33
+ echo "This is an English text." | tokenizer-base -l en
34
+
35
+ For more information about the available CLI options run the following:
36
+
37
+ tokenizer-base --help
38
+
39
+ ## Requirements
40
+
41
+ * Perl 5.14.2 or newer.
42
+ * Ruby 1.9.3 or newer (1.9.2 should work too but 1.9.3. is recommended). Ruby
43
+ 2 is supported.
44
+
45
+ ## Installation
46
+
47
+ To set up the project run the following commands:
48
+
49
+ bundle install
50
+ bundle exec rake compile
51
+
52
+ This will install all the dependencies and generate the Java files. To run all
53
+ the tests (including the process of building the files first) you can run the
54
+ following:
55
+
56
+ bundle exec rake
57
+
58
+ or:
59
+
60
+ bundle exec rake test
61
+
62
+ Building a new Gem can be done as following:
63
+
64
+ bundle exec rake build
65
+
66
+ For more information invoke `rake -T` or take a look at the Rakefile.
67
+
68
+
69
+ ## Gem Installation
70
+
71
+ Add the following to your Gemfile (use Git for now):
72
+
73
+ gem 'opener-tokenizer-base',
74
+ :git=>"git@github.com:opener-project/tokenizer-base.git"
75
+
76
+
77
+ ## Usage
78
+
79
+ Once installed, the tokenizer can be called as a shell command. It reads the
80
+ standard input, and writes to standard output.
81
+
82
+ It is mandatory to set the language as a parameter (there is no default
83
+ language nor automatic detection inside). Providing no language parameter will
84
+ raise an error. To set a language, it has to be preceded by -l
85
+
86
+ echo "Tokenizer example." | tokenizer-base -l en
87
+
88
+ or you can use the convenience option
89
+
90
+ echo "Tokenizer example." | tokenizer-it
91
+
92
+ The output should be:
93
+
94
+ ```xml
95
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
96
+ <KAF version="v1.opener" xml:lang="en">
97
+ <kafHeader>
98
+ <linguisticProcessors layer="text">
99
+ <lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
100
+ <lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
101
+ </linguisticProcessors>
102
+ </kafHeader>
103
+ <text>
104
+ <wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
105
+ <wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
106
+ <wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
107
+ </text>
108
+ </KAF>
109
+ ```
110
+
111
+ If you need a static timestamp you can use the -t param.
112
+
113
+ echo "Tokenizer example." | tokenizer-base -l en -t
114
+
115
+ The output will be something along the lines of the following:
116
+
117
+ ```xml
118
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
119
+ <KAF version="v1.opener" xml:lang="en">
120
+ <kafHeader>
121
+ <linguisticProcessors layer="text">
122
+ <lp name="opener-sentence-splitter-en" timestamp="0000-00-00T00:00:00Z" version="0.0.1"/>
123
+ <lp name="opener-tokenizer-en" timestamp="0000-00-00T00:00:00Z" version="1.0.1"/>
124
+ </linguisticProcessors>
125
+ </kafHeader>
126
+ <text>
127
+ <wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
128
+ <wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
129
+ <wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
130
+ </text>
131
+ </KAF>
132
+ ```
133
+
134
+ ## Possible bugs
135
+
136
+ The merging of all tokenizers in one has been done quite quickly. It seems to
137
+ work so far, but there can be bugs, or lack of some functionality. As the
138
+ tokenizer is the first step of the chain, any error will affect to the analysis
139
+ of the rest of the layers.
140
+
141
+
142
+ ## Contributing
143
+
144
+ 1. Pull it
145
+ 2. Create your feature branch (`git checkout -b features/my-new-feature`)
146
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
147
+ 4. Push to the branch (`git push origin features/my-new-feature`)
148
+ 5. If you're confident, merge your changes into master.
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::Base.new
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::DE.new(:language=>"de")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::EN.new(:language=>"en")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::ES.new(:language=>"es")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::FR.new(:language=>"fr")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::IT.new(:language=>"it")
5
+ puts kernel.run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/opener/tokenizers/base'
3
+
4
+ kernel = Opener::Tokenizers::NL.new(:language=>"nl")
5
+ puts kernel.run
@@ -0,0 +1,256 @@
1
+ use strict;
2
+ use warnings;
3
+ package Data::OptList;
4
+ BEGIN {
5
+ $Data::OptList::VERSION = '0.107';
6
+ }
7
+ # ABSTRACT: parse and validate simple name/value option pairs
8
+
9
+ use List::Util ();
10
+ use Params::Util ();
11
+ use Sub::Install 0.921 ();
12
+
13
+
14
+ my %test_for;
15
+ BEGIN {
16
+ %test_for = (
17
+ CODE => \&Params::Util::_CODELIKE, ## no critic
18
+ HASH => \&Params::Util::_HASHLIKE, ## no critic
19
+ ARRAY => \&Params::Util::_ARRAYLIKE, ## no critic
20
+ SCALAR => \&Params::Util::_SCALAR0, ## no critic
21
+ );
22
+ }
23
+
24
+ sub __is_a {
25
+ my ($got, $expected) = @_;
26
+
27
+ return List::Util::first { __is_a($got, $_) } @$expected if ref $expected;
28
+
29
+ return defined (
30
+ exists($test_for{$expected})
31
+ ? $test_for{$expected}->($got)
32
+ : Params::Util::_INSTANCE($got, $expected) ## no critic
33
+ );
34
+ }
35
+
36
+ sub mkopt {
37
+ my ($opt_list) = shift;
38
+
39
+ my ($moniker, $require_unique, $must_be); # the old positional args
40
+ my $name_test;
41
+
42
+ if (@_ == 1 and Params::Util::_HASHLIKE($_[0])) {
43
+ my $arg = $_[0];
44
+ ($moniker, $require_unique, $must_be, $name_test)
45
+ = @$arg{ qw(moniker require_unique must_be name_test) };
46
+ } else {
47
+ ($moniker, $require_unique, $must_be) = @_;
48
+ }
49
+
50
+ $moniker = 'unnamed' unless defined $moniker;
51
+
52
+ return [] unless $opt_list;
53
+
54
+ $name_test ||= sub { ! ref $_[0] };
55
+
56
+ $opt_list = [
57
+ map { $_ => (ref $opt_list->{$_} ? $opt_list->{$_} : ()) } keys %$opt_list
58
+ ] if ref $opt_list eq 'HASH';
59
+
60
+ my @return;
61
+ my %seen;
62
+
63
+ for (my $i = 0; $i < @$opt_list; $i++) { ## no critic
64
+ my $name = $opt_list->[$i];
65
+ my $value;
66
+
67
+ if ($require_unique) {
68
+ Carp::croak "multiple definitions provided for $name" if $seen{$name}++;
69
+ }
70
+
71
+ if ($i == $#$opt_list) { $value = undef; }
72
+ elsif (not defined $opt_list->[$i+1]) { $value = undef; $i++ }
73
+ elsif ($name_test->($opt_list->[$i+1])) { $value = undef; }
74
+ else { $value = $opt_list->[++$i] }
75
+
76
+ if ($must_be and defined $value) {
77
+ unless (__is_a($value, $must_be)) {
78
+ my $ref = ref $value;
79
+ Carp::croak "$ref-ref values are not valid in $moniker opt list";
80
+ }
81
+ }
82
+
83
+ push @return, [ $name => $value ];
84
+ }
85
+
86
+ return \@return;
87
+ }
88
+
89
+
90
+ sub mkopt_hash {
91
+ my ($opt_list, $moniker, $must_be) = @_;
92
+ return {} unless $opt_list;
93
+
94
+ $opt_list = mkopt($opt_list, $moniker, 1, $must_be);
95
+ my %hash = map { $_->[0] => $_->[1] } @$opt_list;
96
+ return \%hash;
97
+ }
98
+
99
+
100
+ BEGIN {
101
+ *import = Sub::Install::exporter {
102
+ exports => [qw(mkopt mkopt_hash)],
103
+ };
104
+ }
105
+
106
+ 1;
107
+
108
+ __END__
109
+ =pod
110
+
111
+ =head1 NAME
112
+
113
+ Data::OptList - parse and validate simple name/value option pairs
114
+
115
+ =head1 VERSION
116
+
117
+ version 0.107
118
+
119
+ =head1 SYNOPSIS
120
+
121
+ use Data::OptList;
122
+
123
+ my $options = Data::OptList::mkopt([
124
+ qw(key1 key2 key3 key4),
125
+ key5 => { ... },
126
+ key6 => [ ... ],
127
+ key7 => sub { ... },
128
+ key8 => { ... },
129
+ key8 => [ ... ],
130
+ ]);
131
+
132
+ ...is the same thing, more or less, as:
133
+
134
+ my $options = [
135
+ [ key1 => undef, ],
136
+ [ key2 => undef, ],
137
+ [ key3 => undef, ],
138
+ [ key4 => undef, ],
139
+ [ key5 => { ... }, ],
140
+ [ key6 => [ ... ], ],
141
+ [ key7 => sub { ... }, ],
142
+ [ key8 => { ... }, ],
143
+ [ key8 => [ ... ], ],
144
+ ]);
145
+
146
+ =head1 DESCRIPTION
147
+
148
+ Hashes are great for storing named data, but if you want more than one entry
149
+ for a name, you have to use a list of pairs. Even then, this is really boring
150
+ to write:
151
+
152
+ $values = [
153
+ foo => undef,
154
+ bar => undef,
155
+ baz => undef,
156
+ xyz => { ... },
157
+ ];
158
+
159
+ Just look at all those undefs! Don't worry, we can get rid of those:
160
+
161
+ $values = [
162
+ map { $_ => undef } qw(foo bar baz),
163
+ xyz => { ... },
164
+ ];
165
+
166
+ Aaaauuugh! We've saved a little typing, but now it requires thought to read,
167
+ and thinking is even worse than typing... and it's got a bug! It looked right,
168
+ didn't it? Well, the C<< xyz => { ... } >> gets consumed by the map, and we
169
+ don't get the data we wanted.
170
+
171
+ With Data::OptList, you can do this instead:
172
+
173
+ $values = Data::OptList::mkopt([
174
+ qw(foo bar baz),
175
+ xyz => { ... },
176
+ ]);
177
+
178
+ This works by assuming that any defined scalar is a name and any reference
179
+ following a name is its value.
180
+
181
+ =head1 FUNCTIONS
182
+
183
+ =head2 mkopt
184
+
185
+ my $opt_list = Data::OptList::mkopt($input, \%arg);
186
+
187
+ Valid arguments are:
188
+
189
+ moniker - a word used in errors to describe the opt list; encouraged
190
+ require_unique - if true, no name may appear more than once
191
+ must_be - types to which opt list values are limited (described below)
192
+ name_test - a coderef used to test whether a value can be a name
193
+ (described below, but you probably don't want this)
194
+
195
+ This produces an array of arrays; the inner arrays are name/value pairs.
196
+ Values will be either "undef" or a reference.
197
+
198
+ Positional parameters may be used for compability with the old C<mkopt>
199
+ interface:
200
+
201
+ my $opt_list = Data::OptList::mkopt($input, $moniker, $req_uni, $must_be);
202
+
203
+ Valid values for C<$input>:
204
+
205
+ undef -> []
206
+ hashref -> [ [ key1 => value1 ] ... ] # non-ref values become undef
207
+ arrayref -> every name followed by a non-name becomes a pair: [ name => ref ]
208
+ every name followed by undef becomes a pair: [ name => undef ]
209
+ otherwise, it becomes [ name => undef ] like so:
210
+ [ "a", "b", [ 1, 2 ] ] -> [ [ a => undef ], [ b => [ 1, 2 ] ] ]
211
+
212
+ By default, a I<name> is any defined non-reference. The C<name_test> parameter
213
+ can be a code ref that tests whether the argument passed it is a name or not.
214
+ This should be used rarely. Interactions between C<require_unique> and
215
+ C<name_test> are not yet particularly elegant, as C<require_unique> just tests
216
+ string equality. B<This may change.>
217
+
218
+ The C<must_be> parameter is either a scalar or array of scalars; it defines
219
+ what kind(s) of refs may be values. If an invalid value is found, an exception
220
+ is thrown. If no value is passed for this argument, any reference is valid.
221
+ If C<must_be> specifies that values must be CODE, HASH, ARRAY, or SCALAR, then
222
+ Params::Util is used to check whether the given value can provide that
223
+ interface. Otherwise, it checks that the given value is an object of the kind.
224
+
225
+ In other words:
226
+
227
+ [ qw(SCALAR HASH Object::Known) ]
228
+
229
+ Means:
230
+
231
+ _SCALAR0($value) or _HASH($value) or _INSTANCE($value, 'Object::Known')
232
+
233
+ =head2 mkopt_hash
234
+
235
+ my $opt_hash = Data::OptList::mkopt_hash($input, $moniker, $must_be);
236
+
237
+ Given valid C<L</mkopt>> input, this routine returns a reference to a hash. It
238
+ will throw an exception if any name has more than one value.
239
+
240
+ =head1 EXPORTS
241
+
242
+ Both C<mkopt> and C<mkopt_hash> may be exported on request.
243
+
244
+ =head1 AUTHOR
245
+
246
+ Ricardo Signes <rjbs@cpan.org>
247
+
248
+ =head1 COPYRIGHT AND LICENSE
249
+
250
+ This software is copyright (c) 2006 by Ricardo Signes.
251
+
252
+ This is free software; you can redistribute it and/or modify it under
253
+ the same terms as the Perl 5 programming language system itself.
254
+
255
+ =cut
256
+