opener-tokenizer-base 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/README.md +148 -0
- data/bin/tokenizer-base +5 -0
- data/bin/tokenizer-de +5 -0
- data/bin/tokenizer-en +5 -0
- data/bin/tokenizer-es +5 -0
- data/bin/tokenizer-fr +5 -0
- data/bin/tokenizer-it +5 -0
- data/bin/tokenizer-nl +5 -0
- data/core/lib/Data/OptList.pm +256 -0
- data/core/lib/Params/Util.pm +866 -0
- data/core/lib/Sub/Exporter.pm +1101 -0
- data/core/lib/Sub/Exporter/Cookbook.pod +309 -0
- data/core/lib/Sub/Exporter/Tutorial.pod +280 -0
- data/core/lib/Sub/Exporter/Util.pm +354 -0
- data/core/lib/Sub/Install.pm +329 -0
- data/core/lib/Time/Stamp.pm +808 -0
- data/core/load-prefixes.pl +43 -0
- data/core/nonbreaking_prefixes/abbreviation_list.kaf +0 -0
- data/core/nonbreaking_prefixes/abbreviation_list.txt +444 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ca +533 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.de +781 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.el +448 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.en +564 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.es +758 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.fr +1027 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.is +697 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.it +641 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.nl +739 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.pl +729 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.pt +656 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ro +484 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ru +705 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sk +920 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sl +524 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sv +492 -0
- data/core/split-sentences.pl +114 -0
- data/core/text-fixer.pl +169 -0
- data/core/tokenizer-cli.pl +363 -0
- data/core/tokenizer.pl +145 -0
- data/lib/opener/tokenizers/base.rb +84 -0
- data/lib/opener/tokenizers/base/version.rb +8 -0
- data/opener-tokenizer-base.gemspec +25 -0
- metadata +134 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 52254f5259f0ae95c92f14b961c4c56699da2ae0
|
4
|
+
data.tar.gz: 5c303cf8e7fd2d876e3f88dd66a2da9b6dc29b64
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: e1da706e735d5e3a872aa1249e29b5e612b16a8773341d496448ebccc2c114946b93fe46967e5cc8fd230ea9871496dd646b23da0bde6929a4ec8b740ab61f66
|
7
|
+
data.tar.gz: 10052efe22ae37be19137afe01c8b1770389eb8bd5ab94614783b384d605ea67a5bd7ba4927e24f8ca6347a91d95303a5f1c1a2fd367eff18d9e24e0412f1b14
|
data/README.md
ADDED
@@ -0,0 +1,148 @@
|
|
1
|
+
[![Build Status](https://drone.io/github.com/opener-project/tokenizer-base/status.png)](https://drone.io/github.com/opener-project/tokenizer-base/latest)
|
2
|
+
# Opener::Tokenizer::Base
|
3
|
+
|
4
|
+
Base tokenizer for various languages such as English, German and Italian. Keep
|
5
|
+
in mind that this tokenizer supports multiple languages and as such requires
|
6
|
+
you to specify said language in a commandline option. The language is specified
|
7
|
+
using the `-l` option. The following languages are supported:
|
8
|
+
|
9
|
+
* en
|
10
|
+
* es
|
11
|
+
* it
|
12
|
+
* nl
|
13
|
+
* de
|
14
|
+
* fr
|
15
|
+
|
16
|
+
More languages may be supported in the future.
|
17
|
+
|
18
|
+
## Quick Use Overview
|
19
|
+
|
20
|
+
Install the Gem using Specific Install
|
21
|
+
|
22
|
+
gem specific_install opener-tokenizer-base \
|
23
|
+
-l https://github.com/opener-project/tokenizer-base.git
|
24
|
+
|
25
|
+
If you dont have specific\_install already, install it first:
|
26
|
+
|
27
|
+
gem intall specific_install
|
28
|
+
|
29
|
+
You should now be able to call the tokenizer as a regular shell command, by its
|
30
|
+
name. Once installed as a gem you can access the gem from anywhere. This aplication
|
31
|
+
reads a text from standard input in order to tokenize.
|
32
|
+
|
33
|
+
echo "This is an English text." | tokenizer-base -l en
|
34
|
+
|
35
|
+
For more information about the available CLI options run the following:
|
36
|
+
|
37
|
+
tokenizer-base --help
|
38
|
+
|
39
|
+
## Requirements
|
40
|
+
|
41
|
+
* Perl 5.14.2 or newer.
|
42
|
+
* Ruby 1.9.3 or newer (1.9.2 should work too but 1.9.3. is recommended). Ruby
|
43
|
+
2 is supported.
|
44
|
+
|
45
|
+
## Installation
|
46
|
+
|
47
|
+
To set up the project run the following commands:
|
48
|
+
|
49
|
+
bundle install
|
50
|
+
bundle exec rake compile
|
51
|
+
|
52
|
+
This will install all the dependencies and generate the Java files. To run all
|
53
|
+
the tests (including the process of building the files first) you can run the
|
54
|
+
following:
|
55
|
+
|
56
|
+
bundle exec rake
|
57
|
+
|
58
|
+
or:
|
59
|
+
|
60
|
+
bundle exec rake test
|
61
|
+
|
62
|
+
Building a new Gem can be done as following:
|
63
|
+
|
64
|
+
bundle exec rake build
|
65
|
+
|
66
|
+
For more information invoke `rake -T` or take a look at the Rakefile.
|
67
|
+
|
68
|
+
|
69
|
+
## Gem Installation
|
70
|
+
|
71
|
+
Add the following to your Gemfile (use Git for now):
|
72
|
+
|
73
|
+
gem 'opener-tokenizer-base',
|
74
|
+
:git=>"git@github.com:opener-project/tokenizer-base.git"
|
75
|
+
|
76
|
+
|
77
|
+
## Usage
|
78
|
+
|
79
|
+
Once installed, the tokenizer can be called as a shell command. It reads the
|
80
|
+
standard input, and writes to standard output.
|
81
|
+
|
82
|
+
It is mandatory to set the language as a parameter (there is no default
|
83
|
+
language nor automatic detection inside). Providing no language parameter will
|
84
|
+
raise an error. To set a language, it has to be preceded by -l
|
85
|
+
|
86
|
+
echo "Tokenizer example." | tokenizer-base -l en
|
87
|
+
|
88
|
+
or you can use the convenience option
|
89
|
+
|
90
|
+
echo "Tokenizer example." | tokenizer-it
|
91
|
+
|
92
|
+
The output should be:
|
93
|
+
|
94
|
+
```xml
|
95
|
+
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
96
|
+
<KAF version="v1.opener" xml:lang="en">
|
97
|
+
<kafHeader>
|
98
|
+
<linguisticProcessors layer="text">
|
99
|
+
<lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
|
100
|
+
<lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
|
101
|
+
</linguisticProcessors>
|
102
|
+
</kafHeader>
|
103
|
+
<text>
|
104
|
+
<wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
|
105
|
+
<wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
|
106
|
+
<wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
|
107
|
+
</text>
|
108
|
+
</KAF>
|
109
|
+
```
|
110
|
+
|
111
|
+
If you need a static timestamp you can use the -t param.
|
112
|
+
|
113
|
+
echo "Tokenizer example." | tokenizer-base -l en -t
|
114
|
+
|
115
|
+
The output will be something along the lines of the following:
|
116
|
+
|
117
|
+
```xml
|
118
|
+
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
119
|
+
<KAF version="v1.opener" xml:lang="en">
|
120
|
+
<kafHeader>
|
121
|
+
<linguisticProcessors layer="text">
|
122
|
+
<lp name="opener-sentence-splitter-en" timestamp="0000-00-00T00:00:00Z" version="0.0.1"/>
|
123
|
+
<lp name="opener-tokenizer-en" timestamp="0000-00-00T00:00:00Z" version="1.0.1"/>
|
124
|
+
</linguisticProcessors>
|
125
|
+
</kafHeader>
|
126
|
+
<text>
|
127
|
+
<wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
|
128
|
+
<wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
|
129
|
+
<wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
|
130
|
+
</text>
|
131
|
+
</KAF>
|
132
|
+
```
|
133
|
+
|
134
|
+
## Possible bugs
|
135
|
+
|
136
|
+
The merging of all tokenizers in one has been done quite quickly. It seems to
|
137
|
+
work so far, but there can be bugs, or lack of some functionality. As the
|
138
|
+
tokenizer is the first step of the chain, any error will affect to the analysis
|
139
|
+
of the rest of the layers.
|
140
|
+
|
141
|
+
|
142
|
+
## Contributing
|
143
|
+
|
144
|
+
1. Pull it
|
145
|
+
2. Create your feature branch (`git checkout -b features/my-new-feature`)
|
146
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
147
|
+
4. Push to the branch (`git push origin features/my-new-feature`)
|
148
|
+
5. If you're confident, merge your changes into master.
|
data/bin/tokenizer-base
ADDED
data/bin/tokenizer-de
ADDED
data/bin/tokenizer-en
ADDED
data/bin/tokenizer-es
ADDED
data/bin/tokenizer-fr
ADDED
data/bin/tokenizer-it
ADDED
data/bin/tokenizer-nl
ADDED
@@ -0,0 +1,256 @@
|
|
1
|
+
use strict;
|
2
|
+
use warnings;
|
3
|
+
package Data::OptList;
|
4
|
+
BEGIN {
|
5
|
+
$Data::OptList::VERSION = '0.107';
|
6
|
+
}
|
7
|
+
# ABSTRACT: parse and validate simple name/value option pairs
|
8
|
+
|
9
|
+
use List::Util ();
|
10
|
+
use Params::Util ();
|
11
|
+
use Sub::Install 0.921 ();
|
12
|
+
|
13
|
+
|
14
|
+
my %test_for;
|
15
|
+
BEGIN {
|
16
|
+
%test_for = (
|
17
|
+
CODE => \&Params::Util::_CODELIKE, ## no critic
|
18
|
+
HASH => \&Params::Util::_HASHLIKE, ## no critic
|
19
|
+
ARRAY => \&Params::Util::_ARRAYLIKE, ## no critic
|
20
|
+
SCALAR => \&Params::Util::_SCALAR0, ## no critic
|
21
|
+
);
|
22
|
+
}
|
23
|
+
|
24
|
+
sub __is_a {
|
25
|
+
my ($got, $expected) = @_;
|
26
|
+
|
27
|
+
return List::Util::first { __is_a($got, $_) } @$expected if ref $expected;
|
28
|
+
|
29
|
+
return defined (
|
30
|
+
exists($test_for{$expected})
|
31
|
+
? $test_for{$expected}->($got)
|
32
|
+
: Params::Util::_INSTANCE($got, $expected) ## no critic
|
33
|
+
);
|
34
|
+
}
|
35
|
+
|
36
|
+
sub mkopt {
|
37
|
+
my ($opt_list) = shift;
|
38
|
+
|
39
|
+
my ($moniker, $require_unique, $must_be); # the old positional args
|
40
|
+
my $name_test;
|
41
|
+
|
42
|
+
if (@_ == 1 and Params::Util::_HASHLIKE($_[0])) {
|
43
|
+
my $arg = $_[0];
|
44
|
+
($moniker, $require_unique, $must_be, $name_test)
|
45
|
+
= @$arg{ qw(moniker require_unique must_be name_test) };
|
46
|
+
} else {
|
47
|
+
($moniker, $require_unique, $must_be) = @_;
|
48
|
+
}
|
49
|
+
|
50
|
+
$moniker = 'unnamed' unless defined $moniker;
|
51
|
+
|
52
|
+
return [] unless $opt_list;
|
53
|
+
|
54
|
+
$name_test ||= sub { ! ref $_[0] };
|
55
|
+
|
56
|
+
$opt_list = [
|
57
|
+
map { $_ => (ref $opt_list->{$_} ? $opt_list->{$_} : ()) } keys %$opt_list
|
58
|
+
] if ref $opt_list eq 'HASH';
|
59
|
+
|
60
|
+
my @return;
|
61
|
+
my %seen;
|
62
|
+
|
63
|
+
for (my $i = 0; $i < @$opt_list; $i++) { ## no critic
|
64
|
+
my $name = $opt_list->[$i];
|
65
|
+
my $value;
|
66
|
+
|
67
|
+
if ($require_unique) {
|
68
|
+
Carp::croak "multiple definitions provided for $name" if $seen{$name}++;
|
69
|
+
}
|
70
|
+
|
71
|
+
if ($i == $#$opt_list) { $value = undef; }
|
72
|
+
elsif (not defined $opt_list->[$i+1]) { $value = undef; $i++ }
|
73
|
+
elsif ($name_test->($opt_list->[$i+1])) { $value = undef; }
|
74
|
+
else { $value = $opt_list->[++$i] }
|
75
|
+
|
76
|
+
if ($must_be and defined $value) {
|
77
|
+
unless (__is_a($value, $must_be)) {
|
78
|
+
my $ref = ref $value;
|
79
|
+
Carp::croak "$ref-ref values are not valid in $moniker opt list";
|
80
|
+
}
|
81
|
+
}
|
82
|
+
|
83
|
+
push @return, [ $name => $value ];
|
84
|
+
}
|
85
|
+
|
86
|
+
return \@return;
|
87
|
+
}
|
88
|
+
|
89
|
+
|
90
|
+
sub mkopt_hash {
|
91
|
+
my ($opt_list, $moniker, $must_be) = @_;
|
92
|
+
return {} unless $opt_list;
|
93
|
+
|
94
|
+
$opt_list = mkopt($opt_list, $moniker, 1, $must_be);
|
95
|
+
my %hash = map { $_->[0] => $_->[1] } @$opt_list;
|
96
|
+
return \%hash;
|
97
|
+
}
|
98
|
+
|
99
|
+
|
100
|
+
BEGIN {
|
101
|
+
*import = Sub::Install::exporter {
|
102
|
+
exports => [qw(mkopt mkopt_hash)],
|
103
|
+
};
|
104
|
+
}
|
105
|
+
|
106
|
+
1;
|
107
|
+
|
108
|
+
__END__
|
109
|
+
=pod
|
110
|
+
|
111
|
+
=head1 NAME
|
112
|
+
|
113
|
+
Data::OptList - parse and validate simple name/value option pairs
|
114
|
+
|
115
|
+
=head1 VERSION
|
116
|
+
|
117
|
+
version 0.107
|
118
|
+
|
119
|
+
=head1 SYNOPSIS
|
120
|
+
|
121
|
+
use Data::OptList;
|
122
|
+
|
123
|
+
my $options = Data::OptList::mkopt([
|
124
|
+
qw(key1 key2 key3 key4),
|
125
|
+
key5 => { ... },
|
126
|
+
key6 => [ ... ],
|
127
|
+
key7 => sub { ... },
|
128
|
+
key8 => { ... },
|
129
|
+
key8 => [ ... ],
|
130
|
+
]);
|
131
|
+
|
132
|
+
...is the same thing, more or less, as:
|
133
|
+
|
134
|
+
my $options = [
|
135
|
+
[ key1 => undef, ],
|
136
|
+
[ key2 => undef, ],
|
137
|
+
[ key3 => undef, ],
|
138
|
+
[ key4 => undef, ],
|
139
|
+
[ key5 => { ... }, ],
|
140
|
+
[ key6 => [ ... ], ],
|
141
|
+
[ key7 => sub { ... }, ],
|
142
|
+
[ key8 => { ... }, ],
|
143
|
+
[ key8 => [ ... ], ],
|
144
|
+
]);
|
145
|
+
|
146
|
+
=head1 DESCRIPTION
|
147
|
+
|
148
|
+
Hashes are great for storing named data, but if you want more than one entry
|
149
|
+
for a name, you have to use a list of pairs. Even then, this is really boring
|
150
|
+
to write:
|
151
|
+
|
152
|
+
$values = [
|
153
|
+
foo => undef,
|
154
|
+
bar => undef,
|
155
|
+
baz => undef,
|
156
|
+
xyz => { ... },
|
157
|
+
];
|
158
|
+
|
159
|
+
Just look at all those undefs! Don't worry, we can get rid of those:
|
160
|
+
|
161
|
+
$values = [
|
162
|
+
map { $_ => undef } qw(foo bar baz),
|
163
|
+
xyz => { ... },
|
164
|
+
];
|
165
|
+
|
166
|
+
Aaaauuugh! We've saved a little typing, but now it requires thought to read,
|
167
|
+
and thinking is even worse than typing... and it's got a bug! It looked right,
|
168
|
+
didn't it? Well, the C<< xyz => { ... } >> gets consumed by the map, and we
|
169
|
+
don't get the data we wanted.
|
170
|
+
|
171
|
+
With Data::OptList, you can do this instead:
|
172
|
+
|
173
|
+
$values = Data::OptList::mkopt([
|
174
|
+
qw(foo bar baz),
|
175
|
+
xyz => { ... },
|
176
|
+
]);
|
177
|
+
|
178
|
+
This works by assuming that any defined scalar is a name and any reference
|
179
|
+
following a name is its value.
|
180
|
+
|
181
|
+
=head1 FUNCTIONS
|
182
|
+
|
183
|
+
=head2 mkopt
|
184
|
+
|
185
|
+
my $opt_list = Data::OptList::mkopt($input, \%arg);
|
186
|
+
|
187
|
+
Valid arguments are:
|
188
|
+
|
189
|
+
moniker - a word used in errors to describe the opt list; encouraged
|
190
|
+
require_unique - if true, no name may appear more than once
|
191
|
+
must_be - types to which opt list values are limited (described below)
|
192
|
+
name_test - a coderef used to test whether a value can be a name
|
193
|
+
(described below, but you probably don't want this)
|
194
|
+
|
195
|
+
This produces an array of arrays; the inner arrays are name/value pairs.
|
196
|
+
Values will be either "undef" or a reference.
|
197
|
+
|
198
|
+
Positional parameters may be used for compability with the old C<mkopt>
|
199
|
+
interface:
|
200
|
+
|
201
|
+
my $opt_list = Data::OptList::mkopt($input, $moniker, $req_uni, $must_be);
|
202
|
+
|
203
|
+
Valid values for C<$input>:
|
204
|
+
|
205
|
+
undef -> []
|
206
|
+
hashref -> [ [ key1 => value1 ] ... ] # non-ref values become undef
|
207
|
+
arrayref -> every name followed by a non-name becomes a pair: [ name => ref ]
|
208
|
+
every name followed by undef becomes a pair: [ name => undef ]
|
209
|
+
otherwise, it becomes [ name => undef ] like so:
|
210
|
+
[ "a", "b", [ 1, 2 ] ] -> [ [ a => undef ], [ b => [ 1, 2 ] ] ]
|
211
|
+
|
212
|
+
By default, a I<name> is any defined non-reference. The C<name_test> parameter
|
213
|
+
can be a code ref that tests whether the argument passed it is a name or not.
|
214
|
+
This should be used rarely. Interactions between C<require_unique> and
|
215
|
+
C<name_test> are not yet particularly elegant, as C<require_unique> just tests
|
216
|
+
string equality. B<This may change.>
|
217
|
+
|
218
|
+
The C<must_be> parameter is either a scalar or array of scalars; it defines
|
219
|
+
what kind(s) of refs may be values. If an invalid value is found, an exception
|
220
|
+
is thrown. If no value is passed for this argument, any reference is valid.
|
221
|
+
If C<must_be> specifies that values must be CODE, HASH, ARRAY, or SCALAR, then
|
222
|
+
Params::Util is used to check whether the given value can provide that
|
223
|
+
interface. Otherwise, it checks that the given value is an object of the kind.
|
224
|
+
|
225
|
+
In other words:
|
226
|
+
|
227
|
+
[ qw(SCALAR HASH Object::Known) ]
|
228
|
+
|
229
|
+
Means:
|
230
|
+
|
231
|
+
_SCALAR0($value) or _HASH($value) or _INSTANCE($value, 'Object::Known')
|
232
|
+
|
233
|
+
=head2 mkopt_hash
|
234
|
+
|
235
|
+
my $opt_hash = Data::OptList::mkopt_hash($input, $moniker, $must_be);
|
236
|
+
|
237
|
+
Given valid C<L</mkopt>> input, this routine returns a reference to a hash. It
|
238
|
+
will throw an exception if any name has more than one value.
|
239
|
+
|
240
|
+
=head1 EXPORTS
|
241
|
+
|
242
|
+
Both C<mkopt> and C<mkopt_hash> may be exported on request.
|
243
|
+
|
244
|
+
=head1 AUTHOR
|
245
|
+
|
246
|
+
Ricardo Signes <rjbs@cpan.org>
|
247
|
+
|
248
|
+
=head1 COPYRIGHT AND LICENSE
|
249
|
+
|
250
|
+
This software is copyright (c) 2006 by Ricardo Signes.
|
251
|
+
|
252
|
+
This is free software; you can redistribute it and/or modify it under
|
253
|
+
the same terms as the Perl 5 programming language system itself.
|
254
|
+
|
255
|
+
=cut
|
256
|
+
|