opener-tokenizer-base 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/README.md +148 -0
- data/bin/tokenizer-base +5 -0
- data/bin/tokenizer-de +5 -0
- data/bin/tokenizer-en +5 -0
- data/bin/tokenizer-es +5 -0
- data/bin/tokenizer-fr +5 -0
- data/bin/tokenizer-it +5 -0
- data/bin/tokenizer-nl +5 -0
- data/core/lib/Data/OptList.pm +256 -0
- data/core/lib/Params/Util.pm +866 -0
- data/core/lib/Sub/Exporter.pm +1101 -0
- data/core/lib/Sub/Exporter/Cookbook.pod +309 -0
- data/core/lib/Sub/Exporter/Tutorial.pod +280 -0
- data/core/lib/Sub/Exporter/Util.pm +354 -0
- data/core/lib/Sub/Install.pm +329 -0
- data/core/lib/Time/Stamp.pm +808 -0
- data/core/load-prefixes.pl +43 -0
- data/core/nonbreaking_prefixes/abbreviation_list.kaf +0 -0
- data/core/nonbreaking_prefixes/abbreviation_list.txt +444 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ca +533 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.de +781 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.el +448 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.en +564 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.es +758 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.fr +1027 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.is +697 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.it +641 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.nl +739 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.pl +729 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.pt +656 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ro +484 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.ru +705 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sk +920 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sl +524 -0
- data/core/nonbreaking_prefixes/nonbreaking_prefix.sv +492 -0
- data/core/split-sentences.pl +114 -0
- data/core/text-fixer.pl +169 -0
- data/core/tokenizer-cli.pl +363 -0
- data/core/tokenizer.pl +145 -0
- data/lib/opener/tokenizers/base.rb +84 -0
- data/lib/opener/tokenizers/base/version.rb +8 -0
- data/opener-tokenizer-base.gemspec +25 -0
- metadata +134 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 52254f5259f0ae95c92f14b961c4c56699da2ae0
|
4
|
+
data.tar.gz: 5c303cf8e7fd2d876e3f88dd66a2da9b6dc29b64
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: e1da706e735d5e3a872aa1249e29b5e612b16a8773341d496448ebccc2c114946b93fe46967e5cc8fd230ea9871496dd646b23da0bde6929a4ec8b740ab61f66
|
7
|
+
data.tar.gz: 10052efe22ae37be19137afe01c8b1770389eb8bd5ab94614783b384d605ea67a5bd7ba4927e24f8ca6347a91d95303a5f1c1a2fd367eff18d9e24e0412f1b14
|
data/README.md
ADDED
@@ -0,0 +1,148 @@
|
|
1
|
+
[](https://drone.io/github.com/opener-project/tokenizer-base/latest)
|
2
|
+
# Opener::Tokenizer::Base
|
3
|
+
|
4
|
+
Base tokenizer for various languages such as English, German and Italian. Keep
|
5
|
+
in mind that this tokenizer supports multiple languages and as such requires
|
6
|
+
you to specify said language in a commandline option. The language is specified
|
7
|
+
using the `-l` option. The following languages are supported:
|
8
|
+
|
9
|
+
* en
|
10
|
+
* es
|
11
|
+
* it
|
12
|
+
* nl
|
13
|
+
* de
|
14
|
+
* fr
|
15
|
+
|
16
|
+
More languages may be supported in the future.
|
17
|
+
|
18
|
+
## Quick Use Overview
|
19
|
+
|
20
|
+
Install the Gem using Specific Install
|
21
|
+
|
22
|
+
gem specific_install opener-tokenizer-base \
|
23
|
+
-l https://github.com/opener-project/tokenizer-base.git
|
24
|
+
|
25
|
+
If you dont have specific\_install already, install it first:
|
26
|
+
|
27
|
+
gem intall specific_install
|
28
|
+
|
29
|
+
You should now be able to call the tokenizer as a regular shell command, by its
|
30
|
+
name. Once installed as a gem you can access the gem from anywhere. This aplication
|
31
|
+
reads a text from standard input in order to tokenize.
|
32
|
+
|
33
|
+
echo "This is an English text." | tokenizer-base -l en
|
34
|
+
|
35
|
+
For more information about the available CLI options run the following:
|
36
|
+
|
37
|
+
tokenizer-base --help
|
38
|
+
|
39
|
+
## Requirements
|
40
|
+
|
41
|
+
* Perl 5.14.2 or newer.
|
42
|
+
* Ruby 1.9.3 or newer (1.9.2 should work too but 1.9.3. is recommended). Ruby
|
43
|
+
2 is supported.
|
44
|
+
|
45
|
+
## Installation
|
46
|
+
|
47
|
+
To set up the project run the following commands:
|
48
|
+
|
49
|
+
bundle install
|
50
|
+
bundle exec rake compile
|
51
|
+
|
52
|
+
This will install all the dependencies and generate the Java files. To run all
|
53
|
+
the tests (including the process of building the files first) you can run the
|
54
|
+
following:
|
55
|
+
|
56
|
+
bundle exec rake
|
57
|
+
|
58
|
+
or:
|
59
|
+
|
60
|
+
bundle exec rake test
|
61
|
+
|
62
|
+
Building a new Gem can be done as following:
|
63
|
+
|
64
|
+
bundle exec rake build
|
65
|
+
|
66
|
+
For more information invoke `rake -T` or take a look at the Rakefile.
|
67
|
+
|
68
|
+
|
69
|
+
## Gem Installation
|
70
|
+
|
71
|
+
Add the following to your Gemfile (use Git for now):
|
72
|
+
|
73
|
+
gem 'opener-tokenizer-base',
|
74
|
+
:git=>"git@github.com:opener-project/tokenizer-base.git"
|
75
|
+
|
76
|
+
|
77
|
+
## Usage
|
78
|
+
|
79
|
+
Once installed, the tokenizer can be called as a shell command. It reads the
|
80
|
+
standard input, and writes to standard output.
|
81
|
+
|
82
|
+
It is mandatory to set the language as a parameter (there is no default
|
83
|
+
language nor automatic detection inside). Providing no language parameter will
|
84
|
+
raise an error. To set a language, it has to be preceded by -l
|
85
|
+
|
86
|
+
echo "Tokenizer example." | tokenizer-base -l en
|
87
|
+
|
88
|
+
or you can use the convenience option
|
89
|
+
|
90
|
+
echo "Tokenizer example." | tokenizer-it
|
91
|
+
|
92
|
+
The output should be:
|
93
|
+
|
94
|
+
```xml
|
95
|
+
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
96
|
+
<KAF version="v1.opener" xml:lang="en">
|
97
|
+
<kafHeader>
|
98
|
+
<linguisticProcessors layer="text">
|
99
|
+
<lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
|
100
|
+
<lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
|
101
|
+
</linguisticProcessors>
|
102
|
+
</kafHeader>
|
103
|
+
<text>
|
104
|
+
<wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
|
105
|
+
<wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
|
106
|
+
<wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
|
107
|
+
</text>
|
108
|
+
</KAF>
|
109
|
+
```
|
110
|
+
|
111
|
+
If you need a static timestamp you can use the -t param.
|
112
|
+
|
113
|
+
echo "Tokenizer example." | tokenizer-base -l en -t
|
114
|
+
|
115
|
+
The output will be something along the lines of the following:
|
116
|
+
|
117
|
+
```xml
|
118
|
+
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
119
|
+
<KAF version="v1.opener" xml:lang="en">
|
120
|
+
<kafHeader>
|
121
|
+
<linguisticProcessors layer="text">
|
122
|
+
<lp name="opener-sentence-splitter-en" timestamp="0000-00-00T00:00:00Z" version="0.0.1"/>
|
123
|
+
<lp name="opener-tokenizer-en" timestamp="0000-00-00T00:00:00Z" version="1.0.1"/>
|
124
|
+
</linguisticProcessors>
|
125
|
+
</kafHeader>
|
126
|
+
<text>
|
127
|
+
<wf length="9" offset="0" para="1" sent="1" wid="w1">Tokenizer</wf>
|
128
|
+
<wf length="7" offset="10" para="1" sent="1" wid="w2">example</wf>
|
129
|
+
<wf length="1" offset="17" para="1" sent="1" wid="w3">.</wf>
|
130
|
+
</text>
|
131
|
+
</KAF>
|
132
|
+
```
|
133
|
+
|
134
|
+
## Possible bugs
|
135
|
+
|
136
|
+
The merging of all tokenizers in one has been done quite quickly. It seems to
|
137
|
+
work so far, but there can be bugs, or lack of some functionality. As the
|
138
|
+
tokenizer is the first step of the chain, any error will affect to the analysis
|
139
|
+
of the rest of the layers.
|
140
|
+
|
141
|
+
|
142
|
+
## Contributing
|
143
|
+
|
144
|
+
1. Pull it
|
145
|
+
2. Create your feature branch (`git checkout -b features/my-new-feature`)
|
146
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
147
|
+
4. Push to the branch (`git push origin features/my-new-feature`)
|
148
|
+
5. If you're confident, merge your changes into master.
|
data/bin/tokenizer-base
ADDED
data/bin/tokenizer-de
ADDED
data/bin/tokenizer-en
ADDED
data/bin/tokenizer-es
ADDED
data/bin/tokenizer-fr
ADDED
data/bin/tokenizer-it
ADDED
data/bin/tokenizer-nl
ADDED
@@ -0,0 +1,256 @@
|
|
1
|
+
use strict;
|
2
|
+
use warnings;
|
3
|
+
package Data::OptList;
|
4
|
+
BEGIN {
|
5
|
+
$Data::OptList::VERSION = '0.107';
|
6
|
+
}
|
7
|
+
# ABSTRACT: parse and validate simple name/value option pairs
|
8
|
+
|
9
|
+
use List::Util ();
|
10
|
+
use Params::Util ();
|
11
|
+
use Sub::Install 0.921 ();
|
12
|
+
|
13
|
+
|
14
|
+
my %test_for;
|
15
|
+
BEGIN {
|
16
|
+
%test_for = (
|
17
|
+
CODE => \&Params::Util::_CODELIKE, ## no critic
|
18
|
+
HASH => \&Params::Util::_HASHLIKE, ## no critic
|
19
|
+
ARRAY => \&Params::Util::_ARRAYLIKE, ## no critic
|
20
|
+
SCALAR => \&Params::Util::_SCALAR0, ## no critic
|
21
|
+
);
|
22
|
+
}
|
23
|
+
|
24
|
+
sub __is_a {
|
25
|
+
my ($got, $expected) = @_;
|
26
|
+
|
27
|
+
return List::Util::first { __is_a($got, $_) } @$expected if ref $expected;
|
28
|
+
|
29
|
+
return defined (
|
30
|
+
exists($test_for{$expected})
|
31
|
+
? $test_for{$expected}->($got)
|
32
|
+
: Params::Util::_INSTANCE($got, $expected) ## no critic
|
33
|
+
);
|
34
|
+
}
|
35
|
+
|
36
|
+
sub mkopt {
|
37
|
+
my ($opt_list) = shift;
|
38
|
+
|
39
|
+
my ($moniker, $require_unique, $must_be); # the old positional args
|
40
|
+
my $name_test;
|
41
|
+
|
42
|
+
if (@_ == 1 and Params::Util::_HASHLIKE($_[0])) {
|
43
|
+
my $arg = $_[0];
|
44
|
+
($moniker, $require_unique, $must_be, $name_test)
|
45
|
+
= @$arg{ qw(moniker require_unique must_be name_test) };
|
46
|
+
} else {
|
47
|
+
($moniker, $require_unique, $must_be) = @_;
|
48
|
+
}
|
49
|
+
|
50
|
+
$moniker = 'unnamed' unless defined $moniker;
|
51
|
+
|
52
|
+
return [] unless $opt_list;
|
53
|
+
|
54
|
+
$name_test ||= sub { ! ref $_[0] };
|
55
|
+
|
56
|
+
$opt_list = [
|
57
|
+
map { $_ => (ref $opt_list->{$_} ? $opt_list->{$_} : ()) } keys %$opt_list
|
58
|
+
] if ref $opt_list eq 'HASH';
|
59
|
+
|
60
|
+
my @return;
|
61
|
+
my %seen;
|
62
|
+
|
63
|
+
for (my $i = 0; $i < @$opt_list; $i++) { ## no critic
|
64
|
+
my $name = $opt_list->[$i];
|
65
|
+
my $value;
|
66
|
+
|
67
|
+
if ($require_unique) {
|
68
|
+
Carp::croak "multiple definitions provided for $name" if $seen{$name}++;
|
69
|
+
}
|
70
|
+
|
71
|
+
if ($i == $#$opt_list) { $value = undef; }
|
72
|
+
elsif (not defined $opt_list->[$i+1]) { $value = undef; $i++ }
|
73
|
+
elsif ($name_test->($opt_list->[$i+1])) { $value = undef; }
|
74
|
+
else { $value = $opt_list->[++$i] }
|
75
|
+
|
76
|
+
if ($must_be and defined $value) {
|
77
|
+
unless (__is_a($value, $must_be)) {
|
78
|
+
my $ref = ref $value;
|
79
|
+
Carp::croak "$ref-ref values are not valid in $moniker opt list";
|
80
|
+
}
|
81
|
+
}
|
82
|
+
|
83
|
+
push @return, [ $name => $value ];
|
84
|
+
}
|
85
|
+
|
86
|
+
return \@return;
|
87
|
+
}
|
88
|
+
|
89
|
+
|
90
|
+
sub mkopt_hash {
|
91
|
+
my ($opt_list, $moniker, $must_be) = @_;
|
92
|
+
return {} unless $opt_list;
|
93
|
+
|
94
|
+
$opt_list = mkopt($opt_list, $moniker, 1, $must_be);
|
95
|
+
my %hash = map { $_->[0] => $_->[1] } @$opt_list;
|
96
|
+
return \%hash;
|
97
|
+
}
|
98
|
+
|
99
|
+
|
100
|
+
BEGIN {
|
101
|
+
*import = Sub::Install::exporter {
|
102
|
+
exports => [qw(mkopt mkopt_hash)],
|
103
|
+
};
|
104
|
+
}
|
105
|
+
|
106
|
+
1;
|
107
|
+
|
108
|
+
__END__
|
109
|
+
=pod
|
110
|
+
|
111
|
+
=head1 NAME
|
112
|
+
|
113
|
+
Data::OptList - parse and validate simple name/value option pairs
|
114
|
+
|
115
|
+
=head1 VERSION
|
116
|
+
|
117
|
+
version 0.107
|
118
|
+
|
119
|
+
=head1 SYNOPSIS
|
120
|
+
|
121
|
+
use Data::OptList;
|
122
|
+
|
123
|
+
my $options = Data::OptList::mkopt([
|
124
|
+
qw(key1 key2 key3 key4),
|
125
|
+
key5 => { ... },
|
126
|
+
key6 => [ ... ],
|
127
|
+
key7 => sub { ... },
|
128
|
+
key8 => { ... },
|
129
|
+
key8 => [ ... ],
|
130
|
+
]);
|
131
|
+
|
132
|
+
...is the same thing, more or less, as:
|
133
|
+
|
134
|
+
my $options = [
|
135
|
+
[ key1 => undef, ],
|
136
|
+
[ key2 => undef, ],
|
137
|
+
[ key3 => undef, ],
|
138
|
+
[ key4 => undef, ],
|
139
|
+
[ key5 => { ... }, ],
|
140
|
+
[ key6 => [ ... ], ],
|
141
|
+
[ key7 => sub { ... }, ],
|
142
|
+
[ key8 => { ... }, ],
|
143
|
+
[ key8 => [ ... ], ],
|
144
|
+
]);
|
145
|
+
|
146
|
+
=head1 DESCRIPTION
|
147
|
+
|
148
|
+
Hashes are great for storing named data, but if you want more than one entry
|
149
|
+
for a name, you have to use a list of pairs. Even then, this is really boring
|
150
|
+
to write:
|
151
|
+
|
152
|
+
$values = [
|
153
|
+
foo => undef,
|
154
|
+
bar => undef,
|
155
|
+
baz => undef,
|
156
|
+
xyz => { ... },
|
157
|
+
];
|
158
|
+
|
159
|
+
Just look at all those undefs! Don't worry, we can get rid of those:
|
160
|
+
|
161
|
+
$values = [
|
162
|
+
map { $_ => undef } qw(foo bar baz),
|
163
|
+
xyz => { ... },
|
164
|
+
];
|
165
|
+
|
166
|
+
Aaaauuugh! We've saved a little typing, but now it requires thought to read,
|
167
|
+
and thinking is even worse than typing... and it's got a bug! It looked right,
|
168
|
+
didn't it? Well, the C<< xyz => { ... } >> gets consumed by the map, and we
|
169
|
+
don't get the data we wanted.
|
170
|
+
|
171
|
+
With Data::OptList, you can do this instead:
|
172
|
+
|
173
|
+
$values = Data::OptList::mkopt([
|
174
|
+
qw(foo bar baz),
|
175
|
+
xyz => { ... },
|
176
|
+
]);
|
177
|
+
|
178
|
+
This works by assuming that any defined scalar is a name and any reference
|
179
|
+
following a name is its value.
|
180
|
+
|
181
|
+
=head1 FUNCTIONS
|
182
|
+
|
183
|
+
=head2 mkopt
|
184
|
+
|
185
|
+
my $opt_list = Data::OptList::mkopt($input, \%arg);
|
186
|
+
|
187
|
+
Valid arguments are:
|
188
|
+
|
189
|
+
moniker - a word used in errors to describe the opt list; encouraged
|
190
|
+
require_unique - if true, no name may appear more than once
|
191
|
+
must_be - types to which opt list values are limited (described below)
|
192
|
+
name_test - a coderef used to test whether a value can be a name
|
193
|
+
(described below, but you probably don't want this)
|
194
|
+
|
195
|
+
This produces an array of arrays; the inner arrays are name/value pairs.
|
196
|
+
Values will be either "undef" or a reference.
|
197
|
+
|
198
|
+
Positional parameters may be used for compability with the old C<mkopt>
|
199
|
+
interface:
|
200
|
+
|
201
|
+
my $opt_list = Data::OptList::mkopt($input, $moniker, $req_uni, $must_be);
|
202
|
+
|
203
|
+
Valid values for C<$input>:
|
204
|
+
|
205
|
+
undef -> []
|
206
|
+
hashref -> [ [ key1 => value1 ] ... ] # non-ref values become undef
|
207
|
+
arrayref -> every name followed by a non-name becomes a pair: [ name => ref ]
|
208
|
+
every name followed by undef becomes a pair: [ name => undef ]
|
209
|
+
otherwise, it becomes [ name => undef ] like so:
|
210
|
+
[ "a", "b", [ 1, 2 ] ] -> [ [ a => undef ], [ b => [ 1, 2 ] ] ]
|
211
|
+
|
212
|
+
By default, a I<name> is any defined non-reference. The C<name_test> parameter
|
213
|
+
can be a code ref that tests whether the argument passed it is a name or not.
|
214
|
+
This should be used rarely. Interactions between C<require_unique> and
|
215
|
+
C<name_test> are not yet particularly elegant, as C<require_unique> just tests
|
216
|
+
string equality. B<This may change.>
|
217
|
+
|
218
|
+
The C<must_be> parameter is either a scalar or array of scalars; it defines
|
219
|
+
what kind(s) of refs may be values. If an invalid value is found, an exception
|
220
|
+
is thrown. If no value is passed for this argument, any reference is valid.
|
221
|
+
If C<must_be> specifies that values must be CODE, HASH, ARRAY, or SCALAR, then
|
222
|
+
Params::Util is used to check whether the given value can provide that
|
223
|
+
interface. Otherwise, it checks that the given value is an object of the kind.
|
224
|
+
|
225
|
+
In other words:
|
226
|
+
|
227
|
+
[ qw(SCALAR HASH Object::Known) ]
|
228
|
+
|
229
|
+
Means:
|
230
|
+
|
231
|
+
_SCALAR0($value) or _HASH($value) or _INSTANCE($value, 'Object::Known')
|
232
|
+
|
233
|
+
=head2 mkopt_hash
|
234
|
+
|
235
|
+
my $opt_hash = Data::OptList::mkopt_hash($input, $moniker, $must_be);
|
236
|
+
|
237
|
+
Given valid C<L</mkopt>> input, this routine returns a reference to a hash. It
|
238
|
+
will throw an exception if any name has more than one value.
|
239
|
+
|
240
|
+
=head1 EXPORTS
|
241
|
+
|
242
|
+
Both C<mkopt> and C<mkopt_hash> may be exported on request.
|
243
|
+
|
244
|
+
=head1 AUTHOR
|
245
|
+
|
246
|
+
Ricardo Signes <rjbs@cpan.org>
|
247
|
+
|
248
|
+
=head1 COPYRIGHT AND LICENSE
|
249
|
+
|
250
|
+
This software is copyright (c) 2006 by Ricardo Signes.
|
251
|
+
|
252
|
+
This is free software; you can redistribute it and/or modify it under
|
253
|
+
the same terms as the Perl 5 programming language system itself.
|
254
|
+
|
255
|
+
=cut
|
256
|
+
|