opener-tokenizer-base 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. checksums.yaml +7 -0
  2. data/README.md +148 -0
  3. data/bin/tokenizer-base +5 -0
  4. data/bin/tokenizer-de +5 -0
  5. data/bin/tokenizer-en +5 -0
  6. data/bin/tokenizer-es +5 -0
  7. data/bin/tokenizer-fr +5 -0
  8. data/bin/tokenizer-it +5 -0
  9. data/bin/tokenizer-nl +5 -0
  10. data/core/lib/Data/OptList.pm +256 -0
  11. data/core/lib/Params/Util.pm +866 -0
  12. data/core/lib/Sub/Exporter.pm +1101 -0
  13. data/core/lib/Sub/Exporter/Cookbook.pod +309 -0
  14. data/core/lib/Sub/Exporter/Tutorial.pod +280 -0
  15. data/core/lib/Sub/Exporter/Util.pm +354 -0
  16. data/core/lib/Sub/Install.pm +329 -0
  17. data/core/lib/Time/Stamp.pm +808 -0
  18. data/core/load-prefixes.pl +43 -0
  19. data/core/nonbreaking_prefixes/abbreviation_list.kaf +0 -0
  20. data/core/nonbreaking_prefixes/abbreviation_list.txt +444 -0
  21. data/core/nonbreaking_prefixes/nonbreaking_prefix.ca +533 -0
  22. data/core/nonbreaking_prefixes/nonbreaking_prefix.de +781 -0
  23. data/core/nonbreaking_prefixes/nonbreaking_prefix.el +448 -0
  24. data/core/nonbreaking_prefixes/nonbreaking_prefix.en +564 -0
  25. data/core/nonbreaking_prefixes/nonbreaking_prefix.es +758 -0
  26. data/core/nonbreaking_prefixes/nonbreaking_prefix.fr +1027 -0
  27. data/core/nonbreaking_prefixes/nonbreaking_prefix.is +697 -0
  28. data/core/nonbreaking_prefixes/nonbreaking_prefix.it +641 -0
  29. data/core/nonbreaking_prefixes/nonbreaking_prefix.nl +739 -0
  30. data/core/nonbreaking_prefixes/nonbreaking_prefix.pl +729 -0
  31. data/core/nonbreaking_prefixes/nonbreaking_prefix.pt +656 -0
  32. data/core/nonbreaking_prefixes/nonbreaking_prefix.ro +484 -0
  33. data/core/nonbreaking_prefixes/nonbreaking_prefix.ru +705 -0
  34. data/core/nonbreaking_prefixes/nonbreaking_prefix.sk +920 -0
  35. data/core/nonbreaking_prefixes/nonbreaking_prefix.sl +524 -0
  36. data/core/nonbreaking_prefixes/nonbreaking_prefix.sv +492 -0
  37. data/core/split-sentences.pl +114 -0
  38. data/core/text-fixer.pl +169 -0
  39. data/core/tokenizer-cli.pl +363 -0
  40. data/core/tokenizer.pl +145 -0
  41. data/lib/opener/tokenizers/base.rb +84 -0
  42. data/lib/opener/tokenizers/base/version.rb +8 -0
  43. data/opener-tokenizer-base.gemspec +25 -0
  44. metadata +134 -0
@@ -0,0 +1,309 @@
1
+
2
+ # ABSTRACT: useful, demonstrative, or stupid Sub::Exporter tricks
3
+ # PODNAME: Sub::Exporter::Cookbook
4
+
5
+
6
+
7
+ __END__
8
+ =pod
9
+
10
+ =head1 NAME
11
+
12
+ Sub::Exporter::Cookbook - useful, demonstrative, or stupid Sub::Exporter tricks
13
+
14
+ =head1 VERSION
15
+
16
+ version 0.984
17
+
18
+ =head1 OVERVIEW
19
+
20
+ Sub::Exporter is a fairly simple tool, and can be used to achieve some very
21
+ simple goals. Its basic behaviors and their basic application (that is,
22
+ "traditional" exporting of routines) are described in
23
+ L<Sub::Exporter::Tutorial> and L<Sub::Exporter>. This document presents
24
+ applications that may not be immediately obvious, or that can demonstrate how
25
+ certain features can be put to use (for good or evil).
26
+
27
+ =head1 THE RECIPES
28
+
29
+ =head2 Exporting Methods as Routines
30
+
31
+ With Exporter.pm, exporting methods is a non-starter. Sub::Exporter makes it
32
+ simple. By using the C<curry_method> utility provided in
33
+ L<Sub::Exporter::Util>, a method can be exported with the invocant built in.
34
+
35
+ package Object::Strenuous;
36
+
37
+ use Sub::Exporter::Util 'curry_method';
38
+ use Sub::Exporter -setup => {
39
+ exports => [ objection => curry_method('new') ],
40
+ };
41
+
42
+ With this configuration, the importing code may contain:
43
+
44
+ my $obj = objection("irrelevant");
45
+
46
+ ...and this will be equivalent to:
47
+
48
+ my $obj = Object::Strenuous->new("irrelevant");
49
+
50
+ The built-in invocant is determined by the invocant for the C<import> method.
51
+ That means that if we were to subclass Object::Strenuous as follows:
52
+
53
+ package Object::Strenuous::Repeated;
54
+ @ISA = 'Object::Strenuous';
55
+
56
+ ...then importing C<objection> from the subclass would build-in that subclass.
57
+
58
+ Finally, since the invocant can be an object, you can write something like
59
+ this:
60
+
61
+ package Cypher;
62
+ use Sub::Exporter::Util 'curry_method';
63
+ use Sub::Exporter -setup => {
64
+ exports => [ encypher => curry_method ],
65
+ };
66
+
67
+ with the expectation that C<import> will be called on an instantiated Cypher
68
+ object:
69
+
70
+ BEGIN {
71
+ my $cypher = Cypher->new( ... );
72
+ $cypher->import('encypher');
73
+ }
74
+
75
+ Now there is a globally-available C<encypher> routine which calls the encypher
76
+ method on an otherwise unavailable Cypher object.
77
+
78
+ =head2 Exporting Methods as Methods
79
+
80
+ While exporting modules usually export subroutines to be called as subroutines,
81
+ it's easy to use Sub::Exporter to export subroutines meant to be called as
82
+ methods on the importing package or its objects.
83
+
84
+ Here's a trivial (and naive) example:
85
+
86
+ package Mixin::DumpObj;
87
+
88
+ use Data::Dumper;
89
+
90
+ use Sub::Exporter -setup => {
91
+ exports => [ qw(dump) ]
92
+ };
93
+
94
+ sub dump {
95
+ my ($self) = @_;
96
+ return Dumper($self);
97
+ }
98
+
99
+ When writing your own object class, you can then import C<dump> to be used as a
100
+ method, called like so:
101
+
102
+ $object->dump;
103
+
104
+ By assuming that the importing class will provide a certain interface, a
105
+ method-exporting module can be used as a simple plugin:
106
+
107
+ package Number::Plugin::Upto;
108
+ use Sub::Exporter -setup => {
109
+ into => 'Number',
110
+ exports => [ qw(upto) ],
111
+ groups => [ default => [ qw(upto) ] ],
112
+ };
113
+
114
+ sub upto {
115
+ my ($self) = @_;
116
+ return 1 .. abs($self->as_integer);
117
+ }
118
+
119
+ The C<into> line in the configuration says that this plugin will export, by
120
+ default, into the Number package, not into the C<use>-ing package. It can be
121
+ exported anyway, though, and will work as long as the destination provides an
122
+ C<as_integer> method like the one it expects. To import it to a different
123
+ destination, one can just write:
124
+
125
+ use Number::Plugin::Upto { into => 'Quantity' };
126
+
127
+ =head2 Mixing-in Complex External Behavior
128
+
129
+ When exporting methods to be used as methods (see above), one very powerful
130
+ option is to export methods that are generated routines that maintain an
131
+ enclosed reference to the exporting module. This allows a user to import a
132
+ single method which is implemented in terms of a complete, well-structured
133
+ package.
134
+
135
+ Here is a very small example:
136
+
137
+ package Data::Analyzer;
138
+
139
+ use Sub::Exporter -setup => {
140
+ exports => [ analyze => \'_generate_analyzer' ],
141
+ };
142
+
143
+ sub _generate_analyzer {
144
+ my ($mixin, $name, $arg, $col) = @_;
145
+
146
+ return sub {
147
+ my ($self) = @_;
148
+
149
+ my $values = [ $self->values ];
150
+
151
+ my $analyzer = $mixin->new($values);
152
+ $analyzer->perform_analysis;
153
+ $analyzer->aggregate_results;
154
+
155
+ return $analyzer->summary;
156
+ };
157
+ }
158
+
159
+ If imported by any package providing a C<values> method, this plugin will
160
+ provide a single C<analyze> method that acts as a simple interface to a more
161
+ complex set of behaviors.
162
+
163
+ Even more importantly, because the C<$mixin> value will be the invocant on
164
+ which the C<import> was actually called, one can subclass C<Data::Analyzer> and
165
+ replace only individual pieces of the complex behavior, making it easy to write
166
+ complex, subclassable toolkits with simple single points of entry for external
167
+ interfaces.
168
+
169
+ =head2 Exporting Constants
170
+
171
+ While Sub::Exporter isn't in the constant-exporting business, it's easy to
172
+ export constants by using one of its sister modules, Package::Generator.
173
+
174
+ package Important::Constants;
175
+
176
+ use Sub::Exporter -setup => {
177
+ collectors => [ constants => \'_set_constants' ],
178
+ };
179
+
180
+ sub _set_constants {
181
+ my ($class, $value, $data) = @_;
182
+
183
+ Package::Generator->assign_symbols(
184
+ $data->{into},
185
+ [
186
+ MEANING_OF_LIFE => \42,
187
+ ONE_TRUE_BASE => \13,
188
+ FACTORS => [ 6, 9 ],
189
+ ],
190
+ );
191
+
192
+ return 1;
193
+ }
194
+
195
+ Then, someone can write:
196
+
197
+ use Important::Constants 'constants';
198
+
199
+ print "The factors @FACTORS produce $MEANING_OF_LIFE in $ONE_TRUE_BASE.";
200
+
201
+ (The constants must be exported via a collector, because they are effectively
202
+ altering the importing class in a way other than installing subroutines.)
203
+
204
+ =head2 Altering the Importer's @ISA
205
+
206
+ It's trivial to make a collector that changes the inheritance of an importing
207
+ package:
208
+
209
+ use Sub::Exporter -setup => {
210
+ collectors => { -base => \'_make_base' },
211
+ };
212
+
213
+ sub _make_base {
214
+ my ($class, $value, $data) = @_;
215
+
216
+ my $target = $data->{into};
217
+ push @{"$target\::ISA"}, $class;
218
+ }
219
+
220
+ Then, the user of your class can write:
221
+
222
+ use Some::Class -base;
223
+
224
+ and become a subclass. This can be quite useful in building, for example, a
225
+ module that helps build plugins. We may want a few utilities imported, but we
226
+ also want to inherit behavior from some base plugin class;
227
+
228
+ package Framework::Util;
229
+
230
+ use Sub::Exporter -setup => {
231
+ exports => [ qw(log global_config) ],
232
+ groups => [ _plugin => [ qw(log global_config) ]
233
+ collectors => { '-plugin' => \'_become_plugin' },
234
+ };
235
+
236
+ sub _become_plugin {
237
+ my ($class, $value, $data) = @_;
238
+
239
+ my $target = $data->{into};
240
+ push @{"$target\::ISA"}, $class->plugin_base_class;
241
+
242
+ push @{ $data->{import_args} }, '-_plugin';
243
+ }
244
+
245
+ Now, you can write a plugin like this:
246
+
247
+ package Framework::Plugin::AirFreshener;
248
+ use Framework::Util -plugin;
249
+
250
+ =head2 Eating Exporter.pm's Brain
251
+
252
+ You probably shouldn't actually do this in production. It's offered more as a
253
+ demonstration than a suggestion.
254
+
255
+ sub exporter_upgrade {
256
+ my ($pkg) = @_;
257
+ my $new_pkg = "$pkg\::UsingSubExporter";
258
+
259
+ return $new_pkg if $new_pkg->isa($pkg);
260
+
261
+ Sub::Exporter::setup_exporter({
262
+ as => 'import',
263
+ into => $new_pkg,
264
+ exports => [ @{"$pkg\::EXPORT_OK"} ],
265
+ groups => {
266
+ %{{"$pkg\::EXPORT_TAG"},
267
+ default => [ @{"$pkg\::EXPORTS"} ],
268
+ },
269
+ });
270
+
271
+ @{"$new_pkg\::ISA"} = $class;
272
+ return $new_pkg;
273
+ }
274
+
275
+ This routine, given the name of an existing package configured to use
276
+ Exporter.pm, returns the name of a new package with a Sub::Exporter-powered
277
+ C<import> routine. This lets you write:
278
+
279
+ BEGIN {
280
+ require Toolkit;
281
+ exporter_upgrade('Toolkit')->import(exported_sub => { -as => 'foo' })
282
+ }
283
+
284
+ If you're feeling particularly naughty, this routine could have been declared
285
+ in the UNIVERSAL package, meaning you could write:
286
+
287
+ BEGIN {
288
+ require Toolkit;
289
+ Toolkit->exporter_upgrade->import(exported_sub => { -as => 'foo' })
290
+ }
291
+
292
+ The new package will have all the same exporter configuration as the original,
293
+ but will support export and group renaming, including exporting into scalar
294
+ references. Further, since Sub::Exporter uses C<can> to find the routine being
295
+ exported, the new package may be subclassed and some of its exports replaced.
296
+
297
+ =head1 AUTHOR
298
+
299
+ Ricardo Signes <rjbs@cpan.org>
300
+
301
+ =head1 COPYRIGHT AND LICENSE
302
+
303
+ This software is copyright (c) 2007 by Ricardo Signes.
304
+
305
+ This is free software; you can redistribute it and/or modify it under
306
+ the same terms as the Perl 5 programming language system itself.
307
+
308
+ =cut
309
+
@@ -0,0 +1,280 @@
1
+
2
+ # PODNAME: Sub::Exporter::Tutorial
3
+ # ABSTRACT: a friendly guide to exporting with Sub::Exporter
4
+
5
+
6
+ __END__
7
+ =pod
8
+
9
+ =head1 NAME
10
+
11
+ Sub::Exporter::Tutorial - a friendly guide to exporting with Sub::Exporter
12
+
13
+ =head1 VERSION
14
+
15
+ version 0.984
16
+
17
+ =head1 DESCRIPTION
18
+
19
+ =head2 What's an Exporter?
20
+
21
+ When you C<use> a module, first it is required, then its C<import> method is
22
+ called. The Perl documentation tells us that the following two lines are
23
+ equivalent:
24
+
25
+ use Module LIST;
26
+
27
+ BEGIN { require Module; Module->import(LIST); }
28
+
29
+ The import method is the module's I<exporter>.
30
+
31
+ =head2 The Basics of Sub::Exporter
32
+
33
+ Sub::Exporter builds a custom exporter which can then be installed into your
34
+ module. It builds this method based on configuration passed to its
35
+ C<setup_exporter> method.
36
+
37
+ A very basic use case might look like this:
38
+
39
+ package Addition;
40
+ use Sub::Exporter;
41
+ Sub::Exporter::setup_exporter({ exports => [ qw(plus) ]});
42
+
43
+ sub plus { my ($x, $y) = @_; return $x + $y; }
44
+
45
+ This would mean that when someone used your Addition module, they could have
46
+ its C<plus> routine imported into their package:
47
+
48
+ use Addition qw(plus);
49
+
50
+ my $z = plus(2, 2); # this works, because now plus is in the main package
51
+
52
+ That syntax to set up the exporter, above, is a little verbose, so for the
53
+ simple case of just naming some exports, you can write this:
54
+
55
+ use Sub::Exporter -setup => { exports => [ qw(plus) ] };
56
+
57
+ ...which is the same as the original example -- except that now the exporter is
58
+ built and installed at compile time. Well, that and you typed less.
59
+
60
+ =head2 Using Export Groups
61
+
62
+ You can specify whole groups of things that should be exportable together.
63
+ These are called groups. L<Exporter> calls these tags. To specify groups, you
64
+ just pass a C<groups> key in your exporter configuration:
65
+
66
+ package Food;
67
+ use Sub::Exporter -setup => {
68
+ exports => [ qw(apple banana beef fluff lox rabbit) ],
69
+ groups => {
70
+ fauna => [ qw(beef lox rabbit) ],
71
+ flora => [ qw(apple banana) ],
72
+ }
73
+ };
74
+
75
+ Now, to import all that delicious foreign meat, your consumer needs only to
76
+ write:
77
+
78
+ use Food qw(:fauna);
79
+ use Food qw(-fauna);
80
+
81
+ Either one of the above is acceptable. A colon is more traditional, but
82
+ barewords with a leading colon can't be enquoted by a fat arrow. We'll see why
83
+ that matters later on.
84
+
85
+ Groups can contain other groups. If you include a group name (with the leading
86
+ dash or colon) in a group definition, it will be expanded recursively when the
87
+ exporter is called. The exporter will B<not> recurse into the same group twice
88
+ while expanding groups.
89
+
90
+ There are two special groups: C<all> and C<default>. The C<all> group is
91
+ defined by default, and contains all exportable subs. You can redefine it,
92
+ if you want to export only a subset when all exports are requested. The
93
+ C<default> group is the set of routines to export when nothing specific is
94
+ requested. By default, there is no C<default> group.
95
+
96
+ =head2 Renaming Your Imports
97
+
98
+ Sometimes you want to import something, but you don't like the name as which
99
+ it's imported. Sub::Exporter can rename your imports for you. If you wanted
100
+ to import C<lox> from the Food package, but you don't like the name, you could
101
+ write this:
102
+
103
+ use Food lox => { -as => 'salmon' };
104
+
105
+ Now you'd get the C<lox> routine, but it would be called salmon in your
106
+ package. You can also rename entire groups by using the C<prefix> option:
107
+
108
+ use Food -fauna => { -prefix => 'cute_little_' };
109
+
110
+ Now you can call your C<cute_little_rabbit> routine. (You can also call
111
+ C<cute_little_beef>, but that hardly seems as enticing.)
112
+
113
+ When you define groups, you can include renaming.
114
+
115
+ use Sub::Exporter -setup => {
116
+ exports => [ qw(apple banana beef fluff lox rabbit) ],
117
+ groups => {
118
+ fauna => [ qw(beef lox), rabbit => { -as => 'coney' } ],
119
+ }
120
+ };
121
+
122
+ A prefix on a group like that does the right thing. This is when it's useful
123
+ to use a dash instead of a colon to indicate a group: you can put a fat arrow
124
+ between the group and its arguments, then.
125
+
126
+ use Food -fauna => { -prefix => 'lovely_' };
127
+
128
+ eat( lovely_coney ); # this works
129
+
130
+ Prefixes also apply recursively. That means that this code works:
131
+
132
+ use Sub::Exporter -setup => {
133
+ exports => [ qw(apple banana beef fluff lox rabbit) ],
134
+ groups => {
135
+ fauna => [ qw(beef lox), rabbit => { -as => 'coney' } ],
136
+ allowed => [ -fauna => { -prefix => 'willing_' }, 'banana' ],
137
+ }
138
+ };
139
+
140
+ ...
141
+
142
+ use Food -allowed => { -prefix => 'any_' };
143
+
144
+ $dinner = any_willing_coney; # yum!
145
+
146
+ Groups can also be passed a C<-suffix> argument.
147
+
148
+ Finally, if the C<-as> argument to an exported routine is a reference to a
149
+ scalar, a reference to the routine will be placed in that scalar.
150
+
151
+ =head2 Building Subroutines to Order
152
+
153
+ Sometimes, you want to export things that you don't have on hand. You might
154
+ want to offer customized routines built to the specification of your consumer;
155
+ that's just good business! With Sub::Exporter, this is easy.
156
+
157
+ To offer subroutines to order, you need to provide a generator when you set up
158
+ your exporter. A generator is just a routine that returns a new routine.
159
+ L<perlref> is talking about these when it discusses closures and function
160
+ templates. The canonical example of a generator builds a unique incrementor;
161
+ here's how you'd do that with Sub::Exporter;
162
+
163
+ package Package::Counter;
164
+ use Sub::Exporter -setup => {
165
+ exports => [ counter => sub { my $i = 0; sub { $i++ } } ],
166
+ groups => { default => [ qw(counter) ] },
167
+ };
168
+
169
+ Now anyone can use your Package::Counter module and he'll receive a C<counter>
170
+ in his package. It will count up by one, and will never interfere with anyone
171
+ else's counter.
172
+
173
+ This isn't very useful, though, unless the consumer can explain what he wants.
174
+ This is done, in part, by supplying arguments when importing. The following
175
+ example shows how a generator can take and use arguments:
176
+
177
+ package Package::Counter;
178
+
179
+ sub _build_counter {
180
+ my ($class, $name, $arg) = @_;
181
+ $arg ||= {};
182
+ my $i = $arg->{start} || 0;
183
+ return sub { $i++ };
184
+ }
185
+
186
+ use Sub::Exporter -setup => {
187
+ exports => [ counter => \'_build_counter' ],
188
+ groups => { default => [ qw(counter) ] },
189
+ };
190
+
191
+ Now, the consumer can (if he wants) specify a starting value for his counter:
192
+
193
+ use Package::Counter counter => { start => 10 };
194
+
195
+ Arguments to a group are passed along to the generators of routines in that
196
+ group, but Sub::Exporter arguments -- anything beginning with a dash -- are
197
+ never passed in. When groups are nested, the arguments are merged as the
198
+ groups are expanded.
199
+
200
+ Notice, too, that in the example above, we gave a reference to a method I<name>
201
+ rather than a method I<implementation>. By giving the name rather than the
202
+ subroutine, we make it possible for subclasses of our "Package::Counter" module
203
+ to replace the C<_build_counter> method.
204
+
205
+ When a generator is called, it is passed four parameters:
206
+
207
+ =over
208
+
209
+ =item * the invocant on which the exporter was called
210
+
211
+ =item * the name of the export being generated (not the name it's being installed as)
212
+
213
+ =item * the arguments supplied for the routine
214
+
215
+ =item * the collection of generic arguments
216
+
217
+ =back
218
+
219
+ The fourth item is the last major feature that hasn't been covered.
220
+
221
+ =head2 Argument Collectors
222
+
223
+ Sometimes you will want to accept arguments once that can then be available to
224
+ any subroutine that you're going to export. To do this, you specify
225
+ collectors, like this:
226
+
227
+ package Menu::Airline
228
+ use Sub::Exporter -setup => {
229
+ exports => ... ,
230
+ groups => ... ,
231
+ collectors => [ qw(allergies ethics) ],
232
+ };
233
+
234
+ Collectors look like normal exports in the import call, but they don't do
235
+ anything but collect data which can later be passed to generators. If the
236
+ module was used like this:
237
+
238
+ use Menu::Airline allergies => [ qw(peanuts) ], ethics => [ qw(vegan) ];
239
+
240
+ ...the consumer would get a salad. Also, all the generators would be passed,
241
+ as their fourth argument, something like this:
242
+
243
+ { allerges => [ qw(peanuts) ], ethics => [ qw(vegan) ] }
244
+
245
+ Generators may have arguments in their definition, as well. These must be code
246
+ refs that perform validation of the collected values. They are passed the
247
+ collection value and may return true or false. If they return false, the
248
+ exporter will throw an exception.
249
+
250
+ =head2 Generating Many Routines in One Scope
251
+
252
+ Sometimes it's useful to have multiple routines generated in one scope. This
253
+ way they can share lexical data which is otherwise unavailable. To do this,
254
+ you can supply a generator for a group which returns a hashref of names and
255
+ code references. This generator is passed all the usual data, and the group
256
+ may receive the usual C<-prefix> or C<-suffix> arguments.
257
+
258
+ =head1 SEE ALSO
259
+
260
+ =over 4
261
+
262
+ =item *
263
+
264
+ L<Sub::Exporter> for complete documentation and references to other exporters
265
+
266
+ =back
267
+
268
+ =head1 AUTHOR
269
+
270
+ Ricardo Signes <rjbs@cpan.org>
271
+
272
+ =head1 COPYRIGHT AND LICENSE
273
+
274
+ This software is copyright (c) 2007 by Ricardo Signes.
275
+
276
+ This is free software; you can redistribute it and/or modify it under
277
+ the same terms as the Perl 5 programming language system itself.
278
+
279
+ =cut
280
+