wukong 3.0.0.pre3 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (76) hide show
  1. data/Gemfile +1 -0
  2. data/README.md +689 -50
  3. data/bin/wu-local +1 -74
  4. data/diagrams/wu_local.dot +39 -0
  5. data/diagrams/wu_local.dot.png +0 -0
  6. data/examples/loadable.rb +2 -0
  7. data/examples/string_reverser.rb +7 -0
  8. data/lib/hanuman/stage.rb +2 -2
  9. data/lib/wukong.rb +21 -10
  10. data/lib/wukong/dataflow.rb +2 -5
  11. data/lib/wukong/doc_helpers.rb +14 -0
  12. data/lib/wukong/doc_helpers/dataflow_handler.rb +29 -0
  13. data/lib/wukong/doc_helpers/field_handler.rb +91 -0
  14. data/lib/wukong/doc_helpers/processor_handler.rb +29 -0
  15. data/lib/wukong/driver.rb +11 -1
  16. data/lib/wukong/local.rb +40 -0
  17. data/lib/wukong/local/event_machine_driver.rb +27 -0
  18. data/lib/wukong/local/runner.rb +98 -0
  19. data/lib/wukong/local/stdio_driver.rb +44 -0
  20. data/lib/wukong/local/tcp_driver.rb +47 -0
  21. data/lib/wukong/logger.rb +16 -7
  22. data/lib/wukong/plugin.rb +48 -0
  23. data/lib/wukong/processor.rb +57 -15
  24. data/lib/wukong/rake_helper.rb +6 -0
  25. data/lib/wukong/runner.rb +151 -128
  26. data/lib/wukong/runner/boot_sequence.rb +123 -0
  27. data/lib/wukong/runner/code_loader.rb +52 -0
  28. data/lib/wukong/runner/deploy_pack_loader.rb +75 -0
  29. data/lib/wukong/runner/help_message.rb +42 -0
  30. data/lib/wukong/spec_helpers.rb +4 -12
  31. data/lib/wukong/spec_helpers/integration_tests.rb +150 -0
  32. data/lib/wukong/spec_helpers/{integration_driver_matchers.rb → integration_tests/integration_test_matchers.rb} +28 -62
  33. data/lib/wukong/spec_helpers/integration_tests/integration_test_runner.rb +97 -0
  34. data/lib/wukong/spec_helpers/shared_examples.rb +19 -10
  35. data/lib/wukong/spec_helpers/unit_tests.rb +134 -0
  36. data/lib/wukong/spec_helpers/{processor_methods.rb → unit_tests/unit_test_driver.rb} +42 -8
  37. data/lib/wukong/spec_helpers/{spec_driver_matchers.rb → unit_tests/unit_test_matchers.rb} +6 -32
  38. data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +54 -0
  39. data/lib/wukong/version.rb +1 -1
  40. data/lib/wukong/widget/filters.rb +134 -8
  41. data/lib/wukong/widget/processors.rb +64 -5
  42. data/lib/wukong/widget/reducers/bin.rb +68 -18
  43. data/lib/wukong/widget/reducers/count.rb +12 -0
  44. data/lib/wukong/widget/reducers/group.rb +48 -5
  45. data/lib/wukong/widget/reducers/group_concat.rb +30 -2
  46. data/lib/wukong/widget/reducers/moments.rb +4 -4
  47. data/lib/wukong/widget/reducers/sort.rb +53 -3
  48. data/lib/wukong/widget/serializers.rb +37 -12
  49. data/lib/wukong/widget/utils.rb +1 -1
  50. data/spec/spec_helper.rb +20 -2
  51. data/spec/wukong/driver_spec.rb +2 -0
  52. data/spec/wukong/local/runner_spec.rb +40 -0
  53. data/spec/wukong/local_spec.rb +6 -0
  54. data/spec/wukong/logger_spec.rb +49 -0
  55. data/spec/wukong/processor_spec.rb +22 -0
  56. data/spec/wukong/runner_spec.rb +128 -8
  57. data/spec/wukong/widget/filters_spec.rb +28 -10
  58. data/spec/wukong/widget/processors_spec.rb +5 -5
  59. data/spec/wukong/widget/reducers/bin_spec.rb +14 -14
  60. data/spec/wukong/widget/reducers/count_spec.rb +1 -1
  61. data/spec/wukong/widget/reducers/group_spec.rb +7 -6
  62. data/spec/wukong/widget/reducers/moments_spec.rb +2 -2
  63. data/spec/wukong/widget/reducers/sort_spec.rb +1 -1
  64. data/spec/wukong/widget/serializers_spec.rb +84 -88
  65. data/spec/wukong/wu-local_spec.rb +109 -0
  66. metadata +43 -20
  67. data/bin/wu-server +0 -70
  68. data/lib/wukong/boot.rb +0 -96
  69. data/lib/wukong/configuration.rb +0 -8
  70. data/lib/wukong/emitter.rb +0 -22
  71. data/lib/wukong/server.rb +0 -119
  72. data/lib/wukong/spec_helpers/integration_driver.rb +0 -157
  73. data/lib/wukong/spec_helpers/processor_helpers.rb +0 -89
  74. data/lib/wukong/spec_helpers/spec_driver.rb +0 -28
  75. data/spec/wukong/local_runner_spec.rb +0 -31
  76. data/spec/wukong/wu_local_spec.rb +0 -125
data/Gemfile CHANGED
@@ -5,6 +5,7 @@ gemspec
5
5
  group :development do
6
6
  gem 'rake', '>= 0.9'
7
7
  gem 'rspec', '>= 2.8'
8
+ gem 'spork', '0.9.2'
8
9
  gem 'guard', '>= 1.0'
9
10
  gem 'guard-rspec', '>= 0.6'
10
11
  gem 'simplecov', '>= 0.5'
data/README.md CHANGED
@@ -19,6 +19,8 @@ Here is a list of various other projects which you may also want to
19
19
  peruse when trying to understand the full Wukong experience:
20
20
 
21
21
  * <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
22
+ * <a href="http://github.com/infochimps-labs/wukong-storm>wukong-storm</a>: Run Wukong processors within the Storm framework. Model flows locally before you run them.
23
+ * <a href="http://github.com/infochimps-labs/wukong-load>wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
22
24
  * <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
23
25
  * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
24
26
 
@@ -36,7 +38,7 @@ processor is Ruby class which
36
38
  * subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
37
39
  * defines a `process` method which takes an input record, does something, and calls `yield` on the output
38
40
 
39
- Here's a processor that reverses all each input record:
41
+ Here's a processor that reverses each of its input records:
40
42
 
41
43
  ```ruby
42
44
  # in string_reverser.rb
@@ -47,8 +49,8 @@ Wukong.processor(:string_reverser) do
47
49
  end
48
50
  ```
49
51
 
50
- When you're developing your application, run your processors on the
51
- command line on flat input files using `wu-local`:
52
+ You can run this processor on the command line using text files as
53
+ input using the `wu-local` tool that comes with Wukong:
52
54
 
53
55
  ```
54
56
  $ cat novel.txt
@@ -59,35 +61,46 @@ $ cat novel.txt | wu-local string_reverser.rb
59
61
  .semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
60
62
  ```
61
63
 
62
- You can use yield as often (or never) as you need. Here's a more
63
- complicated example to illustrate:
64
+ The `wu-local` program consumes one line at at time from STDIN and
65
+ calls your processor's `process` method with that line as a Ruby
66
+ String object. Each object you `yield` within your process method
67
+ will be printed back out on STDOUT.
68
+
69
+ ### Multiple Processors, Multiple (Or No) Yields
70
+
71
+ Processors are intended to be combined so they can be stored in the
72
+ same file like these two, related processors:
64
73
 
65
74
  ```ruby
66
75
  # in processors.rb
67
76
 
68
- Wukong.processor(:tokenizer) do
77
+ Wukong.processor(:splitter) do
69
78
  def process line
70
79
  line.split.each { |token| yield token }
71
80
  end
72
81
  end
73
82
 
74
- Wukong.processor(:starts_with) do
75
-
76
- field :letter, String, :default => 'a'
77
-
78
- def process word
79
- yield word if word =~ Regexp.new("^#{letter}", true)
83
+ Wukong.processor(:normalizer) do
84
+ def process token
85
+ stripped = token.downcase.gsub(/\W/,'')
86
+ yield stripped if stripped.size > 0
80
87
  end
81
88
  end
82
89
  ```
83
90
 
84
- Let's start by running the `tokenizer`. We've defined two processors
85
- in the file `processors.rb` and neither one is named `processors` so
86
- we have to tell `wu-local` the name of the processor we want to run
87
- explicitly.
91
+ Notice how the `splitter` yields multiple tokens for each of its input
92
+ tokens and that the `normalizer` may sometimes never yield at all,
93
+ depending on its input. Processors are under no obligations by the
94
+ framework to yield or return anything so they can easily act as
95
+ filters or even sinks in data flows.
96
+
97
+ There are two processors in this file and neither shares a name with
98
+ the basename of the file ("processors") so `wu-local` can't
99
+ automatically choose a processor to run. We can specify one
100
+ explicitly with the `--run` option:
88
101
 
89
102
  ```
90
- $ cat novel.txt | wu-local processors.rb --run=tokenizer
103
+ $ cat novel.txt | wu-local processors.rb --run=splitter
91
104
  It
92
105
  was
93
106
  the
@@ -97,39 +110,454 @@ times,
97
110
  ...
98
111
  ```
99
112
 
100
- You can combine the output of one processor with another right in the
101
- shell. Let's add the `starts_with` filter and also pass in the
102
- *field* `letter`, defined in that processor:
113
+ We can combine the two processors together
103
114
 
104
115
  ```
105
- $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local processors.rb --run=starts_with --letter=t
106
- the
107
- times
116
+ $ cat novel.txt | wu-local processors.rb --run=splitter | wu-local processors.rb --run=normalizer
117
+ it
118
+ was
108
119
  the
120
+ best
121
+ of
109
122
  times
110
123
  ...
111
124
  ```
112
125
 
113
- Wanting to match on a regular expression is such a common task that
114
- Wukong has a built-in "widget" called `regexp` that you can use
115
- directly:
126
+ but there's an easier way of doing this with <a href="#flows">dataflows</a>.
127
+
128
+ ### Adding Configurable Options
129
+
130
+ Processors can have options that can be set in Ruby code, from the
131
+ command-line, a configuration file, or a variety of other places
132
+ thanks to [Configliere](http://github.com/infochimps-labs/configliere).
133
+
134
+ This processor calculates percentiles from observations assuming a
135
+ normal distribution given a particular mean and standard deviation.
136
+ It uses two *fields*, the mean or average of a distribution (`mean`)
137
+ and its standard deviation (`std_dev`). From this information, it
138
+ will measure the percentile of all input values.
139
+
140
+ ```ruby
141
+ # in percentile.rb
142
+ Wukong.processor(:percentile) do
143
+
144
+ SQRT_1_HALF = Math.sqrt(0.5)
145
+
146
+ field :mean, Float, :default => 0.0
147
+ field :std_dev, Float, :default => 1.0
148
+
149
+ def process value
150
+ observation = value.to_f
151
+ z_score = (mean - observation) / std_dev
152
+ percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
153
+ yield [observation, percentile].join("\t")
154
+ end
155
+ end
156
+ ```
157
+
158
+ These fields have default values but you can overide them on the
159
+ command line. If you scored a 95 on an exam where the mean score was
160
+ 80 points and the standard deviation of the scores was 10 points, for
161
+ example, then you'd be in the 93rd percentile:
162
+
163
+ ```
164
+ $ echo 95 | wu-local /tmp/percentile.rb --mean=80 --std_dev=10
165
+ 95.0 93.3192798731142
166
+ ```
167
+
168
+ If the exam were more difficult, with a mean of 75 points and a
169
+ standard deviation of 8 points, you'd be in the 99th percentile!
170
+
171
+ ```
172
+ $ echo 95 | wu-local /tmp/percentile.rb --mean=75 --std_dev=8
173
+ 95.0 99.37903346742239
174
+ ```
175
+
176
+ ### The Lifecycle of a Processor
177
+
178
+ Processors have a lifecycle that they execute when they are run within
179
+ the context of a Wukong runner like `wu-local` or `wu-hadoop`. Each
180
+ lifecycle phase corresponds to a method of the processor that is
181
+ called:
182
+
183
+ * `setup` called *after* the Processor is initialized but *before* the first record is processed. You cannot yield from this method.
184
+ * `process` called once for each input record, may yield once, many, or no times.
185
+ * `finalize` called after the the *last* record has been processed but while the processor still has an opportunity to yield records.
186
+ * `stop` called to signal to the processor that all work should stop, open connections should be closed, &c. You cannot yield from this method.
187
+
188
+ The above examples have already focused on the `process` method.
189
+
190
+ The `setup` and `stop` methods are often used together to handle
191
+ external connections
192
+
193
+ ```ruby
194
+ # in geolocator.rb
195
+ Wukong.processor(:geolocator) do
196
+ field :host, String, :default => 'localhost'
197
+ attr_accessor :connection
198
+
199
+ def setup
200
+ self.connection = Database::Connection.new(host)
201
+ end
202
+ def process record
203
+ record.added_value = connection.find("...some query...")
204
+ end
205
+ def stop
206
+ self.connection.close
207
+ end
208
+ end
209
+ ```
210
+
211
+ The `finalize` method is most useful when writing a "reduce"-type
212
+ operation that involves storing or aggregating information till some
213
+ criterion is met. It will always be called after the last record has
214
+ been given (to `process`) but you can call it whenever you want to
215
+ within your own code.
216
+
217
+ Here's an example of using the `finalize` method to implement a simple
218
+ counter that counts all the input records:
219
+
220
+ ```ruby
221
+ # in counter.rb
222
+ Wukong.processor(:counter) do
223
+ attr_accessor :count
224
+ def setup
225
+ self.count = 0
226
+ end
227
+ def process thing
228
+ self.count += 1
229
+ end
230
+ def finalize
231
+ yield count
232
+ end
233
+ end
234
+ ```
235
+
236
+ It hinges on the fact that the last input record will be passed to
237
+ `process` *first* and only then will `finalize` be called. This
238
+ allows the last input record to be counted/processed/aggregated and
239
+ then the entire aggregate to be dealt with in finalize.
240
+
241
+ Because of this emphasis on building and processing aggregates, the
242
+ `finalize` method is often useful within processors meant to run as
243
+ reducers in a Hadoop environment.
244
+
245
+ Note:: Finalize is not guaranteed to be called by in every possible
246
+ environment as it depends on the chosen runner. In a local or Hadoop
247
+ environment, the notion of "last record" makes sense and so the
248
+ corresponding runners will call `finalize`. In an environment like
249
+ Storm, where the concept of last record is not (supposed to be)
250
+ meaningful, the corresponding runner doesn't ever call it.
251
+
252
+ ### Serialization
253
+
254
+ `wu-local` (and many similar tools) deal with inputs and outputs as
255
+ strings.
256
+
257
+ Processors want to process objects as close to their domain as is
258
+ possible. A processor which decorates address book entries with
259
+ Twitter handles doesn't want to think of its inputs as Strings but
260
+ Hashes or, better yet, Persons.
261
+
262
+ Wukong makes it easy to wrap a processor with other processors
263
+ dedicated to handling the common tasks of parsing records into or out
264
+ of formats like JSON and turning them into Ruby model instances.
265
+
266
+ #### De-serializing data formats like JSON or TSV
267
+
268
+ Wukong can parse and emit common data formats like JSON and delimited
269
+ formats like TSV or CSV so that you don't pollute or tie down your own
270
+ processors with protocol logic.
271
+
272
+ Here's an example of a processor that wants to deal with Hashes as
273
+ input.
274
+
275
+ ```ruby
276
+ # in extractor.rb
277
+ Wukong.processor(:extractor) do
278
+ def process hsh
279
+ yield hsh["first_name"]
280
+ end
281
+ end
282
+ ```
283
+
284
+ Given JSON data,
285
+
286
+ ```
287
+ $ cat input.json
288
+ {"first_name": "John", "last_name":, "Smith"}
289
+ {"first_name": "Sally", "last_name":, "Johnson"}
290
+ ...
291
+ ```
292
+
293
+ you can feed it directly to a processor
294
+
295
+ ```
296
+ $ cat input.json | wu-local --from=json extractor
297
+ John
298
+ Sally
299
+ ...
300
+ ```
301
+
302
+ Other processors really like Arrays:
303
+
304
+ ```ruby
305
+ Wukong.processor(:summer) do
306
+ def process values
307
+ yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
308
+ end
309
+ end
310
+ ```
311
+
312
+ so you can feed them TSV data
313
+ ```
314
+ $ cat data.tsv
315
+ 1 2 3
316
+ 4 5 6
317
+ 7 8 9
318
+ ...
319
+ $ cat data.tsv | wu-local --from=tsv summer
320
+ 6
321
+ 15
322
+ 24
323
+ ...
324
+ ```
325
+
326
+ but you can just as easily use the same code with CSV data
327
+
328
+ ```
329
+ $ cat data.tsv | wu-local --from=csv summer
330
+ ```
331
+
332
+ or a more general delimited format.
333
+
334
+ ```
335
+ $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
336
+ ```
337
+
338
+ #### Recordizing data structures into domain models
339
+
340
+ Here's a contact validator that relies on a Person model to decide
341
+ whether a contact entry should be yielded:
342
+
343
+ ```ruby
344
+ # in contact_validator.rb
345
+ require 'person'
346
+
347
+ Wukong.processor(:contact_validator) do
348
+ def process person
349
+ yield person if person.valid?
350
+ end
351
+ end
352
+ ```
353
+
354
+ Relying on the (elsewhere-defined) Person model to define `valid?`
355
+ means the processor can stay skinny and readable. Wukong can, in
356
+ combination with the deserializing features above, turn input text
357
+ into instances of Person:
358
+
359
+ ```
360
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator
361
+ #<Person:0x000000020e6120>
362
+ #<Person:0x000000020e6120>
363
+ #<Person:0x000000020e6120>
364
+ ```
365
+
366
+ `wu-local` can also serialize records from the `contact_validator`
367
+ processor:
368
+
369
+ ```
370
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
371
+ {"first_name": "John", "last_name":, "Smith", "valid": "true"}
372
+ {"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
373
+ ...
374
+ ```
375
+
376
+ Serialization formats work just like deserialization formats, with
377
+ JSON as well as delimited formats available.
378
+
379
+ Parsing records into model instances and serializing them out again
380
+ puts constraints on the model class providing these instances. Here's
381
+ what the `Person` class needs to look like:
382
+
383
+
384
+ ```ruby
385
+ # in person.rb
386
+ class Person
387
+
388
+ # Create a new Person from the given attributes. Supports usage of
389
+ # the `--consumes` flag on the command-line
390
+ #
391
+ # @param [Hash] attrs
392
+ # @return [Person]
393
+ def self.receive attrs
394
+ new(attrs)
395
+ end
396
+
397
+ # Turn this Person into a basic data structure. Supports the usage
398
+ # of the `--to` flag on the command-line.
399
+ #
400
+ # @return [Hash]
401
+ def to_wire
402
+ to_hash
403
+ end
404
+ end
405
+ ```
406
+
407
+ To support the `--consumes=Person` syntax, the `receive` class method
408
+ must take a Hash produced from the operation of the `--from` argument
409
+ and return a `Person` instance.
410
+
411
+ To support the `--to=json` syntax, the `Person` class must implement
412
+ the `to_wire` instance method.
413
+
414
+ ### Logging and Notifications
415
+
416
+ Wukong comes with a logger that all processors have access to via
417
+ their `log` attribute. This logger has the following priorities:
418
+
419
+ * debug (can be set as a log level)
420
+ * info (can be set as a log level)
421
+ * warn (can be set as a log level)
422
+ * error
423
+ * fatal
424
+
425
+ and here's a processor which uses them all
426
+
427
+ ```ruby
428
+ # in logs.rb
429
+ Wukong.processor(:logs) do
430
+ def process line
431
+ log.debug line
432
+ log.info line
433
+ log.warn line
434
+ log.error line
435
+ log.fatal line
436
+ end
437
+ end
438
+ ```
439
+
440
+ The default log level is DEBUG.
441
+
442
+ ```
443
+ $ echo something | wu-local logs.rb
444
+ DEBUG 2013-01-11 23:40:56 [Logs ] -- event
445
+ INFO 2013-01-11 23:40:56 [Logs ] -- event
446
+ WARN 2013-01-11 23:40:56 [Logs ] -- event
447
+ ERROR 2013-01-11 23:40:56 [Logs ] -- event
448
+ FATAL 2013-01-11 23:40:56 [Logs ] -- event
449
+ ```
450
+
451
+ though you can set it to something else globally
452
+
453
+ ```
454
+ $ echo something | wu-local logs.rb --log.level=warn
455
+ WARN 2013-01-11 23:40:56 [Logs ] -- event
456
+ ERROR 2013-01-11 23:40:56 [Logs ] -- event
457
+ FATAL 2013-01-11 23:40:56 [Logs ] -- event
458
+ ```
459
+
460
+ or on a per-class basis.
461
+
462
+ ### Creating Documentation
463
+
464
+ `wu-local` includes a help message:
465
+
466
+ ```
467
+ $ wu-local --help
468
+ usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
469
+
470
+ wu-local is a tool for running Wukong processors and flows locally on
471
+ the command-line. Use wu-local by passing it a processor and feeding
472
+ ...
473
+
474
+
475
+ Params:
476
+ -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
477
+ -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
478
+ ```
479
+
480
+ You can generate custom help messages for your own processors. Here's
481
+ the percentile processor from before but made more usable with good
482
+ documentation:
116
483
 
484
+ ```ruby
485
+ # in percentile.rb
486
+ Wukong.processor(:percentile) do
487
+
488
+ description <<-EOF.gsub(/^ {2}/,'')
489
+ This processor calculates percentiles from input scores based on a
490
+ given mean score and a given standard deviation for the scores.
491
+
492
+ The mean and standard deviation are given at run time and processed
493
+ scores will be compared against the given mean and standard
494
+ deviation.
495
+
496
+ The input is expected to consist of float values, one per line.
497
+
498
+ Example:
499
+
500
+ $ cat input.dat
501
+ 88
502
+ 89
503
+ 77
504
+ ...
505
+
506
+ $ cat input.dat | wu-local percentile.rb --mean=85 --std_dev=7
507
+ 88.0 66.58824291023753
508
+ 89.0 71.61454169013237
509
+ 77.0 12.654895447355777
510
+ EOF
511
+
512
+ SQRT_1_HALF = Math.sqrt(0.5)
513
+
514
+ field :mean, Float, :default => 0.0, :doc => "The mean of the assumed distribution"
515
+ field :std_dev, Float, :default => 1.0, :doc => "The standard deviation of the assumed distribution"
516
+
517
+ def process value
518
+ observation = value.to_f
519
+ z_score = (mean - observation) / std_dev
520
+ percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
521
+ yield [observation, percentile].join("\t")
522
+ end
523
+ end
117
524
  ```
118
- $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local regexp --match='^t'
525
+
526
+ If you call `wu-local` with the file to this processor as an argument
527
+ in addition to the original `--help` argument, you'll get custom
528
+ documentation.
529
+
119
530
  ```
531
+ $ wu-local percentile.rb --help
532
+ usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
533
+
534
+ This processor calculates percentiles from input scores based on a
535
+ given mean score and a given standard deviation for the scores.
536
+ ...
120
537
 
121
- There are many more simple <a href="#widgets">widgets</a> like these.
538
+
539
+ Params:
540
+ --mean=Float The mean of the assumed distribution [Default: 0.0]
541
+ -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
542
+ --std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
543
+ -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
544
+
545
+ ```
122
546
 
123
547
  <a name="flows"></a>
124
548
  ## Combining Processors into Dataflows
125
549
 
126
550
  Combining processors which each do one thing well together in a chain
127
551
  is mimicing the tried and true UNIX pipeline. Wukong lets you define
128
- these pipelines more formally as a dataflow. Here's the dataflow for
552
+ these pipelines more formally as a dataflow.
553
+
554
+ Having written the `tokenizer` processor, we can use it in a dataflow
555
+ along with the built-in `regexp` processor to replicate what we did in
129
556
  the last example:
130
557
 
131
558
  ```
132
559
  # in find_t_words.rb
560
+ require_relative('processors')
133
561
  Wukong.dataflow(:find_t_words) do
134
562
  tokenizer | regexp(match: /^t/)
135
563
  end
@@ -148,7 +576,8 @@ times
148
576
  ...
149
577
  ```
150
578
 
151
- and it works exactly like before.
579
+ and it works exactly like manually chaining the two processors
580
+ together.
152
581
 
153
582
  <a name="serialization></a>
154
583
  ## Serialization
@@ -163,7 +592,14 @@ yield a String argument (or something that will `to_s` appropriately).
163
592
  ## Widgets
164
593
 
165
594
  Wukong has a number of built-in widgets that are useful for
166
- scaffolding your dataflows.
595
+ scaffolding your dataflows or using as starting off points for your
596
+ own processors.
597
+
598
+ For any of these widgets you can get customized help, say
599
+
600
+ ```
601
+ $ wu-local group --help
602
+ ```
167
603
 
168
604
  ### Serializers
169
605
 
@@ -350,10 +786,10 @@ describe :tokenizer do
350
786
  processor.given("Hi there.\nMy name is Wukong!").should emit(6).records
351
787
  end
352
788
  it "eliminates all punctuation" do
353
- processor.given("Never!").output.first.should_not include(',')
789
+ processor(:tokenizer).given("Never!").should emit('Never')
354
790
  end
355
- it "downcases all input text" do
356
- processor.given("Whatever").output.first.should match(/^w/)
791
+ it "will not emit tokens in a stop list" do
792
+ processor(:tokenizer, :stop_list => ['apples', 'bananas']).given("I like apples and bananas").should emit('I', 'like', 'and')
357
793
  end
358
794
  end
359
795
  ```
@@ -364,8 +800,13 @@ Let's look at each kind of helper:
364
800
  `it_behaves_like` helper) adds some tests that ensure that the
365
801
  processor conforms to the API of a Wukong::Processor.
366
802
 
367
- * The `processor` method instantiates a processor very similarly to
368
- the way `wu-local` instantiates one on the command-line. It accepts
803
+ * The `processor` method is actually an alias for the more aptly named
804
+ (but less convenient) `unit_test_runner`. This method accepts a
805
+ processor name and options (just like `wu-local` and other
806
+ command-line tools) and returns a Wukong::UnitTestRunner instance.
807
+ This runner handles the
808
+
809
+
369
810
  a (registered) processor name and options and creates a new
370
811
  processor. If no name is given, the argument of the enclosing
371
812
  `describe` or `context` block is used. The object returned by
@@ -374,29 +815,38 @@ Let's look at each kind of helper:
374
815
  behavior.
375
816
 
376
817
  * The `given` method (and other helpers like `given_json`,
377
- `given_tsv`, &c.) is added to the Processor class when
378
- Wukong::SpecHelpers is required. It's a way of lazily feeding
379
- records to a processor, without having to go through the `process`
380
- method directly and having to handle the block or the processor's
381
- lifecycle as in the prior example.
818
+ `given_tsv`, &c.) is a method on the runner. It's a way of lazily
819
+ feeding records to a processor, without having to go through the
820
+ `process` method directly and having to handle the block or the
821
+ processor's lifecycle as in the prior example.
382
822
 
383
823
  * The `output` and `emit` matchers will `process` all previously
384
824
  `given` records when they are called. This lets you separate
385
825
  instantiation, input, expectations, and output. Here's a more
386
- complicated example:
826
+ complicated example.
387
827
 
388
828
  The same helpers can be used to test dataflows as well as
389
- processors. For complete details, see documentation for the
390
- Wukong::SpecHelpers module.
829
+ processors.
830
+
831
+ ####
832
+
833
+ #### Functions vs. Objects
834
+
835
+ The above test helpers are designed to aid in testing processors
836
+ functionally because:
837
+
838
+ * they accept the
391
839
 
392
840
  ### Integration Tests
393
841
 
394
- Sometimes unit tests aren't enough and you need to test your
395
- processors or flows as they will be run in production using
396
- `wu-local`.
842
+ If you are implementing a new Wukong command (akin to `wu-local`) then
843
+ you may also want to run integration tests. Wukong comes with helpers
844
+ for these, too.
397
845
 
398
- For these use cases, Wukong provides some integration helpers that
399
- make testing command line processes easier.
846
+ You should almost always be able to test your processors without
847
+ integration tests. Your unit tests and the Wukong framework itself
848
+ should ensure that your processors work correctly no matter what
849
+ environment they are deployed in.
400
850
 
401
851
  ```ruby
402
852
  # spec/integration/tokenizer_spec.rb
@@ -415,7 +865,7 @@ context "interpreting its arguments" do
415
865
  end
416
866
  context "with a malformed --match argument" do
417
867
  # invalid b/c the regexp is broken...
418
- subject { command("wu-local tokenizer --match='^[h'") < "hi there" }
868
+ subject { command("wu-local tokenizer --match='^(h'") < "hi there" }
419
869
  it { should exit_with(:non_zero) }
420
870
  it { should have_stderr(/invalid/) }
421
871
  end
@@ -457,3 +907,192 @@ Let's go through the helpers:
457
907
  * The `have_stdout` and `have_stderr` matchers let you test the STDOUT or STDERR of the command for particular strings or regular expressions.
458
908
 
459
909
  * The `exit_with` matcher lets you test the exit code of the command. You can pass the symbol `:non_zero` to set the expectation of _any_ non-zero exit code.
910
+
911
+ ## Plugins
912
+
913
+ Wukong has a built-in plugin framework to make it easy to adapt Wukong
914
+ processors to new backends or add other functionality. The
915
+ `Wukong::Local` module and the `wu-local` program it supports is
916
+ itself a Wukong plugin.
917
+
918
+ The following shows how you might build a simplified version of
919
+ `Wukong::Local` as a new plugin. We'll call this plugin `Cat` as it
920
+ will implement a program `wu-cat` that is similar in function to
921
+ `wu-local` (just simplified).
922
+
923
+ The first thing to do is include the `Wukong::Plugin` module in your
924
+ code:
925
+
926
+
927
+ ```Ruby
928
+ # in lib/cat.rb
929
+ #
930
+ # This Wukong plugin works like wu-local but replicates some silly
931
+ # features of cat like numbered lines.
932
+ module Cat
933
+
934
+ # This registers Cat as a Wukong plugin.
935
+ include Wukong::Plugin
936
+
937
+ # Defines any settings specific to Cat. Cat doesn't need to, but
938
+ # you can define global settings here if you want. You can also
939
+ # check the `program` name to decide whether to apply your settings.
940
+ # This helps you not pollute other commands with your stuff.
941
+ def self.configure settings, program
942
+ case program
943
+ when 'wu-cat'
944
+ settings.define(:input, :description => "The input file to use")
945
+ settings.define(:number, :description => "Prepend each input record with a consecutive number", :type => :boolean)
946
+ else
947
+ # configure other programs if you need to
948
+ end
949
+ end
950
+
951
+ # Lets Cat boot up with settings that have already been resolved
952
+ # from the command-line or other sources like config files or remote
953
+ # servers added by other plugins.
954
+ #
955
+ # The `root` directory in which the program is executing is also
956
+ # provided.
957
+ def self.boot settings, root
958
+ puts "Cat booting up using resolved settings within directory #{root}"
959
+ end
960
+ end
961
+ ```
962
+
963
+ If your plugin doesn't interact directly with the command-line
964
+ (through a wu-tool like `wu-local` or `wu-hadoop`) and doesn't
965
+ directly interface with passing records to processors then you can
966
+ just require the rest of your plugin's code at this point and be done.
967
+
968
+ ### Write a Runner to interact with the command-line
969
+
970
+ If you need to implement a new command line tool then you should write
971
+ a Runner. A Runner is used to implement Wukong programs like
972
+ `wu-local` or `wu-hadoop`. Here's what the actual program file would
973
+ look like for our example plugin's `wu-cat` program.
974
+
975
+ ```ruby
976
+ #!/usr/bin/env ruby
977
+ # in bin/wu-cat
978
+ require 'cat'
979
+ Cat::Runner.run
980
+ ```
981
+
982
+ The Cat::Runner class is implemented separately.
983
+
984
+ ```ruby
985
+ # in lib/cat/runner.rb
986
+ require_relative('driver')
987
+ module Cat
988
+
989
+ # Implements the `wu-cat` command.
990
+ class Runner < Wukong::Runner
991
+
992
+ usage "PROCESSOR|FLOW"
993
+
994
+ description <<-EOF
995
+
996
+ wu-cat lets you run a Wukong processor or dataflow on the
997
+ command-line. Try it like this.
998
+
999
+ $ wu-cat --input=data.txt
1000
+ hello
1001
+ my
1002
+ friend
1003
+
1004
+ Connect the output to a processor in upcaser.rb
1005
+
1006
+ $ wu-cat --input=data.txt upcaser.rb
1007
+ HELLO
1008
+ MY
1009
+ FRIEND
1010
+
1011
+ You can also include add line numbers to the output.
1012
+
1013
+ $ wu-cat --number --input=data.txt upcaser.rb
1014
+ 1 HELLO
1015
+ 2 MY
1016
+ 3 FRIEND
1017
+ EOF
1018
+
1019
+ # The name of the processor we're going to run. The #args method
1020
+ # is provided by the Runner class.
1021
+ def processor_name
1022
+ args.first
1023
+ end
1024
+
1025
+ # Validate that we were given the name of a registered processor
1026
+ # to run. Be careful to return true here or validation will fail.
1027
+ def validate
1028
+ raise Wukong::Error.new("Must provide a processor as the first argument") unless processor_name
1029
+ true
1030
+ end
1031
+
1032
+ # Delgates to a driver class to run the processor.
1033
+ def run
1034
+ Driver.new(processor_name, settings).start
1035
+ end
1036
+
1037
+ end
1038
+ end
1039
+ ```
1040
+
1041
+ ### Write a Driver to interact with processors
1042
+
1043
+ The `Cat::Runner#run` method delegates to the `Cat::Driver` class to
1044
+ handle instantiating and interacting with processors.
1045
+
1046
+ ```ruby
1047
+ # in lib/cat/driver.rb
1048
+ module Cat
1049
+
1050
+ # A class for driving a processor from `wu-cat`.
1051
+ class Driver
1052
+
1053
+ # Lets us count the records.
1054
+ attr_accessor :number
1055
+
1056
+ # Gives methods to construct and interact with dataflows.
1057
+ include Wukong::DriverMethods
1058
+
1059
+ # Create a new Driver for a dataflow with the given `label` using
1060
+ # the given `settings`.
1061
+ #
1062
+ # @param [String] label the name of the dataflow
1063
+ # @param [Configliere::Param] settings the settings to use when creating the dataflow
1064
+ def initialize label, settings
1065
+ self.settings = settings
1066
+ self.dataflow = construct_dataflow(label, settings)
1067
+ self.number = 1
1068
+ end
1069
+
1070
+ # The file handle of the input file.
1071
+ #
1072
+ # @return [File]
1073
+ def input_file
1074
+ @input_file ||= File.new(settings[:input])
1075
+ end
1076
+
1077
+ # Starts feeding records to the processor
1078
+ def start
1079
+ while line = input_file.readline rescue nil
1080
+ driver.send_through_dataflow(line)
1081
+ end
1082
+ end
1083
+
1084
+ # Process each record that comes back from the dataflow.
1085
+ #
1086
+ # @param [Object] record the yielded record
1087
+ def process record
1088
+ if settings[:number]
1089
+ puts [number, record].map(&:to_s).join("\t")
1090
+ else
1091
+ puts record.to_s
1092
+ end
1093
+ self.number += 1
1094
+ end
1095
+
1096
+ end
1097
+ end
1098
+ ```