wukong 3.0.0.pre3 → 3.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (76) hide show
  1. data/Gemfile +1 -0
  2. data/README.md +689 -50
  3. data/bin/wu-local +1 -74
  4. data/diagrams/wu_local.dot +39 -0
  5. data/diagrams/wu_local.dot.png +0 -0
  6. data/examples/loadable.rb +2 -0
  7. data/examples/string_reverser.rb +7 -0
  8. data/lib/hanuman/stage.rb +2 -2
  9. data/lib/wukong.rb +21 -10
  10. data/lib/wukong/dataflow.rb +2 -5
  11. data/lib/wukong/doc_helpers.rb +14 -0
  12. data/lib/wukong/doc_helpers/dataflow_handler.rb +29 -0
  13. data/lib/wukong/doc_helpers/field_handler.rb +91 -0
  14. data/lib/wukong/doc_helpers/processor_handler.rb +29 -0
  15. data/lib/wukong/driver.rb +11 -1
  16. data/lib/wukong/local.rb +40 -0
  17. data/lib/wukong/local/event_machine_driver.rb +27 -0
  18. data/lib/wukong/local/runner.rb +98 -0
  19. data/lib/wukong/local/stdio_driver.rb +44 -0
  20. data/lib/wukong/local/tcp_driver.rb +47 -0
  21. data/lib/wukong/logger.rb +16 -7
  22. data/lib/wukong/plugin.rb +48 -0
  23. data/lib/wukong/processor.rb +57 -15
  24. data/lib/wukong/rake_helper.rb +6 -0
  25. data/lib/wukong/runner.rb +151 -128
  26. data/lib/wukong/runner/boot_sequence.rb +123 -0
  27. data/lib/wukong/runner/code_loader.rb +52 -0
  28. data/lib/wukong/runner/deploy_pack_loader.rb +75 -0
  29. data/lib/wukong/runner/help_message.rb +42 -0
  30. data/lib/wukong/spec_helpers.rb +4 -12
  31. data/lib/wukong/spec_helpers/integration_tests.rb +150 -0
  32. data/lib/wukong/spec_helpers/{integration_driver_matchers.rb → integration_tests/integration_test_matchers.rb} +28 -62
  33. data/lib/wukong/spec_helpers/integration_tests/integration_test_runner.rb +97 -0
  34. data/lib/wukong/spec_helpers/shared_examples.rb +19 -10
  35. data/lib/wukong/spec_helpers/unit_tests.rb +134 -0
  36. data/lib/wukong/spec_helpers/{processor_methods.rb → unit_tests/unit_test_driver.rb} +42 -8
  37. data/lib/wukong/spec_helpers/{spec_driver_matchers.rb → unit_tests/unit_test_matchers.rb} +6 -32
  38. data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +54 -0
  39. data/lib/wukong/version.rb +1 -1
  40. data/lib/wukong/widget/filters.rb +134 -8
  41. data/lib/wukong/widget/processors.rb +64 -5
  42. data/lib/wukong/widget/reducers/bin.rb +68 -18
  43. data/lib/wukong/widget/reducers/count.rb +12 -0
  44. data/lib/wukong/widget/reducers/group.rb +48 -5
  45. data/lib/wukong/widget/reducers/group_concat.rb +30 -2
  46. data/lib/wukong/widget/reducers/moments.rb +4 -4
  47. data/lib/wukong/widget/reducers/sort.rb +53 -3
  48. data/lib/wukong/widget/serializers.rb +37 -12
  49. data/lib/wukong/widget/utils.rb +1 -1
  50. data/spec/spec_helper.rb +20 -2
  51. data/spec/wukong/driver_spec.rb +2 -0
  52. data/spec/wukong/local/runner_spec.rb +40 -0
  53. data/spec/wukong/local_spec.rb +6 -0
  54. data/spec/wukong/logger_spec.rb +49 -0
  55. data/spec/wukong/processor_spec.rb +22 -0
  56. data/spec/wukong/runner_spec.rb +128 -8
  57. data/spec/wukong/widget/filters_spec.rb +28 -10
  58. data/spec/wukong/widget/processors_spec.rb +5 -5
  59. data/spec/wukong/widget/reducers/bin_spec.rb +14 -14
  60. data/spec/wukong/widget/reducers/count_spec.rb +1 -1
  61. data/spec/wukong/widget/reducers/group_spec.rb +7 -6
  62. data/spec/wukong/widget/reducers/moments_spec.rb +2 -2
  63. data/spec/wukong/widget/reducers/sort_spec.rb +1 -1
  64. data/spec/wukong/widget/serializers_spec.rb +84 -88
  65. data/spec/wukong/wu-local_spec.rb +109 -0
  66. metadata +43 -20
  67. data/bin/wu-server +0 -70
  68. data/lib/wukong/boot.rb +0 -96
  69. data/lib/wukong/configuration.rb +0 -8
  70. data/lib/wukong/emitter.rb +0 -22
  71. data/lib/wukong/server.rb +0 -119
  72. data/lib/wukong/spec_helpers/integration_driver.rb +0 -157
  73. data/lib/wukong/spec_helpers/processor_helpers.rb +0 -89
  74. data/lib/wukong/spec_helpers/spec_driver.rb +0 -28
  75. data/spec/wukong/local_runner_spec.rb +0 -31
  76. data/spec/wukong/wu_local_spec.rb +0 -125
data/Gemfile CHANGED
@@ -5,6 +5,7 @@ gemspec
5
5
  group :development do
6
6
  gem 'rake', '>= 0.9'
7
7
  gem 'rspec', '>= 2.8'
8
+ gem 'spork', '0.9.2'
8
9
  gem 'guard', '>= 1.0'
9
10
  gem 'guard-rspec', '>= 0.6'
10
11
  gem 'simplecov', '>= 0.5'
data/README.md CHANGED
@@ -19,6 +19,8 @@ Here is a list of various other projects which you may also want to
19
19
  peruse when trying to understand the full Wukong experience:
20
20
 
21
21
  * <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
22
+ * <a href="http://github.com/infochimps-labs/wukong-storm>wukong-storm</a>: Run Wukong processors within the Storm framework. Model flows locally before you run them.
23
+ * <a href="http://github.com/infochimps-labs/wukong-load>wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
22
24
  * <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
23
25
  * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
24
26
 
@@ -36,7 +38,7 @@ processor is Ruby class which
36
38
  * subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
37
39
  * defines a `process` method which takes an input record, does something, and calls `yield` on the output
38
40
 
39
- Here's a processor that reverses all each input record:
41
+ Here's a processor that reverses each of its input records:
40
42
 
41
43
  ```ruby
42
44
  # in string_reverser.rb
@@ -47,8 +49,8 @@ Wukong.processor(:string_reverser) do
47
49
  end
48
50
  ```
49
51
 
50
- When you're developing your application, run your processors on the
51
- command line on flat input files using `wu-local`:
52
+ You can run this processor on the command line using text files as
53
+ input using the `wu-local` tool that comes with Wukong:
52
54
 
53
55
  ```
54
56
  $ cat novel.txt
@@ -59,35 +61,46 @@ $ cat novel.txt | wu-local string_reverser.rb
59
61
  .semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
60
62
  ```
61
63
 
62
- You can use yield as often (or never) as you need. Here's a more
63
- complicated example to illustrate:
64
+ The `wu-local` program consumes one line at at time from STDIN and
65
+ calls your processor's `process` method with that line as a Ruby
66
+ String object. Each object you `yield` within your process method
67
+ will be printed back out on STDOUT.
68
+
69
+ ### Multiple Processors, Multiple (Or No) Yields
70
+
71
+ Processors are intended to be combined so they can be stored in the
72
+ same file like these two, related processors:
64
73
 
65
74
  ```ruby
66
75
  # in processors.rb
67
76
 
68
- Wukong.processor(:tokenizer) do
77
+ Wukong.processor(:splitter) do
69
78
  def process line
70
79
  line.split.each { |token| yield token }
71
80
  end
72
81
  end
73
82
 
74
- Wukong.processor(:starts_with) do
75
-
76
- field :letter, String, :default => 'a'
77
-
78
- def process word
79
- yield word if word =~ Regexp.new("^#{letter}", true)
83
+ Wukong.processor(:normalizer) do
84
+ def process token
85
+ stripped = token.downcase.gsub(/\W/,'')
86
+ yield stripped if stripped.size > 0
80
87
  end
81
88
  end
82
89
  ```
83
90
 
84
- Let's start by running the `tokenizer`. We've defined two processors
85
- in the file `processors.rb` and neither one is named `processors` so
86
- we have to tell `wu-local` the name of the processor we want to run
87
- explicitly.
91
+ Notice how the `splitter` yields multiple tokens for each of its input
92
+ tokens and that the `normalizer` may sometimes never yield at all,
93
+ depending on its input. Processors are under no obligations by the
94
+ framework to yield or return anything so they can easily act as
95
+ filters or even sinks in data flows.
96
+
97
+ There are two processors in this file and neither shares a name with
98
+ the basename of the file ("processors") so `wu-local` can't
99
+ automatically choose a processor to run. We can specify one
100
+ explicitly with the `--run` option:
88
101
 
89
102
  ```
90
- $ cat novel.txt | wu-local processors.rb --run=tokenizer
103
+ $ cat novel.txt | wu-local processors.rb --run=splitter
91
104
  It
92
105
  was
93
106
  the
@@ -97,39 +110,454 @@ times,
97
110
  ...
98
111
  ```
99
112
 
100
- You can combine the output of one processor with another right in the
101
- shell. Let's add the `starts_with` filter and also pass in the
102
- *field* `letter`, defined in that processor:
113
+ We can combine the two processors together
103
114
 
104
115
  ```
105
- $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local processors.rb --run=starts_with --letter=t
106
- the
107
- times
116
+ $ cat novel.txt | wu-local processors.rb --run=splitter | wu-local processors.rb --run=normalizer
117
+ it
118
+ was
108
119
  the
120
+ best
121
+ of
109
122
  times
110
123
  ...
111
124
  ```
112
125
 
113
- Wanting to match on a regular expression is such a common task that
114
- Wukong has a built-in "widget" called `regexp` that you can use
115
- directly:
126
+ but there's an easier way of doing this with <a href="#flows">dataflows</a>.
127
+
128
+ ### Adding Configurable Options
129
+
130
+ Processors can have options that can be set in Ruby code, from the
131
+ command-line, a configuration file, or a variety of other places
132
+ thanks to [Configliere](http://github.com/infochimps-labs/configliere).
133
+
134
+ This processor calculates percentiles from observations assuming a
135
+ normal distribution given a particular mean and standard deviation.
136
+ It uses two *fields*, the mean or average of a distribution (`mean`)
137
+ and its standard deviation (`std_dev`). From this information, it
138
+ will measure the percentile of all input values.
139
+
140
+ ```ruby
141
+ # in percentile.rb
142
+ Wukong.processor(:percentile) do
143
+
144
+ SQRT_1_HALF = Math.sqrt(0.5)
145
+
146
+ field :mean, Float, :default => 0.0
147
+ field :std_dev, Float, :default => 1.0
148
+
149
+ def process value
150
+ observation = value.to_f
151
+ z_score = (mean - observation) / std_dev
152
+ percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
153
+ yield [observation, percentile].join("\t")
154
+ end
155
+ end
156
+ ```
157
+
158
+ These fields have default values but you can overide them on the
159
+ command line. If you scored a 95 on an exam where the mean score was
160
+ 80 points and the standard deviation of the scores was 10 points, for
161
+ example, then you'd be in the 93rd percentile:
162
+
163
+ ```
164
+ $ echo 95 | wu-local /tmp/percentile.rb --mean=80 --std_dev=10
165
+ 95.0 93.3192798731142
166
+ ```
167
+
168
+ If the exam were more difficult, with a mean of 75 points and a
169
+ standard deviation of 8 points, you'd be in the 99th percentile!
170
+
171
+ ```
172
+ $ echo 95 | wu-local /tmp/percentile.rb --mean=75 --std_dev=8
173
+ 95.0 99.37903346742239
174
+ ```
175
+
176
+ ### The Lifecycle of a Processor
177
+
178
+ Processors have a lifecycle that they execute when they are run within
179
+ the context of a Wukong runner like `wu-local` or `wu-hadoop`. Each
180
+ lifecycle phase corresponds to a method of the processor that is
181
+ called:
182
+
183
+ * `setup` called *after* the Processor is initialized but *before* the first record is processed. You cannot yield from this method.
184
+ * `process` called once for each input record, may yield once, many, or no times.
185
+ * `finalize` called after the the *last* record has been processed but while the processor still has an opportunity to yield records.
186
+ * `stop` called to signal to the processor that all work should stop, open connections should be closed, &c. You cannot yield from this method.
187
+
188
+ The above examples have already focused on the `process` method.
189
+
190
+ The `setup` and `stop` methods are often used together to handle
191
+ external connections
192
+
193
+ ```ruby
194
+ # in geolocator.rb
195
+ Wukong.processor(:geolocator) do
196
+ field :host, String, :default => 'localhost'
197
+ attr_accessor :connection
198
+
199
+ def setup
200
+ self.connection = Database::Connection.new(host)
201
+ end
202
+ def process record
203
+ record.added_value = connection.find("...some query...")
204
+ end
205
+ def stop
206
+ self.connection.close
207
+ end
208
+ end
209
+ ```
210
+
211
+ The `finalize` method is most useful when writing a "reduce"-type
212
+ operation that involves storing or aggregating information till some
213
+ criterion is met. It will always be called after the last record has
214
+ been given (to `process`) but you can call it whenever you want to
215
+ within your own code.
216
+
217
+ Here's an example of using the `finalize` method to implement a simple
218
+ counter that counts all the input records:
219
+
220
+ ```ruby
221
+ # in counter.rb
222
+ Wukong.processor(:counter) do
223
+ attr_accessor :count
224
+ def setup
225
+ self.count = 0
226
+ end
227
+ def process thing
228
+ self.count += 1
229
+ end
230
+ def finalize
231
+ yield count
232
+ end
233
+ end
234
+ ```
235
+
236
+ It hinges on the fact that the last input record will be passed to
237
+ `process` *first* and only then will `finalize` be called. This
238
+ allows the last input record to be counted/processed/aggregated and
239
+ then the entire aggregate to be dealt with in finalize.
240
+
241
+ Because of this emphasis on building and processing aggregates, the
242
+ `finalize` method is often useful within processors meant to run as
243
+ reducers in a Hadoop environment.
244
+
245
+ Note:: Finalize is not guaranteed to be called by in every possible
246
+ environment as it depends on the chosen runner. In a local or Hadoop
247
+ environment, the notion of "last record" makes sense and so the
248
+ corresponding runners will call `finalize`. In an environment like
249
+ Storm, where the concept of last record is not (supposed to be)
250
+ meaningful, the corresponding runner doesn't ever call it.
251
+
252
+ ### Serialization
253
+
254
+ `wu-local` (and many similar tools) deal with inputs and outputs as
255
+ strings.
256
+
257
+ Processors want to process objects as close to their domain as is
258
+ possible. A processor which decorates address book entries with
259
+ Twitter handles doesn't want to think of its inputs as Strings but
260
+ Hashes or, better yet, Persons.
261
+
262
+ Wukong makes it easy to wrap a processor with other processors
263
+ dedicated to handling the common tasks of parsing records into or out
264
+ of formats like JSON and turning them into Ruby model instances.
265
+
266
+ #### De-serializing data formats like JSON or TSV
267
+
268
+ Wukong can parse and emit common data formats like JSON and delimited
269
+ formats like TSV or CSV so that you don't pollute or tie down your own
270
+ processors with protocol logic.
271
+
272
+ Here's an example of a processor that wants to deal with Hashes as
273
+ input.
274
+
275
+ ```ruby
276
+ # in extractor.rb
277
+ Wukong.processor(:extractor) do
278
+ def process hsh
279
+ yield hsh["first_name"]
280
+ end
281
+ end
282
+ ```
283
+
284
+ Given JSON data,
285
+
286
+ ```
287
+ $ cat input.json
288
+ {"first_name": "John", "last_name":, "Smith"}
289
+ {"first_name": "Sally", "last_name":, "Johnson"}
290
+ ...
291
+ ```
292
+
293
+ you can feed it directly to a processor
294
+
295
+ ```
296
+ $ cat input.json | wu-local --from=json extractor
297
+ John
298
+ Sally
299
+ ...
300
+ ```
301
+
302
+ Other processors really like Arrays:
303
+
304
+ ```ruby
305
+ Wukong.processor(:summer) do
306
+ def process values
307
+ yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
308
+ end
309
+ end
310
+ ```
311
+
312
+ so you can feed them TSV data
313
+ ```
314
+ $ cat data.tsv
315
+ 1 2 3
316
+ 4 5 6
317
+ 7 8 9
318
+ ...
319
+ $ cat data.tsv | wu-local --from=tsv summer
320
+ 6
321
+ 15
322
+ 24
323
+ ...
324
+ ```
325
+
326
+ but you can just as easily use the same code with CSV data
327
+
328
+ ```
329
+ $ cat data.tsv | wu-local --from=csv summer
330
+ ```
331
+
332
+ or a more general delimited format.
333
+
334
+ ```
335
+ $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
336
+ ```
337
+
338
+ #### Recordizing data structures into domain models
339
+
340
+ Here's a contact validator that relies on a Person model to decide
341
+ whether a contact entry should be yielded:
342
+
343
+ ```ruby
344
+ # in contact_validator.rb
345
+ require 'person'
346
+
347
+ Wukong.processor(:contact_validator) do
348
+ def process person
349
+ yield person if person.valid?
350
+ end
351
+ end
352
+ ```
353
+
354
+ Relying on the (elsewhere-defined) Person model to define `valid?`
355
+ means the processor can stay skinny and readable. Wukong can, in
356
+ combination with the deserializing features above, turn input text
357
+ into instances of Person:
358
+
359
+ ```
360
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator
361
+ #<Person:0x000000020e6120>
362
+ #<Person:0x000000020e6120>
363
+ #<Person:0x000000020e6120>
364
+ ```
365
+
366
+ `wu-local` can also serialize records from the `contact_validator`
367
+ processor:
368
+
369
+ ```
370
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
371
+ {"first_name": "John", "last_name":, "Smith", "valid": "true"}
372
+ {"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
373
+ ...
374
+ ```
375
+
376
+ Serialization formats work just like deserialization formats, with
377
+ JSON as well as delimited formats available.
378
+
379
+ Parsing records into model instances and serializing them out again
380
+ puts constraints on the model class providing these instances. Here's
381
+ what the `Person` class needs to look like:
382
+
383
+
384
+ ```ruby
385
+ # in person.rb
386
+ class Person
387
+
388
+ # Create a new Person from the given attributes. Supports usage of
389
+ # the `--consumes` flag on the command-line
390
+ #
391
+ # @param [Hash] attrs
392
+ # @return [Person]
393
+ def self.receive attrs
394
+ new(attrs)
395
+ end
396
+
397
+ # Turn this Person into a basic data structure. Supports the usage
398
+ # of the `--to` flag on the command-line.
399
+ #
400
+ # @return [Hash]
401
+ def to_wire
402
+ to_hash
403
+ end
404
+ end
405
+ ```
406
+
407
+ To support the `--consumes=Person` syntax, the `receive` class method
408
+ must take a Hash produced from the operation of the `--from` argument
409
+ and return a `Person` instance.
410
+
411
+ To support the `--to=json` syntax, the `Person` class must implement
412
+ the `to_wire` instance method.
413
+
414
+ ### Logging and Notifications
415
+
416
+ Wukong comes with a logger that all processors have access to via
417
+ their `log` attribute. This logger has the following priorities:
418
+
419
+ * debug (can be set as a log level)
420
+ * info (can be set as a log level)
421
+ * warn (can be set as a log level)
422
+ * error
423
+ * fatal
424
+
425
+ and here's a processor which uses them all
426
+
427
+ ```ruby
428
+ # in logs.rb
429
+ Wukong.processor(:logs) do
430
+ def process line
431
+ log.debug line
432
+ log.info line
433
+ log.warn line
434
+ log.error line
435
+ log.fatal line
436
+ end
437
+ end
438
+ ```
439
+
440
+ The default log level is DEBUG.
441
+
442
+ ```
443
+ $ echo something | wu-local logs.rb
444
+ DEBUG 2013-01-11 23:40:56 [Logs ] -- event
445
+ INFO 2013-01-11 23:40:56 [Logs ] -- event
446
+ WARN 2013-01-11 23:40:56 [Logs ] -- event
447
+ ERROR 2013-01-11 23:40:56 [Logs ] -- event
448
+ FATAL 2013-01-11 23:40:56 [Logs ] -- event
449
+ ```
450
+
451
+ though you can set it to something else globally
452
+
453
+ ```
454
+ $ echo something | wu-local logs.rb --log.level=warn
455
+ WARN 2013-01-11 23:40:56 [Logs ] -- event
456
+ ERROR 2013-01-11 23:40:56 [Logs ] -- event
457
+ FATAL 2013-01-11 23:40:56 [Logs ] -- event
458
+ ```
459
+
460
+ or on a per-class basis.
461
+
462
+ ### Creating Documentation
463
+
464
+ `wu-local` includes a help message:
465
+
466
+ ```
467
+ $ wu-local --help
468
+ usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
469
+
470
+ wu-local is a tool for running Wukong processors and flows locally on
471
+ the command-line. Use wu-local by passing it a processor and feeding
472
+ ...
473
+
474
+
475
+ Params:
476
+ -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
477
+ -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
478
+ ```
479
+
480
+ You can generate custom help messages for your own processors. Here's
481
+ the percentile processor from before but made more usable with good
482
+ documentation:
116
483
 
484
+ ```ruby
485
+ # in percentile.rb
486
+ Wukong.processor(:percentile) do
487
+
488
+ description <<-EOF.gsub(/^ {2}/,'')
489
+ This processor calculates percentiles from input scores based on a
490
+ given mean score and a given standard deviation for the scores.
491
+
492
+ The mean and standard deviation are given at run time and processed
493
+ scores will be compared against the given mean and standard
494
+ deviation.
495
+
496
+ The input is expected to consist of float values, one per line.
497
+
498
+ Example:
499
+
500
+ $ cat input.dat
501
+ 88
502
+ 89
503
+ 77
504
+ ...
505
+
506
+ $ cat input.dat | wu-local percentile.rb --mean=85 --std_dev=7
507
+ 88.0 66.58824291023753
508
+ 89.0 71.61454169013237
509
+ 77.0 12.654895447355777
510
+ EOF
511
+
512
+ SQRT_1_HALF = Math.sqrt(0.5)
513
+
514
+ field :mean, Float, :default => 0.0, :doc => "The mean of the assumed distribution"
515
+ field :std_dev, Float, :default => 1.0, :doc => "The standard deviation of the assumed distribution"
516
+
517
+ def process value
518
+ observation = value.to_f
519
+ z_score = (mean - observation) / std_dev
520
+ percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
521
+ yield [observation, percentile].join("\t")
522
+ end
523
+ end
117
524
  ```
118
- $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local regexp --match='^t'
525
+
526
+ If you call `wu-local` with the file to this processor as an argument
527
+ in addition to the original `--help` argument, you'll get custom
528
+ documentation.
529
+
119
530
  ```
531
+ $ wu-local percentile.rb --help
532
+ usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
533
+
534
+ This processor calculates percentiles from input scores based on a
535
+ given mean score and a given standard deviation for the scores.
536
+ ...
120
537
 
121
- There are many more simple <a href="#widgets">widgets</a> like these.
538
+
539
+ Params:
540
+ --mean=Float The mean of the assumed distribution [Default: 0.0]
541
+ -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
542
+ --std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
543
+ -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
544
+
545
+ ```
122
546
 
123
547
  <a name="flows"></a>
124
548
  ## Combining Processors into Dataflows
125
549
 
126
550
  Combining processors which each do one thing well together in a chain
127
551
  is mimicing the tried and true UNIX pipeline. Wukong lets you define
128
- these pipelines more formally as a dataflow. Here's the dataflow for
552
+ these pipelines more formally as a dataflow.
553
+
554
+ Having written the `tokenizer` processor, we can use it in a dataflow
555
+ along with the built-in `regexp` processor to replicate what we did in
129
556
  the last example:
130
557
 
131
558
  ```
132
559
  # in find_t_words.rb
560
+ require_relative('processors')
133
561
  Wukong.dataflow(:find_t_words) do
134
562
  tokenizer | regexp(match: /^t/)
135
563
  end
@@ -148,7 +576,8 @@ times
148
576
  ...
149
577
  ```
150
578
 
151
- and it works exactly like before.
579
+ and it works exactly like manually chaining the two processors
580
+ together.
152
581
 
153
582
  <a name="serialization></a>
154
583
  ## Serialization
@@ -163,7 +592,14 @@ yield a String argument (or something that will `to_s` appropriately).
163
592
  ## Widgets
164
593
 
165
594
  Wukong has a number of built-in widgets that are useful for
166
- scaffolding your dataflows.
595
+ scaffolding your dataflows or using as starting off points for your
596
+ own processors.
597
+
598
+ For any of these widgets you can get customized help, say
599
+
600
+ ```
601
+ $ wu-local group --help
602
+ ```
167
603
 
168
604
  ### Serializers
169
605
 
@@ -350,10 +786,10 @@ describe :tokenizer do
350
786
  processor.given("Hi there.\nMy name is Wukong!").should emit(6).records
351
787
  end
352
788
  it "eliminates all punctuation" do
353
- processor.given("Never!").output.first.should_not include(',')
789
+ processor(:tokenizer).given("Never!").should emit('Never')
354
790
  end
355
- it "downcases all input text" do
356
- processor.given("Whatever").output.first.should match(/^w/)
791
+ it "will not emit tokens in a stop list" do
792
+ processor(:tokenizer, :stop_list => ['apples', 'bananas']).given("I like apples and bananas").should emit('I', 'like', 'and')
357
793
  end
358
794
  end
359
795
  ```
@@ -364,8 +800,13 @@ Let's look at each kind of helper:
364
800
  `it_behaves_like` helper) adds some tests that ensure that the
365
801
  processor conforms to the API of a Wukong::Processor.
366
802
 
367
- * The `processor` method instantiates a processor very similarly to
368
- the way `wu-local` instantiates one on the command-line. It accepts
803
+ * The `processor` method is actually an alias for the more aptly named
804
+ (but less convenient) `unit_test_runner`. This method accepts a
805
+ processor name and options (just like `wu-local` and other
806
+ command-line tools) and returns a Wukong::UnitTestRunner instance.
807
+ This runner handles the
808
+
809
+
369
810
  a (registered) processor name and options and creates a new
370
811
  processor. If no name is given, the argument of the enclosing
371
812
  `describe` or `context` block is used. The object returned by
@@ -374,29 +815,38 @@ Let's look at each kind of helper:
374
815
  behavior.
375
816
 
376
817
  * The `given` method (and other helpers like `given_json`,
377
- `given_tsv`, &c.) is added to the Processor class when
378
- Wukong::SpecHelpers is required. It's a way of lazily feeding
379
- records to a processor, without having to go through the `process`
380
- method directly and having to handle the block or the processor's
381
- lifecycle as in the prior example.
818
+ `given_tsv`, &c.) is a method on the runner. It's a way of lazily
819
+ feeding records to a processor, without having to go through the
820
+ `process` method directly and having to handle the block or the
821
+ processor's lifecycle as in the prior example.
382
822
 
383
823
  * The `output` and `emit` matchers will `process` all previously
384
824
  `given` records when they are called. This lets you separate
385
825
  instantiation, input, expectations, and output. Here's a more
386
- complicated example:
826
+ complicated example.
387
827
 
388
828
  The same helpers can be used to test dataflows as well as
389
- processors. For complete details, see documentation for the
390
- Wukong::SpecHelpers module.
829
+ processors.
830
+
831
+ ####
832
+
833
+ #### Functions vs. Objects
834
+
835
+ The above test helpers are designed to aid in testing processors
836
+ functionally because:
837
+
838
+ * they accept the
391
839
 
392
840
  ### Integration Tests
393
841
 
394
- Sometimes unit tests aren't enough and you need to test your
395
- processors or flows as they will be run in production using
396
- `wu-local`.
842
+ If you are implementing a new Wukong command (akin to `wu-local`) then
843
+ you may also want to run integration tests. Wukong comes with helpers
844
+ for these, too.
397
845
 
398
- For these use cases, Wukong provides some integration helpers that
399
- make testing command line processes easier.
846
+ You should almost always be able to test your processors without
847
+ integration tests. Your unit tests and the Wukong framework itself
848
+ should ensure that your processors work correctly no matter what
849
+ environment they are deployed in.
400
850
 
401
851
  ```ruby
402
852
  # spec/integration/tokenizer_spec.rb
@@ -415,7 +865,7 @@ context "interpreting its arguments" do
415
865
  end
416
866
  context "with a malformed --match argument" do
417
867
  # invalid b/c the regexp is broken...
418
- subject { command("wu-local tokenizer --match='^[h'") < "hi there" }
868
+ subject { command("wu-local tokenizer --match='^(h'") < "hi there" }
419
869
  it { should exit_with(:non_zero) }
420
870
  it { should have_stderr(/invalid/) }
421
871
  end
@@ -457,3 +907,192 @@ Let's go through the helpers:
457
907
  * The `have_stdout` and `have_stderr` matchers let you test the STDOUT or STDERR of the command for particular strings or regular expressions.
458
908
 
459
909
  * The `exit_with` matcher lets you test the exit code of the command. You can pass the symbol `:non_zero` to set the expectation of _any_ non-zero exit code.
910
+
911
+ ## Plugins
912
+
913
+ Wukong has a built-in plugin framework to make it easy to adapt Wukong
914
+ processors to new backends or add other functionality. The
915
+ `Wukong::Local` module and the `wu-local` program it supports is
916
+ itself a Wukong plugin.
917
+
918
+ The following shows how you might build a simplified version of
919
+ `Wukong::Local` as a new plugin. We'll call this plugin `Cat` as it
920
+ will implement a program `wu-cat` that is similar in function to
921
+ `wu-local` (just simplified).
922
+
923
+ The first thing to do is include the `Wukong::Plugin` module in your
924
+ code:
925
+
926
+
927
+ ```Ruby
928
+ # in lib/cat.rb
929
+ #
930
+ # This Wukong plugin works like wu-local but replicates some silly
931
+ # features of cat like numbered lines.
932
+ module Cat
933
+
934
+ # This registers Cat as a Wukong plugin.
935
+ include Wukong::Plugin
936
+
937
+ # Defines any settings specific to Cat. Cat doesn't need to, but
938
+ # you can define global settings here if you want. You can also
939
+ # check the `program` name to decide whether to apply your settings.
940
+ # This helps you not pollute other commands with your stuff.
941
+ def self.configure settings, program
942
+ case program
943
+ when 'wu-cat'
944
+ settings.define(:input, :description => "The input file to use")
945
+ settings.define(:number, :description => "Prepend each input record with a consecutive number", :type => :boolean)
946
+ else
947
+ # configure other programs if you need to
948
+ end
949
+ end
950
+
951
+ # Lets Cat boot up with settings that have already been resolved
952
+ # from the command-line or other sources like config files or remote
953
+ # servers added by other plugins.
954
+ #
955
+ # The `root` directory in which the program is executing is also
956
+ # provided.
957
+ def self.boot settings, root
958
+ puts "Cat booting up using resolved settings within directory #{root}"
959
+ end
960
+ end
961
+ ```
962
+
963
+ If your plugin doesn't interact directly with the command-line
964
+ (through a wu-tool like `wu-local` or `wu-hadoop`) and doesn't
965
+ directly interface with passing records to processors then you can
966
+ just require the rest of your plugin's code at this point and be done.
967
+
968
+ ### Write a Runner to interact with the command-line
969
+
970
+ If you need to implement a new command line tool then you should write
971
+ a Runner. A Runner is used to implement Wukong programs like
972
+ `wu-local` or `wu-hadoop`. Here's what the actual program file would
973
+ look like for our example plugin's `wu-cat` program.
974
+
975
+ ```ruby
976
+ #!/usr/bin/env ruby
977
+ # in bin/wu-cat
978
+ require 'cat'
979
+ Cat::Runner.run
980
+ ```
981
+
982
+ The Cat::Runner class is implemented separately.
983
+
984
+ ```ruby
985
+ # in lib/cat/runner.rb
986
+ require_relative('driver')
987
+ module Cat
988
+
989
+ # Implements the `wu-cat` command.
990
+ class Runner < Wukong::Runner
991
+
992
+ usage "PROCESSOR|FLOW"
993
+
994
+ description <<-EOF
995
+
996
+ wu-cat lets you run a Wukong processor or dataflow on the
997
+ command-line. Try it like this.
998
+
999
+ $ wu-cat --input=data.txt
1000
+ hello
1001
+ my
1002
+ friend
1003
+
1004
+ Connect the output to a processor in upcaser.rb
1005
+
1006
+ $ wu-cat --input=data.txt upcaser.rb
1007
+ HELLO
1008
+ MY
1009
+ FRIEND
1010
+
1011
+ You can also include add line numbers to the output.
1012
+
1013
+ $ wu-cat --number --input=data.txt upcaser.rb
1014
+ 1 HELLO
1015
+ 2 MY
1016
+ 3 FRIEND
1017
+ EOF
1018
+
1019
+ # The name of the processor we're going to run. The #args method
1020
+ # is provided by the Runner class.
1021
+ def processor_name
1022
+ args.first
1023
+ end
1024
+
1025
+ # Validate that we were given the name of a registered processor
1026
+ # to run. Be careful to return true here or validation will fail.
1027
+ def validate
1028
+ raise Wukong::Error.new("Must provide a processor as the first argument") unless processor_name
1029
+ true
1030
+ end
1031
+
1032
+ # Delgates to a driver class to run the processor.
1033
+ def run
1034
+ Driver.new(processor_name, settings).start
1035
+ end
1036
+
1037
+ end
1038
+ end
1039
+ ```
1040
+
1041
+ ### Write a Driver to interact with processors
1042
+
1043
+ The `Cat::Runner#run` method delegates to the `Cat::Driver` class to
1044
+ handle instantiating and interacting with processors.
1045
+
1046
+ ```ruby
1047
+ # in lib/cat/driver.rb
1048
+ module Cat
1049
+
1050
+ # A class for driving a processor from `wu-cat`.
1051
+ class Driver
1052
+
1053
+ # Lets us count the records.
1054
+ attr_accessor :number
1055
+
1056
+ # Gives methods to construct and interact with dataflows.
1057
+ include Wukong::DriverMethods
1058
+
1059
+ # Create a new Driver for a dataflow with the given `label` using
1060
+ # the given `settings`.
1061
+ #
1062
+ # @param [String] label the name of the dataflow
1063
+ # @param [Configliere::Param] settings the settings to use when creating the dataflow
1064
+ def initialize label, settings
1065
+ self.settings = settings
1066
+ self.dataflow = construct_dataflow(label, settings)
1067
+ self.number = 1
1068
+ end
1069
+
1070
+ # The file handle of the input file.
1071
+ #
1072
+ # @return [File]
1073
+ def input_file
1074
+ @input_file ||= File.new(settings[:input])
1075
+ end
1076
+
1077
+ # Starts feeding records to the processor
1078
+ def start
1079
+ while line = input_file.readline rescue nil
1080
+ driver.send_through_dataflow(line)
1081
+ end
1082
+ end
1083
+
1084
+ # Process each record that comes back from the dataflow.
1085
+ #
1086
+ # @param [Object] record the yielded record
1087
+ def process record
1088
+ if settings[:number]
1089
+ puts [number, record].map(&:to_s).join("\t")
1090
+ else
1091
+ puts record.to_s
1092
+ end
1093
+ self.number += 1
1094
+ end
1095
+
1096
+ end
1097
+ end
1098
+ ```