wukong 3.0.0.pre3 → 3.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -0
- data/README.md +689 -50
- data/bin/wu-local +1 -74
- data/diagrams/wu_local.dot +39 -0
- data/diagrams/wu_local.dot.png +0 -0
- data/examples/loadable.rb +2 -0
- data/examples/string_reverser.rb +7 -0
- data/lib/hanuman/stage.rb +2 -2
- data/lib/wukong.rb +21 -10
- data/lib/wukong/dataflow.rb +2 -5
- data/lib/wukong/doc_helpers.rb +14 -0
- data/lib/wukong/doc_helpers/dataflow_handler.rb +29 -0
- data/lib/wukong/doc_helpers/field_handler.rb +91 -0
- data/lib/wukong/doc_helpers/processor_handler.rb +29 -0
- data/lib/wukong/driver.rb +11 -1
- data/lib/wukong/local.rb +40 -0
- data/lib/wukong/local/event_machine_driver.rb +27 -0
- data/lib/wukong/local/runner.rb +98 -0
- data/lib/wukong/local/stdio_driver.rb +44 -0
- data/lib/wukong/local/tcp_driver.rb +47 -0
- data/lib/wukong/logger.rb +16 -7
- data/lib/wukong/plugin.rb +48 -0
- data/lib/wukong/processor.rb +57 -15
- data/lib/wukong/rake_helper.rb +6 -0
- data/lib/wukong/runner.rb +151 -128
- data/lib/wukong/runner/boot_sequence.rb +123 -0
- data/lib/wukong/runner/code_loader.rb +52 -0
- data/lib/wukong/runner/deploy_pack_loader.rb +75 -0
- data/lib/wukong/runner/help_message.rb +42 -0
- data/lib/wukong/spec_helpers.rb +4 -12
- data/lib/wukong/spec_helpers/integration_tests.rb +150 -0
- data/lib/wukong/spec_helpers/{integration_driver_matchers.rb → integration_tests/integration_test_matchers.rb} +28 -62
- data/lib/wukong/spec_helpers/integration_tests/integration_test_runner.rb +97 -0
- data/lib/wukong/spec_helpers/shared_examples.rb +19 -10
- data/lib/wukong/spec_helpers/unit_tests.rb +134 -0
- data/lib/wukong/spec_helpers/{processor_methods.rb → unit_tests/unit_test_driver.rb} +42 -8
- data/lib/wukong/spec_helpers/{spec_driver_matchers.rb → unit_tests/unit_test_matchers.rb} +6 -32
- data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +54 -0
- data/lib/wukong/version.rb +1 -1
- data/lib/wukong/widget/filters.rb +134 -8
- data/lib/wukong/widget/processors.rb +64 -5
- data/lib/wukong/widget/reducers/bin.rb +68 -18
- data/lib/wukong/widget/reducers/count.rb +12 -0
- data/lib/wukong/widget/reducers/group.rb +48 -5
- data/lib/wukong/widget/reducers/group_concat.rb +30 -2
- data/lib/wukong/widget/reducers/moments.rb +4 -4
- data/lib/wukong/widget/reducers/sort.rb +53 -3
- data/lib/wukong/widget/serializers.rb +37 -12
- data/lib/wukong/widget/utils.rb +1 -1
- data/spec/spec_helper.rb +20 -2
- data/spec/wukong/driver_spec.rb +2 -0
- data/spec/wukong/local/runner_spec.rb +40 -0
- data/spec/wukong/local_spec.rb +6 -0
- data/spec/wukong/logger_spec.rb +49 -0
- data/spec/wukong/processor_spec.rb +22 -0
- data/spec/wukong/runner_spec.rb +128 -8
- data/spec/wukong/widget/filters_spec.rb +28 -10
- data/spec/wukong/widget/processors_spec.rb +5 -5
- data/spec/wukong/widget/reducers/bin_spec.rb +14 -14
- data/spec/wukong/widget/reducers/count_spec.rb +1 -1
- data/spec/wukong/widget/reducers/group_spec.rb +7 -6
- data/spec/wukong/widget/reducers/moments_spec.rb +2 -2
- data/spec/wukong/widget/reducers/sort_spec.rb +1 -1
- data/spec/wukong/widget/serializers_spec.rb +84 -88
- data/spec/wukong/wu-local_spec.rb +109 -0
- metadata +43 -20
- data/bin/wu-server +0 -70
- data/lib/wukong/boot.rb +0 -96
- data/lib/wukong/configuration.rb +0 -8
- data/lib/wukong/emitter.rb +0 -22
- data/lib/wukong/server.rb +0 -119
- data/lib/wukong/spec_helpers/integration_driver.rb +0 -157
- data/lib/wukong/spec_helpers/processor_helpers.rb +0 -89
- data/lib/wukong/spec_helpers/spec_driver.rb +0 -28
- data/spec/wukong/local_runner_spec.rb +0 -31
- data/spec/wukong/wu_local_spec.rb +0 -125
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -19,6 +19,8 @@ Here is a list of various other projects which you may also want to
|
|
19
19
|
peruse when trying to understand the full Wukong experience:
|
20
20
|
|
21
21
|
* <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
|
22
|
+
* <a href="http://github.com/infochimps-labs/wukong-storm>wukong-storm</a>: Run Wukong processors within the Storm framework. Model flows locally before you run them.
|
23
|
+
* <a href="http://github.com/infochimps-labs/wukong-load>wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
|
22
24
|
* <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
|
23
25
|
* <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
|
24
26
|
|
@@ -36,7 +38,7 @@ processor is Ruby class which
|
|
36
38
|
* subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
|
37
39
|
* defines a `process` method which takes an input record, does something, and calls `yield` on the output
|
38
40
|
|
39
|
-
Here's a processor that reverses
|
41
|
+
Here's a processor that reverses each of its input records:
|
40
42
|
|
41
43
|
```ruby
|
42
44
|
# in string_reverser.rb
|
@@ -47,8 +49,8 @@ Wukong.processor(:string_reverser) do
|
|
47
49
|
end
|
48
50
|
```
|
49
51
|
|
50
|
-
|
51
|
-
|
52
|
+
You can run this processor on the command line using text files as
|
53
|
+
input using the `wu-local` tool that comes with Wukong:
|
52
54
|
|
53
55
|
```
|
54
56
|
$ cat novel.txt
|
@@ -59,35 +61,46 @@ $ cat novel.txt | wu-local string_reverser.rb
|
|
59
61
|
.semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
|
60
62
|
```
|
61
63
|
|
62
|
-
|
63
|
-
|
64
|
+
The `wu-local` program consumes one line at at time from STDIN and
|
65
|
+
calls your processor's `process` method with that line as a Ruby
|
66
|
+
String object. Each object you `yield` within your process method
|
67
|
+
will be printed back out on STDOUT.
|
68
|
+
|
69
|
+
### Multiple Processors, Multiple (Or No) Yields
|
70
|
+
|
71
|
+
Processors are intended to be combined so they can be stored in the
|
72
|
+
same file like these two, related processors:
|
64
73
|
|
65
74
|
```ruby
|
66
75
|
# in processors.rb
|
67
76
|
|
68
|
-
Wukong.processor(:
|
77
|
+
Wukong.processor(:splitter) do
|
69
78
|
def process line
|
70
79
|
line.split.each { |token| yield token }
|
71
80
|
end
|
72
81
|
end
|
73
82
|
|
74
|
-
Wukong.processor(:
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
def process word
|
79
|
-
yield word if word =~ Regexp.new("^#{letter}", true)
|
83
|
+
Wukong.processor(:normalizer) do
|
84
|
+
def process token
|
85
|
+
stripped = token.downcase.gsub(/\W/,'')
|
86
|
+
yield stripped if stripped.size > 0
|
80
87
|
end
|
81
88
|
end
|
82
89
|
```
|
83
90
|
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
91
|
+
Notice how the `splitter` yields multiple tokens for each of its input
|
92
|
+
tokens and that the `normalizer` may sometimes never yield at all,
|
93
|
+
depending on its input. Processors are under no obligations by the
|
94
|
+
framework to yield or return anything so they can easily act as
|
95
|
+
filters or even sinks in data flows.
|
96
|
+
|
97
|
+
There are two processors in this file and neither shares a name with
|
98
|
+
the basename of the file ("processors") so `wu-local` can't
|
99
|
+
automatically choose a processor to run. We can specify one
|
100
|
+
explicitly with the `--run` option:
|
88
101
|
|
89
102
|
```
|
90
|
-
$ cat novel.txt | wu-local processors.rb --run=
|
103
|
+
$ cat novel.txt | wu-local processors.rb --run=splitter
|
91
104
|
It
|
92
105
|
was
|
93
106
|
the
|
@@ -97,39 +110,454 @@ times,
|
|
97
110
|
...
|
98
111
|
```
|
99
112
|
|
100
|
-
|
101
|
-
shell. Let's add the `starts_with` filter and also pass in the
|
102
|
-
*field* `letter`, defined in that processor:
|
113
|
+
We can combine the two processors together
|
103
114
|
|
104
115
|
```
|
105
|
-
$ cat novel.txt | wu-local processors.rb --run=
|
106
|
-
|
107
|
-
|
116
|
+
$ cat novel.txt | wu-local processors.rb --run=splitter | wu-local processors.rb --run=normalizer
|
117
|
+
it
|
118
|
+
was
|
108
119
|
the
|
120
|
+
best
|
121
|
+
of
|
109
122
|
times
|
110
123
|
...
|
111
124
|
```
|
112
125
|
|
113
|
-
|
114
|
-
|
115
|
-
|
126
|
+
but there's an easier way of doing this with <a href="#flows">dataflows</a>.
|
127
|
+
|
128
|
+
### Adding Configurable Options
|
129
|
+
|
130
|
+
Processors can have options that can be set in Ruby code, from the
|
131
|
+
command-line, a configuration file, or a variety of other places
|
132
|
+
thanks to [Configliere](http://github.com/infochimps-labs/configliere).
|
133
|
+
|
134
|
+
This processor calculates percentiles from observations assuming a
|
135
|
+
normal distribution given a particular mean and standard deviation.
|
136
|
+
It uses two *fields*, the mean or average of a distribution (`mean`)
|
137
|
+
and its standard deviation (`std_dev`). From this information, it
|
138
|
+
will measure the percentile of all input values.
|
139
|
+
|
140
|
+
```ruby
|
141
|
+
# in percentile.rb
|
142
|
+
Wukong.processor(:percentile) do
|
143
|
+
|
144
|
+
SQRT_1_HALF = Math.sqrt(0.5)
|
145
|
+
|
146
|
+
field :mean, Float, :default => 0.0
|
147
|
+
field :std_dev, Float, :default => 1.0
|
148
|
+
|
149
|
+
def process value
|
150
|
+
observation = value.to_f
|
151
|
+
z_score = (mean - observation) / std_dev
|
152
|
+
percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
|
153
|
+
yield [observation, percentile].join("\t")
|
154
|
+
end
|
155
|
+
end
|
156
|
+
```
|
157
|
+
|
158
|
+
These fields have default values but you can overide them on the
|
159
|
+
command line. If you scored a 95 on an exam where the mean score was
|
160
|
+
80 points and the standard deviation of the scores was 10 points, for
|
161
|
+
example, then you'd be in the 93rd percentile:
|
162
|
+
|
163
|
+
```
|
164
|
+
$ echo 95 | wu-local /tmp/percentile.rb --mean=80 --std_dev=10
|
165
|
+
95.0 93.3192798731142
|
166
|
+
```
|
167
|
+
|
168
|
+
If the exam were more difficult, with a mean of 75 points and a
|
169
|
+
standard deviation of 8 points, you'd be in the 99th percentile!
|
170
|
+
|
171
|
+
```
|
172
|
+
$ echo 95 | wu-local /tmp/percentile.rb --mean=75 --std_dev=8
|
173
|
+
95.0 99.37903346742239
|
174
|
+
```
|
175
|
+
|
176
|
+
### The Lifecycle of a Processor
|
177
|
+
|
178
|
+
Processors have a lifecycle that they execute when they are run within
|
179
|
+
the context of a Wukong runner like `wu-local` or `wu-hadoop`. Each
|
180
|
+
lifecycle phase corresponds to a method of the processor that is
|
181
|
+
called:
|
182
|
+
|
183
|
+
* `setup` called *after* the Processor is initialized but *before* the first record is processed. You cannot yield from this method.
|
184
|
+
* `process` called once for each input record, may yield once, many, or no times.
|
185
|
+
* `finalize` called after the the *last* record has been processed but while the processor still has an opportunity to yield records.
|
186
|
+
* `stop` called to signal to the processor that all work should stop, open connections should be closed, &c. You cannot yield from this method.
|
187
|
+
|
188
|
+
The above examples have already focused on the `process` method.
|
189
|
+
|
190
|
+
The `setup` and `stop` methods are often used together to handle
|
191
|
+
external connections
|
192
|
+
|
193
|
+
```ruby
|
194
|
+
# in geolocator.rb
|
195
|
+
Wukong.processor(:geolocator) do
|
196
|
+
field :host, String, :default => 'localhost'
|
197
|
+
attr_accessor :connection
|
198
|
+
|
199
|
+
def setup
|
200
|
+
self.connection = Database::Connection.new(host)
|
201
|
+
end
|
202
|
+
def process record
|
203
|
+
record.added_value = connection.find("...some query...")
|
204
|
+
end
|
205
|
+
def stop
|
206
|
+
self.connection.close
|
207
|
+
end
|
208
|
+
end
|
209
|
+
```
|
210
|
+
|
211
|
+
The `finalize` method is most useful when writing a "reduce"-type
|
212
|
+
operation that involves storing or aggregating information till some
|
213
|
+
criterion is met. It will always be called after the last record has
|
214
|
+
been given (to `process`) but you can call it whenever you want to
|
215
|
+
within your own code.
|
216
|
+
|
217
|
+
Here's an example of using the `finalize` method to implement a simple
|
218
|
+
counter that counts all the input records:
|
219
|
+
|
220
|
+
```ruby
|
221
|
+
# in counter.rb
|
222
|
+
Wukong.processor(:counter) do
|
223
|
+
attr_accessor :count
|
224
|
+
def setup
|
225
|
+
self.count = 0
|
226
|
+
end
|
227
|
+
def process thing
|
228
|
+
self.count += 1
|
229
|
+
end
|
230
|
+
def finalize
|
231
|
+
yield count
|
232
|
+
end
|
233
|
+
end
|
234
|
+
```
|
235
|
+
|
236
|
+
It hinges on the fact that the last input record will be passed to
|
237
|
+
`process` *first* and only then will `finalize` be called. This
|
238
|
+
allows the last input record to be counted/processed/aggregated and
|
239
|
+
then the entire aggregate to be dealt with in finalize.
|
240
|
+
|
241
|
+
Because of this emphasis on building and processing aggregates, the
|
242
|
+
`finalize` method is often useful within processors meant to run as
|
243
|
+
reducers in a Hadoop environment.
|
244
|
+
|
245
|
+
Note:: Finalize is not guaranteed to be called by in every possible
|
246
|
+
environment as it depends on the chosen runner. In a local or Hadoop
|
247
|
+
environment, the notion of "last record" makes sense and so the
|
248
|
+
corresponding runners will call `finalize`. In an environment like
|
249
|
+
Storm, where the concept of last record is not (supposed to be)
|
250
|
+
meaningful, the corresponding runner doesn't ever call it.
|
251
|
+
|
252
|
+
### Serialization
|
253
|
+
|
254
|
+
`wu-local` (and many similar tools) deal with inputs and outputs as
|
255
|
+
strings.
|
256
|
+
|
257
|
+
Processors want to process objects as close to their domain as is
|
258
|
+
possible. A processor which decorates address book entries with
|
259
|
+
Twitter handles doesn't want to think of its inputs as Strings but
|
260
|
+
Hashes or, better yet, Persons.
|
261
|
+
|
262
|
+
Wukong makes it easy to wrap a processor with other processors
|
263
|
+
dedicated to handling the common tasks of parsing records into or out
|
264
|
+
of formats like JSON and turning them into Ruby model instances.
|
265
|
+
|
266
|
+
#### De-serializing data formats like JSON or TSV
|
267
|
+
|
268
|
+
Wukong can parse and emit common data formats like JSON and delimited
|
269
|
+
formats like TSV or CSV so that you don't pollute or tie down your own
|
270
|
+
processors with protocol logic.
|
271
|
+
|
272
|
+
Here's an example of a processor that wants to deal with Hashes as
|
273
|
+
input.
|
274
|
+
|
275
|
+
```ruby
|
276
|
+
# in extractor.rb
|
277
|
+
Wukong.processor(:extractor) do
|
278
|
+
def process hsh
|
279
|
+
yield hsh["first_name"]
|
280
|
+
end
|
281
|
+
end
|
282
|
+
```
|
283
|
+
|
284
|
+
Given JSON data,
|
285
|
+
|
286
|
+
```
|
287
|
+
$ cat input.json
|
288
|
+
{"first_name": "John", "last_name":, "Smith"}
|
289
|
+
{"first_name": "Sally", "last_name":, "Johnson"}
|
290
|
+
...
|
291
|
+
```
|
292
|
+
|
293
|
+
you can feed it directly to a processor
|
294
|
+
|
295
|
+
```
|
296
|
+
$ cat input.json | wu-local --from=json extractor
|
297
|
+
John
|
298
|
+
Sally
|
299
|
+
...
|
300
|
+
```
|
301
|
+
|
302
|
+
Other processors really like Arrays:
|
303
|
+
|
304
|
+
```ruby
|
305
|
+
Wukong.processor(:summer) do
|
306
|
+
def process values
|
307
|
+
yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
|
308
|
+
end
|
309
|
+
end
|
310
|
+
```
|
311
|
+
|
312
|
+
so you can feed them TSV data
|
313
|
+
```
|
314
|
+
$ cat data.tsv
|
315
|
+
1 2 3
|
316
|
+
4 5 6
|
317
|
+
7 8 9
|
318
|
+
...
|
319
|
+
$ cat data.tsv | wu-local --from=tsv summer
|
320
|
+
6
|
321
|
+
15
|
322
|
+
24
|
323
|
+
...
|
324
|
+
```
|
325
|
+
|
326
|
+
but you can just as easily use the same code with CSV data
|
327
|
+
|
328
|
+
```
|
329
|
+
$ cat data.tsv | wu-local --from=csv summer
|
330
|
+
```
|
331
|
+
|
332
|
+
or a more general delimited format.
|
333
|
+
|
334
|
+
```
|
335
|
+
$ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
|
336
|
+
```
|
337
|
+
|
338
|
+
#### Recordizing data structures into domain models
|
339
|
+
|
340
|
+
Here's a contact validator that relies on a Person model to decide
|
341
|
+
whether a contact entry should be yielded:
|
342
|
+
|
343
|
+
```ruby
|
344
|
+
# in contact_validator.rb
|
345
|
+
require 'person'
|
346
|
+
|
347
|
+
Wukong.processor(:contact_validator) do
|
348
|
+
def process person
|
349
|
+
yield person if person.valid?
|
350
|
+
end
|
351
|
+
end
|
352
|
+
```
|
353
|
+
|
354
|
+
Relying on the (elsewhere-defined) Person model to define `valid?`
|
355
|
+
means the processor can stay skinny and readable. Wukong can, in
|
356
|
+
combination with the deserializing features above, turn input text
|
357
|
+
into instances of Person:
|
358
|
+
|
359
|
+
```
|
360
|
+
$ cat input.json | wu-local --consumes=Person --from=json contact_validator
|
361
|
+
#<Person:0x000000020e6120>
|
362
|
+
#<Person:0x000000020e6120>
|
363
|
+
#<Person:0x000000020e6120>
|
364
|
+
```
|
365
|
+
|
366
|
+
`wu-local` can also serialize records from the `contact_validator`
|
367
|
+
processor:
|
368
|
+
|
369
|
+
```
|
370
|
+
$ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
|
371
|
+
{"first_name": "John", "last_name":, "Smith", "valid": "true"}
|
372
|
+
{"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
|
373
|
+
...
|
374
|
+
```
|
375
|
+
|
376
|
+
Serialization formats work just like deserialization formats, with
|
377
|
+
JSON as well as delimited formats available.
|
378
|
+
|
379
|
+
Parsing records into model instances and serializing them out again
|
380
|
+
puts constraints on the model class providing these instances. Here's
|
381
|
+
what the `Person` class needs to look like:
|
382
|
+
|
383
|
+
|
384
|
+
```ruby
|
385
|
+
# in person.rb
|
386
|
+
class Person
|
387
|
+
|
388
|
+
# Create a new Person from the given attributes. Supports usage of
|
389
|
+
# the `--consumes` flag on the command-line
|
390
|
+
#
|
391
|
+
# @param [Hash] attrs
|
392
|
+
# @return [Person]
|
393
|
+
def self.receive attrs
|
394
|
+
new(attrs)
|
395
|
+
end
|
396
|
+
|
397
|
+
# Turn this Person into a basic data structure. Supports the usage
|
398
|
+
# of the `--to` flag on the command-line.
|
399
|
+
#
|
400
|
+
# @return [Hash]
|
401
|
+
def to_wire
|
402
|
+
to_hash
|
403
|
+
end
|
404
|
+
end
|
405
|
+
```
|
406
|
+
|
407
|
+
To support the `--consumes=Person` syntax, the `receive` class method
|
408
|
+
must take a Hash produced from the operation of the `--from` argument
|
409
|
+
and return a `Person` instance.
|
410
|
+
|
411
|
+
To support the `--to=json` syntax, the `Person` class must implement
|
412
|
+
the `to_wire` instance method.
|
413
|
+
|
414
|
+
### Logging and Notifications
|
415
|
+
|
416
|
+
Wukong comes with a logger that all processors have access to via
|
417
|
+
their `log` attribute. This logger has the following priorities:
|
418
|
+
|
419
|
+
* debug (can be set as a log level)
|
420
|
+
* info (can be set as a log level)
|
421
|
+
* warn (can be set as a log level)
|
422
|
+
* error
|
423
|
+
* fatal
|
424
|
+
|
425
|
+
and here's a processor which uses them all
|
426
|
+
|
427
|
+
```ruby
|
428
|
+
# in logs.rb
|
429
|
+
Wukong.processor(:logs) do
|
430
|
+
def process line
|
431
|
+
log.debug line
|
432
|
+
log.info line
|
433
|
+
log.warn line
|
434
|
+
log.error line
|
435
|
+
log.fatal line
|
436
|
+
end
|
437
|
+
end
|
438
|
+
```
|
439
|
+
|
440
|
+
The default log level is DEBUG.
|
441
|
+
|
442
|
+
```
|
443
|
+
$ echo something | wu-local logs.rb
|
444
|
+
DEBUG 2013-01-11 23:40:56 [Logs ] -- event
|
445
|
+
INFO 2013-01-11 23:40:56 [Logs ] -- event
|
446
|
+
WARN 2013-01-11 23:40:56 [Logs ] -- event
|
447
|
+
ERROR 2013-01-11 23:40:56 [Logs ] -- event
|
448
|
+
FATAL 2013-01-11 23:40:56 [Logs ] -- event
|
449
|
+
```
|
450
|
+
|
451
|
+
though you can set it to something else globally
|
452
|
+
|
453
|
+
```
|
454
|
+
$ echo something | wu-local logs.rb --log.level=warn
|
455
|
+
WARN 2013-01-11 23:40:56 [Logs ] -- event
|
456
|
+
ERROR 2013-01-11 23:40:56 [Logs ] -- event
|
457
|
+
FATAL 2013-01-11 23:40:56 [Logs ] -- event
|
458
|
+
```
|
459
|
+
|
460
|
+
or on a per-class basis.
|
461
|
+
|
462
|
+
### Creating Documentation
|
463
|
+
|
464
|
+
`wu-local` includes a help message:
|
465
|
+
|
466
|
+
```
|
467
|
+
$ wu-local --help
|
468
|
+
usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
|
469
|
+
|
470
|
+
wu-local is a tool for running Wukong processors and flows locally on
|
471
|
+
the command-line. Use wu-local by passing it a processor and feeding
|
472
|
+
...
|
473
|
+
|
474
|
+
|
475
|
+
Params:
|
476
|
+
-r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
|
477
|
+
-t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
|
478
|
+
```
|
479
|
+
|
480
|
+
You can generate custom help messages for your own processors. Here's
|
481
|
+
the percentile processor from before but made more usable with good
|
482
|
+
documentation:
|
116
483
|
|
484
|
+
```ruby
|
485
|
+
# in percentile.rb
|
486
|
+
Wukong.processor(:percentile) do
|
487
|
+
|
488
|
+
description <<-EOF.gsub(/^ {2}/,'')
|
489
|
+
This processor calculates percentiles from input scores based on a
|
490
|
+
given mean score and a given standard deviation for the scores.
|
491
|
+
|
492
|
+
The mean and standard deviation are given at run time and processed
|
493
|
+
scores will be compared against the given mean and standard
|
494
|
+
deviation.
|
495
|
+
|
496
|
+
The input is expected to consist of float values, one per line.
|
497
|
+
|
498
|
+
Example:
|
499
|
+
|
500
|
+
$ cat input.dat
|
501
|
+
88
|
502
|
+
89
|
503
|
+
77
|
504
|
+
...
|
505
|
+
|
506
|
+
$ cat input.dat | wu-local percentile.rb --mean=85 --std_dev=7
|
507
|
+
88.0 66.58824291023753
|
508
|
+
89.0 71.61454169013237
|
509
|
+
77.0 12.654895447355777
|
510
|
+
EOF
|
511
|
+
|
512
|
+
SQRT_1_HALF = Math.sqrt(0.5)
|
513
|
+
|
514
|
+
field :mean, Float, :default => 0.0, :doc => "The mean of the assumed distribution"
|
515
|
+
field :std_dev, Float, :default => 1.0, :doc => "The standard deviation of the assumed distribution"
|
516
|
+
|
517
|
+
def process value
|
518
|
+
observation = value.to_f
|
519
|
+
z_score = (mean - observation) / std_dev
|
520
|
+
percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
|
521
|
+
yield [observation, percentile].join("\t")
|
522
|
+
end
|
523
|
+
end
|
117
524
|
```
|
118
|
-
|
525
|
+
|
526
|
+
If you call `wu-local` with the file to this processor as an argument
|
527
|
+
in addition to the original `--help` argument, you'll get custom
|
528
|
+
documentation.
|
529
|
+
|
119
530
|
```
|
531
|
+
$ wu-local percentile.rb --help
|
532
|
+
usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
|
533
|
+
|
534
|
+
This processor calculates percentiles from input scores based on a
|
535
|
+
given mean score and a given standard deviation for the scores.
|
536
|
+
...
|
120
537
|
|
121
|
-
|
538
|
+
|
539
|
+
Params:
|
540
|
+
--mean=Float The mean of the assumed distribution [Default: 0.0]
|
541
|
+
-r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
|
542
|
+
--std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
|
543
|
+
-t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
|
544
|
+
|
545
|
+
```
|
122
546
|
|
123
547
|
<a name="flows"></a>
|
124
548
|
## Combining Processors into Dataflows
|
125
549
|
|
126
550
|
Combining processors which each do one thing well together in a chain
|
127
551
|
is mimicing the tried and true UNIX pipeline. Wukong lets you define
|
128
|
-
these pipelines more formally as a dataflow.
|
552
|
+
these pipelines more formally as a dataflow.
|
553
|
+
|
554
|
+
Having written the `tokenizer` processor, we can use it in a dataflow
|
555
|
+
along with the built-in `regexp` processor to replicate what we did in
|
129
556
|
the last example:
|
130
557
|
|
131
558
|
```
|
132
559
|
# in find_t_words.rb
|
560
|
+
require_relative('processors')
|
133
561
|
Wukong.dataflow(:find_t_words) do
|
134
562
|
tokenizer | regexp(match: /^t/)
|
135
563
|
end
|
@@ -148,7 +576,8 @@ times
|
|
148
576
|
...
|
149
577
|
```
|
150
578
|
|
151
|
-
and it works exactly like
|
579
|
+
and it works exactly like manually chaining the two processors
|
580
|
+
together.
|
152
581
|
|
153
582
|
<a name="serialization></a>
|
154
583
|
## Serialization
|
@@ -163,7 +592,14 @@ yield a String argument (or something that will `to_s` appropriately).
|
|
163
592
|
## Widgets
|
164
593
|
|
165
594
|
Wukong has a number of built-in widgets that are useful for
|
166
|
-
scaffolding your dataflows
|
595
|
+
scaffolding your dataflows or using as starting off points for your
|
596
|
+
own processors.
|
597
|
+
|
598
|
+
For any of these widgets you can get customized help, say
|
599
|
+
|
600
|
+
```
|
601
|
+
$ wu-local group --help
|
602
|
+
```
|
167
603
|
|
168
604
|
### Serializers
|
169
605
|
|
@@ -350,10 +786,10 @@ describe :tokenizer do
|
|
350
786
|
processor.given("Hi there.\nMy name is Wukong!").should emit(6).records
|
351
787
|
end
|
352
788
|
it "eliminates all punctuation" do
|
353
|
-
processor.given("Never!").
|
789
|
+
processor(:tokenizer).given("Never!").should emit('Never')
|
354
790
|
end
|
355
|
-
it "
|
356
|
-
processor.given("
|
791
|
+
it "will not emit tokens in a stop list" do
|
792
|
+
processor(:tokenizer, :stop_list => ['apples', 'bananas']).given("I like apples and bananas").should emit('I', 'like', 'and')
|
357
793
|
end
|
358
794
|
end
|
359
795
|
```
|
@@ -364,8 +800,13 @@ Let's look at each kind of helper:
|
|
364
800
|
`it_behaves_like` helper) adds some tests that ensure that the
|
365
801
|
processor conforms to the API of a Wukong::Processor.
|
366
802
|
|
367
|
-
* The `processor` method
|
368
|
-
|
803
|
+
* The `processor` method is actually an alias for the more aptly named
|
804
|
+
(but less convenient) `unit_test_runner`. This method accepts a
|
805
|
+
processor name and options (just like `wu-local` and other
|
806
|
+
command-line tools) and returns a Wukong::UnitTestRunner instance.
|
807
|
+
This runner handles the
|
808
|
+
|
809
|
+
|
369
810
|
a (registered) processor name and options and creates a new
|
370
811
|
processor. If no name is given, the argument of the enclosing
|
371
812
|
`describe` or `context` block is used. The object returned by
|
@@ -374,29 +815,38 @@ Let's look at each kind of helper:
|
|
374
815
|
behavior.
|
375
816
|
|
376
817
|
* The `given` method (and other helpers like `given_json`,
|
377
|
-
`given_tsv`, &c.) is
|
378
|
-
|
379
|
-
|
380
|
-
|
381
|
-
lifecycle as in the prior example.
|
818
|
+
`given_tsv`, &c.) is a method on the runner. It's a way of lazily
|
819
|
+
feeding records to a processor, without having to go through the
|
820
|
+
`process` method directly and having to handle the block or the
|
821
|
+
processor's lifecycle as in the prior example.
|
382
822
|
|
383
823
|
* The `output` and `emit` matchers will `process` all previously
|
384
824
|
`given` records when they are called. This lets you separate
|
385
825
|
instantiation, input, expectations, and output. Here's a more
|
386
|
-
complicated example
|
826
|
+
complicated example.
|
387
827
|
|
388
828
|
The same helpers can be used to test dataflows as well as
|
389
|
-
processors.
|
390
|
-
|
829
|
+
processors.
|
830
|
+
|
831
|
+
####
|
832
|
+
|
833
|
+
#### Functions vs. Objects
|
834
|
+
|
835
|
+
The above test helpers are designed to aid in testing processors
|
836
|
+
functionally because:
|
837
|
+
|
838
|
+
* they accept the
|
391
839
|
|
392
840
|
### Integration Tests
|
393
841
|
|
394
|
-
|
395
|
-
|
396
|
-
|
842
|
+
If you are implementing a new Wukong command (akin to `wu-local`) then
|
843
|
+
you may also want to run integration tests. Wukong comes with helpers
|
844
|
+
for these, too.
|
397
845
|
|
398
|
-
|
399
|
-
|
846
|
+
You should almost always be able to test your processors without
|
847
|
+
integration tests. Your unit tests and the Wukong framework itself
|
848
|
+
should ensure that your processors work correctly no matter what
|
849
|
+
environment they are deployed in.
|
400
850
|
|
401
851
|
```ruby
|
402
852
|
# spec/integration/tokenizer_spec.rb
|
@@ -415,7 +865,7 @@ context "interpreting its arguments" do
|
|
415
865
|
end
|
416
866
|
context "with a malformed --match argument" do
|
417
867
|
# invalid b/c the regexp is broken...
|
418
|
-
subject { command("wu-local tokenizer --match='^
|
868
|
+
subject { command("wu-local tokenizer --match='^(h'") < "hi there" }
|
419
869
|
it { should exit_with(:non_zero) }
|
420
870
|
it { should have_stderr(/invalid/) }
|
421
871
|
end
|
@@ -457,3 +907,192 @@ Let's go through the helpers:
|
|
457
907
|
* The `have_stdout` and `have_stderr` matchers let you test the STDOUT or STDERR of the command for particular strings or regular expressions.
|
458
908
|
|
459
909
|
* The `exit_with` matcher lets you test the exit code of the command. You can pass the symbol `:non_zero` to set the expectation of _any_ non-zero exit code.
|
910
|
+
|
911
|
+
## Plugins
|
912
|
+
|
913
|
+
Wukong has a built-in plugin framework to make it easy to adapt Wukong
|
914
|
+
processors to new backends or add other functionality. The
|
915
|
+
`Wukong::Local` module and the `wu-local` program it supports is
|
916
|
+
itself a Wukong plugin.
|
917
|
+
|
918
|
+
The following shows how you might build a simplified version of
|
919
|
+
`Wukong::Local` as a new plugin. We'll call this plugin `Cat` as it
|
920
|
+
will implement a program `wu-cat` that is similar in function to
|
921
|
+
`wu-local` (just simplified).
|
922
|
+
|
923
|
+
The first thing to do is include the `Wukong::Plugin` module in your
|
924
|
+
code:
|
925
|
+
|
926
|
+
|
927
|
+
```Ruby
|
928
|
+
# in lib/cat.rb
|
929
|
+
#
|
930
|
+
# This Wukong plugin works like wu-local but replicates some silly
|
931
|
+
# features of cat like numbered lines.
|
932
|
+
module Cat
|
933
|
+
|
934
|
+
# This registers Cat as a Wukong plugin.
|
935
|
+
include Wukong::Plugin
|
936
|
+
|
937
|
+
# Defines any settings specific to Cat. Cat doesn't need to, but
|
938
|
+
# you can define global settings here if you want. You can also
|
939
|
+
# check the `program` name to decide whether to apply your settings.
|
940
|
+
# This helps you not pollute other commands with your stuff.
|
941
|
+
def self.configure settings, program
|
942
|
+
case program
|
943
|
+
when 'wu-cat'
|
944
|
+
settings.define(:input, :description => "The input file to use")
|
945
|
+
settings.define(:number, :description => "Prepend each input record with a consecutive number", :type => :boolean)
|
946
|
+
else
|
947
|
+
# configure other programs if you need to
|
948
|
+
end
|
949
|
+
end
|
950
|
+
|
951
|
+
# Lets Cat boot up with settings that have already been resolved
|
952
|
+
# from the command-line or other sources like config files or remote
|
953
|
+
# servers added by other plugins.
|
954
|
+
#
|
955
|
+
# The `root` directory in which the program is executing is also
|
956
|
+
# provided.
|
957
|
+
def self.boot settings, root
|
958
|
+
puts "Cat booting up using resolved settings within directory #{root}"
|
959
|
+
end
|
960
|
+
end
|
961
|
+
```
|
962
|
+
|
963
|
+
If your plugin doesn't interact directly with the command-line
|
964
|
+
(through a wu-tool like `wu-local` or `wu-hadoop`) and doesn't
|
965
|
+
directly interface with passing records to processors then you can
|
966
|
+
just require the rest of your plugin's code at this point and be done.
|
967
|
+
|
968
|
+
### Write a Runner to interact with the command-line
|
969
|
+
|
970
|
+
If you need to implement a new command line tool then you should write
|
971
|
+
a Runner. A Runner is used to implement Wukong programs like
|
972
|
+
`wu-local` or `wu-hadoop`. Here's what the actual program file would
|
973
|
+
look like for our example plugin's `wu-cat` program.
|
974
|
+
|
975
|
+
```ruby
|
976
|
+
#!/usr/bin/env ruby
|
977
|
+
# in bin/wu-cat
|
978
|
+
require 'cat'
|
979
|
+
Cat::Runner.run
|
980
|
+
```
|
981
|
+
|
982
|
+
The Cat::Runner class is implemented separately.
|
983
|
+
|
984
|
+
```ruby
|
985
|
+
# in lib/cat/runner.rb
|
986
|
+
require_relative('driver')
|
987
|
+
module Cat
|
988
|
+
|
989
|
+
# Implements the `wu-cat` command.
|
990
|
+
class Runner < Wukong::Runner
|
991
|
+
|
992
|
+
usage "PROCESSOR|FLOW"
|
993
|
+
|
994
|
+
description <<-EOF
|
995
|
+
|
996
|
+
wu-cat lets you run a Wukong processor or dataflow on the
|
997
|
+
command-line. Try it like this.
|
998
|
+
|
999
|
+
$ wu-cat --input=data.txt
|
1000
|
+
hello
|
1001
|
+
my
|
1002
|
+
friend
|
1003
|
+
|
1004
|
+
Connect the output to a processor in upcaser.rb
|
1005
|
+
|
1006
|
+
$ wu-cat --input=data.txt upcaser.rb
|
1007
|
+
HELLO
|
1008
|
+
MY
|
1009
|
+
FRIEND
|
1010
|
+
|
1011
|
+
You can also include add line numbers to the output.
|
1012
|
+
|
1013
|
+
$ wu-cat --number --input=data.txt upcaser.rb
|
1014
|
+
1 HELLO
|
1015
|
+
2 MY
|
1016
|
+
3 FRIEND
|
1017
|
+
EOF
|
1018
|
+
|
1019
|
+
# The name of the processor we're going to run. The #args method
|
1020
|
+
# is provided by the Runner class.
|
1021
|
+
def processor_name
|
1022
|
+
args.first
|
1023
|
+
end
|
1024
|
+
|
1025
|
+
# Validate that we were given the name of a registered processor
|
1026
|
+
# to run. Be careful to return true here or validation will fail.
|
1027
|
+
def validate
|
1028
|
+
raise Wukong::Error.new("Must provide a processor as the first argument") unless processor_name
|
1029
|
+
true
|
1030
|
+
end
|
1031
|
+
|
1032
|
+
# Delgates to a driver class to run the processor.
|
1033
|
+
def run
|
1034
|
+
Driver.new(processor_name, settings).start
|
1035
|
+
end
|
1036
|
+
|
1037
|
+
end
|
1038
|
+
end
|
1039
|
+
```
|
1040
|
+
|
1041
|
+
### Write a Driver to interact with processors
|
1042
|
+
|
1043
|
+
The `Cat::Runner#run` method delegates to the `Cat::Driver` class to
|
1044
|
+
handle instantiating and interacting with processors.
|
1045
|
+
|
1046
|
+
```ruby
|
1047
|
+
# in lib/cat/driver.rb
|
1048
|
+
module Cat
|
1049
|
+
|
1050
|
+
# A class for driving a processor from `wu-cat`.
|
1051
|
+
class Driver
|
1052
|
+
|
1053
|
+
# Lets us count the records.
|
1054
|
+
attr_accessor :number
|
1055
|
+
|
1056
|
+
# Gives methods to construct and interact with dataflows.
|
1057
|
+
include Wukong::DriverMethods
|
1058
|
+
|
1059
|
+
# Create a new Driver for a dataflow with the given `label` using
|
1060
|
+
# the given `settings`.
|
1061
|
+
#
|
1062
|
+
# @param [String] label the name of the dataflow
|
1063
|
+
# @param [Configliere::Param] settings the settings to use when creating the dataflow
|
1064
|
+
def initialize label, settings
|
1065
|
+
self.settings = settings
|
1066
|
+
self.dataflow = construct_dataflow(label, settings)
|
1067
|
+
self.number = 1
|
1068
|
+
end
|
1069
|
+
|
1070
|
+
# The file handle of the input file.
|
1071
|
+
#
|
1072
|
+
# @return [File]
|
1073
|
+
def input_file
|
1074
|
+
@input_file ||= File.new(settings[:input])
|
1075
|
+
end
|
1076
|
+
|
1077
|
+
# Starts feeding records to the processor
|
1078
|
+
def start
|
1079
|
+
while line = input_file.readline rescue nil
|
1080
|
+
driver.send_through_dataflow(line)
|
1081
|
+
end
|
1082
|
+
end
|
1083
|
+
|
1084
|
+
# Process each record that comes back from the dataflow.
|
1085
|
+
#
|
1086
|
+
# @param [Object] record the yielded record
|
1087
|
+
def process record
|
1088
|
+
if settings[:number]
|
1089
|
+
puts [number, record].map(&:to_s).join("\t")
|
1090
|
+
else
|
1091
|
+
puts record.to_s
|
1092
|
+
end
|
1093
|
+
self.number += 1
|
1094
|
+
end
|
1095
|
+
|
1096
|
+
end
|
1097
|
+
end
|
1098
|
+
```
|