wukong 3.0.0.pre3 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/Gemfile +1 -0
- data/README.md +689 -50
- data/bin/wu-local +1 -74
- data/diagrams/wu_local.dot +39 -0
- data/diagrams/wu_local.dot.png +0 -0
- data/examples/loadable.rb +2 -0
- data/examples/string_reverser.rb +7 -0
- data/lib/hanuman/stage.rb +2 -2
- data/lib/wukong.rb +21 -10
- data/lib/wukong/dataflow.rb +2 -5
- data/lib/wukong/doc_helpers.rb +14 -0
- data/lib/wukong/doc_helpers/dataflow_handler.rb +29 -0
- data/lib/wukong/doc_helpers/field_handler.rb +91 -0
- data/lib/wukong/doc_helpers/processor_handler.rb +29 -0
- data/lib/wukong/driver.rb +11 -1
- data/lib/wukong/local.rb +40 -0
- data/lib/wukong/local/event_machine_driver.rb +27 -0
- data/lib/wukong/local/runner.rb +98 -0
- data/lib/wukong/local/stdio_driver.rb +44 -0
- data/lib/wukong/local/tcp_driver.rb +47 -0
- data/lib/wukong/logger.rb +16 -7
- data/lib/wukong/plugin.rb +48 -0
- data/lib/wukong/processor.rb +57 -15
- data/lib/wukong/rake_helper.rb +6 -0
- data/lib/wukong/runner.rb +151 -128
- data/lib/wukong/runner/boot_sequence.rb +123 -0
- data/lib/wukong/runner/code_loader.rb +52 -0
- data/lib/wukong/runner/deploy_pack_loader.rb +75 -0
- data/lib/wukong/runner/help_message.rb +42 -0
- data/lib/wukong/spec_helpers.rb +4 -12
- data/lib/wukong/spec_helpers/integration_tests.rb +150 -0
- data/lib/wukong/spec_helpers/{integration_driver_matchers.rb → integration_tests/integration_test_matchers.rb} +28 -62
- data/lib/wukong/spec_helpers/integration_tests/integration_test_runner.rb +97 -0
- data/lib/wukong/spec_helpers/shared_examples.rb +19 -10
- data/lib/wukong/spec_helpers/unit_tests.rb +134 -0
- data/lib/wukong/spec_helpers/{processor_methods.rb → unit_tests/unit_test_driver.rb} +42 -8
- data/lib/wukong/spec_helpers/{spec_driver_matchers.rb → unit_tests/unit_test_matchers.rb} +6 -32
- data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +54 -0
- data/lib/wukong/version.rb +1 -1
- data/lib/wukong/widget/filters.rb +134 -8
- data/lib/wukong/widget/processors.rb +64 -5
- data/lib/wukong/widget/reducers/bin.rb +68 -18
- data/lib/wukong/widget/reducers/count.rb +12 -0
- data/lib/wukong/widget/reducers/group.rb +48 -5
- data/lib/wukong/widget/reducers/group_concat.rb +30 -2
- data/lib/wukong/widget/reducers/moments.rb +4 -4
- data/lib/wukong/widget/reducers/sort.rb +53 -3
- data/lib/wukong/widget/serializers.rb +37 -12
- data/lib/wukong/widget/utils.rb +1 -1
- data/spec/spec_helper.rb +20 -2
- data/spec/wukong/driver_spec.rb +2 -0
- data/spec/wukong/local/runner_spec.rb +40 -0
- data/spec/wukong/local_spec.rb +6 -0
- data/spec/wukong/logger_spec.rb +49 -0
- data/spec/wukong/processor_spec.rb +22 -0
- data/spec/wukong/runner_spec.rb +128 -8
- data/spec/wukong/widget/filters_spec.rb +28 -10
- data/spec/wukong/widget/processors_spec.rb +5 -5
- data/spec/wukong/widget/reducers/bin_spec.rb +14 -14
- data/spec/wukong/widget/reducers/count_spec.rb +1 -1
- data/spec/wukong/widget/reducers/group_spec.rb +7 -6
- data/spec/wukong/widget/reducers/moments_spec.rb +2 -2
- data/spec/wukong/widget/reducers/sort_spec.rb +1 -1
- data/spec/wukong/widget/serializers_spec.rb +84 -88
- data/spec/wukong/wu-local_spec.rb +109 -0
- metadata +43 -20
- data/bin/wu-server +0 -70
- data/lib/wukong/boot.rb +0 -96
- data/lib/wukong/configuration.rb +0 -8
- data/lib/wukong/emitter.rb +0 -22
- data/lib/wukong/server.rb +0 -119
- data/lib/wukong/spec_helpers/integration_driver.rb +0 -157
- data/lib/wukong/spec_helpers/processor_helpers.rb +0 -89
- data/lib/wukong/spec_helpers/spec_driver.rb +0 -28
- data/spec/wukong/local_runner_spec.rb +0 -31
- data/spec/wukong/wu_local_spec.rb +0 -125
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -19,6 +19,8 @@ Here is a list of various other projects which you may also want to
|
|
19
19
|
peruse when trying to understand the full Wukong experience:
|
20
20
|
|
21
21
|
* <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
|
22
|
+
* <a href="http://github.com/infochimps-labs/wukong-storm>wukong-storm</a>: Run Wukong processors within the Storm framework. Model flows locally before you run them.
|
23
|
+
* <a href="http://github.com/infochimps-labs/wukong-load>wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
|
22
24
|
* <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
|
23
25
|
* <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
|
24
26
|
|
@@ -36,7 +38,7 @@ processor is Ruby class which
|
|
36
38
|
* subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
|
37
39
|
* defines a `process` method which takes an input record, does something, and calls `yield` on the output
|
38
40
|
|
39
|
-
Here's a processor that reverses
|
41
|
+
Here's a processor that reverses each of its input records:
|
40
42
|
|
41
43
|
```ruby
|
42
44
|
# in string_reverser.rb
|
@@ -47,8 +49,8 @@ Wukong.processor(:string_reverser) do
|
|
47
49
|
end
|
48
50
|
```
|
49
51
|
|
50
|
-
|
51
|
-
|
52
|
+
You can run this processor on the command line using text files as
|
53
|
+
input using the `wu-local` tool that comes with Wukong:
|
52
54
|
|
53
55
|
```
|
54
56
|
$ cat novel.txt
|
@@ -59,35 +61,46 @@ $ cat novel.txt | wu-local string_reverser.rb
|
|
59
61
|
.semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
|
60
62
|
```
|
61
63
|
|
62
|
-
|
63
|
-
|
64
|
+
The `wu-local` program consumes one line at at time from STDIN and
|
65
|
+
calls your processor's `process` method with that line as a Ruby
|
66
|
+
String object. Each object you `yield` within your process method
|
67
|
+
will be printed back out on STDOUT.
|
68
|
+
|
69
|
+
### Multiple Processors, Multiple (Or No) Yields
|
70
|
+
|
71
|
+
Processors are intended to be combined so they can be stored in the
|
72
|
+
same file like these two, related processors:
|
64
73
|
|
65
74
|
```ruby
|
66
75
|
# in processors.rb
|
67
76
|
|
68
|
-
Wukong.processor(:
|
77
|
+
Wukong.processor(:splitter) do
|
69
78
|
def process line
|
70
79
|
line.split.each { |token| yield token }
|
71
80
|
end
|
72
81
|
end
|
73
82
|
|
74
|
-
Wukong.processor(:
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
def process word
|
79
|
-
yield word if word =~ Regexp.new("^#{letter}", true)
|
83
|
+
Wukong.processor(:normalizer) do
|
84
|
+
def process token
|
85
|
+
stripped = token.downcase.gsub(/\W/,'')
|
86
|
+
yield stripped if stripped.size > 0
|
80
87
|
end
|
81
88
|
end
|
82
89
|
```
|
83
90
|
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
91
|
+
Notice how the `splitter` yields multiple tokens for each of its input
|
92
|
+
tokens and that the `normalizer` may sometimes never yield at all,
|
93
|
+
depending on its input. Processors are under no obligations by the
|
94
|
+
framework to yield or return anything so they can easily act as
|
95
|
+
filters or even sinks in data flows.
|
96
|
+
|
97
|
+
There are two processors in this file and neither shares a name with
|
98
|
+
the basename of the file ("processors") so `wu-local` can't
|
99
|
+
automatically choose a processor to run. We can specify one
|
100
|
+
explicitly with the `--run` option:
|
88
101
|
|
89
102
|
```
|
90
|
-
$ cat novel.txt | wu-local processors.rb --run=
|
103
|
+
$ cat novel.txt | wu-local processors.rb --run=splitter
|
91
104
|
It
|
92
105
|
was
|
93
106
|
the
|
@@ -97,39 +110,454 @@ times,
|
|
97
110
|
...
|
98
111
|
```
|
99
112
|
|
100
|
-
|
101
|
-
shell. Let's add the `starts_with` filter and also pass in the
|
102
|
-
*field* `letter`, defined in that processor:
|
113
|
+
We can combine the two processors together
|
103
114
|
|
104
115
|
```
|
105
|
-
$ cat novel.txt | wu-local processors.rb --run=
|
106
|
-
|
107
|
-
|
116
|
+
$ cat novel.txt | wu-local processors.rb --run=splitter | wu-local processors.rb --run=normalizer
|
117
|
+
it
|
118
|
+
was
|
108
119
|
the
|
120
|
+
best
|
121
|
+
of
|
109
122
|
times
|
110
123
|
...
|
111
124
|
```
|
112
125
|
|
113
|
-
|
114
|
-
|
115
|
-
|
126
|
+
but there's an easier way of doing this with <a href="#flows">dataflows</a>.
|
127
|
+
|
128
|
+
### Adding Configurable Options
|
129
|
+
|
130
|
+
Processors can have options that can be set in Ruby code, from the
|
131
|
+
command-line, a configuration file, or a variety of other places
|
132
|
+
thanks to [Configliere](http://github.com/infochimps-labs/configliere).
|
133
|
+
|
134
|
+
This processor calculates percentiles from observations assuming a
|
135
|
+
normal distribution given a particular mean and standard deviation.
|
136
|
+
It uses two *fields*, the mean or average of a distribution (`mean`)
|
137
|
+
and its standard deviation (`std_dev`). From this information, it
|
138
|
+
will measure the percentile of all input values.
|
139
|
+
|
140
|
+
```ruby
|
141
|
+
# in percentile.rb
|
142
|
+
Wukong.processor(:percentile) do
|
143
|
+
|
144
|
+
SQRT_1_HALF = Math.sqrt(0.5)
|
145
|
+
|
146
|
+
field :mean, Float, :default => 0.0
|
147
|
+
field :std_dev, Float, :default => 1.0
|
148
|
+
|
149
|
+
def process value
|
150
|
+
observation = value.to_f
|
151
|
+
z_score = (mean - observation) / std_dev
|
152
|
+
percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
|
153
|
+
yield [observation, percentile].join("\t")
|
154
|
+
end
|
155
|
+
end
|
156
|
+
```
|
157
|
+
|
158
|
+
These fields have default values but you can overide them on the
|
159
|
+
command line. If you scored a 95 on an exam where the mean score was
|
160
|
+
80 points and the standard deviation of the scores was 10 points, for
|
161
|
+
example, then you'd be in the 93rd percentile:
|
162
|
+
|
163
|
+
```
|
164
|
+
$ echo 95 | wu-local /tmp/percentile.rb --mean=80 --std_dev=10
|
165
|
+
95.0 93.3192798731142
|
166
|
+
```
|
167
|
+
|
168
|
+
If the exam were more difficult, with a mean of 75 points and a
|
169
|
+
standard deviation of 8 points, you'd be in the 99th percentile!
|
170
|
+
|
171
|
+
```
|
172
|
+
$ echo 95 | wu-local /tmp/percentile.rb --mean=75 --std_dev=8
|
173
|
+
95.0 99.37903346742239
|
174
|
+
```
|
175
|
+
|
176
|
+
### The Lifecycle of a Processor
|
177
|
+
|
178
|
+
Processors have a lifecycle that they execute when they are run within
|
179
|
+
the context of a Wukong runner like `wu-local` or `wu-hadoop`. Each
|
180
|
+
lifecycle phase corresponds to a method of the processor that is
|
181
|
+
called:
|
182
|
+
|
183
|
+
* `setup` called *after* the Processor is initialized but *before* the first record is processed. You cannot yield from this method.
|
184
|
+
* `process` called once for each input record, may yield once, many, or no times.
|
185
|
+
* `finalize` called after the the *last* record has been processed but while the processor still has an opportunity to yield records.
|
186
|
+
* `stop` called to signal to the processor that all work should stop, open connections should be closed, &c. You cannot yield from this method.
|
187
|
+
|
188
|
+
The above examples have already focused on the `process` method.
|
189
|
+
|
190
|
+
The `setup` and `stop` methods are often used together to handle
|
191
|
+
external connections
|
192
|
+
|
193
|
+
```ruby
|
194
|
+
# in geolocator.rb
|
195
|
+
Wukong.processor(:geolocator) do
|
196
|
+
field :host, String, :default => 'localhost'
|
197
|
+
attr_accessor :connection
|
198
|
+
|
199
|
+
def setup
|
200
|
+
self.connection = Database::Connection.new(host)
|
201
|
+
end
|
202
|
+
def process record
|
203
|
+
record.added_value = connection.find("...some query...")
|
204
|
+
end
|
205
|
+
def stop
|
206
|
+
self.connection.close
|
207
|
+
end
|
208
|
+
end
|
209
|
+
```
|
210
|
+
|
211
|
+
The `finalize` method is most useful when writing a "reduce"-type
|
212
|
+
operation that involves storing or aggregating information till some
|
213
|
+
criterion is met. It will always be called after the last record has
|
214
|
+
been given (to `process`) but you can call it whenever you want to
|
215
|
+
within your own code.
|
216
|
+
|
217
|
+
Here's an example of using the `finalize` method to implement a simple
|
218
|
+
counter that counts all the input records:
|
219
|
+
|
220
|
+
```ruby
|
221
|
+
# in counter.rb
|
222
|
+
Wukong.processor(:counter) do
|
223
|
+
attr_accessor :count
|
224
|
+
def setup
|
225
|
+
self.count = 0
|
226
|
+
end
|
227
|
+
def process thing
|
228
|
+
self.count += 1
|
229
|
+
end
|
230
|
+
def finalize
|
231
|
+
yield count
|
232
|
+
end
|
233
|
+
end
|
234
|
+
```
|
235
|
+
|
236
|
+
It hinges on the fact that the last input record will be passed to
|
237
|
+
`process` *first* and only then will `finalize` be called. This
|
238
|
+
allows the last input record to be counted/processed/aggregated and
|
239
|
+
then the entire aggregate to be dealt with in finalize.
|
240
|
+
|
241
|
+
Because of this emphasis on building and processing aggregates, the
|
242
|
+
`finalize` method is often useful within processors meant to run as
|
243
|
+
reducers in a Hadoop environment.
|
244
|
+
|
245
|
+
Note:: Finalize is not guaranteed to be called by in every possible
|
246
|
+
environment as it depends on the chosen runner. In a local or Hadoop
|
247
|
+
environment, the notion of "last record" makes sense and so the
|
248
|
+
corresponding runners will call `finalize`. In an environment like
|
249
|
+
Storm, where the concept of last record is not (supposed to be)
|
250
|
+
meaningful, the corresponding runner doesn't ever call it.
|
251
|
+
|
252
|
+
### Serialization
|
253
|
+
|
254
|
+
`wu-local` (and many similar tools) deal with inputs and outputs as
|
255
|
+
strings.
|
256
|
+
|
257
|
+
Processors want to process objects as close to their domain as is
|
258
|
+
possible. A processor which decorates address book entries with
|
259
|
+
Twitter handles doesn't want to think of its inputs as Strings but
|
260
|
+
Hashes or, better yet, Persons.
|
261
|
+
|
262
|
+
Wukong makes it easy to wrap a processor with other processors
|
263
|
+
dedicated to handling the common tasks of parsing records into or out
|
264
|
+
of formats like JSON and turning them into Ruby model instances.
|
265
|
+
|
266
|
+
#### De-serializing data formats like JSON or TSV
|
267
|
+
|
268
|
+
Wukong can parse and emit common data formats like JSON and delimited
|
269
|
+
formats like TSV or CSV so that you don't pollute or tie down your own
|
270
|
+
processors with protocol logic.
|
271
|
+
|
272
|
+
Here's an example of a processor that wants to deal with Hashes as
|
273
|
+
input.
|
274
|
+
|
275
|
+
```ruby
|
276
|
+
# in extractor.rb
|
277
|
+
Wukong.processor(:extractor) do
|
278
|
+
def process hsh
|
279
|
+
yield hsh["first_name"]
|
280
|
+
end
|
281
|
+
end
|
282
|
+
```
|
283
|
+
|
284
|
+
Given JSON data,
|
285
|
+
|
286
|
+
```
|
287
|
+
$ cat input.json
|
288
|
+
{"first_name": "John", "last_name":, "Smith"}
|
289
|
+
{"first_name": "Sally", "last_name":, "Johnson"}
|
290
|
+
...
|
291
|
+
```
|
292
|
+
|
293
|
+
you can feed it directly to a processor
|
294
|
+
|
295
|
+
```
|
296
|
+
$ cat input.json | wu-local --from=json extractor
|
297
|
+
John
|
298
|
+
Sally
|
299
|
+
...
|
300
|
+
```
|
301
|
+
|
302
|
+
Other processors really like Arrays:
|
303
|
+
|
304
|
+
```ruby
|
305
|
+
Wukong.processor(:summer) do
|
306
|
+
def process values
|
307
|
+
yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
|
308
|
+
end
|
309
|
+
end
|
310
|
+
```
|
311
|
+
|
312
|
+
so you can feed them TSV data
|
313
|
+
```
|
314
|
+
$ cat data.tsv
|
315
|
+
1 2 3
|
316
|
+
4 5 6
|
317
|
+
7 8 9
|
318
|
+
...
|
319
|
+
$ cat data.tsv | wu-local --from=tsv summer
|
320
|
+
6
|
321
|
+
15
|
322
|
+
24
|
323
|
+
...
|
324
|
+
```
|
325
|
+
|
326
|
+
but you can just as easily use the same code with CSV data
|
327
|
+
|
328
|
+
```
|
329
|
+
$ cat data.tsv | wu-local --from=csv summer
|
330
|
+
```
|
331
|
+
|
332
|
+
or a more general delimited format.
|
333
|
+
|
334
|
+
```
|
335
|
+
$ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
|
336
|
+
```
|
337
|
+
|
338
|
+
#### Recordizing data structures into domain models
|
339
|
+
|
340
|
+
Here's a contact validator that relies on a Person model to decide
|
341
|
+
whether a contact entry should be yielded:
|
342
|
+
|
343
|
+
```ruby
|
344
|
+
# in contact_validator.rb
|
345
|
+
require 'person'
|
346
|
+
|
347
|
+
Wukong.processor(:contact_validator) do
|
348
|
+
def process person
|
349
|
+
yield person if person.valid?
|
350
|
+
end
|
351
|
+
end
|
352
|
+
```
|
353
|
+
|
354
|
+
Relying on the (elsewhere-defined) Person model to define `valid?`
|
355
|
+
means the processor can stay skinny and readable. Wukong can, in
|
356
|
+
combination with the deserializing features above, turn input text
|
357
|
+
into instances of Person:
|
358
|
+
|
359
|
+
```
|
360
|
+
$ cat input.json | wu-local --consumes=Person --from=json contact_validator
|
361
|
+
#<Person:0x000000020e6120>
|
362
|
+
#<Person:0x000000020e6120>
|
363
|
+
#<Person:0x000000020e6120>
|
364
|
+
```
|
365
|
+
|
366
|
+
`wu-local` can also serialize records from the `contact_validator`
|
367
|
+
processor:
|
368
|
+
|
369
|
+
```
|
370
|
+
$ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
|
371
|
+
{"first_name": "John", "last_name":, "Smith", "valid": "true"}
|
372
|
+
{"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
|
373
|
+
...
|
374
|
+
```
|
375
|
+
|
376
|
+
Serialization formats work just like deserialization formats, with
|
377
|
+
JSON as well as delimited formats available.
|
378
|
+
|
379
|
+
Parsing records into model instances and serializing them out again
|
380
|
+
puts constraints on the model class providing these instances. Here's
|
381
|
+
what the `Person` class needs to look like:
|
382
|
+
|
383
|
+
|
384
|
+
```ruby
|
385
|
+
# in person.rb
|
386
|
+
class Person
|
387
|
+
|
388
|
+
# Create a new Person from the given attributes. Supports usage of
|
389
|
+
# the `--consumes` flag on the command-line
|
390
|
+
#
|
391
|
+
# @param [Hash] attrs
|
392
|
+
# @return [Person]
|
393
|
+
def self.receive attrs
|
394
|
+
new(attrs)
|
395
|
+
end
|
396
|
+
|
397
|
+
# Turn this Person into a basic data structure. Supports the usage
|
398
|
+
# of the `--to` flag on the command-line.
|
399
|
+
#
|
400
|
+
# @return [Hash]
|
401
|
+
def to_wire
|
402
|
+
to_hash
|
403
|
+
end
|
404
|
+
end
|
405
|
+
```
|
406
|
+
|
407
|
+
To support the `--consumes=Person` syntax, the `receive` class method
|
408
|
+
must take a Hash produced from the operation of the `--from` argument
|
409
|
+
and return a `Person` instance.
|
410
|
+
|
411
|
+
To support the `--to=json` syntax, the `Person` class must implement
|
412
|
+
the `to_wire` instance method.
|
413
|
+
|
414
|
+
### Logging and Notifications
|
415
|
+
|
416
|
+
Wukong comes with a logger that all processors have access to via
|
417
|
+
their `log` attribute. This logger has the following priorities:
|
418
|
+
|
419
|
+
* debug (can be set as a log level)
|
420
|
+
* info (can be set as a log level)
|
421
|
+
* warn (can be set as a log level)
|
422
|
+
* error
|
423
|
+
* fatal
|
424
|
+
|
425
|
+
and here's a processor which uses them all
|
426
|
+
|
427
|
+
```ruby
|
428
|
+
# in logs.rb
|
429
|
+
Wukong.processor(:logs) do
|
430
|
+
def process line
|
431
|
+
log.debug line
|
432
|
+
log.info line
|
433
|
+
log.warn line
|
434
|
+
log.error line
|
435
|
+
log.fatal line
|
436
|
+
end
|
437
|
+
end
|
438
|
+
```
|
439
|
+
|
440
|
+
The default log level is DEBUG.
|
441
|
+
|
442
|
+
```
|
443
|
+
$ echo something | wu-local logs.rb
|
444
|
+
DEBUG 2013-01-11 23:40:56 [Logs ] -- event
|
445
|
+
INFO 2013-01-11 23:40:56 [Logs ] -- event
|
446
|
+
WARN 2013-01-11 23:40:56 [Logs ] -- event
|
447
|
+
ERROR 2013-01-11 23:40:56 [Logs ] -- event
|
448
|
+
FATAL 2013-01-11 23:40:56 [Logs ] -- event
|
449
|
+
```
|
450
|
+
|
451
|
+
though you can set it to something else globally
|
452
|
+
|
453
|
+
```
|
454
|
+
$ echo something | wu-local logs.rb --log.level=warn
|
455
|
+
WARN 2013-01-11 23:40:56 [Logs ] -- event
|
456
|
+
ERROR 2013-01-11 23:40:56 [Logs ] -- event
|
457
|
+
FATAL 2013-01-11 23:40:56 [Logs ] -- event
|
458
|
+
```
|
459
|
+
|
460
|
+
or on a per-class basis.
|
461
|
+
|
462
|
+
### Creating Documentation
|
463
|
+
|
464
|
+
`wu-local` includes a help message:
|
465
|
+
|
466
|
+
```
|
467
|
+
$ wu-local --help
|
468
|
+
usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
|
469
|
+
|
470
|
+
wu-local is a tool for running Wukong processors and flows locally on
|
471
|
+
the command-line. Use wu-local by passing it a processor and feeding
|
472
|
+
...
|
473
|
+
|
474
|
+
|
475
|
+
Params:
|
476
|
+
-r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
|
477
|
+
-t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
|
478
|
+
```
|
479
|
+
|
480
|
+
You can generate custom help messages for your own processors. Here's
|
481
|
+
the percentile processor from before but made more usable with good
|
482
|
+
documentation:
|
116
483
|
|
484
|
+
```ruby
|
485
|
+
# in percentile.rb
|
486
|
+
Wukong.processor(:percentile) do
|
487
|
+
|
488
|
+
description <<-EOF.gsub(/^ {2}/,'')
|
489
|
+
This processor calculates percentiles from input scores based on a
|
490
|
+
given mean score and a given standard deviation for the scores.
|
491
|
+
|
492
|
+
The mean and standard deviation are given at run time and processed
|
493
|
+
scores will be compared against the given mean and standard
|
494
|
+
deviation.
|
495
|
+
|
496
|
+
The input is expected to consist of float values, one per line.
|
497
|
+
|
498
|
+
Example:
|
499
|
+
|
500
|
+
$ cat input.dat
|
501
|
+
88
|
502
|
+
89
|
503
|
+
77
|
504
|
+
...
|
505
|
+
|
506
|
+
$ cat input.dat | wu-local percentile.rb --mean=85 --std_dev=7
|
507
|
+
88.0 66.58824291023753
|
508
|
+
89.0 71.61454169013237
|
509
|
+
77.0 12.654895447355777
|
510
|
+
EOF
|
511
|
+
|
512
|
+
SQRT_1_HALF = Math.sqrt(0.5)
|
513
|
+
|
514
|
+
field :mean, Float, :default => 0.0, :doc => "The mean of the assumed distribution"
|
515
|
+
field :std_dev, Float, :default => 1.0, :doc => "The standard deviation of the assumed distribution"
|
516
|
+
|
517
|
+
def process value
|
518
|
+
observation = value.to_f
|
519
|
+
z_score = (mean - observation) / std_dev
|
520
|
+
percentile = 50 * Math.erfc(z_score * SQRT_1_HALF)
|
521
|
+
yield [observation, percentile].join("\t")
|
522
|
+
end
|
523
|
+
end
|
117
524
|
```
|
118
|
-
|
525
|
+
|
526
|
+
If you call `wu-local` with the file to this processor as an argument
|
527
|
+
in addition to the original `--help` argument, you'll get custom
|
528
|
+
documentation.
|
529
|
+
|
119
530
|
```
|
531
|
+
$ wu-local percentile.rb --help
|
532
|
+
usage: wu-local [ --param=val | --param | -p val | -p ] PROCESSOR|FLOW
|
533
|
+
|
534
|
+
This processor calculates percentiles from input scores based on a
|
535
|
+
given mean score and a given standard deviation for the scores.
|
536
|
+
...
|
120
537
|
|
121
|
-
|
538
|
+
|
539
|
+
Params:
|
540
|
+
--mean=Float The mean of the assumed distribution [Default: 0.0]
|
541
|
+
-r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
|
542
|
+
--std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
|
543
|
+
-t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
|
544
|
+
|
545
|
+
```
|
122
546
|
|
123
547
|
<a name="flows"></a>
|
124
548
|
## Combining Processors into Dataflows
|
125
549
|
|
126
550
|
Combining processors which each do one thing well together in a chain
|
127
551
|
is mimicing the tried and true UNIX pipeline. Wukong lets you define
|
128
|
-
these pipelines more formally as a dataflow.
|
552
|
+
these pipelines more formally as a dataflow.
|
553
|
+
|
554
|
+
Having written the `tokenizer` processor, we can use it in a dataflow
|
555
|
+
along with the built-in `regexp` processor to replicate what we did in
|
129
556
|
the last example:
|
130
557
|
|
131
558
|
```
|
132
559
|
# in find_t_words.rb
|
560
|
+
require_relative('processors')
|
133
561
|
Wukong.dataflow(:find_t_words) do
|
134
562
|
tokenizer | regexp(match: /^t/)
|
135
563
|
end
|
@@ -148,7 +576,8 @@ times
|
|
148
576
|
...
|
149
577
|
```
|
150
578
|
|
151
|
-
and it works exactly like
|
579
|
+
and it works exactly like manually chaining the two processors
|
580
|
+
together.
|
152
581
|
|
153
582
|
<a name="serialization></a>
|
154
583
|
## Serialization
|
@@ -163,7 +592,14 @@ yield a String argument (or something that will `to_s` appropriately).
|
|
163
592
|
## Widgets
|
164
593
|
|
165
594
|
Wukong has a number of built-in widgets that are useful for
|
166
|
-
scaffolding your dataflows
|
595
|
+
scaffolding your dataflows or using as starting off points for your
|
596
|
+
own processors.
|
597
|
+
|
598
|
+
For any of these widgets you can get customized help, say
|
599
|
+
|
600
|
+
```
|
601
|
+
$ wu-local group --help
|
602
|
+
```
|
167
603
|
|
168
604
|
### Serializers
|
169
605
|
|
@@ -350,10 +786,10 @@ describe :tokenizer do
|
|
350
786
|
processor.given("Hi there.\nMy name is Wukong!").should emit(6).records
|
351
787
|
end
|
352
788
|
it "eliminates all punctuation" do
|
353
|
-
processor.given("Never!").
|
789
|
+
processor(:tokenizer).given("Never!").should emit('Never')
|
354
790
|
end
|
355
|
-
it "
|
356
|
-
processor.given("
|
791
|
+
it "will not emit tokens in a stop list" do
|
792
|
+
processor(:tokenizer, :stop_list => ['apples', 'bananas']).given("I like apples and bananas").should emit('I', 'like', 'and')
|
357
793
|
end
|
358
794
|
end
|
359
795
|
```
|
@@ -364,8 +800,13 @@ Let's look at each kind of helper:
|
|
364
800
|
`it_behaves_like` helper) adds some tests that ensure that the
|
365
801
|
processor conforms to the API of a Wukong::Processor.
|
366
802
|
|
367
|
-
* The `processor` method
|
368
|
-
|
803
|
+
* The `processor` method is actually an alias for the more aptly named
|
804
|
+
(but less convenient) `unit_test_runner`. This method accepts a
|
805
|
+
processor name and options (just like `wu-local` and other
|
806
|
+
command-line tools) and returns a Wukong::UnitTestRunner instance.
|
807
|
+
This runner handles the
|
808
|
+
|
809
|
+
|
369
810
|
a (registered) processor name and options and creates a new
|
370
811
|
processor. If no name is given, the argument of the enclosing
|
371
812
|
`describe` or `context` block is used. The object returned by
|
@@ -374,29 +815,38 @@ Let's look at each kind of helper:
|
|
374
815
|
behavior.
|
375
816
|
|
376
817
|
* The `given` method (and other helpers like `given_json`,
|
377
|
-
`given_tsv`, &c.) is
|
378
|
-
|
379
|
-
|
380
|
-
|
381
|
-
lifecycle as in the prior example.
|
818
|
+
`given_tsv`, &c.) is a method on the runner. It's a way of lazily
|
819
|
+
feeding records to a processor, without having to go through the
|
820
|
+
`process` method directly and having to handle the block or the
|
821
|
+
processor's lifecycle as in the prior example.
|
382
822
|
|
383
823
|
* The `output` and `emit` matchers will `process` all previously
|
384
824
|
`given` records when they are called. This lets you separate
|
385
825
|
instantiation, input, expectations, and output. Here's a more
|
386
|
-
complicated example
|
826
|
+
complicated example.
|
387
827
|
|
388
828
|
The same helpers can be used to test dataflows as well as
|
389
|
-
processors.
|
390
|
-
|
829
|
+
processors.
|
830
|
+
|
831
|
+
####
|
832
|
+
|
833
|
+
#### Functions vs. Objects
|
834
|
+
|
835
|
+
The above test helpers are designed to aid in testing processors
|
836
|
+
functionally because:
|
837
|
+
|
838
|
+
* they accept the
|
391
839
|
|
392
840
|
### Integration Tests
|
393
841
|
|
394
|
-
|
395
|
-
|
396
|
-
|
842
|
+
If you are implementing a new Wukong command (akin to `wu-local`) then
|
843
|
+
you may also want to run integration tests. Wukong comes with helpers
|
844
|
+
for these, too.
|
397
845
|
|
398
|
-
|
399
|
-
|
846
|
+
You should almost always be able to test your processors without
|
847
|
+
integration tests. Your unit tests and the Wukong framework itself
|
848
|
+
should ensure that your processors work correctly no matter what
|
849
|
+
environment they are deployed in.
|
400
850
|
|
401
851
|
```ruby
|
402
852
|
# spec/integration/tokenizer_spec.rb
|
@@ -415,7 +865,7 @@ context "interpreting its arguments" do
|
|
415
865
|
end
|
416
866
|
context "with a malformed --match argument" do
|
417
867
|
# invalid b/c the regexp is broken...
|
418
|
-
subject { command("wu-local tokenizer --match='^
|
868
|
+
subject { command("wu-local tokenizer --match='^(h'") < "hi there" }
|
419
869
|
it { should exit_with(:non_zero) }
|
420
870
|
it { should have_stderr(/invalid/) }
|
421
871
|
end
|
@@ -457,3 +907,192 @@ Let's go through the helpers:
|
|
457
907
|
* The `have_stdout` and `have_stderr` matchers let you test the STDOUT or STDERR of the command for particular strings or regular expressions.
|
458
908
|
|
459
909
|
* The `exit_with` matcher lets you test the exit code of the command. You can pass the symbol `:non_zero` to set the expectation of _any_ non-zero exit code.
|
910
|
+
|
911
|
+
## Plugins
|
912
|
+
|
913
|
+
Wukong has a built-in plugin framework to make it easy to adapt Wukong
|
914
|
+
processors to new backends or add other functionality. The
|
915
|
+
`Wukong::Local` module and the `wu-local` program it supports is
|
916
|
+
itself a Wukong plugin.
|
917
|
+
|
918
|
+
The following shows how you might build a simplified version of
|
919
|
+
`Wukong::Local` as a new plugin. We'll call this plugin `Cat` as it
|
920
|
+
will implement a program `wu-cat` that is similar in function to
|
921
|
+
`wu-local` (just simplified).
|
922
|
+
|
923
|
+
The first thing to do is include the `Wukong::Plugin` module in your
|
924
|
+
code:
|
925
|
+
|
926
|
+
|
927
|
+
```Ruby
|
928
|
+
# in lib/cat.rb
|
929
|
+
#
|
930
|
+
# This Wukong plugin works like wu-local but replicates some silly
|
931
|
+
# features of cat like numbered lines.
|
932
|
+
module Cat
|
933
|
+
|
934
|
+
# This registers Cat as a Wukong plugin.
|
935
|
+
include Wukong::Plugin
|
936
|
+
|
937
|
+
# Defines any settings specific to Cat. Cat doesn't need to, but
|
938
|
+
# you can define global settings here if you want. You can also
|
939
|
+
# check the `program` name to decide whether to apply your settings.
|
940
|
+
# This helps you not pollute other commands with your stuff.
|
941
|
+
def self.configure settings, program
|
942
|
+
case program
|
943
|
+
when 'wu-cat'
|
944
|
+
settings.define(:input, :description => "The input file to use")
|
945
|
+
settings.define(:number, :description => "Prepend each input record with a consecutive number", :type => :boolean)
|
946
|
+
else
|
947
|
+
# configure other programs if you need to
|
948
|
+
end
|
949
|
+
end
|
950
|
+
|
951
|
+
# Lets Cat boot up with settings that have already been resolved
|
952
|
+
# from the command-line or other sources like config files or remote
|
953
|
+
# servers added by other plugins.
|
954
|
+
#
|
955
|
+
# The `root` directory in which the program is executing is also
|
956
|
+
# provided.
|
957
|
+
def self.boot settings, root
|
958
|
+
puts "Cat booting up using resolved settings within directory #{root}"
|
959
|
+
end
|
960
|
+
end
|
961
|
+
```
|
962
|
+
|
963
|
+
If your plugin doesn't interact directly with the command-line
|
964
|
+
(through a wu-tool like `wu-local` or `wu-hadoop`) and doesn't
|
965
|
+
directly interface with passing records to processors then you can
|
966
|
+
just require the rest of your plugin's code at this point and be done.
|
967
|
+
|
968
|
+
### Write a Runner to interact with the command-line
|
969
|
+
|
970
|
+
If you need to implement a new command line tool then you should write
|
971
|
+
a Runner. A Runner is used to implement Wukong programs like
|
972
|
+
`wu-local` or `wu-hadoop`. Here's what the actual program file would
|
973
|
+
look like for our example plugin's `wu-cat` program.
|
974
|
+
|
975
|
+
```ruby
|
976
|
+
#!/usr/bin/env ruby
|
977
|
+
# in bin/wu-cat
|
978
|
+
require 'cat'
|
979
|
+
Cat::Runner.run
|
980
|
+
```
|
981
|
+
|
982
|
+
The Cat::Runner class is implemented separately.
|
983
|
+
|
984
|
+
```ruby
|
985
|
+
# in lib/cat/runner.rb
|
986
|
+
require_relative('driver')
|
987
|
+
module Cat
|
988
|
+
|
989
|
+
# Implements the `wu-cat` command.
|
990
|
+
class Runner < Wukong::Runner
|
991
|
+
|
992
|
+
usage "PROCESSOR|FLOW"
|
993
|
+
|
994
|
+
description <<-EOF
|
995
|
+
|
996
|
+
wu-cat lets you run a Wukong processor or dataflow on the
|
997
|
+
command-line. Try it like this.
|
998
|
+
|
999
|
+
$ wu-cat --input=data.txt
|
1000
|
+
hello
|
1001
|
+
my
|
1002
|
+
friend
|
1003
|
+
|
1004
|
+
Connect the output to a processor in upcaser.rb
|
1005
|
+
|
1006
|
+
$ wu-cat --input=data.txt upcaser.rb
|
1007
|
+
HELLO
|
1008
|
+
MY
|
1009
|
+
FRIEND
|
1010
|
+
|
1011
|
+
You can also include add line numbers to the output.
|
1012
|
+
|
1013
|
+
$ wu-cat --number --input=data.txt upcaser.rb
|
1014
|
+
1 HELLO
|
1015
|
+
2 MY
|
1016
|
+
3 FRIEND
|
1017
|
+
EOF
|
1018
|
+
|
1019
|
+
# The name of the processor we're going to run. The #args method
|
1020
|
+
# is provided by the Runner class.
|
1021
|
+
def processor_name
|
1022
|
+
args.first
|
1023
|
+
end
|
1024
|
+
|
1025
|
+
# Validate that we were given the name of a registered processor
|
1026
|
+
# to run. Be careful to return true here or validation will fail.
|
1027
|
+
def validate
|
1028
|
+
raise Wukong::Error.new("Must provide a processor as the first argument") unless processor_name
|
1029
|
+
true
|
1030
|
+
end
|
1031
|
+
|
1032
|
+
# Delgates to a driver class to run the processor.
|
1033
|
+
def run
|
1034
|
+
Driver.new(processor_name, settings).start
|
1035
|
+
end
|
1036
|
+
|
1037
|
+
end
|
1038
|
+
end
|
1039
|
+
```
|
1040
|
+
|
1041
|
+
### Write a Driver to interact with processors
|
1042
|
+
|
1043
|
+
The `Cat::Runner#run` method delegates to the `Cat::Driver` class to
|
1044
|
+
handle instantiating and interacting with processors.
|
1045
|
+
|
1046
|
+
```ruby
|
1047
|
+
# in lib/cat/driver.rb
|
1048
|
+
module Cat
|
1049
|
+
|
1050
|
+
# A class for driving a processor from `wu-cat`.
|
1051
|
+
class Driver
|
1052
|
+
|
1053
|
+
# Lets us count the records.
|
1054
|
+
attr_accessor :number
|
1055
|
+
|
1056
|
+
# Gives methods to construct and interact with dataflows.
|
1057
|
+
include Wukong::DriverMethods
|
1058
|
+
|
1059
|
+
# Create a new Driver for a dataflow with the given `label` using
|
1060
|
+
# the given `settings`.
|
1061
|
+
#
|
1062
|
+
# @param [String] label the name of the dataflow
|
1063
|
+
# @param [Configliere::Param] settings the settings to use when creating the dataflow
|
1064
|
+
def initialize label, settings
|
1065
|
+
self.settings = settings
|
1066
|
+
self.dataflow = construct_dataflow(label, settings)
|
1067
|
+
self.number = 1
|
1068
|
+
end
|
1069
|
+
|
1070
|
+
# The file handle of the input file.
|
1071
|
+
#
|
1072
|
+
# @return [File]
|
1073
|
+
def input_file
|
1074
|
+
@input_file ||= File.new(settings[:input])
|
1075
|
+
end
|
1076
|
+
|
1077
|
+
# Starts feeding records to the processor
|
1078
|
+
def start
|
1079
|
+
while line = input_file.readline rescue nil
|
1080
|
+
driver.send_through_dataflow(line)
|
1081
|
+
end
|
1082
|
+
end
|
1083
|
+
|
1084
|
+
# Process each record that comes back from the dataflow.
|
1085
|
+
#
|
1086
|
+
# @param [Object] record the yielded record
|
1087
|
+
def process record
|
1088
|
+
if settings[:number]
|
1089
|
+
puts [number, record].map(&:to_s).join("\t")
|
1090
|
+
else
|
1091
|
+
puts record.to_s
|
1092
|
+
end
|
1093
|
+
self.number += 1
|
1094
|
+
end
|
1095
|
+
|
1096
|
+
end
|
1097
|
+
end
|
1098
|
+
```
|