wukong 3.0.1 → 4.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (69) hide show
  1. data/.gitignore +1 -0
  2. data/Gemfile +1 -1
  3. data/README.md +253 -45
  4. data/bin/wu +34 -0
  5. data/bin/wu-source +5 -0
  6. data/examples/Gemfile +0 -1
  7. data/examples/deploy_pack/Gemfile +0 -1
  8. data/examples/improver/tweet_summary.rb +73 -0
  9. data/examples/ruby_project/Gemfile +0 -1
  10. data/examples/splitter.rb +94 -0
  11. data/examples/twitter.rb +5 -0
  12. data/lib/hanuman.rb +1 -1
  13. data/lib/hanuman/graph.rb +39 -22
  14. data/lib/hanuman/stage.rb +46 -13
  15. data/lib/hanuman/tree.rb +67 -0
  16. data/lib/wukong.rb +6 -1
  17. data/lib/wukong/dataflow.rb +19 -48
  18. data/lib/wukong/driver.rb +176 -65
  19. data/lib/wukong/{local → driver}/event_machine_driver.rb +1 -13
  20. data/lib/wukong/driver/wiring.rb +68 -0
  21. data/lib/wukong/local.rb +6 -4
  22. data/lib/wukong/local/runner.rb +14 -16
  23. data/lib/wukong/local/stdio_driver.rb +72 -12
  24. data/lib/wukong/processor.rb +1 -30
  25. data/lib/wukong/runner.rb +2 -0
  26. data/lib/wukong/runner/command_runner.rb +44 -0
  27. data/lib/wukong/source.rb +33 -0
  28. data/lib/wukong/source/source_driver.rb +74 -0
  29. data/lib/wukong/source/source_runner.rb +38 -0
  30. data/lib/wukong/spec_helpers/shared_examples.rb +0 -1
  31. data/lib/wukong/spec_helpers/unit_tests.rb +6 -5
  32. data/lib/wukong/spec_helpers/unit_tests/unit_test_driver.rb +4 -14
  33. data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +7 -8
  34. data/lib/wukong/version.rb +1 -1
  35. data/lib/wukong/widget/echo.rb +55 -0
  36. data/lib/wukong/widget/{processors.rb → extract.rb} +0 -106
  37. data/lib/wukong/widget/filters.rb +15 -0
  38. data/lib/wukong/widget/logger.rb +56 -0
  39. data/lib/wukong/widget/operators.rb +82 -0
  40. data/lib/wukong/widget/reducers.rb +2 -0
  41. data/lib/wukong/widget/reducers/improver.rb +71 -0
  42. data/lib/wukong/widget/reducers/join_xml.rb +37 -0
  43. data/lib/wukong/widget/serializers.rb +21 -6
  44. data/lib/wukong/widgets.rb +6 -3
  45. data/spec/hanuman/graph_spec.rb +73 -10
  46. data/spec/hanuman/stage_spec.rb +15 -0
  47. data/spec/hanuman/tree_spec.rb +119 -0
  48. data/spec/spec_helper.rb +13 -1
  49. data/spec/support/example_test_helpers.rb +0 -1
  50. data/spec/support/model_test_helpers.rb +1 -1
  51. data/spec/support/shared_context_for_graphs.rb +57 -0
  52. data/spec/support/shared_examples_for_builders.rb +8 -15
  53. data/spec/wukong/driver_spec.rb +152 -0
  54. data/spec/wukong/local/runner_spec.rb +1 -12
  55. data/spec/wukong/local/stdio_driver_spec.rb +73 -0
  56. data/spec/wukong/processor_spec.rb +0 -1
  57. data/spec/wukong/runner_spec.rb +2 -2
  58. data/spec/wukong/source_spec.rb +6 -0
  59. data/spec/wukong/widget/extract_spec.rb +101 -0
  60. data/spec/wukong/widget/logger_spec.rb +23 -0
  61. data/spec/wukong/widget/operators_spec.rb +25 -0
  62. data/spec/wukong/widget/reducers/join_xml_spec.rb +25 -0
  63. data/spec/wukong/wu-source_spec.rb +32 -0
  64. data/spec/wukong/wu_spec.rb +14 -0
  65. data/wukong.gemspec +1 -2
  66. metadata +45 -28
  67. data/lib/wukong/local/tcp_driver.rb +0 -47
  68. data/spec/wu/geo/geolocated_spec.rb +0 -247
  69. data/spec/wukong/widget/processors_spec.rb +0 -125
data/.gitignore CHANGED
@@ -57,3 +57,4 @@ away
57
57
  .rbx
58
58
  Gemfile.lock
59
59
  Backup*of*.numbers
60
+ *.gem
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
- source :rubygems
1
+ source 'https://rubygems.org'
2
2
 
3
3
  gemspec
4
4
 
data/README.md CHANGED
@@ -5,8 +5,8 @@ at any scale.
5
5
 
6
6
  The core concept in Wukong is a **Processor**. Wukong processors are
7
7
  simple Ruby classes that do one thing and do it well. This codebase
8
- implements processors and other core Wukong classes and provides a
9
- tool, `wu-local`, to run and combine processors on the command-line.
8
+ implements processors and other core Wukong classes and provides a way
9
+ to run and combine processors on the command-line.
10
10
 
11
11
  Wukong's larger theme is *powerful black boxes, beautiful glue*. The
12
12
  Wukong ecosystem consists of other tools which run Wukong processors
@@ -293,7 +293,7 @@ $ cat input.json
293
293
  you can feed it directly to a processor
294
294
 
295
295
  ```
296
- $ cat input.json | wu-local --from=json extractor
296
+ $ cat input.json | wu-local --from=json extractor.rb
297
297
  John
298
298
  Sally
299
299
  ...
@@ -302,9 +302,10 @@ Sally
302
302
  Other processors really like Arrays:
303
303
 
304
304
  ```ruby
305
+ # in summer.rb
305
306
  Wukong.processor(:summer) do
306
307
  def process values
307
- yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
308
+ yield values.map(&:to_f).inject(&:+)
308
309
  end
309
310
  end
310
311
  ```
@@ -316,7 +317,7 @@ $ cat data.tsv
316
317
  4 5 6
317
318
  7 8 9
318
319
  ...
319
- $ cat data.tsv | wu-local --from=tsv summer
320
+ $ cat data.tsv | wu-local --from=tsv summer.rb
320
321
  6
321
322
  15
322
323
  24
@@ -326,13 +327,13 @@ $ cat data.tsv | wu-local --from=tsv summer
326
327
  but you can just as easily use the same code with CSV data
327
328
 
328
329
  ```
329
- $ cat data.tsv | wu-local --from=csv summer
330
+ $ cat data.tsv | wu-local --from=csv summer.rb
330
331
  ```
331
332
 
332
333
  or a more general delimited format.
333
334
 
334
335
  ```
335
- $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
336
+ $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer.rb
336
337
  ```
337
338
 
338
339
  #### Recordizing data structures into domain models
@@ -357,7 +358,7 @@ combination with the deserializing features above, turn input text
357
358
  into instances of Person:
358
359
 
359
360
  ```
360
- $ cat input.json | wu-local --consumes=Person --from=json contact_validator
361
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator.rb
361
362
  #<Person:0x000000020e6120>
362
363
  #<Person:0x000000020e6120>
363
364
  #<Person:0x000000020e6120>
@@ -367,7 +368,7 @@ $ cat input.json | wu-local --consumes=Person --from=json contact_validator
367
368
  processor:
368
369
 
369
370
  ```
370
- $ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
371
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator.rb --to=json
371
372
  {"first_name": "John", "last_name":, "Smith", "valid": "true"}
372
373
  {"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
373
374
  ...
@@ -441,20 +442,20 @@ The default log level is DEBUG.
441
442
 
442
443
  ```
443
444
  $ echo something | wu-local logs.rb
444
- DEBUG 2013-01-11 23:40:56 [Logs ] -- event
445
- INFO 2013-01-11 23:40:56 [Logs ] -- event
446
- WARN 2013-01-11 23:40:56 [Logs ] -- event
447
- ERROR 2013-01-11 23:40:56 [Logs ] -- event
448
- FATAL 2013-01-11 23:40:56 [Logs ] -- event
445
+ DEBUG 2013-01-11 23:40:56 [Logs ] -- something
446
+ INFO 2013-01-11 23:40:56 [Logs ] -- something
447
+ WARN 2013-01-11 23:40:56 [Logs ] -- something
448
+ ERROR 2013-01-11 23:40:56 [Logs ] -- something
449
+ FATAL 2013-01-11 23:40:56 [Logs ] -- something
449
450
  ```
450
451
 
451
452
  though you can set it to something else globally
452
453
 
453
454
  ```
454
455
  $ echo something | wu-local logs.rb --log.level=warn
455
- WARN 2013-01-11 23:40:56 [Logs ] -- event
456
- ERROR 2013-01-11 23:40:56 [Logs ] -- event
457
- FATAL 2013-01-11 23:40:56 [Logs ] -- event
456
+ WARN 2013-01-11 23:40:56 [Logs ] -- something
457
+ ERROR 2013-01-11 23:40:56 [Logs ] -- something
458
+ FATAL 2013-01-11 23:40:56 [Logs ] -- something
458
459
  ```
459
460
 
460
461
  or on a per-class basis.
@@ -474,7 +475,6 @@ the command-line. Use wu-local by passing it a processor and feeding
474
475
 
475
476
  Params:
476
477
  -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
477
- -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
478
478
  ```
479
479
 
480
480
  You can generate custom help messages for your own processors. Here's
@@ -540,22 +540,23 @@ Params:
540
540
  --mean=Float The mean of the assumed distribution [Default: 0.0]
541
541
  -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
542
542
  --std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
543
- -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
544
543
 
545
544
  ```
546
545
 
547
546
  <a name="flows"></a>
548
547
  ## Combining Processors into Dataflows
549
548
 
550
- Combining processors which each do one thing well together in a chain
551
- is mimicing the tried and true UNIX pipeline. Wukong lets you define
552
- these pipelines more formally as a dataflow.
549
+ Wukong provides a DSL for combining processors together into
550
+ dataflows. This DSL is designed to make it easy to replicate the
551
+ tried and true UNIX philosophy of building simple tools which do one
552
+ thing well and then combining them together to create more complicated
553
+ flows.
553
554
 
554
- Having written the `tokenizer` processor, we can use it in a dataflow
555
- along with the built-in `regexp` processor to replicate what we did in
556
- the last example:
555
+ For example, having written the `tokenizer` processor, we can use it
556
+ in a dataflow along with the built-in `regexp` processor to replicate
557
+ what we did in the last example:
557
558
 
558
- ```
559
+ ```ruby
559
560
  # in find_t_words.rb
560
561
  require_relative('processors')
561
562
  Wukong.dataflow(:find_t_words) do
@@ -563,9 +564,13 @@ Wukong.dataflow(:find_t_words) do
563
564
  end
564
565
  ```
565
566
 
566
- The DSL Wukong provides for combining processors is designed to
567
- similar to the processing of developing them on the command line. You
568
- can run this dataflow directly
567
+ The `|` operator connects the output of one processor (what it
568
+ `yield`s) with the input of another (its `process` method). In this
569
+ example, every record emitted by `tokenizer` will be subsequently
570
+ processed by `regexp`.
571
+
572
+ You can run this dataflow directly (mimicing what we did above with
573
+ single processors chained together on the command-line):
569
574
 
570
575
  ```
571
576
  $ cat novel.txt | wu-local find_t_words.rb
@@ -576,8 +581,85 @@ times
576
581
  ...
577
582
  ```
578
583
 
579
- and it works exactly like manually chaining the two processors
580
- together.
584
+ ### More complicated dataflow topologies
585
+
586
+ The Wukong dataflow DSL allows for more complicated topologies than
587
+ just chaining processors together in a linear pipeline.
588
+
589
+ The `|` operator, used in the above examples to connect two processors
590
+ together into a chain, can also be used to connect a single processor
591
+ to *multiple* processors, creating a branch-point in the dataflow.
592
+ Each branch of the flow will receive the same records.
593
+
594
+ This can be used to perform multiple actions with the same record, as
595
+ in the following example:
596
+
597
+ ```ruby
598
+ # in book_reviews.rb
599
+ Wukong.dataflow(:complicated) do
600
+ from_json | recordize(model: BookReview) |
601
+ [
602
+ map(&:author) | do_author_stuff | ... | to_json,
603
+ map(&:book) | do_book_stuff | ... | to_json,
604
+ ]
605
+ end
606
+ ```
607
+
608
+ Each `BookReview` record yielded by the `recordize` processor will be
609
+ passed to both subsequent branches of the flow, with each branch doing
610
+ a different kind of processing. Output records from both branches
611
+ (which are here turned `to_json` first) will be interspersed in the
612
+ final output when run.
613
+
614
+ A processor like `select`, which filters its inputs, can be used to
615
+ split a flow into records of two types:
616
+
617
+ ```ruby
618
+ # in complicated.rb
619
+ Wukong.dataflow(:complicated) do
620
+ from_json | parser |
621
+ [
622
+ select(&:valid?) | further_processing | ... | to_json,
623
+ select(&:invalid?) | track_errors | null
624
+ ]
625
+ end
626
+ ```
627
+
628
+ Here, only records which respond true to the method `valid?` will pass
629
+ through the first flow (applying `further_processing` and so on) while
630
+ only records which respond true to `invalid?` will pass through the
631
+ second flow (with `track_errors`). The `null` processor at the end of
632
+ this second branch ensures that only records from the first branch
633
+ will be emitted in the final output.
634
+
635
+ Flows can be split over and over again, allowing for rich semantics
636
+ when processing an input source:
637
+
638
+ ```ruby
639
+ # in many_splits.rb
640
+ Wukong.dataflow(:many_splits) do
641
+ from_json | parser | recordize(model: BookReview) |
642
+ [
643
+ map(&:author) | ... | to_json,
644
+ map(&:publisher) |
645
+ [
646
+ select(&:domestic?) | ... | to_json,
647
+ select(&:international?) |
648
+ [
649
+ select(&:north_american?) | ... |
650
+ [
651
+ select(&:american?) | ... | to_json,
652
+ select(&:canadian?) | ... | to_json,
653
+ select(&:mexican?) | ... | to_json,
654
+ ],
655
+ select(&:asian?) | ... | to_json,
656
+ select(&:european?) | ... | to_json,
657
+ ],
658
+ ],
659
+ map(&:title) | ... | to_json
660
+ ]
661
+ end
662
+ ```
581
663
 
582
664
  <a name="serialization></a>
583
665
  ## Serialization
@@ -612,19 +694,9 @@ record, merely its representation. Here's a list:
612
694
 
613
695
  When you're writing processors that are capable of running in
614
696
  isolation you'll want to ensure that you deserialize and serialize
615
- records on the way in and out, like this
616
-
617
- ```ruby
618
- Wukong.processor(:on_my_own) do
619
- def process json
620
- obj = MultiJson.load(json)
621
-
622
- # do something with obj...
623
-
624
- yield MultiJson.dump(obj)
625
- end
626
- end
627
- ```
697
+ records on the way in and out, using the serialization/deserialization
698
+ options `--to` and `--from` on the command-line, as <a
699
+ href="#serialization">defined above</a>.
628
700
 
629
701
  For processors which will only run inside a data flow, you can
630
702
  optimize by not doing any (de)serialization until except at the very
@@ -636,7 +708,12 @@ Wukong.dataflow(:complicated) do
636
708
  end
637
709
  ```
638
710
 
639
- in this approach, no serialization will be done between processors.
711
+ in this approach, no serialization will be done between processors,
712
+ only at the beginning and end.
713
+
714
+ (This is actually the implementation behind the serialization options
715
+ themselves -- they dynamically prepend/append the appropriate
716
+ deserializers/serializers.)
640
717
 
641
718
  ### General Purpose
642
719
 
@@ -718,6 +795,137 @@ Wukong.dataflow(:word_count) do
718
795
  end
719
796
  ```
720
797
 
798
+ ## Commands
799
+
800
+ Wukong comes with a few commands built-in.
801
+
802
+ ### wu-local
803
+
804
+ You've seen one already, `wu-local`, in many of the examples above.
805
+ `wu-local` is used to model dataflows locally, using `STDIN` and
806
+ `STDOUT` for input and output.
807
+
808
+ `wu-local` is a "core" Wukong command in the sense that more
809
+ complicated commands like `wu-hadoop` and `wu-storm`, implemented by
810
+ Wukong plugins, ultimately invoke some `wu-local` process.
811
+
812
+ ### wu-source
813
+
814
+ Wukong also comes with another basic command `wu-source`. This
815
+ command works very similarly to `wu-local` except that it doesn't read
816
+ any input from `STDIN`. Instead it generates its *own* input records
817
+ in an easy to configure, periodic way. It thus acts as a *source* of
818
+ data for other processes in a UNIX pipeline.
819
+
820
+ Here's an example using the `identity` processor which will have the
821
+ effect of printing to `STDOUT` the exact input received:
822
+
823
+ ```
824
+ $ wu-source identity
825
+ 1
826
+ 2
827
+ 3
828
+ ...
829
+ ```
830
+
831
+ From this example it's clear that the records produced by `wu-source`
832
+ are consecutive integers starting at 1 and that they are produced at a
833
+ rate of one record per second.
834
+
835
+ `wu-source` can thus be used to turn any processor (or dataflow) into
836
+ a source of data:
837
+
838
+ ```ruby
839
+ # in random_numbers.rb
840
+ Wukong.processor(:random_numbers) do
841
+ def process index
842
+ yield rand() * index.to_i
843
+ end
844
+ end
845
+ ```
846
+
847
+ Run `random_numbers` like this:
848
+
849
+ ```
850
+ $ wu-source random_numbers.rb
851
+ 0.7671364694830113
852
+ 0.5958089791553307
853
+ 1.8284806932633886
854
+ 3.707189931235327
855
+ 4.106618048255548
856
+ ...
857
+ ```
858
+
859
+ Which produces random numbers with an ever greater ceiling.
860
+
861
+ You can also completely ignore the input record from `wu-source` in
862
+ your processor:
863
+
864
+ ```ruby
865
+ # in generator.rb
866
+ Wukong.processor(:generator) do
867
+ def process _
868
+ yield new_record
869
+ end
870
+ def new_record
871
+ MyRecord.new(...)
872
+ end
873
+ end
874
+ ```
875
+
876
+ which can produce `MyRecord` instances as it's driven by `wu-source`.
877
+
878
+ It's easy to generate several thousand events per second using
879
+ `wu-source` this way:
880
+
881
+ ```
882
+ $ wu-source generator.rb --per_sec=2000
883
+ ```
884
+
885
+ or use the `--period` (which is the inverse of `--per_sec`) to spit
886
+ out records at a regular interval (every 5 minutes in this example):
887
+
888
+ ```
889
+ $ wu-source generator.rb --period=300
890
+ ```
891
+
892
+ `wu-source` can naturally combine with other dataflows or programs you
893
+ might write:
894
+
895
+ ```
896
+ $ wu-source generator.rb --per_sec=200 | wu-local my_flow
897
+ ```
898
+ ### wu
899
+
900
+ The `wu` command is a convenience command useful when using any of the
901
+ other `wu-` commands in the context of a Ruby project with a
902
+ [`Gemfile`](http://bundler.io/v1.3/gemfile.html).
903
+
904
+ Instead of typing
905
+
906
+ ```
907
+ $ bundle exec wu-local my_flow --option=value ...
908
+ ```
909
+
910
+ which would run `wu-local` using the exact version of `wukong` (and
911
+ any other dependencies) as declared in your project's `Gemfile` and
912
+ `Gemfile.lock`, the `wu` command lets you type
913
+
914
+ ```
915
+ $ wu local my_flow --option=value ...
916
+ ```
917
+
918
+ essentially adding the `bundle exec` prefix and munging `wu local` to
919
+ `wu-local` for you. This can be very helpful when doing lots of work
920
+ with Wukong.
921
+
922
+ **Note:** If `bundle exec wu-whatever` works in your project but `wu
923
+ whatever` fails it is probably because Bundler is resolving `wu-`
924
+ commands to some installation that is not on your `$PATH` (often the
925
+ case if you ran `bundle install --standalone`). Ensure that the
926
+ `wukong` gem is installed on your system and that it's binaries are
927
+ your `$PATH` to use the `wu` command.
928
+
721
929
  ## Testing
722
930
 
723
931
  Wukong comes with several helpers to make writing specs using
data/bin/wu ADDED
@@ -0,0 +1,34 @@
1
+ #!/usr/bin/env ruby
2
+ require 'shellwords'
3
+ now=Time.now.strftime("%Y-%m-%d %H:%M:%S")
4
+ if ARGV.empty?
5
+ abort "ERROR #{now} [wu ] -- Must provide a Wukong command to run. Try the --help option."
6
+ else
7
+ if ARGV.size == 1 && ARGV.first == '--help'
8
+ abort <<EOF
9
+ usage: wu COMMAND [OPTIONS] [ARG] ...
10
+
11
+ wu is a wrapper for easy use of Wukong's command-line tools. It takes
12
+ your arguments, constructs the name of the proper wu-tool to call, and
13
+ prepends a call to bundle exec.
14
+
15
+ $ wu local ...
16
+
17
+ is equivalent to
18
+
19
+ $ bundle exec wu-local ...
20
+
21
+ You can run any of the wu-tools this way:
22
+
23
+ wu-local wu-source
24
+ wu-hadoop wu-storm
25
+ wu-deploy wu-load
26
+ EOF
27
+ else
28
+ if ARGV.first =~ /^-/
29
+ abort "ERROR ${now} [wu ] -- First argument must be the name of a wu tool to run, got <${1}>"
30
+ else
31
+ Kernel.exec "bundle exec wu-#{Shellwords.join(ARGV)}"
32
+ end
33
+ end
34
+ end