wukong 3.0.1 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (69) hide show
  1. data/.gitignore +1 -0
  2. data/Gemfile +1 -1
  3. data/README.md +253 -45
  4. data/bin/wu +34 -0
  5. data/bin/wu-source +5 -0
  6. data/examples/Gemfile +0 -1
  7. data/examples/deploy_pack/Gemfile +0 -1
  8. data/examples/improver/tweet_summary.rb +73 -0
  9. data/examples/ruby_project/Gemfile +0 -1
  10. data/examples/splitter.rb +94 -0
  11. data/examples/twitter.rb +5 -0
  12. data/lib/hanuman.rb +1 -1
  13. data/lib/hanuman/graph.rb +39 -22
  14. data/lib/hanuman/stage.rb +46 -13
  15. data/lib/hanuman/tree.rb +67 -0
  16. data/lib/wukong.rb +6 -1
  17. data/lib/wukong/dataflow.rb +19 -48
  18. data/lib/wukong/driver.rb +176 -65
  19. data/lib/wukong/{local → driver}/event_machine_driver.rb +1 -13
  20. data/lib/wukong/driver/wiring.rb +68 -0
  21. data/lib/wukong/local.rb +6 -4
  22. data/lib/wukong/local/runner.rb +14 -16
  23. data/lib/wukong/local/stdio_driver.rb +72 -12
  24. data/lib/wukong/processor.rb +1 -30
  25. data/lib/wukong/runner.rb +2 -0
  26. data/lib/wukong/runner/command_runner.rb +44 -0
  27. data/lib/wukong/source.rb +33 -0
  28. data/lib/wukong/source/source_driver.rb +74 -0
  29. data/lib/wukong/source/source_runner.rb +38 -0
  30. data/lib/wukong/spec_helpers/shared_examples.rb +0 -1
  31. data/lib/wukong/spec_helpers/unit_tests.rb +6 -5
  32. data/lib/wukong/spec_helpers/unit_tests/unit_test_driver.rb +4 -14
  33. data/lib/wukong/spec_helpers/unit_tests/unit_test_runner.rb +7 -8
  34. data/lib/wukong/version.rb +1 -1
  35. data/lib/wukong/widget/echo.rb +55 -0
  36. data/lib/wukong/widget/{processors.rb → extract.rb} +0 -106
  37. data/lib/wukong/widget/filters.rb +15 -0
  38. data/lib/wukong/widget/logger.rb +56 -0
  39. data/lib/wukong/widget/operators.rb +82 -0
  40. data/lib/wukong/widget/reducers.rb +2 -0
  41. data/lib/wukong/widget/reducers/improver.rb +71 -0
  42. data/lib/wukong/widget/reducers/join_xml.rb +37 -0
  43. data/lib/wukong/widget/serializers.rb +21 -6
  44. data/lib/wukong/widgets.rb +6 -3
  45. data/spec/hanuman/graph_spec.rb +73 -10
  46. data/spec/hanuman/stage_spec.rb +15 -0
  47. data/spec/hanuman/tree_spec.rb +119 -0
  48. data/spec/spec_helper.rb +13 -1
  49. data/spec/support/example_test_helpers.rb +0 -1
  50. data/spec/support/model_test_helpers.rb +1 -1
  51. data/spec/support/shared_context_for_graphs.rb +57 -0
  52. data/spec/support/shared_examples_for_builders.rb +8 -15
  53. data/spec/wukong/driver_spec.rb +152 -0
  54. data/spec/wukong/local/runner_spec.rb +1 -12
  55. data/spec/wukong/local/stdio_driver_spec.rb +73 -0
  56. data/spec/wukong/processor_spec.rb +0 -1
  57. data/spec/wukong/runner_spec.rb +2 -2
  58. data/spec/wukong/source_spec.rb +6 -0
  59. data/spec/wukong/widget/extract_spec.rb +101 -0
  60. data/spec/wukong/widget/logger_spec.rb +23 -0
  61. data/spec/wukong/widget/operators_spec.rb +25 -0
  62. data/spec/wukong/widget/reducers/join_xml_spec.rb +25 -0
  63. data/spec/wukong/wu-source_spec.rb +32 -0
  64. data/spec/wukong/wu_spec.rb +14 -0
  65. data/wukong.gemspec +1 -2
  66. metadata +45 -28
  67. data/lib/wukong/local/tcp_driver.rb +0 -47
  68. data/spec/wu/geo/geolocated_spec.rb +0 -247
  69. data/spec/wukong/widget/processors_spec.rb +0 -125
data/.gitignore CHANGED
@@ -57,3 +57,4 @@ away
57
57
  .rbx
58
58
  Gemfile.lock
59
59
  Backup*of*.numbers
60
+ *.gem
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
- source :rubygems
1
+ source 'https://rubygems.org'
2
2
 
3
3
  gemspec
4
4
 
data/README.md CHANGED
@@ -5,8 +5,8 @@ at any scale.
5
5
 
6
6
  The core concept in Wukong is a **Processor**. Wukong processors are
7
7
  simple Ruby classes that do one thing and do it well. This codebase
8
- implements processors and other core Wukong classes and provides a
9
- tool, `wu-local`, to run and combine processors on the command-line.
8
+ implements processors and other core Wukong classes and provides a way
9
+ to run and combine processors on the command-line.
10
10
 
11
11
  Wukong's larger theme is *powerful black boxes, beautiful glue*. The
12
12
  Wukong ecosystem consists of other tools which run Wukong processors
@@ -293,7 +293,7 @@ $ cat input.json
293
293
  you can feed it directly to a processor
294
294
 
295
295
  ```
296
- $ cat input.json | wu-local --from=json extractor
296
+ $ cat input.json | wu-local --from=json extractor.rb
297
297
  John
298
298
  Sally
299
299
  ...
@@ -302,9 +302,10 @@ Sally
302
302
  Other processors really like Arrays:
303
303
 
304
304
  ```ruby
305
+ # in summer.rb
305
306
  Wukong.processor(:summer) do
306
307
  def process values
307
- yield values.map(&:to_f).inject(0.0) { |sum, summand| sum += summand }
308
+ yield values.map(&:to_f).inject(&:+)
308
309
  end
309
310
  end
310
311
  ```
@@ -316,7 +317,7 @@ $ cat data.tsv
316
317
  4 5 6
317
318
  7 8 9
318
319
  ...
319
- $ cat data.tsv | wu-local --from=tsv summer
320
+ $ cat data.tsv | wu-local --from=tsv summer.rb
320
321
  6
321
322
  15
322
323
  24
@@ -326,13 +327,13 @@ $ cat data.tsv | wu-local --from=tsv summer
326
327
  but you can just as easily use the same code with CSV data
327
328
 
328
329
  ```
329
- $ cat data.tsv | wu-local --from=csv summer
330
+ $ cat data.tsv | wu-local --from=csv summer.rb
330
331
  ```
331
332
 
332
333
  or a more general delimited format.
333
334
 
334
335
  ```
335
- $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer
336
+ $ cat data.tsv | wu-local --from=delimited --delimiter='--' summer.rb
336
337
  ```
337
338
 
338
339
  #### Recordizing data structures into domain models
@@ -357,7 +358,7 @@ combination with the deserializing features above, turn input text
357
358
  into instances of Person:
358
359
 
359
360
  ```
360
- $ cat input.json | wu-local --consumes=Person --from=json contact_validator
361
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator.rb
361
362
  #<Person:0x000000020e6120>
362
363
  #<Person:0x000000020e6120>
363
364
  #<Person:0x000000020e6120>
@@ -367,7 +368,7 @@ $ cat input.json | wu-local --consumes=Person --from=json contact_validator
367
368
  processor:
368
369
 
369
370
  ```
370
- $ cat input.json | wu-local --consumes=Person --from=json contact_validator --to=json
371
+ $ cat input.json | wu-local --consumes=Person --from=json contact_validator.rb --to=json
371
372
  {"first_name": "John", "last_name":, "Smith", "valid": "true"}
372
373
  {"first_name": "Sally", "last_name":, "Johnson", "valid": "true"}
373
374
  ...
@@ -441,20 +442,20 @@ The default log level is DEBUG.
441
442
 
442
443
  ```
443
444
  $ echo something | wu-local logs.rb
444
- DEBUG 2013-01-11 23:40:56 [Logs ] -- event
445
- INFO 2013-01-11 23:40:56 [Logs ] -- event
446
- WARN 2013-01-11 23:40:56 [Logs ] -- event
447
- ERROR 2013-01-11 23:40:56 [Logs ] -- event
448
- FATAL 2013-01-11 23:40:56 [Logs ] -- event
445
+ DEBUG 2013-01-11 23:40:56 [Logs ] -- something
446
+ INFO 2013-01-11 23:40:56 [Logs ] -- something
447
+ WARN 2013-01-11 23:40:56 [Logs ] -- something
448
+ ERROR 2013-01-11 23:40:56 [Logs ] -- something
449
+ FATAL 2013-01-11 23:40:56 [Logs ] -- something
449
450
  ```
450
451
 
451
452
  though you can set it to something else globally
452
453
 
453
454
  ```
454
455
  $ echo something | wu-local logs.rb --log.level=warn
455
- WARN 2013-01-11 23:40:56 [Logs ] -- event
456
- ERROR 2013-01-11 23:40:56 [Logs ] -- event
457
- FATAL 2013-01-11 23:40:56 [Logs ] -- event
456
+ WARN 2013-01-11 23:40:56 [Logs ] -- something
457
+ ERROR 2013-01-11 23:40:56 [Logs ] -- something
458
+ FATAL 2013-01-11 23:40:56 [Logs ] -- something
458
459
  ```
459
460
 
460
461
  or on a per-class basis.
@@ -474,7 +475,6 @@ the command-line. Use wu-local by passing it a processor and feeding
474
475
 
475
476
  Params:
476
477
  -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
477
- -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
478
478
  ```
479
479
 
480
480
  You can generate custom help messages for your own processors. Here's
@@ -540,22 +540,23 @@ Params:
540
540
  --mean=Float The mean of the assumed distribution [Default: 0.0]
541
541
  -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path.
542
542
  --std_dev=Float The standard deviation of the assumed distribution [Default: 1.0]
543
- -t, --tcp_port=Integer Consume TCP requests on the given port instead of lines over STDIN
544
543
 
545
544
  ```
546
545
 
547
546
  <a name="flows"></a>
548
547
  ## Combining Processors into Dataflows
549
548
 
550
- Combining processors which each do one thing well together in a chain
551
- is mimicing the tried and true UNIX pipeline. Wukong lets you define
552
- these pipelines more formally as a dataflow.
549
+ Wukong provides a DSL for combining processors together into
550
+ dataflows. This DSL is designed to make it easy to replicate the
551
+ tried and true UNIX philosophy of building simple tools which do one
552
+ thing well and then combining them together to create more complicated
553
+ flows.
553
554
 
554
- Having written the `tokenizer` processor, we can use it in a dataflow
555
- along with the built-in `regexp` processor to replicate what we did in
556
- the last example:
555
+ For example, having written the `tokenizer` processor, we can use it
556
+ in a dataflow along with the built-in `regexp` processor to replicate
557
+ what we did in the last example:
557
558
 
558
- ```
559
+ ```ruby
559
560
  # in find_t_words.rb
560
561
  require_relative('processors')
561
562
  Wukong.dataflow(:find_t_words) do
@@ -563,9 +564,13 @@ Wukong.dataflow(:find_t_words) do
563
564
  end
564
565
  ```
565
566
 
566
- The DSL Wukong provides for combining processors is designed to
567
- similar to the processing of developing them on the command line. You
568
- can run this dataflow directly
567
+ The `|` operator connects the output of one processor (what it
568
+ `yield`s) with the input of another (its `process` method). In this
569
+ example, every record emitted by `tokenizer` will be subsequently
570
+ processed by `regexp`.
571
+
572
+ You can run this dataflow directly (mimicing what we did above with
573
+ single processors chained together on the command-line):
569
574
 
570
575
  ```
571
576
  $ cat novel.txt | wu-local find_t_words.rb
@@ -576,8 +581,85 @@ times
576
581
  ...
577
582
  ```
578
583
 
579
- and it works exactly like manually chaining the two processors
580
- together.
584
+ ### More complicated dataflow topologies
585
+
586
+ The Wukong dataflow DSL allows for more complicated topologies than
587
+ just chaining processors together in a linear pipeline.
588
+
589
+ The `|` operator, used in the above examples to connect two processors
590
+ together into a chain, can also be used to connect a single processor
591
+ to *multiple* processors, creating a branch-point in the dataflow.
592
+ Each branch of the flow will receive the same records.
593
+
594
+ This can be used to perform multiple actions with the same record, as
595
+ in the following example:
596
+
597
+ ```ruby
598
+ # in book_reviews.rb
599
+ Wukong.dataflow(:complicated) do
600
+ from_json | recordize(model: BookReview) |
601
+ [
602
+ map(&:author) | do_author_stuff | ... | to_json,
603
+ map(&:book) | do_book_stuff | ... | to_json,
604
+ ]
605
+ end
606
+ ```
607
+
608
+ Each `BookReview` record yielded by the `recordize` processor will be
609
+ passed to both subsequent branches of the flow, with each branch doing
610
+ a different kind of processing. Output records from both branches
611
+ (which are here turned `to_json` first) will be interspersed in the
612
+ final output when run.
613
+
614
+ A processor like `select`, which filters its inputs, can be used to
615
+ split a flow into records of two types:
616
+
617
+ ```ruby
618
+ # in complicated.rb
619
+ Wukong.dataflow(:complicated) do
620
+ from_json | parser |
621
+ [
622
+ select(&:valid?) | further_processing | ... | to_json,
623
+ select(&:invalid?) | track_errors | null
624
+ ]
625
+ end
626
+ ```
627
+
628
+ Here, only records which respond true to the method `valid?` will pass
629
+ through the first flow (applying `further_processing` and so on) while
630
+ only records which respond true to `invalid?` will pass through the
631
+ second flow (with `track_errors`). The `null` processor at the end of
632
+ this second branch ensures that only records from the first branch
633
+ will be emitted in the final output.
634
+
635
+ Flows can be split over and over again, allowing for rich semantics
636
+ when processing an input source:
637
+
638
+ ```ruby
639
+ # in many_splits.rb
640
+ Wukong.dataflow(:many_splits) do
641
+ from_json | parser | recordize(model: BookReview) |
642
+ [
643
+ map(&:author) | ... | to_json,
644
+ map(&:publisher) |
645
+ [
646
+ select(&:domestic?) | ... | to_json,
647
+ select(&:international?) |
648
+ [
649
+ select(&:north_american?) | ... |
650
+ [
651
+ select(&:american?) | ... | to_json,
652
+ select(&:canadian?) | ... | to_json,
653
+ select(&:mexican?) | ... | to_json,
654
+ ],
655
+ select(&:asian?) | ... | to_json,
656
+ select(&:european?) | ... | to_json,
657
+ ],
658
+ ],
659
+ map(&:title) | ... | to_json
660
+ ]
661
+ end
662
+ ```
581
663
 
582
664
  <a name="serialization></a>
583
665
  ## Serialization
@@ -612,19 +694,9 @@ record, merely its representation. Here's a list:
612
694
 
613
695
  When you're writing processors that are capable of running in
614
696
  isolation you'll want to ensure that you deserialize and serialize
615
- records on the way in and out, like this
616
-
617
- ```ruby
618
- Wukong.processor(:on_my_own) do
619
- def process json
620
- obj = MultiJson.load(json)
621
-
622
- # do something with obj...
623
-
624
- yield MultiJson.dump(obj)
625
- end
626
- end
627
- ```
697
+ records on the way in and out, using the serialization/deserialization
698
+ options `--to` and `--from` on the command-line, as <a
699
+ href="#serialization">defined above</a>.
628
700
 
629
701
  For processors which will only run inside a data flow, you can
630
702
  optimize by not doing any (de)serialization until except at the very
@@ -636,7 +708,12 @@ Wukong.dataflow(:complicated) do
636
708
  end
637
709
  ```
638
710
 
639
- in this approach, no serialization will be done between processors.
711
+ in this approach, no serialization will be done between processors,
712
+ only at the beginning and end.
713
+
714
+ (This is actually the implementation behind the serialization options
715
+ themselves -- they dynamically prepend/append the appropriate
716
+ deserializers/serializers.)
640
717
 
641
718
  ### General Purpose
642
719
 
@@ -718,6 +795,137 @@ Wukong.dataflow(:word_count) do
718
795
  end
719
796
  ```
720
797
 
798
+ ## Commands
799
+
800
+ Wukong comes with a few commands built-in.
801
+
802
+ ### wu-local
803
+
804
+ You've seen one already, `wu-local`, in many of the examples above.
805
+ `wu-local` is used to model dataflows locally, using `STDIN` and
806
+ `STDOUT` for input and output.
807
+
808
+ `wu-local` is a "core" Wukong command in the sense that more
809
+ complicated commands like `wu-hadoop` and `wu-storm`, implemented by
810
+ Wukong plugins, ultimately invoke some `wu-local` process.
811
+
812
+ ### wu-source
813
+
814
+ Wukong also comes with another basic command `wu-source`. This
815
+ command works very similarly to `wu-local` except that it doesn't read
816
+ any input from `STDIN`. Instead it generates its *own* input records
817
+ in an easy to configure, periodic way. It thus acts as a *source* of
818
+ data for other processes in a UNIX pipeline.
819
+
820
+ Here's an example using the `identity` processor which will have the
821
+ effect of printing to `STDOUT` the exact input received:
822
+
823
+ ```
824
+ $ wu-source identity
825
+ 1
826
+ 2
827
+ 3
828
+ ...
829
+ ```
830
+
831
+ From this example it's clear that the records produced by `wu-source`
832
+ are consecutive integers starting at 1 and that they are produced at a
833
+ rate of one record per second.
834
+
835
+ `wu-source` can thus be used to turn any processor (or dataflow) into
836
+ a source of data:
837
+
838
+ ```ruby
839
+ # in random_numbers.rb
840
+ Wukong.processor(:random_numbers) do
841
+ def process index
842
+ yield rand() * index.to_i
843
+ end
844
+ end
845
+ ```
846
+
847
+ Run `random_numbers` like this:
848
+
849
+ ```
850
+ $ wu-source random_numbers.rb
851
+ 0.7671364694830113
852
+ 0.5958089791553307
853
+ 1.8284806932633886
854
+ 3.707189931235327
855
+ 4.106618048255548
856
+ ...
857
+ ```
858
+
859
+ Which produces random numbers with an ever greater ceiling.
860
+
861
+ You can also completely ignore the input record from `wu-source` in
862
+ your processor:
863
+
864
+ ```ruby
865
+ # in generator.rb
866
+ Wukong.processor(:generator) do
867
+ def process _
868
+ yield new_record
869
+ end
870
+ def new_record
871
+ MyRecord.new(...)
872
+ end
873
+ end
874
+ ```
875
+
876
+ which can produce `MyRecord` instances as it's driven by `wu-source`.
877
+
878
+ It's easy to generate several thousand events per second using
879
+ `wu-source` this way:
880
+
881
+ ```
882
+ $ wu-source generator.rb --per_sec=2000
883
+ ```
884
+
885
+ or use the `--period` (which is the inverse of `--per_sec`) to spit
886
+ out records at a regular interval (every 5 minutes in this example):
887
+
888
+ ```
889
+ $ wu-source generator.rb --period=300
890
+ ```
891
+
892
+ `wu-source` can naturally combine with other dataflows or programs you
893
+ might write:
894
+
895
+ ```
896
+ $ wu-source generator.rb --per_sec=200 | wu-local my_flow
897
+ ```
898
+ ### wu
899
+
900
+ The `wu` command is a convenience command useful when using any of the
901
+ other `wu-` commands in the context of a Ruby project with a
902
+ [`Gemfile`](http://bundler.io/v1.3/gemfile.html).
903
+
904
+ Instead of typing
905
+
906
+ ```
907
+ $ bundle exec wu-local my_flow --option=value ...
908
+ ```
909
+
910
+ which would run `wu-local` using the exact version of `wukong` (and
911
+ any other dependencies) as declared in your project's `Gemfile` and
912
+ `Gemfile.lock`, the `wu` command lets you type
913
+
914
+ ```
915
+ $ wu local my_flow --option=value ...
916
+ ```
917
+
918
+ essentially adding the `bundle exec` prefix and munging `wu local` to
919
+ `wu-local` for you. This can be very helpful when doing lots of work
920
+ with Wukong.
921
+
922
+ **Note:** If `bundle exec wu-whatever` works in your project but `wu
923
+ whatever` fails it is probably because Bundler is resolving `wu-`
924
+ commands to some installation that is not on your `$PATH` (often the
925
+ case if you ran `bundle install --standalone`). Ensure that the
926
+ `wukong` gem is installed on your system and that it's binaries are
927
+ your `$PATH` to use the `wu` command.
928
+
721
929
  ## Testing
722
930
 
723
931
  Wukong comes with several helpers to make writing specs using
data/bin/wu ADDED
@@ -0,0 +1,34 @@
1
+ #!/usr/bin/env ruby
2
+ require 'shellwords'
3
+ now=Time.now.strftime("%Y-%m-%d %H:%M:%S")
4
+ if ARGV.empty?
5
+ abort "ERROR #{now} [wu ] -- Must provide a Wukong command to run. Try the --help option."
6
+ else
7
+ if ARGV.size == 1 && ARGV.first == '--help'
8
+ abort <<EOF
9
+ usage: wu COMMAND [OPTIONS] [ARG] ...
10
+
11
+ wu is a wrapper for easy use of Wukong's command-line tools. It takes
12
+ your arguments, constructs the name of the proper wu-tool to call, and
13
+ prepends a call to bundle exec.
14
+
15
+ $ wu local ...
16
+
17
+ is equivalent to
18
+
19
+ $ bundle exec wu-local ...
20
+
21
+ You can run any of the wu-tools this way:
22
+
23
+ wu-local wu-source
24
+ wu-hadoop wu-storm
25
+ wu-deploy wu-load
26
+ EOF
27
+ else
28
+ if ARGV.first =~ /^-/
29
+ abort "ERROR ${now} [wu ] -- First argument must be the name of a wu tool to run, got <${1}>"
30
+ else
31
+ Kernel.exec "bundle exec wu-#{Shellwords.join(ARGV)}"
32
+ end
33
+ end
34
+ end