wukong 3.0.0.pre → 3.0.0.pre2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +46 -33
- data/.gitmodules +3 -0
- data/.rspec +1 -1
- data/.travis.yml +8 -1
- data/.yardopts +0 -13
- data/Guardfile +4 -6
- data/{LICENSE.textile → LICENSE.md} +43 -55
- data/README-old.md +422 -0
- data/README.md +279 -418
- data/Rakefile +21 -5
- data/TODO.md +6 -6
- data/bin/wu-clean-encoding +31 -0
- data/bin/wu-lign +2 -2
- data/bin/wu-local +69 -0
- data/bin/wu-server +70 -0
- data/examples/Gemfile +38 -0
- data/examples/README.md +9 -0
- data/examples/dataflow/apache_log_line.rb +64 -25
- data/examples/dataflow/fibonacci_series.rb +101 -0
- data/examples/dataflow/parse_apache_logs.rb +37 -7
- data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
- data/examples/dataflow/simple.rb +4 -4
- data/examples/geo.rb +4 -0
- data/examples/geo/geo_grids.numbers +0 -0
- data/examples/geo/geolocated.rb +331 -0
- data/examples/geo/quadtile.rb +69 -0
- data/examples/geo/spec/geolocated_spec.rb +247 -0
- data/examples/geo/tile_fetcher.rb +77 -0
- data/examples/graph/minimum_spanning_tree.rb +61 -61
- data/examples/jabberwocky.txt +36 -0
- data/examples/models/wikipedia.rb +20 -0
- data/examples/munging/Gemfile +8 -0
- data/examples/munging/airline_flights/airline.rb +57 -0
- data/examples/munging/airline_flights/airline_flights.rake +83 -0
- data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
- data/examples/munging/airline_flights/airport.rb +211 -0
- data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
- data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
- data/examples/munging/airline_flights/flight.rb +156 -0
- data/examples/munging/airline_flights/models.rb +4 -0
- data/examples/munging/airline_flights/parse.rb +26 -0
- data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
- data/examples/munging/airline_flights/route.rb +35 -0
- data/examples/munging/airline_flights/tasks.rake +83 -0
- data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
- data/examples/munging/airline_flights/topcities.rb +167 -0
- data/examples/munging/airports/40_wbans.txt +40 -0
- data/examples/munging/airports/filter_weather_reports.rb +37 -0
- data/examples/munging/airports/join.pig +31 -0
- data/examples/munging/airports/to_tsv.rb +33 -0
- data/examples/munging/airports/usa_wbans.pig +19 -0
- data/examples/munging/airports/usa_wbans.txt +2157 -0
- data/examples/munging/airports/wbans.pig +19 -0
- data/examples/munging/airports/wbans.txt +2310 -0
- data/examples/munging/geo/geo_json.rb +54 -0
- data/examples/munging/geo/geo_models.rb +69 -0
- data/examples/munging/geo/geonames_models.rb +78 -0
- data/examples/munging/geo/iso_codes.rb +172 -0
- data/examples/munging/geo/reconcile_countries.rb +124 -0
- data/examples/munging/geo/tasks.rake +71 -0
- data/examples/munging/rake_helper.rb +62 -0
- data/examples/munging/weather/.gitignore +1 -0
- data/examples/munging/weather/Gemfile +4 -0
- data/examples/munging/weather/Rakefile +28 -0
- data/examples/munging/weather/extract_ish.rb +13 -0
- data/examples/munging/weather/models/weather.rb +119 -0
- data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
- data/examples/munging/wikipedia/README.md +34 -0
- data/examples/munging/wikipedia/Rakefile +193 -0
- data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
- data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
- data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
- data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
- data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
- data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
- data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
- data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
- data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
- data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
- data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
- data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
- data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
- data/examples/munging/wikipedia/pig_style_guide.md +25 -0
- data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
- data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
- data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
- data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
- data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
- data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
- data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
- data/examples/munging/wikipedia/utils/namespaces.json +1 -0
- data/examples/rake_helper.rb +85 -0
- data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
- data/examples/server_logs/logline.rb +95 -0
- data/examples/server_logs/models.rb +66 -0
- data/examples/server_logs/page_counts.pig +48 -0
- data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
- data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
- data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
- data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
- data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
- data/examples/string_reverser.rb +26 -0
- data/examples/text/pig_latin.rb +2 -2
- data/examples/text/regional_flavor/README.md +14 -0
- data/examples/text/regional_flavor/article_wordbags.pig +39 -0
- data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
- data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
- data/examples/word_count/accumulator.rb +26 -0
- data/examples/word_count/tokenizer.rb +13 -0
- data/examples/word_count/word_count.rb +6 -0
- data/examples/workflow/cherry_pie.dot +97 -0
- data/examples/workflow/cherry_pie.png +0 -0
- data/examples/workflow/cherry_pie.rb +61 -26
- data/lib/hanuman.rb +34 -7
- data/lib/hanuman/graph.rb +55 -31
- data/lib/hanuman/graphvizzer.rb +199 -178
- data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
- data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
- data/lib/hanuman/link.rb +35 -0
- data/lib/hanuman/registry.rb +46 -0
- data/lib/hanuman/stage.rb +76 -32
- data/lib/wukong.rb +23 -24
- data/lib/wukong/boot.rb +87 -0
- data/lib/wukong/configuration.rb +8 -0
- data/lib/wukong/dataflow.rb +45 -78
- data/lib/wukong/driver.rb +99 -0
- data/lib/wukong/emitter.rb +22 -0
- data/lib/wukong/model/faker.rb +24 -24
- data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
- data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
- data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
- data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
- data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
- data/lib/wukong/processor.rb +60 -114
- data/lib/wukong/spec_helpers.rb +81 -0
- data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
- data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
- data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
- data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
- data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
- data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
- data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
- data/lib/wukong/version.rb +2 -1
- data/lib/wukong/widget/filters.rb +311 -0
- data/lib/wukong/widget/processors.rb +156 -0
- data/lib/wukong/widget/reducers.rb +7 -0
- data/lib/wukong/widget/reducers/accumulator.rb +73 -0
- data/lib/wukong/widget/reducers/bin.rb +318 -0
- data/lib/wukong/widget/reducers/count.rb +61 -0
- data/lib/wukong/widget/reducers/group.rb +85 -0
- data/lib/wukong/widget/reducers/group_concat.rb +70 -0
- data/lib/wukong/widget/reducers/moments.rb +72 -0
- data/lib/wukong/widget/reducers/sort.rb +130 -0
- data/lib/wukong/widget/serializers.rb +287 -0
- data/lib/wukong/widget/sink.rb +10 -52
- data/lib/wukong/widget/source.rb +7 -113
- data/lib/wukong/widget/utils.rb +46 -0
- data/lib/wukong/widgets.rb +6 -0
- data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
- data/spec/examples/dataflow/parsing_spec.rb +12 -11
- data/spec/examples/dataflow/simple_spec.rb +32 -6
- data/spec/examples/dataflow/telegram_spec.rb +36 -36
- data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
- data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
- data/spec/examples/munging/airline_flights_spec.rb +202 -0
- data/spec/examples/text/pig_latin_spec.rb +13 -16
- data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
- data/spec/hanuman/graph_spec.rb +27 -2
- data/spec/hanuman/hanuman_spec.rb +10 -0
- data/spec/hanuman/registry_spec.rb +123 -0
- data/spec/hanuman/stage_spec.rb +61 -7
- data/spec/spec_helper.rb +29 -19
- data/spec/support/hanuman_test_helpers.rb +14 -12
- data/spec/support/shared_context_for_reducers.rb +37 -0
- data/spec/support/shared_examples_for_builders.rb +101 -0
- data/spec/support/shared_examples_for_shortcuts.rb +57 -0
- data/spec/support/wukong_test_helpers.rb +37 -11
- data/spec/wukong/dataflow_spec.rb +77 -55
- data/spec/wukong/local_runner_spec.rb +24 -24
- data/spec/wukong/model/faker_spec.rb +132 -131
- data/spec/wukong/runner_spec.rb +8 -8
- data/spec/wukong/widget/filters_spec.rb +61 -0
- data/spec/wukong/widget/processors_spec.rb +126 -0
- data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
- data/spec/wukong/widget/reducers/count_spec.rb +11 -0
- data/spec/wukong/widget/reducers/group_spec.rb +20 -0
- data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
- data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
- data/spec/wukong/widget/serializers_spec.rb +92 -0
- data/spec/wukong/widget/sink_spec.rb +15 -15
- data/spec/wukong/widget/source_spec.rb +65 -41
- data/spec/wukong/wukong_spec.rb +10 -0
- data/wukong.gemspec +17 -10
- metadata +359 -335
- data/.document +0 -5
- data/VERSION +0 -1
- data/bin/hdp-bin +0 -44
- data/bin/hdp-bzip +0 -23
- data/bin/hdp-cat +0 -3
- data/bin/hdp-catd +0 -3
- data/bin/hdp-cp +0 -3
- data/bin/hdp-du +0 -86
- data/bin/hdp-get +0 -3
- data/bin/hdp-kill +0 -3
- data/bin/hdp-kill-task +0 -3
- data/bin/hdp-ls +0 -11
- data/bin/hdp-mkdir +0 -2
- data/bin/hdp-mkdirp +0 -12
- data/bin/hdp-mv +0 -3
- data/bin/hdp-parts_to_keys.rb +0 -77
- data/bin/hdp-ps +0 -3
- data/bin/hdp-put +0 -3
- data/bin/hdp-rm +0 -32
- data/bin/hdp-sort +0 -40
- data/bin/hdp-stream +0 -40
- data/bin/hdp-stream-flat +0 -22
- data/bin/hdp-stream2 +0 -39
- data/bin/hdp-sync +0 -17
- data/bin/hdp-wc +0 -67
- data/bin/wu-flow +0 -10
- data/bin/wu-map +0 -17
- data/bin/wu-red +0 -17
- data/bin/wukong +0 -17
- data/data/CREDITS.md +0 -355
- data/data/graph/airfares.tsv +0 -2174
- data/data/text/gift_of_the_magi.txt +0 -225
- data/data/text/jabberwocky.txt +0 -36
- data/data/text/rectification_of_names.txt +0 -33
- data/data/twitter/a_atsigns_b.tsv +0 -64
- data/data/twitter/a_follows_b.tsv +0 -53
- data/data/twitter/tweet.tsv +0 -167
- data/data/twitter/twitter_user.tsv +0 -55
- data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
- data/docpages/INSTALL.textile +0 -92
- data/docpages/LICENSE.textile +0 -107
- data/docpages/README-elastic_map_reduce.textile +0 -377
- data/docpages/README-performance.textile +0 -90
- data/docpages/README-wulign.textile +0 -65
- data/docpages/UsingWukong-part1-get_ready.textile +0 -17
- data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
- data/docpages/UsingWukong-part3-parsing.textile +0 -138
- data/docpages/_config.yml +0 -39
- data/docpages/avro/avro_notes.textile +0 -56
- data/docpages/avro/performance.textile +0 -36
- data/docpages/avro/tethering.textile +0 -19
- data/docpages/bigdata-tips.textile +0 -143
- data/docpages/code/api_response_example.txt +0 -20
- data/docpages/code/parser_skeleton.rb +0 -38
- data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
- data/docpages/favicon.ico +0 -0
- data/docpages/gem.css +0 -16
- data/docpages/hadoop-tips.textile +0 -83
- data/docpages/index.textile +0 -92
- data/docpages/intro.textile +0 -8
- data/docpages/moreinfo.textile +0 -174
- data/docpages/news.html +0 -24
- data/docpages/pig/PigLatinExpressionsList.txt +0 -122
- data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
- data/docpages/pig/commandline_params.txt +0 -26
- data/docpages/pig/cookbook.html +0 -481
- data/docpages/pig/images/hadoop-logo.jpg +0 -0
- data/docpages/pig/images/instruction_arrow.png +0 -0
- data/docpages/pig/images/pig-logo.gif +0 -0
- data/docpages/pig/piglatin_ref1.html +0 -1103
- data/docpages/pig/piglatin_ref2.html +0 -14340
- data/docpages/pig/setup.html +0 -505
- data/docpages/pig/skin/basic.css +0 -166
- data/docpages/pig/skin/breadcrumbs.js +0 -237
- data/docpages/pig/skin/fontsize.js +0 -166
- data/docpages/pig/skin/getBlank.js +0 -40
- data/docpages/pig/skin/getMenu.js +0 -45
- data/docpages/pig/skin/images/chapter.gif +0 -0
- data/docpages/pig/skin/images/chapter_open.gif +0 -0
- data/docpages/pig/skin/images/current.gif +0 -0
- data/docpages/pig/skin/images/external-link.gif +0 -0
- data/docpages/pig/skin/images/header_white_line.gif +0 -0
- data/docpages/pig/skin/images/page.gif +0 -0
- data/docpages/pig/skin/images/pdfdoc.gif +0 -0
- data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/print.css +0 -54
- data/docpages/pig/skin/profile.css +0 -181
- data/docpages/pig/skin/screen.css +0 -587
- data/docpages/pig/tutorial.html +0 -1059
- data/docpages/pig/udf.html +0 -1509
- data/docpages/tutorial.textile +0 -283
- data/docpages/usage.textile +0 -195
- data/docpages/wutils.textile +0 -263
- data/examples/dataflow/complex.rb +0 -11
- data/examples/dataflow/donuts.rb +0 -13
- data/examples/tiny_count/jabberwocky_output.tsv +0 -92
- data/examples/word_count.rb +0 -48
- data/examples/workflow/fiddle.rb +0 -24
- data/lib/away/escapement.rb +0 -129
- data/lib/away/exe.rb +0 -11
- data/lib/away/experimental.rb +0 -5
- data/lib/away/from_file.rb +0 -52
- data/lib/away/job.rb +0 -56
- data/lib/away/job/rake_compat.rb +0 -17
- data/lib/away/registry.rb +0 -79
- data/lib/away/runner.rb +0 -276
- data/lib/away/runner/execute.rb +0 -121
- data/lib/away/script.rb +0 -161
- data/lib/away/script/hadoop_command.rb +0 -240
- data/lib/away/source/file_list_source.rb +0 -15
- data/lib/away/source/looper.rb +0 -18
- data/lib/away/task.rb +0 -219
- data/lib/hanuman/action.rb +0 -21
- data/lib/hanuman/chain.rb +0 -4
- data/lib/hanuman/graphviz.rb +0 -74
- data/lib/hanuman/resource.rb +0 -6
- data/lib/hanuman/slot.rb +0 -87
- data/lib/hanuman/slottable.rb +0 -220
- data/lib/wukong/bad_record.rb +0 -15
- data/lib/wukong/event.rb +0 -44
- data/lib/wukong/local_runner.rb +0 -55
- data/lib/wukong/mapred.rb +0 -3
- data/lib/wukong/universe.rb +0 -48
- data/lib/wukong/widget/filter.rb +0 -81
- data/lib/wukong/widget/gibberish.rb +0 -123
- data/lib/wukong/widget/monitor.rb +0 -26
- data/lib/wukong/widget/reducer.rb +0 -66
- data/lib/wukong/widget/stringifier.rb +0 -50
- data/lib/wukong/workflow.rb +0 -22
- data/lib/wukong/workflow/command.rb +0 -42
- data/old/config/emr-example.yaml +0 -48
- data/old/examples/README.txt +0 -17
- data/old/examples/contrib/jeans/README.markdown +0 -165
- data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
- data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
- data/old/examples/contrib/jeans/data/sizes +0 -3
- data/old/examples/contrib/jeans/normalize.rb +0 -20
- data/old/examples/contrib/jeans/sizes.rb +0 -55
- data/old/examples/corpus/bnc_word_freq.rb +0 -44
- data/old/examples/corpus/bucket_counter.rb +0 -47
- data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
- data/old/examples/corpus/sentence_bigrams.rb +0 -53
- data/old/examples/corpus/sentence_coocurrence.rb +0 -66
- data/old/examples/corpus/stopwords.rb +0 -138
- data/old/examples/corpus/words_to_bigrams.rb +0 -53
- data/old/examples/emr/README.textile +0 -110
- data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
- data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
- data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
- data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
- data/old/examples/network_graph/adjacency_list.rb +0 -74
- data/old/examples/network_graph/breadth_first_search.rb +0 -72
- data/old/examples/network_graph/gen_2paths.rb +0 -68
- data/old/examples/network_graph/gen_multi_edge.rb +0 -112
- data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
- data/old/examples/pagerank/README.textile +0 -6
- data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
- data/old/examples/pagerank/pagerank.rb +0 -72
- data/old/examples/pagerank/pagerank_initialize.rb +0 -42
- data/old/examples/pagerank/run_pagerank.sh +0 -21
- data/old/examples/sample_records.rb +0 -33
- data/old/examples/server_logs/apache_log_parser.rb +0 -15
- data/old/examples/server_logs/nook.rb +0 -48
- data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
- data/old/examples/server_logs/user_agent.rb +0 -40
- data/old/examples/simple_word_count.rb +0 -82
- data/old/examples/size.rb +0 -61
- data/old/examples/stats/avg_value_frequency.rb +0 -86
- data/old/examples/stats/binning_percentile_estimator.rb +0 -140
- data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
- data/old/examples/stats/rank_and_bin.rb +0 -173
- data/old/examples/stupidly_simple_filter.rb +0 -40
- data/old/examples/word_count.rb +0 -75
- data/old/graph/graphviz_builder.rb +0 -580
- data/old/graph_easy/Attributes.pm +0 -4181
- data/old/graph_easy/Graphviz.pm +0 -2232
- data/old/wukong.rb +0 -18
- data/old/wukong/and_pig.rb +0 -38
- data/old/wukong/bad_record.rb +0 -18
- data/old/wukong/datatypes.rb +0 -24
- data/old/wukong/datatypes/enum.rb +0 -127
- data/old/wukong/datatypes/fake_types.rb +0 -17
- data/old/wukong/decorator.rb +0 -28
- data/old/wukong/encoding/asciize.rb +0 -108
- data/old/wukong/extensions.rb +0 -16
- data/old/wukong/extensions/array.rb +0 -18
- data/old/wukong/extensions/blank.rb +0 -93
- data/old/wukong/extensions/class.rb +0 -189
- data/old/wukong/extensions/date_time.rb +0 -53
- data/old/wukong/extensions/emittable.rb +0 -69
- data/old/wukong/extensions/enumerable.rb +0 -79
- data/old/wukong/extensions/hash.rb +0 -167
- data/old/wukong/extensions/hash_keys.rb +0 -16
- data/old/wukong/extensions/hash_like.rb +0 -150
- data/old/wukong/extensions/hashlike_class.rb +0 -47
- data/old/wukong/extensions/module.rb +0 -2
- data/old/wukong/extensions/pathname.rb +0 -27
- data/old/wukong/extensions/string.rb +0 -65
- data/old/wukong/extensions/struct.rb +0 -17
- data/old/wukong/extensions/symbol.rb +0 -11
- data/old/wukong/filename_pattern.rb +0 -74
- data/old/wukong/helper.rb +0 -7
- data/old/wukong/helper/stopwords.rb +0 -195
- data/old/wukong/helper/tokenize.rb +0 -35
- data/old/wukong/logger.rb +0 -38
- data/old/wukong/periodic_monitor.rb +0 -72
- data/old/wukong/schema.rb +0 -269
- data/old/wukong/script.rb +0 -286
- data/old/wukong/script/avro_command.rb +0 -5
- data/old/wukong/script/cassandra_loader_script.rb +0 -40
- data/old/wukong/script/emr_command.rb +0 -168
- data/old/wukong/script/hadoop_command.rb +0 -237
- data/old/wukong/script/local_command.rb +0 -41
- data/old/wukong/store.rb +0 -10
- data/old/wukong/store/base.rb +0 -27
- data/old/wukong/store/cassandra.rb +0 -10
- data/old/wukong/store/cassandra/streaming.rb +0 -75
- data/old/wukong/store/cassandra/struct_loader.rb +0 -21
- data/old/wukong/store/cassandra_model.rb +0 -91
- data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
- data/old/wukong/store/chunked_flat_file_store.rb +0 -48
- data/old/wukong/store/conditional_store.rb +0 -57
- data/old/wukong/store/factory.rb +0 -8
- data/old/wukong/store/flat_file_store.rb +0 -89
- data/old/wukong/store/key_store.rb +0 -51
- data/old/wukong/store/null_store.rb +0 -15
- data/old/wukong/store/read_thru_store.rb +0 -22
- data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
- data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
- data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
- data/old/wukong/streamer.rb +0 -30
- data/old/wukong/streamer/accumulating_reducer.rb +0 -83
- data/old/wukong/streamer/base.rb +0 -126
- data/old/wukong/streamer/counting_reducer.rb +0 -25
- data/old/wukong/streamer/filter.rb +0 -20
- data/old/wukong/streamer/instance_streamer.rb +0 -15
- data/old/wukong/streamer/json_streamer.rb +0 -21
- data/old/wukong/streamer/line_streamer.rb +0 -12
- data/old/wukong/streamer/list_reducer.rb +0 -31
- data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
- data/old/wukong/streamer/record_streamer.rb +0 -14
- data/old/wukong/streamer/reducer.rb +0 -11
- data/old/wukong/streamer/set_reducer.rb +0 -14
- data/old/wukong/streamer/struct_streamer.rb +0 -48
- data/old/wukong/streamer/summing_reducer.rb +0 -29
- data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
- data/old/wukong/typed_struct.rb +0 -12
- data/spec/away/encoding_spec.rb +0 -32
- data/spec/away/exe_spec.rb +0 -20
- data/spec/away/flow_spec.rb +0 -82
- data/spec/away/graph_spec.rb +0 -6
- data/spec/away/job_spec.rb +0 -15
- data/spec/away/rake_compat_spec.rb +0 -9
- data/spec/away/script_spec.rb +0 -81
- data/spec/hanuman/graphviz_spec.rb +0 -29
- data/spec/hanuman/slot_spec.rb +0 -2
- data/spec/support/examples_helper.rb +0 -10
- data/spec/support/streamer_test_helpers.rb +0 -6
- data/spec/support/wukong_widget_helpers.rb +0 -66
- data/spec/wukong/processor_spec.rb +0 -109
- data/spec/wukong/widget/filter_spec.rb +0 -99
- data/spec/wukong/widget/stringifier_spec.rb +0 -51
- data/spec/wukong/workflow/command_spec.rb +0 -5
data/README.md
CHANGED
@@ -1,422 +1,283 @@
|
|
1
|
-
# Wukong
|
2
|
-
|
3
|
-
Wukong is a toolkit for rapid, agile development of
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
* **get shit done**: sometimes ugly tasks require ugly solutions. Shelling out to the hadoop process monitor and parsing its output is acceptable if it is robust and obviates the need for a native protocol handler.
|
74
|
-
* **be clever early, boring late**: magic in service of having a terse language for assembling a graph is great. However, the assembled graph should be stomic and largely free of any conditional logic or dependencies.
|
75
|
-
- for example, the data flow `split` statement allows you to set a condition on each branch. The assembled graph, however, is typically a `fanout` stage followed by `filter` stages.
|
76
|
-
- the graph language has some helpers to refer to graph stages. The compiled graph uses explicit mostly-readable but unambiguous static handles.
|
77
|
-
- some stages offer light polymorphism -- for example, `select` accepts either a regexp or block. This is handled at the factory level, and the resulting stage is free of conditional logic.
|
78
|
-
* **no lock-in**: needless to say, Wukong works seamlessly with the Infochimps platform, making robust, reliable massive-scale dataflows amazingly simple. However, wukong flows are not tied to the cloud: they project to Hadoop, Flume or any of the other open-source components that power our platform.
|
79
|
-
|
80
|
-
__________________________________________________________________________
|
81
|
-
|
82
|
-
<a name="stage"></a>
|
83
|
-
## Stage
|
84
|
-
|
85
|
-
A graph is composed of `stage`s.
|
86
|
-
|
87
|
-
* *desc* (alias `description`)
|
88
|
-
|
89
|
-
#### Actions
|
90
|
-
|
91
|
-
each action
|
92
|
-
|
93
|
-
* the default action is `call`
|
94
|
-
* all stages respond to `nothing`, and like ze goggles, do `nothing`.
|
95
|
-
|
96
|
-
__________________________________________________________________________
|
97
|
-
|
98
|
-
<a name="dataflows"></a>
|
99
|
-
## Workflows
|
100
|
-
|
101
|
-
Wukong workflows work somewhat differently than you may be familiar with Rake and such.
|
102
|
-
|
103
|
-
In wukong, a stage corresponds to a resource; you can then act on that resource.
|
104
|
-
|
105
|
-
Consider first compiling a c program:
|
106
|
-
|
107
|
-
to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
|
108
|
-
to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`
|
109
|
-
|
110
|
-
In this case, you define the *steps*, implying the resources.
|
111
|
-
|
112
|
-
|
113
|
-
Something rake can't do (but we should be able to): make it so I can define a dependency that runs **last**
|
114
|
-
|
115
|
-
### Defining jobs
|
116
|
-
|
117
|
-
Wukong.job(:launch) do
|
118
|
-
task :aim do
|
119
|
-
#...
|
120
|
-
end
|
121
|
-
task :enter do
|
122
|
-
end
|
123
|
-
task :commit do
|
124
|
-
# ...
|
125
|
-
end
|
126
|
-
end
|
127
|
-
|
128
|
-
Wukong.job(:recall) do
|
129
|
-
task :smash_with_rock do
|
130
|
-
#...
|
131
|
-
end
|
132
|
-
task :reprogram do
|
133
|
-
# ...
|
134
|
-
end
|
135
|
-
end
|
136
|
-
|
137
|
-
* stages construct resources
|
138
|
-
- these have default actions
|
139
|
-
* hanuman tracks defined order
|
140
|
-
|
141
|
-
* do steps run in order, or is dependency explicit?
|
142
|
-
* what about idempotency?
|
143
|
-
|
144
|
-
* `task` vs `action` vs `resource`; `job`, `task`, `group`, `namespace`.
|
145
|
-
|
146
|
-
### documenting
|
147
|
-
|
148
|
-
Inline option (`:desc` or `:description`?)
|
149
|
-
|
150
|
-
```ruby
|
151
|
-
task :foo, :description => "pity the foo" do
|
152
|
-
# ...
|
153
|
-
end
|
154
|
-
```
|
155
|
-
|
156
|
-
DSL method option
|
157
|
-
|
158
|
-
```ruby
|
159
|
-
task :foo do
|
160
|
-
description "pity the foo"
|
161
|
-
# ...
|
162
|
-
end
|
163
|
-
```
|
164
|
-
|
165
|
-
### actions
|
166
|
-
|
167
|
-
default action:
|
168
|
-
|
169
|
-
```ruby
|
170
|
-
script 'nukes/launch_codes.rb' do
|
171
|
-
# ...
|
172
|
-
end
|
173
|
-
```
|
1
|
+
# Wukong
|
2
|
+
|
3
|
+
Wukong is a toolkit for rapid, agile development of data applications
|
4
|
+
at any scale.
|
5
|
+
|
6
|
+
The core concept in Wukong is a **Processor**. Wukong processors are
|
7
|
+
simple Ruby classes that do one thing and do it well. This codebase
|
8
|
+
implements processors and other core Wukong classes and provides a
|
9
|
+
tool, `wu-local`, to run and combine processors on the command-line.
|
10
|
+
|
11
|
+
Wukong's larger theme is *powerful black boxes, beautiful glue*. The
|
12
|
+
Wukong ecosystem consists of other tools which run Wukong processors
|
13
|
+
in various topologies across a variety of different backends. Code
|
14
|
+
written in Wukong can be easily ported between environments and
|
15
|
+
frameworks: local command-line scripts on your laptop instantly turn
|
16
|
+
into powerful jobs running in Hadoop.
|
17
|
+
|
18
|
+
Here is a list of various other projects which you may also want to
|
19
|
+
peruse when trying to understand the full Wukong experience:
|
20
|
+
|
21
|
+
* <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
|
22
|
+
* <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
|
23
|
+
* <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
|
24
|
+
|
25
|
+
For a more holistic perspective also see the Infochimps Platform
|
26
|
+
Community Edition (**FIXME: link to this**) which combines all the
|
27
|
+
Wukong tools together into a jetpack which fits comfortably over the
|
28
|
+
shoulders of developers.
|
29
|
+
|
30
|
+
<a name="processors"></a>
|
31
|
+
## Writing Simple Processors
|
32
|
+
|
33
|
+
The fundamental unit of computation in Wukong is the processor. A
|
34
|
+
processor is Ruby class which
|
35
|
+
|
36
|
+
* subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
|
37
|
+
* defines a `process` method which takes an input record, does something, and calls `yield` on the output
|
38
|
+
|
39
|
+
Here's a processor that reverses all each input record:
|
40
|
+
|
41
|
+
```ruby
|
42
|
+
# in string_reverser.rb
|
43
|
+
Wukong.processor(:string_reverser) do
|
44
|
+
def process string
|
45
|
+
yield string.reverse
|
46
|
+
end
|
47
|
+
end
|
48
|
+
```
|
49
|
+
|
50
|
+
When you're developing your application, run your processors on the
|
51
|
+
command line on flat input files using `wu-local`:
|
52
|
+
|
53
|
+
```
|
54
|
+
$ cat novel.txt
|
55
|
+
It was the best of times, it was the worst of times.
|
56
|
+
...
|
57
|
+
|
58
|
+
$ cat novel.txt | wu-local string_reverser.rb
|
59
|
+
.semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
|
60
|
+
```
|
61
|
+
|
62
|
+
You can use yield as often (or never) as you need. Here's a more
|
63
|
+
complicated example to illustrate:
|
64
|
+
|
65
|
+
```ruby
|
66
|
+
# in processors.rb
|
67
|
+
|
68
|
+
Wukong.processor(:tokenizer) do
|
69
|
+
def process line
|
70
|
+
line.split.each { |token| yield token }
|
71
|
+
end
|
72
|
+
end
|
174
73
|
|
175
|
-
|
176
|
-
|
177
|
-
```ruby
|
178
|
-
script 'nukes/launch_codes.rb', :undo do
|
179
|
-
# ...
|
180
|
-
end
|
181
|
-
```
|
182
|
-
|
183
|
-
<a name="file-name-templates"></a>
|
184
|
-
### File name templates
|
185
|
-
|
186
|
-
* *timestamp*: timestamp of run. everything in this invocation will have the same timestamp.
|
187
|
-
* *user*: username; `ENV['USER']` by default
|
188
|
-
* *sources*: basenames of job inputs, minus extension, non-`\w` replaced with '_', joined by '-', max 50 chars.
|
189
|
-
* *job*: job flow name
|
190
|
-
|
191
|
-
<a name="job-versioning-of-clobbered"></a>
|
192
|
-
### versioning of clobbered files
|
193
|
-
|
194
|
-
* when files are generated or removed, relocate to a timestamped location
|
195
|
-
- a file `/path/to/file.txt` is relocated to `~/.wukong/backups/path/to/file.txt.wukong-20110102120011` where `20110102120011` is the [job timestamp](#file-naming)
|
196
|
-
- accepts a `max_size` param
|
197
|
-
- raises if it can't write to directory -- must explicitly say `--safe_file_ops=false`
|
198
|
-
|
199
|
-
<a name="job-running"></a>
|
200
|
-
### running
|
201
|
-
|
202
|
-
* `clobber` -- run, but clear all dependencies
|
203
|
-
* `undo` --
|
204
|
-
* `clean` --
|
205
|
-
|
206
|
-
|
207
|
-
### Utility and Filesystem tasks
|
208
|
-
|
209
|
-
The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.
|
210
|
-
|
211
|
-
### Configuration
|
212
|
-
|
213
|
-
|
214
|
-
#### Commandline args
|
215
|
-
|
216
|
-
* handled by configliere: `nukes launch --launch_code=GLG20`
|
217
|
-
|
218
|
-
* TODO: configliere needs context-specific config vars, so I only get information about the `launch` action in the `nukes` job when I run `nukes launch --help`
|
219
|
-
|
220
|
-
|
221
|
-
|
222
|
-
__________________________________________________________________________
|
223
|
-
|
224
|
-
<a name="dataflows"></a>
|
225
|
-
## Dataflows
|
226
|
-
|
227
|
-
|
228
|
-
Data flows
|
229
|
-
|
230
|
-
* you can have a consumer connect to a provider, or vice versa
|
231
|
-
- producer binds to a port, consumers connect to it: pub/sub
|
232
|
-
- consumers open a port, producer connects to many: megaphone
|
74
|
+
Wukong.processor(:starts_with) do
|
233
75
|
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
|
259
|
-
|
260
|
-
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
|
265
|
-
|
266
|
-
|
76
|
+
field :letter, String, :default => 'a'
|
77
|
+
|
78
|
+
def process word
|
79
|
+
yield word if word =~ Regexp.new("^#{letter}", true)
|
80
|
+
end
|
81
|
+
end
|
82
|
+
```
|
83
|
+
|
84
|
+
Let's start by running the `tokenizer`. We've defined two processors
|
85
|
+
in the file `processors.rb` and neither one is named `processors` so
|
86
|
+
we have to tell `wu-local` the name of the processor we want to run
|
87
|
+
explicitly.
|
88
|
+
|
89
|
+
```
|
90
|
+
$ cat novel.txt | wu-local processors.rb --run=tokenizer
|
91
|
+
It
|
92
|
+
was
|
93
|
+
the
|
94
|
+
best
|
95
|
+
of
|
96
|
+
times,
|
97
|
+
...
|
98
|
+
```
|
99
|
+
|
100
|
+
You can combine the output of one processor with another right in the
|
101
|
+
shell. Let's add the `starts_with` filter and also pass in the
|
102
|
+
*field* `letter`, defined in that processor:
|
103
|
+
|
104
|
+
```
|
105
|
+
$ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local processors.rb --run=starts_with --letter=t
|
106
|
+
the
|
107
|
+
times
|
108
|
+
the
|
109
|
+
times
|
110
|
+
...
|
111
|
+
```
|
112
|
+
|
113
|
+
Wanting to match on a regular expression is such a common task that
|
114
|
+
Wukong has a built-in "widget" called `regexp` that you can use
|
115
|
+
directly:
|
116
|
+
|
117
|
+
```
|
118
|
+
$ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local regexp --match='^t'
|
119
|
+
```
|
120
|
+
|
121
|
+
There are many more simple <a href="#widgets">widgets</a> like these.
|
122
|
+
|
123
|
+
<a name="flows"></a>
|
124
|
+
## Combining Processors into Dataflows
|
125
|
+
|
126
|
+
Combining processors which each do one thing well together in a chain
|
127
|
+
is mimicing the tried and true UNIX pipeline. Wukong lets you define
|
128
|
+
these pipelines more formally as a dataflow. Here's the dataflow for
|
129
|
+
the last example:
|
130
|
+
|
131
|
+
```
|
132
|
+
# in find_t_words.rb
|
133
|
+
Wukong.dataflow(:find_t_words) do
|
134
|
+
tokenizer > regexp(match: /^t/)
|
135
|
+
end
|
136
|
+
```
|
137
|
+
|
138
|
+
The DSL Wukong provides for combining processors is designed to
|
139
|
+
similar to the processing of developing them on the command line. You
|
140
|
+
can run this dataflow directly
|
141
|
+
|
142
|
+
```
|
143
|
+
$ cat novel.txt | wu-local find_t_words.rb
|
144
|
+
the
|
145
|
+
times
|
146
|
+
the
|
147
|
+
times
|
148
|
+
...
|
149
|
+
```
|
150
|
+
|
151
|
+
and it works exactly like before.
|
152
|
+
|
153
|
+
<a name="serialization></a>
|
154
|
+
## Serialization
|
155
|
+
|
156
|
+
The process method for a Processor must accept a String argument and
|
157
|
+
yield a String argument (or something that will `to_s` appropriately).
|
158
|
+
|
159
|
+
**Coming Soon:** The ability to define `consumes` and `emits` to
|
160
|
+
automatically handle serialization and deserialization.
|
161
|
+
|
162
|
+
<a name="widgets></a>
|
163
|
+
## Widgets
|
164
|
+
|
165
|
+
Wukong has a number of built-in widgets that are useful for
|
166
|
+
scaffolding your dataflows.
|
167
|
+
|
168
|
+
### Serializers
|
169
|
+
|
170
|
+
Serializers are widgets which don't change the semantic meaning of a
|
171
|
+
record, merely its representation. Here's a list:
|
172
|
+
|
173
|
+
* `to_json`, `from_json` for turning records into JSON or parsing JSON into records
|
174
|
+
* `to_tsv`, `from_tsv` for turning Array records into TSV or parsing TSV into Array records
|
175
|
+
* `pretty` for pretty printing JSON inputs
|
176
|
+
|
177
|
+
When you're writing processors that are capable of running in
|
178
|
+
isolation you'll want to ensure that you deserialize and serialize
|
179
|
+
records on the way in and out, like this
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
Wukong.processor(:on_my_own) do
|
183
|
+
def process json
|
184
|
+
obj = MultiJson.load(json)
|
267
185
|
|
268
|
-
|
269
|
-
|
270
|
-
wukong count_followers.rb users.json followers_histogram.tsv
|
186
|
+
# do something with obj...
|
271
187
|
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
|
290
|
-
|
291
|
-
|
292
|
-
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
|
300
|
-
|
301
|
-
|
302
|
-
|
303
|
-
|
304
|
-
|
305
|
-
|
306
|
-
|
307
|
-
|
308
|
-
|
309
|
-
|
310
|
-
|
311
|
-
|
312
|
-
|
313
|
-
|
314
|
-
|
315
|
-
|
316
|
-
|
317
|
-
|
318
|
-
|
319
|
-
|
320
|
-
|
321
|
-
|
322
|
-
|
323
|
-
|
324
|
-
|
325
|
-
|
326
|
-
|
327
|
-
|
328
|
-
|
329
|
-
|
330
|
-
|
331
|
-
|
332
|
-
|
333
|
-
|
334
|
-
|
335
|
-
|
336
|
-
|
337
|
-
|
338
|
-
|
339
|
-
|
340
|
-
|
341
|
-
|
342
|
-
|
343
|
-
|
344
|
-
|
345
|
-
|
346
|
-
|
347
|
-
|
348
|
-
|
349
|
-
|
350
|
-
|
351
|
-
|
352
|
-
|
353
|
-
|
354
|
-
|
355
|
-
|
356
|
-
|
357
|
-
|
358
|
-
|
359
|
-
|
360
|
-
|
361
|
-
|
362
|
-
|
363
|
-
|
364
|
-
|
365
|
-
|
366
|
-
|
367
|
-
|
368
|
-
|
369
|
-
<a name="refs-dataflow"></a>
|
370
|
-
### Dataflow
|
371
|
-
|
372
|
-
* **Esper**
|
373
|
-
|
374
|
-
- Must read: [StreamSQL Event Processing with Esper](http://www.igvita.com/2011/05/27/streamsql-event-processing-with-esper/)
|
375
|
-
- [Esper docs](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
|
376
|
-
- [Esper EPL Reference](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
|
377
|
-
|
378
|
-
* **Storm**
|
379
|
-
|
380
|
-
- [A Storm is coming: more details and plans for release](http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html)
|
381
|
-
- [Storm: distributed and fault-tolerant realtime computation](http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation) -- slideshare presentation
|
382
|
-
- [Storm: the Hadoop of Realtime Processing](http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce)
|
383
|
-
|
384
|
-
* **Kafka**: LinkedIn's high-throughput messaging queue
|
385
|
-
|
386
|
-
- [Kafka's Design: Why we built this](http://incubator.apache.org/kafka/design.html)
|
387
|
-
|
388
|
-
* **ZeroMQ**: tcp sockets like you think they should work
|
389
|
-
|
390
|
-
- [ZeroMQ: A Modern & Fast Networking Stack](http://www.igvita.com/2010/09/03/zeromq-modern-fast-networking-stack/)
|
391
|
-
- [ZeroMQ Guide](http://zguide.zeromq.org/page:all)
|
392
|
-
- [ZeroMQ: An Introduction](http://nichol.as/zeromq-an-introduction)
|
393
|
-
- [Routing with Ruby & ZeroMQ Devices](http://www.igvita.com/2010/11/17/routing-with-ruby-zeromq-devices/)
|
394
|
-
- [Ruby bindings for ZeroMQ](http://zeromq.github.com/rbzmq/) and the [Ruby-FFI bindings](http://www.zeromq.org/bindings:ruby-ffi)
|
395
|
-
- [Learn ruby ZeroMQ](https://github.com/andrewvc/learn-ruby-zeromq) by @andrewvc
|
396
|
-
|
397
|
-
* **Other**
|
398
|
-
|
399
|
-
- [Infopipes: An abstraction for multimedia streamin](http://web.cecs.pdx.edu/~black/publications/Mms062%203rd%20try.pdf) Black et al 2002
|
400
|
-
- [Yahoo Pipes](http://pipes.yahoo.com/pipes/)
|
401
|
-
- [Yahoo Pipes wikipedia page](http://en.wikipedia.org/wiki/Yahoo_Pipes)
|
402
|
-
- [Streambase](http://www.streambase.com/products/streambasecep/faqs/) -- Why is is so goddamn hard to find out anything real about a project once it gets an enterprise version? Seriously, the consistent fundamental brokenness of enterprise product is astonishing. It's like they take inspiration from shitty major-label band websites but layer a whiteout of [web jargon bullshit](http://www.dack.com/web/bullshit.html) in place of inessential flash animation. Anyway I think Streambase is kinda similar but who the hell can tell.
|
403
|
-
- [Scribe](http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/)
|
404
|
-
- [Splunk Case Study](http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/)
|
405
|
-
|
406
|
-
<a name="refs-dataflow"></a>
|
407
|
-
### Messaging Queue
|
408
|
-
|
409
|
-
- [DripDrop](https://github.com/andrewvc/dripdrop) - a message passing library with a unified API abstracting HTTP, zeroMQ and websockets.
|
410
|
-
|
411
|
-
|
412
|
-
<a name="refs-dataflow"></a>
|
413
|
-
### Data Processing
|
414
|
-
|
415
|
-
* **Hadoop**
|
416
|
-
|
417
|
-
- [Hadoop]()
|
418
|
-
|
419
|
-
|
420
|
-
* **Spark/Mesos**
|
421
|
-
|
422
|
-
- [Mesos](http://www.mesosproject.org/)
|
188
|
+
yield MultiJson.dump(obj)
|
189
|
+
end
|
190
|
+
end
|
191
|
+
```
|
192
|
+
|
193
|
+
For processors which will only run inside a data flow, you can
|
194
|
+
optimize by not doing any (de)serialization until except at the very
|
195
|
+
beginning and at the end
|
196
|
+
|
197
|
+
```ruby
|
198
|
+
Wukong.dataflow(:complicated) do
|
199
|
+
from_json > proc_1 > proc_2 > proc_3 ... proc_n > to_json
|
200
|
+
end
|
201
|
+
```
|
202
|
+
|
203
|
+
in this approach, no serialization will be done between processors.
|
204
|
+
|
205
|
+
### General Purpose
|
206
|
+
|
207
|
+
There are several general purpose processors which implement common
|
208
|
+
patterns on input and output data. These are most useful within the
|
209
|
+
context of a dataflow definition.
|
210
|
+
|
211
|
+
* `null` does what you think it doesn't
|
212
|
+
* `map` perform some block on each
|
213
|
+
* `flatten` flatten the input array
|
214
|
+
* `filter`, `select`, `reject` only let certain records through based on a block
|
215
|
+
* `regexp`, `not_regexp` only pass records matching (or not matching) a regular expression
|
216
|
+
* `limit` only let some number of records pass
|
217
|
+
* `logger` send events to the local log stream
|
218
|
+
* `extract` extract some part of each input event
|
219
|
+
|
220
|
+
Some of these widgets can be used directly, perhaps with some
|
221
|
+
arguments
|
222
|
+
|
223
|
+
```ruby
|
224
|
+
Wukong.processor(:log_everything) do
|
225
|
+
proc_1 > proc_2 > ... > logger
|
226
|
+
end
|
227
|
+
|
228
|
+
Wukong.processor(:log_everything_important) do
|
229
|
+
proc_1 > proc_2 > ... > regexp(match: /important/i) > logger
|
230
|
+
end
|
231
|
+
```
|
232
|
+
|
233
|
+
Other widgets require a block to define their action:
|
234
|
+
|
235
|
+
```ruby
|
236
|
+
Wukong.processor(:log_everything_important) do
|
237
|
+
parser > select { |record| record.priority =~ /important/i } > logger
|
238
|
+
end
|
239
|
+
```
|
240
|
+
|
241
|
+
### Reducers
|
242
|
+
|
243
|
+
There are a selection of widgets that do aggregative operations like
|
244
|
+
counting, sorting, and summing.
|
245
|
+
|
246
|
+
* `count` emits a final count of all input records
|
247
|
+
* `sort` can sort input streams
|
248
|
+
* `group` will group records by some extracting part and give a count of each group's size
|
249
|
+
* `moments` will emit more complicated statistics (mean, std. dev.) on the group given some other value to measure
|
250
|
+
|
251
|
+
Here's an example of sorting data right on the command line
|
252
|
+
|
253
|
+
```
|
254
|
+
$ head tokens.txt | wu-local sort
|
255
|
+
abhor
|
256
|
+
abide
|
257
|
+
abide
|
258
|
+
able
|
259
|
+
able
|
260
|
+
able
|
261
|
+
about
|
262
|
+
...
|
263
|
+
```
|
264
|
+
|
265
|
+
Try adding group:
|
266
|
+
|
267
|
+
```
|
268
|
+
$ head tokens.txt | wu-local sort | wu-local group
|
269
|
+
{:group=>"abhor", :count=>1}
|
270
|
+
{:group=>"abide", :count=>2}
|
271
|
+
{:group=>"able", :count=>3}
|
272
|
+
{:group=>"about", :count=>3}
|
273
|
+
{:group=>"above", :count=>1}
|
274
|
+
...
|
275
|
+
```
|
276
|
+
|
277
|
+
You can also use these within a more complicated dataflow:
|
278
|
+
|
279
|
+
```ruby
|
280
|
+
Wukong.dataflow(:word_count) do
|
281
|
+
tokenize > remove_stopwords > sort > group
|
282
|
+
end
|
283
|
+
```
|