wukong 3.0.0.pre → 3.0.0.pre2
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +46 -33
- data/.gitmodules +3 -0
- data/.rspec +1 -1
- data/.travis.yml +8 -1
- data/.yardopts +0 -13
- data/Guardfile +4 -6
- data/{LICENSE.textile → LICENSE.md} +43 -55
- data/README-old.md +422 -0
- data/README.md +279 -418
- data/Rakefile +21 -5
- data/TODO.md +6 -6
- data/bin/wu-clean-encoding +31 -0
- data/bin/wu-lign +2 -2
- data/bin/wu-local +69 -0
- data/bin/wu-server +70 -0
- data/examples/Gemfile +38 -0
- data/examples/README.md +9 -0
- data/examples/dataflow/apache_log_line.rb +64 -25
- data/examples/dataflow/fibonacci_series.rb +101 -0
- data/examples/dataflow/parse_apache_logs.rb +37 -7
- data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
- data/examples/dataflow/simple.rb +4 -4
- data/examples/geo.rb +4 -0
- data/examples/geo/geo_grids.numbers +0 -0
- data/examples/geo/geolocated.rb +331 -0
- data/examples/geo/quadtile.rb +69 -0
- data/examples/geo/spec/geolocated_spec.rb +247 -0
- data/examples/geo/tile_fetcher.rb +77 -0
- data/examples/graph/minimum_spanning_tree.rb +61 -61
- data/examples/jabberwocky.txt +36 -0
- data/examples/models/wikipedia.rb +20 -0
- data/examples/munging/Gemfile +8 -0
- data/examples/munging/airline_flights/airline.rb +57 -0
- data/examples/munging/airline_flights/airline_flights.rake +83 -0
- data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
- data/examples/munging/airline_flights/airport.rb +211 -0
- data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
- data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
- data/examples/munging/airline_flights/flight.rb +156 -0
- data/examples/munging/airline_flights/models.rb +4 -0
- data/examples/munging/airline_flights/parse.rb +26 -0
- data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
- data/examples/munging/airline_flights/route.rb +35 -0
- data/examples/munging/airline_flights/tasks.rake +83 -0
- data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
- data/examples/munging/airline_flights/topcities.rb +167 -0
- data/examples/munging/airports/40_wbans.txt +40 -0
- data/examples/munging/airports/filter_weather_reports.rb +37 -0
- data/examples/munging/airports/join.pig +31 -0
- data/examples/munging/airports/to_tsv.rb +33 -0
- data/examples/munging/airports/usa_wbans.pig +19 -0
- data/examples/munging/airports/usa_wbans.txt +2157 -0
- data/examples/munging/airports/wbans.pig +19 -0
- data/examples/munging/airports/wbans.txt +2310 -0
- data/examples/munging/geo/geo_json.rb +54 -0
- data/examples/munging/geo/geo_models.rb +69 -0
- data/examples/munging/geo/geonames_models.rb +78 -0
- data/examples/munging/geo/iso_codes.rb +172 -0
- data/examples/munging/geo/reconcile_countries.rb +124 -0
- data/examples/munging/geo/tasks.rake +71 -0
- data/examples/munging/rake_helper.rb +62 -0
- data/examples/munging/weather/.gitignore +1 -0
- data/examples/munging/weather/Gemfile +4 -0
- data/examples/munging/weather/Rakefile +28 -0
- data/examples/munging/weather/extract_ish.rb +13 -0
- data/examples/munging/weather/models/weather.rb +119 -0
- data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
- data/examples/munging/wikipedia/README.md +34 -0
- data/examples/munging/wikipedia/Rakefile +193 -0
- data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
- data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
- data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
- data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
- data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
- data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
- data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
- data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
- data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
- data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
- data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
- data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
- data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
- data/examples/munging/wikipedia/pig_style_guide.md +25 -0
- data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
- data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
- data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
- data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
- data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
- data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
- data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
- data/examples/munging/wikipedia/utils/namespaces.json +1 -0
- data/examples/rake_helper.rb +85 -0
- data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
- data/examples/server_logs/logline.rb +95 -0
- data/examples/server_logs/models.rb +66 -0
- data/examples/server_logs/page_counts.pig +48 -0
- data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
- data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
- data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
- data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
- data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
- data/examples/string_reverser.rb +26 -0
- data/examples/text/pig_latin.rb +2 -2
- data/examples/text/regional_flavor/README.md +14 -0
- data/examples/text/regional_flavor/article_wordbags.pig +39 -0
- data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
- data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
- data/examples/word_count/accumulator.rb +26 -0
- data/examples/word_count/tokenizer.rb +13 -0
- data/examples/word_count/word_count.rb +6 -0
- data/examples/workflow/cherry_pie.dot +97 -0
- data/examples/workflow/cherry_pie.png +0 -0
- data/examples/workflow/cherry_pie.rb +61 -26
- data/lib/hanuman.rb +34 -7
- data/lib/hanuman/graph.rb +55 -31
- data/lib/hanuman/graphvizzer.rb +199 -178
- data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
- data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
- data/lib/hanuman/link.rb +35 -0
- data/lib/hanuman/registry.rb +46 -0
- data/lib/hanuman/stage.rb +76 -32
- data/lib/wukong.rb +23 -24
- data/lib/wukong/boot.rb +87 -0
- data/lib/wukong/configuration.rb +8 -0
- data/lib/wukong/dataflow.rb +45 -78
- data/lib/wukong/driver.rb +99 -0
- data/lib/wukong/emitter.rb +22 -0
- data/lib/wukong/model/faker.rb +24 -24
- data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
- data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
- data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
- data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
- data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
- data/lib/wukong/processor.rb +60 -114
- data/lib/wukong/spec_helpers.rb +81 -0
- data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
- data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
- data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
- data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
- data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
- data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
- data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
- data/lib/wukong/version.rb +2 -1
- data/lib/wukong/widget/filters.rb +311 -0
- data/lib/wukong/widget/processors.rb +156 -0
- data/lib/wukong/widget/reducers.rb +7 -0
- data/lib/wukong/widget/reducers/accumulator.rb +73 -0
- data/lib/wukong/widget/reducers/bin.rb +318 -0
- data/lib/wukong/widget/reducers/count.rb +61 -0
- data/lib/wukong/widget/reducers/group.rb +85 -0
- data/lib/wukong/widget/reducers/group_concat.rb +70 -0
- data/lib/wukong/widget/reducers/moments.rb +72 -0
- data/lib/wukong/widget/reducers/sort.rb +130 -0
- data/lib/wukong/widget/serializers.rb +287 -0
- data/lib/wukong/widget/sink.rb +10 -52
- data/lib/wukong/widget/source.rb +7 -113
- data/lib/wukong/widget/utils.rb +46 -0
- data/lib/wukong/widgets.rb +6 -0
- data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
- data/spec/examples/dataflow/parsing_spec.rb +12 -11
- data/spec/examples/dataflow/simple_spec.rb +32 -6
- data/spec/examples/dataflow/telegram_spec.rb +36 -36
- data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
- data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
- data/spec/examples/munging/airline_flights_spec.rb +202 -0
- data/spec/examples/text/pig_latin_spec.rb +13 -16
- data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
- data/spec/hanuman/graph_spec.rb +27 -2
- data/spec/hanuman/hanuman_spec.rb +10 -0
- data/spec/hanuman/registry_spec.rb +123 -0
- data/spec/hanuman/stage_spec.rb +61 -7
- data/spec/spec_helper.rb +29 -19
- data/spec/support/hanuman_test_helpers.rb +14 -12
- data/spec/support/shared_context_for_reducers.rb +37 -0
- data/spec/support/shared_examples_for_builders.rb +101 -0
- data/spec/support/shared_examples_for_shortcuts.rb +57 -0
- data/spec/support/wukong_test_helpers.rb +37 -11
- data/spec/wukong/dataflow_spec.rb +77 -55
- data/spec/wukong/local_runner_spec.rb +24 -24
- data/spec/wukong/model/faker_spec.rb +132 -131
- data/spec/wukong/runner_spec.rb +8 -8
- data/spec/wukong/widget/filters_spec.rb +61 -0
- data/spec/wukong/widget/processors_spec.rb +126 -0
- data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
- data/spec/wukong/widget/reducers/count_spec.rb +11 -0
- data/spec/wukong/widget/reducers/group_spec.rb +20 -0
- data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
- data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
- data/spec/wukong/widget/serializers_spec.rb +92 -0
- data/spec/wukong/widget/sink_spec.rb +15 -15
- data/spec/wukong/widget/source_spec.rb +65 -41
- data/spec/wukong/wukong_spec.rb +10 -0
- data/wukong.gemspec +17 -10
- metadata +359 -335
- data/.document +0 -5
- data/VERSION +0 -1
- data/bin/hdp-bin +0 -44
- data/bin/hdp-bzip +0 -23
- data/bin/hdp-cat +0 -3
- data/bin/hdp-catd +0 -3
- data/bin/hdp-cp +0 -3
- data/bin/hdp-du +0 -86
- data/bin/hdp-get +0 -3
- data/bin/hdp-kill +0 -3
- data/bin/hdp-kill-task +0 -3
- data/bin/hdp-ls +0 -11
- data/bin/hdp-mkdir +0 -2
- data/bin/hdp-mkdirp +0 -12
- data/bin/hdp-mv +0 -3
- data/bin/hdp-parts_to_keys.rb +0 -77
- data/bin/hdp-ps +0 -3
- data/bin/hdp-put +0 -3
- data/bin/hdp-rm +0 -32
- data/bin/hdp-sort +0 -40
- data/bin/hdp-stream +0 -40
- data/bin/hdp-stream-flat +0 -22
- data/bin/hdp-stream2 +0 -39
- data/bin/hdp-sync +0 -17
- data/bin/hdp-wc +0 -67
- data/bin/wu-flow +0 -10
- data/bin/wu-map +0 -17
- data/bin/wu-red +0 -17
- data/bin/wukong +0 -17
- data/data/CREDITS.md +0 -355
- data/data/graph/airfares.tsv +0 -2174
- data/data/text/gift_of_the_magi.txt +0 -225
- data/data/text/jabberwocky.txt +0 -36
- data/data/text/rectification_of_names.txt +0 -33
- data/data/twitter/a_atsigns_b.tsv +0 -64
- data/data/twitter/a_follows_b.tsv +0 -53
- data/data/twitter/tweet.tsv +0 -167
- data/data/twitter/twitter_user.tsv +0 -55
- data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
- data/docpages/INSTALL.textile +0 -92
- data/docpages/LICENSE.textile +0 -107
- data/docpages/README-elastic_map_reduce.textile +0 -377
- data/docpages/README-performance.textile +0 -90
- data/docpages/README-wulign.textile +0 -65
- data/docpages/UsingWukong-part1-get_ready.textile +0 -17
- data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
- data/docpages/UsingWukong-part3-parsing.textile +0 -138
- data/docpages/_config.yml +0 -39
- data/docpages/avro/avro_notes.textile +0 -56
- data/docpages/avro/performance.textile +0 -36
- data/docpages/avro/tethering.textile +0 -19
- data/docpages/bigdata-tips.textile +0 -143
- data/docpages/code/api_response_example.txt +0 -20
- data/docpages/code/parser_skeleton.rb +0 -38
- data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
- data/docpages/favicon.ico +0 -0
- data/docpages/gem.css +0 -16
- data/docpages/hadoop-tips.textile +0 -83
- data/docpages/index.textile +0 -92
- data/docpages/intro.textile +0 -8
- data/docpages/moreinfo.textile +0 -174
- data/docpages/news.html +0 -24
- data/docpages/pig/PigLatinExpressionsList.txt +0 -122
- data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
- data/docpages/pig/commandline_params.txt +0 -26
- data/docpages/pig/cookbook.html +0 -481
- data/docpages/pig/images/hadoop-logo.jpg +0 -0
- data/docpages/pig/images/instruction_arrow.png +0 -0
- data/docpages/pig/images/pig-logo.gif +0 -0
- data/docpages/pig/piglatin_ref1.html +0 -1103
- data/docpages/pig/piglatin_ref2.html +0 -14340
- data/docpages/pig/setup.html +0 -505
- data/docpages/pig/skin/basic.css +0 -166
- data/docpages/pig/skin/breadcrumbs.js +0 -237
- data/docpages/pig/skin/fontsize.js +0 -166
- data/docpages/pig/skin/getBlank.js +0 -40
- data/docpages/pig/skin/getMenu.js +0 -45
- data/docpages/pig/skin/images/chapter.gif +0 -0
- data/docpages/pig/skin/images/chapter_open.gif +0 -0
- data/docpages/pig/skin/images/current.gif +0 -0
- data/docpages/pig/skin/images/external-link.gif +0 -0
- data/docpages/pig/skin/images/header_white_line.gif +0 -0
- data/docpages/pig/skin/images/page.gif +0 -0
- data/docpages/pig/skin/images/pdfdoc.gif +0 -0
- data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/print.css +0 -54
- data/docpages/pig/skin/profile.css +0 -181
- data/docpages/pig/skin/screen.css +0 -587
- data/docpages/pig/tutorial.html +0 -1059
- data/docpages/pig/udf.html +0 -1509
- data/docpages/tutorial.textile +0 -283
- data/docpages/usage.textile +0 -195
- data/docpages/wutils.textile +0 -263
- data/examples/dataflow/complex.rb +0 -11
- data/examples/dataflow/donuts.rb +0 -13
- data/examples/tiny_count/jabberwocky_output.tsv +0 -92
- data/examples/word_count.rb +0 -48
- data/examples/workflow/fiddle.rb +0 -24
- data/lib/away/escapement.rb +0 -129
- data/lib/away/exe.rb +0 -11
- data/lib/away/experimental.rb +0 -5
- data/lib/away/from_file.rb +0 -52
- data/lib/away/job.rb +0 -56
- data/lib/away/job/rake_compat.rb +0 -17
- data/lib/away/registry.rb +0 -79
- data/lib/away/runner.rb +0 -276
- data/lib/away/runner/execute.rb +0 -121
- data/lib/away/script.rb +0 -161
- data/lib/away/script/hadoop_command.rb +0 -240
- data/lib/away/source/file_list_source.rb +0 -15
- data/lib/away/source/looper.rb +0 -18
- data/lib/away/task.rb +0 -219
- data/lib/hanuman/action.rb +0 -21
- data/lib/hanuman/chain.rb +0 -4
- data/lib/hanuman/graphviz.rb +0 -74
- data/lib/hanuman/resource.rb +0 -6
- data/lib/hanuman/slot.rb +0 -87
- data/lib/hanuman/slottable.rb +0 -220
- data/lib/wukong/bad_record.rb +0 -15
- data/lib/wukong/event.rb +0 -44
- data/lib/wukong/local_runner.rb +0 -55
- data/lib/wukong/mapred.rb +0 -3
- data/lib/wukong/universe.rb +0 -48
- data/lib/wukong/widget/filter.rb +0 -81
- data/lib/wukong/widget/gibberish.rb +0 -123
- data/lib/wukong/widget/monitor.rb +0 -26
- data/lib/wukong/widget/reducer.rb +0 -66
- data/lib/wukong/widget/stringifier.rb +0 -50
- data/lib/wukong/workflow.rb +0 -22
- data/lib/wukong/workflow/command.rb +0 -42
- data/old/config/emr-example.yaml +0 -48
- data/old/examples/README.txt +0 -17
- data/old/examples/contrib/jeans/README.markdown +0 -165
- data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
- data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
- data/old/examples/contrib/jeans/data/sizes +0 -3
- data/old/examples/contrib/jeans/normalize.rb +0 -20
- data/old/examples/contrib/jeans/sizes.rb +0 -55
- data/old/examples/corpus/bnc_word_freq.rb +0 -44
- data/old/examples/corpus/bucket_counter.rb +0 -47
- data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
- data/old/examples/corpus/sentence_bigrams.rb +0 -53
- data/old/examples/corpus/sentence_coocurrence.rb +0 -66
- data/old/examples/corpus/stopwords.rb +0 -138
- data/old/examples/corpus/words_to_bigrams.rb +0 -53
- data/old/examples/emr/README.textile +0 -110
- data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
- data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
- data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
- data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
- data/old/examples/network_graph/adjacency_list.rb +0 -74
- data/old/examples/network_graph/breadth_first_search.rb +0 -72
- data/old/examples/network_graph/gen_2paths.rb +0 -68
- data/old/examples/network_graph/gen_multi_edge.rb +0 -112
- data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
- data/old/examples/pagerank/README.textile +0 -6
- data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
- data/old/examples/pagerank/pagerank.rb +0 -72
- data/old/examples/pagerank/pagerank_initialize.rb +0 -42
- data/old/examples/pagerank/run_pagerank.sh +0 -21
- data/old/examples/sample_records.rb +0 -33
- data/old/examples/server_logs/apache_log_parser.rb +0 -15
- data/old/examples/server_logs/nook.rb +0 -48
- data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
- data/old/examples/server_logs/user_agent.rb +0 -40
- data/old/examples/simple_word_count.rb +0 -82
- data/old/examples/size.rb +0 -61
- data/old/examples/stats/avg_value_frequency.rb +0 -86
- data/old/examples/stats/binning_percentile_estimator.rb +0 -140
- data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
- data/old/examples/stats/rank_and_bin.rb +0 -173
- data/old/examples/stupidly_simple_filter.rb +0 -40
- data/old/examples/word_count.rb +0 -75
- data/old/graph/graphviz_builder.rb +0 -580
- data/old/graph_easy/Attributes.pm +0 -4181
- data/old/graph_easy/Graphviz.pm +0 -2232
- data/old/wukong.rb +0 -18
- data/old/wukong/and_pig.rb +0 -38
- data/old/wukong/bad_record.rb +0 -18
- data/old/wukong/datatypes.rb +0 -24
- data/old/wukong/datatypes/enum.rb +0 -127
- data/old/wukong/datatypes/fake_types.rb +0 -17
- data/old/wukong/decorator.rb +0 -28
- data/old/wukong/encoding/asciize.rb +0 -108
- data/old/wukong/extensions.rb +0 -16
- data/old/wukong/extensions/array.rb +0 -18
- data/old/wukong/extensions/blank.rb +0 -93
- data/old/wukong/extensions/class.rb +0 -189
- data/old/wukong/extensions/date_time.rb +0 -53
- data/old/wukong/extensions/emittable.rb +0 -69
- data/old/wukong/extensions/enumerable.rb +0 -79
- data/old/wukong/extensions/hash.rb +0 -167
- data/old/wukong/extensions/hash_keys.rb +0 -16
- data/old/wukong/extensions/hash_like.rb +0 -150
- data/old/wukong/extensions/hashlike_class.rb +0 -47
- data/old/wukong/extensions/module.rb +0 -2
- data/old/wukong/extensions/pathname.rb +0 -27
- data/old/wukong/extensions/string.rb +0 -65
- data/old/wukong/extensions/struct.rb +0 -17
- data/old/wukong/extensions/symbol.rb +0 -11
- data/old/wukong/filename_pattern.rb +0 -74
- data/old/wukong/helper.rb +0 -7
- data/old/wukong/helper/stopwords.rb +0 -195
- data/old/wukong/helper/tokenize.rb +0 -35
- data/old/wukong/logger.rb +0 -38
- data/old/wukong/periodic_monitor.rb +0 -72
- data/old/wukong/schema.rb +0 -269
- data/old/wukong/script.rb +0 -286
- data/old/wukong/script/avro_command.rb +0 -5
- data/old/wukong/script/cassandra_loader_script.rb +0 -40
- data/old/wukong/script/emr_command.rb +0 -168
- data/old/wukong/script/hadoop_command.rb +0 -237
- data/old/wukong/script/local_command.rb +0 -41
- data/old/wukong/store.rb +0 -10
- data/old/wukong/store/base.rb +0 -27
- data/old/wukong/store/cassandra.rb +0 -10
- data/old/wukong/store/cassandra/streaming.rb +0 -75
- data/old/wukong/store/cassandra/struct_loader.rb +0 -21
- data/old/wukong/store/cassandra_model.rb +0 -91
- data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
- data/old/wukong/store/chunked_flat_file_store.rb +0 -48
- data/old/wukong/store/conditional_store.rb +0 -57
- data/old/wukong/store/factory.rb +0 -8
- data/old/wukong/store/flat_file_store.rb +0 -89
- data/old/wukong/store/key_store.rb +0 -51
- data/old/wukong/store/null_store.rb +0 -15
- data/old/wukong/store/read_thru_store.rb +0 -22
- data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
- data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
- data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
- data/old/wukong/streamer.rb +0 -30
- data/old/wukong/streamer/accumulating_reducer.rb +0 -83
- data/old/wukong/streamer/base.rb +0 -126
- data/old/wukong/streamer/counting_reducer.rb +0 -25
- data/old/wukong/streamer/filter.rb +0 -20
- data/old/wukong/streamer/instance_streamer.rb +0 -15
- data/old/wukong/streamer/json_streamer.rb +0 -21
- data/old/wukong/streamer/line_streamer.rb +0 -12
- data/old/wukong/streamer/list_reducer.rb +0 -31
- data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
- data/old/wukong/streamer/record_streamer.rb +0 -14
- data/old/wukong/streamer/reducer.rb +0 -11
- data/old/wukong/streamer/set_reducer.rb +0 -14
- data/old/wukong/streamer/struct_streamer.rb +0 -48
- data/old/wukong/streamer/summing_reducer.rb +0 -29
- data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
- data/old/wukong/typed_struct.rb +0 -12
- data/spec/away/encoding_spec.rb +0 -32
- data/spec/away/exe_spec.rb +0 -20
- data/spec/away/flow_spec.rb +0 -82
- data/spec/away/graph_spec.rb +0 -6
- data/spec/away/job_spec.rb +0 -15
- data/spec/away/rake_compat_spec.rb +0 -9
- data/spec/away/script_spec.rb +0 -81
- data/spec/hanuman/graphviz_spec.rb +0 -29
- data/spec/hanuman/slot_spec.rb +0 -2
- data/spec/support/examples_helper.rb +0 -10
- data/spec/support/streamer_test_helpers.rb +0 -6
- data/spec/support/wukong_widget_helpers.rb +0 -66
- data/spec/wukong/processor_spec.rb +0 -109
- data/spec/wukong/widget/filter_spec.rb +0 -99
- data/spec/wukong/widget/stringifier_spec.rb +0 -51
- data/spec/wukong/workflow/command_spec.rb +0 -5
@@ -1,123 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
module Widget
|
3
|
-
#
|
4
|
-
# Uses bigram frequencies taken from the British National Corpus
|
5
|
-
#
|
6
|
-
class Gibberish < Wukong::Source
|
7
|
-
register_source
|
8
|
-
include CappedGenerator
|
9
|
-
|
10
|
-
field :min_wordlen, Integer, :default => 3, :doc => "Shortest word length to generate"
|
11
|
-
field :max_wordlen, Integer, :default => 10, :doc => "Largest word length to generate"
|
12
|
-
field :max_linelen, Integer, :default => 80, :doc => "Max line length to generate; might be shorter by up to `min_wordlen` chars."
|
13
|
-
attr_accessor :rng
|
14
|
-
|
15
|
-
def setup
|
16
|
-
self.rng = Random.new
|
17
|
-
super
|
18
|
-
end
|
19
|
-
|
20
|
-
def next_item
|
21
|
-
funny_line
|
22
|
-
end
|
23
|
-
|
24
|
-
def random_from_cdf cdf, except=[]
|
25
|
-
offset = rng.rand
|
26
|
-
cdf.find{|freq, c| (offset < freq) && (not except.include?(c)) }.last
|
27
|
-
end
|
28
|
-
|
29
|
-
def funny_charseq(len = 5)
|
30
|
-
char = random_from_cdf(BIGRAM_CDF_HSH['^'])
|
31
|
-
word = [char]
|
32
|
-
len.times do
|
33
|
-
next_char = random_from_cdf(BIGRAM_CDF_HSH[char], [])
|
34
|
-
break if next_char == '$'
|
35
|
-
word << next_char
|
36
|
-
end
|
37
|
-
word.join
|
38
|
-
end
|
39
|
-
|
40
|
-
def funny_word(cap)
|
41
|
-
cap = [max_wordlen, cap].min
|
42
|
-
word = ''
|
43
|
-
100.times do
|
44
|
-
word = funny_charseq(cap)
|
45
|
-
return word if word.length >= min_wordlen && word.length <= cap
|
46
|
-
end
|
47
|
-
word
|
48
|
-
end
|
49
|
-
|
50
|
-
def funny_line
|
51
|
-
min_wordlen = 3
|
52
|
-
min_linelen = max_linelen - min_wordlen
|
53
|
-
line = ""
|
54
|
-
max_linelen.times do
|
55
|
-
line << funny_word(max_linelen-line.length) << " "
|
56
|
-
return line.strip if line.strip.length > min_linelen
|
57
|
-
end
|
58
|
-
end
|
59
|
-
|
60
|
-
UNIGRAM_CDF = [
|
61
|
-
[0.1083564980, "^"],
|
62
|
-
[0.2034715900, "e"],
|
63
|
-
[0.2836505016, "a"],
|
64
|
-
[0.3608587709, "i"],
|
65
|
-
[0.4254609572, "r"],
|
66
|
-
[0.4878711345, "n"],
|
67
|
-
[0.5491446929, "o"],
|
68
|
-
[0.6081589386, "t"],
|
69
|
-
[0.6604080108, "s"],
|
70
|
-
[0.7104964048, "l"],
|
71
|
-
[0.7482015140, "c"],
|
72
|
-
[0.7807004566, "u"],
|
73
|
-
[0.8107323736, "d"],
|
74
|
-
[0.8387591070, "m"],
|
75
|
-
[0.8645706232, "h"],
|
76
|
-
[0.8900984199, "p"],
|
77
|
-
[0.9124467010, "g"],
|
78
|
-
[0.9302838994, "y"],
|
79
|
-
[0.9480503129, "b"],
|
80
|
-
[0.9604887171, "f"],
|
81
|
-
[0.9713402644, "k"],
|
82
|
-
[0.9806072759, "v"],
|
83
|
-
[0.9890852104, "w"],
|
84
|
-
[0.9930607653, "z"],
|
85
|
-
[0.9962141717, "x"],
|
86
|
-
[0.9983324949, "j"],
|
87
|
-
[1.0000000000, "q"],
|
88
|
-
] # UNIGRAM_CDF
|
89
|
-
|
90
|
-
BIGRAM_CDF_HSH = {
|
91
|
-
'^' => [ [0.1036218381, "s"], [0.1906712502, "c"], [0.2682537643, "p"], [0.3317216047, "a"], [0.3936580351, "m"], [0.4519587055, "b"], [0.5062006040, "d"], [0.5582364154, "t"], [0.6060742359, "r"], [0.6465762814, "h"], [0.6851613871, "f"], [0.7236340466, "e"], [0.7599541648, "g"], [0.7954175502, "i"], [0.8298153741, "l"], [0.8625318598, "u"], [0.8880517895, "o"], [0.9135503009, "w"], [0.9378761593, "n"], [0.9575114053, "k"], [0.9749726916, "v"], [0.9844556534, "j"], [0.9891783932, "y"], [0.9938957784, "q"], [0.9981580244, "z"], [1.0000000000, "x"], ],
|
92
|
-
'a' => [ [0.1475494063, "n"], [0.2731292197, "t"], [0.3885347092, "l"], [0.5028764536, "r"], [0.5712094131, "$"], [0.6238394686, "s"], [0.6733868342, "c"], [0.7157919112, "m"], [0.7521546266, "b"], [0.7858181792, "d"], [0.8182153686, "i"], [0.8462128503, "p"], [0.8732913142, "g"], [0.8945879254, "u"], [0.9098711204, "y"], [0.9241484612, "k"], [0.9383968565, "v"], [0.9508578706, "e"], [0.9609453582, "f"], [0.9688908829, "w"], [0.9766844440, "h"], [0.9836313508, "z"], [0.9897822579, "a"], [0.9943628746, "x"], [0.9964758921, "o"], [0.9984297096, "j"], [1.0000000000, "q"], ],
|
93
|
-
'b' => [ [0.1548283857, "a"], [0.3006433493, "l"], [0.4444335587, "e"], [0.5720257340, "o"], [0.6942947650, "i"], [0.7909604520, "r"], [0.8741386630, "u"], [0.8985663434, "$"], [0.9228307371, "b"], [0.9419352732, "s"], [0.9592109990, "y"], [0.9649586885, "t"], [0.9702165181, "c"], [0.9754416903, "h"], [0.9804382613, "d"], [0.9838672806, "m"], [0.9871983279, "j"], [0.9895496555, "p"], [0.9918683257, "n"], [0.9939257372, "f"], [0.9954606316, "w"], [0.9969628686, "v"], [0.9981385324, "g"], [0.9991182522, "k"], [0.9996081121, "z"], [0.9998367134, "x"], [1.0000000000, "q"], ],
|
94
|
-
'c' => [ [0.1561567107, "o"], [0.3072354045, "h"], [0.4498199612, "a"], [0.5342381436, "e"], [0.6138245160, "$"], [0.6810697689, "i"], [0.7428984704, "t"], [0.8037423445, "k"], [0.8614778568, "r"], [0.9075339304, "u"], [0.9451574185, "l"], [0.9638537531, "y"], [0.9812728895, "c"], [0.9879666390, "s"], [0.9894131044, "n"], [0.9908287939, "m"], [0.9920906041, "q"], [0.9933062506, "d"], [0.9944757332, "p"], [0.9955067245, "z"], [0.9964300003, "g"], [0.9972763364, "f"], [0.9980303450, "x"], [0.9987535777, "b"], [0.9993998707, "v"], [0.9998307328, "w"], [1.0000000000, "j"], ],
|
95
|
-
'd' => [ [0.2735843589, "$"], [0.4794729623, "e"], [0.6243696992, "i"], [0.6957748112, "a"], [0.7647456579, "o"], [0.8067270725, "r"], [0.8421205154, "u"], [0.8667529607, "l"], [0.8871737408, "d"], [0.9052375341, "y"], [0.9209829795, "s"], [0.9341589227, "g"], [0.9428334074, "n"], [0.9506578312, "w"], [0.9579219876, "m"], [0.9650122679, "h"], [0.9720445896, "b"], [0.9768938004, "f"], [0.9816657329, "c"], [0.9861478719, "t"], [0.9899731458, "v"], [0.9937597805, "p"], [0.9968702305, "j"], [0.9984351152, "z"], [0.9994590522, "k"], [0.9997295261, "q"], [1.0000000000, "x"], ],
|
96
|
-
'e' => [ [0.1940659046, "$"], [0.3647626484, "r"], [0.4678651164, "n"], [0.5406383057, "d"], [0.6084582820, "s"], [0.6653470299, "l"], [0.7148364586, "t"], [0.7581282711, "a"], [0.7938804641, "c"], [0.8234045409, "m"], [0.8492808081, "e"], [0.8678553564, "p"], [0.8835079971, "i"], [0.8977820342, "x"], [0.9112447692, "g"], [0.9229689997, "u"], [0.9342845291, "v"], [0.9454719582, "f"], [0.9560981859, "y"], [0.9666695133, "o"], [0.9757646370, "b"], [0.9844937597, "w"], [0.9899593739, "h"], [0.9938206839, "k"], [0.9962789903, "q"], [0.9987189967, "z"], [1.0000000000, "j"], ],
|
97
|
-
'f' => [ [0.1662002052, "i"], [0.2924246665, "o"], [0.4152439593, "e"], [0.5190316261, "a"], [0.6096650807, "l"], [0.6926485680, "f"], [0.7724134714, "r"], [0.8517119134, "u"], [0.9112790372, "$"], [0.9520011195, "t"], [0.9698199459, "y"], [0.9751842523, "s"], [0.9794290512, "c"], [0.9824610505, "m"], [0.9849332960, "n"], [0.9873588954, "b"], [0.9897378487, "w"], [0.9918835712, "h"], [0.9938893554, "d"], [0.9958484933, "p"], [0.9973878160, "g"], [0.9983673850, "k"], [0.9990670772, "j"], [0.9994868924, "v"], [0.9997667693, "x"], [0.9999067077, "z"], [1.0000000000, "q"], ],
|
98
|
-
'g' => [ [0.2294771276, "$"], [0.3781089361, "e"], [0.4743756166, "a"], [0.5630354639, "i"], [0.6496183602, "r"], [0.7118749675, "o"], [0.7740277273, "l"], [0.8293005867, "h"], [0.8777714315, "u"], [0.9066929747, "g"], [0.9340827665, "n"], [0.9539176489, "y"], [0.9633677761, "s"], [0.9717534659, "m"], [0.9797237655, "t"], [0.9840853627, "w"], [0.9880834934, "b"], [0.9911469962, "c"], [0.9936912612, "d"], [0.9960278311, "f"], [0.9975855444, "p"], [0.9983644011, "k"], [0.9988836388, "j"], [0.9993769147, "v"], [0.9996365336, "x"], [0.9998701906, "z"], [1.0000000000, "q"], ],
|
99
|
-
'h' => [ [0.1921235417, "e"], [0.3644435453, "a"], [0.5135657608, "o"], [0.6515836087, "i"], [0.7645154764, "$"], [0.8146424798, "y"], [0.8527884551, "u"], [0.8857868591, "r"], [0.9185379999, "t"], [0.9357115562, "l"], [0.9473778857, "n"], [0.9580551622, "m"], [0.9658776721, "w"], [0.9721491672, "s"], [0.9781284420, "h"], [0.9828714006, "b"], [0.9868051341, "c"], [0.9895699867, "f"], [0.9922674040, "p"], [0.9947400364, "d"], [0.9965832715, "k"], [0.9982691573, "g"], [0.9991233394, "v"], [0.9994829950, "j"], [0.9997752152, "q"], [0.9999550430, "z"], [1.0000000000, "x"], ],
|
100
|
-
'i' => [ [0.1973908665, "n"], [0.3137122288, "s"], [0.4121934907, "c"], [0.4902570808, "t"], [0.5606405603, "o"], [0.6174823966, "a"], [0.6742490851, "l"], [0.7181354315, "$"], [0.7553711928, "d"], [0.7895559513, "e"], [0.8198179919, "m"], [0.8494187314, "r"], [0.8767049169, "g"], [0.9008874961, "v"], [0.9198247552, "p"], [0.9372139685, "z"], [0.9544904600, "f"], [0.9682274876, "b"], [0.9765388402, "k"], [0.9841588325, "u"], [0.9891110760, "i"], [0.9927707765, "x"], [0.9952957444, "q"], [0.9969114232, "h"], [0.9981513628, "j"], [0.9992109475, "y"], [1.0000000000, "w"], ],
|
101
|
-
'j' => [ [0.2434949329, "a"], [0.4396055875, "u"], [0.6343467543, "o"], [0.8184059162, "e"], [0.9104354971, "i"], [0.9367296631, "$"], [0.9457682827, "k"], [0.9520679266, "m"], [0.9569980827, "n"], [0.9619282388, "d"], [0.9668583950, "l"], [0.9709668584, "j"], [0.9748014243, "t"], [0.9786359901, "c"], [0.9821966584, "h"], [0.9857573268, "s"], [0.9887701999, "r"], [0.9912352780, "b"], [0.9934264585, "p"], [0.9953437414, "f"], [0.9972610244, "y"], [0.9983566146, "v"], [0.9989044098, "w"], [0.9994522049, "g"], [1.0000000000, "z"], ],
|
102
|
-
'k' => [ [0.2145110410, "e"], [0.4128214725, "$"], [0.5650430412, "i"], [0.6673260974, "a"], [0.7285462225, "o"], [0.7652782976, "l"], [0.7947388120, "u"], [0.8235042507, "y"], [0.8500775277, "s"], [0.8754745228, "n"], [0.9003903117, "r"], [0.9246644923, "h"], [0.9366411806, "t"], [0.9481366626, "w"], [0.9578142544, "b"], [0.9671175747, "m"], [0.9750307437, "k"], [0.9813933594, "f"], [0.9869539646, "p"], [0.9911778859, "c"], [0.9941185906, "d"], [0.9963642196, "g"], [0.9978613057, "v"], [0.9990375876, "j"], [0.9994653264, "x"], [0.9997861306, "z"], [1.0000000000, "q"], ],
|
103
|
-
'l' => [ [0.1721321920, "e"], [0.3312328418, "i"], [0.4596138030, "a"], [0.5644569032, "l"], [0.6646897335, "$"], [0.7533389707, "o"], [0.8401464166, "y"], [0.8755574604, "u"], [0.8986783120, "d"], [0.9217875801, "t"], [0.9320853942, "s"], [0.9413986030, "m"], [0.9491480267, "f"], [0.9568163653, "c"], [0.9643341172, "k"], [0.9708440964, "p"], [0.9770413187, "v"], [0.9830300363, "b"], [0.9868757891, "g"], [0.9903392834, "n"], [0.9929340083, "w"], [0.9953318120, "h"], [0.9976137798, "r"], [0.9984130478, "z"], [0.9990848962, "j"], [0.9997104102, "x"], [1.0000000000, "q"], ],
|
104
|
-
'm' => [ [0.2069558017, "a"], [0.3746403064, "e"], [0.5182279267, "i"], [0.6378221716, "$"], [0.7489907877, "o"], [0.8112410724, "p"], [0.8594348411, "u"], [0.8979194700, "m"], [0.9363212918, "b"], [0.9540420246, "y"], [0.9636269537, "s"], [0.9718662664, "n"], [0.9770417141, "l"], [0.9810371597, "t"], [0.9841010247, "f"], [0.9867922575, "c"], [0.9893385778, "r"], [0.9917192837, "d"], [0.9936445503, "h"], [0.9955698168, "w"], [0.9970810475, "g"], [0.9981368388, "k"], [0.9987371908, "v"], [0.9993168409, "j"], [0.9996273678, "q"], [0.9998757893, "z"], [1.0000000000, "x"], ],
|
105
|
-
'n' => [ [0.1731959913, "$"], [0.2872468996, "g"], [0.4009352398, "e"], [0.5103657289, "t"], [0.5939609170, "i"], [0.6679155123, "a"], [0.7369707900, "d"], [0.7893386386, "o"], [0.8332651581, "c"], [0.8768384062, "s"], [0.8977279066, "n"], [0.9132160720, "u"], [0.9263893795, "k"], [0.9386981016, "f"], [0.9482736181, "y"], [0.9546882844, "l"], [0.9609170184, "v"], [0.9668947437, "h"], [0.9727330197, "b"], [0.9780785750, "r"], [0.9831359351, "p"], [0.9879051001, "m"], [0.9919026458, "w"], [0.9952773181, "z"], [0.9978431846, "j"], [0.9994514995, "q"], [1.0000000000, "x"], ],
|
106
|
-
'o' => [ [0.1792873578, "n"], [0.3023095060, "r"], [0.3778063954, "l"], [0.4461351994, "u"], [0.5056388308, "m"], [0.5562320680, "$"], [0.6058215838, "s"], [0.6533752497, "t"], [0.6988551895, "p"], [0.7370818222, "c"], [0.7749770375, "o"], [0.8078347079, "g"], [0.8369805032, "d"], [0.8623197326, "w"], [0.8856041740, "v"], [0.9054797504, "i"], [0.9235656727, "b"], [0.9396914977, "a"], [0.9530334163, "f"], [0.9646140881, "e"], [0.9753425436, "k"], [0.9832965618, "x"], [0.9897734052, "y"], [0.9942806822, "h"], [0.9976422017, "z"], [0.9989678714, "j"], [1.0000000000, "q"], ],
|
107
|
-
'p' => [ [0.1630755943, "e"], [0.2878539934, "a"], [0.4087231238, "o"], [0.5266603027, "r"], [0.6227101232, "h"], [0.7102822856, "i"], [0.7778535388, "l"], [0.8271512341, "$"], [0.8667893995, "p"], [0.9040865494, "u"], [0.9379289968, "t"], [0.9606573026, "s"], [0.9756579845, "y"], [0.9790217737, "m"], [0.9823628347, "f"], [0.9856584390, "n"], [0.9888858584, "c"], [0.9914996136, "w"], [0.9938633574, "b"], [0.9958407200, "d"], [0.9972498750, "g"], [0.9985453884, "k"], [0.9992499659, "v"], [0.9995227056, "x"], [0.9997499886, "z"], [0.9999545434, "j"], [1.0000000000, "q"], ],
|
108
|
-
'q' => [ [0.8851774530, "u"], [0.9144050104, "$"], [0.9345859429, "a"], [0.9512874043, "i"], [0.9592901879, "e"], [0.9641614475, "l"], [0.9686847599, "v"], [0.9718162839, "r"], [0.9745998608, "b"], [0.9773834377, "f"], [0.9801670146, "m"], [0.9829505915, "o"], [0.9853862213, "s"], [0.9878218511, "t"], [0.9899095338, "k"], [0.9919972164, "q"], [0.9933890049, "p"], [0.9947807933, "n"], [0.9961725818, "w"], [0.9975643702, "d"], [0.9986082116, "c"], [0.9993041058, "h"], [0.9996520529, "g"], [1.0000000000, "j"], ],
|
109
|
-
'r' => [ [0.1577633281, "e"], [0.2944208938, "a"], [0.4238126886, "i"], [0.5479055899, "$"], [0.6580327633, "o"], [0.6988881305, "t"], [0.7281847248, "u"], [0.7566550510, "y"], [0.7843889208, "d"], [0.8109282943, "s"], [0.8362103032, "r"], [0.8607468746, "m"], [0.8826160368, "n"], [0.9026081334, "c"], [0.9184778704, "g"], [0.9321651818, "b"], [0.9458165685, "l"], [0.9588213105, "k"], [0.9713769938, "p"], [0.9787864636, "v"], [0.9857827993, "f"], [0.9924647938, "h"], [0.9966230780, "w"], [0.9981229343, "z"], [0.9990839201, "j"], [0.9997575083, "q"], [1.0000000000, "x"], ],
|
110
|
-
's' => [ [0.1670682820, "t"], [0.2939047006, "$"], [0.4058387838, "e"], [0.4933650184, "s"], [0.5800473055, "i"], [0.6458085794, "h"], [0.7020976536, "a"], [0.7513464293, "c"], [0.7976858072, "o"], [0.8403273628, "u"], [0.8806036445, "p"], [0.9077099042, "m"], [0.9273649962, "l"], [0.9428558738, "k"], [0.9567810067, "y"], [0.9666307619, "n"], [0.9762029027, "w"], [0.9812332737, "q"], [0.9854863247, "f"], [0.9895728071, "b"], [0.9922823226, "d"], [0.9949141061, "r"], [0.9973237982, "g"], [0.9985119874, "v"], [0.9992337845, "z"], [0.9999222680, "j"], [1.0000000000, "x"], ],
|
111
|
-
't' => [ [0.1842635651, "i"], [0.3614680523, "e"], [0.4876171188, "$"], [0.5830228191, "a"], [0.6670435441, "r"], [0.7509364585, "o"], [0.8272885472, "h"], [0.8631934954, "u"], [0.8966897053, "t"], [0.9280622929, "y"], [0.9435961971, "l"], [0.9553842675, "s"], [0.9665037901, "c"], [0.9732777521, "w"], [0.9779870813, "m"], [0.9817034204, "f"], [0.9851739699, "z"], [0.9884970456, "b"], [0.9917218055, "n"], [0.9942976807, "p"], [0.9964016399, "g"], [0.9975912617, "d"], [0.9984662727, "k"], [0.9993117891, "v"], [0.9998033683, "j"], [0.9999213473, "x"], [1.0000000000, "q"], ],
|
112
|
-
'u' => [ [0.1659793262, "n"], [0.3036437970, "s"], [0.4266152500, "r"], [0.5220573795, "l"], [0.5933801082, "m"], [0.6608108832, "t"], [0.6990520058, "c"], [0.7350257976, "e"], [0.7697320265, "i"], [0.8040097834, "a"], [0.8368414475, "p"], [0.8691910806, "b"], [0.8962205202, "d"], [0.9214825130, "$"], [0.9456197668, "g"], [0.9569207148, "f"], [0.9658829200, "o"], [0.9740595933, "k"], [0.9797189938, "v"], [0.9851641583, "x"], [0.9894667309, "z"], [0.9922696517, "h"], [0.9948047775, "y"], [0.9966793423, "u"], [0.9984289361, "j"], [0.9997143520, "w"], [1.0000000000, "q"], ],
|
113
|
-
'v' => [ [0.4405209116, "e"], [0.6701101928, "i"], [0.8340846481, "a"], [0.9149135988, "o"], [0.9356999750, "$"], [0.9479714500, "u"], [0.9574254946, "r"], [0.9664412722, "y"], [0.9723891811, "s"], [0.9777109942, "l"], [0.9827197596, "v"], [0.9859754570, "n"], [0.9884172302, "c"], [0.9902955172, "t"], [0.9919859755, "d"], [0.9936138242, "m"], [0.9946781868, "g"], [0.9956799399, "k"], [0.9966190834, "p"], [0.9973703982, "b"], [0.9979964939, "f"], [0.9984973704, "h"], [0.9989356374, "j"], [0.9992486852, "q"], [0.9995617330, "x"], [0.9998121713, "w"], [1.0000000000, "z"], ],
|
114
|
-
'w' => [ [0.2197508897, "a"], [0.3764713934, "e"], [0.5284697509, "i"], [0.6595264166, "o"], [0.7374076102, "$"], [0.8001642486, "h"], [0.8445113605, "n"], [0.8720914317, "r"], [0.8938543663, "l"], [0.9145223104, "s"], [0.9261565836, "y"], [0.9366274295, "d"], [0.9460717219, "t"], [0.9542157131, "b"], [0.9619490829, "u"], [0.9690665207, "k"], [0.9750889680, "f"], [0.9811114153, "c"], [0.9867916781, "m"], [0.9911716397, "p"], [0.9946619217, "w"], [0.9976731454, "g"], [0.9985628251, "j"], [0.9993840679, "z"], [0.9998631262, "q"], [1.0000000000, "v"], ],
|
115
|
-
'x' => [ [0.2546458142, "$"], [0.4077276909, "i"], [0.5054277829, "t"], [0.5863845446, "p"], [0.6559337626, "e"], [0.7238270469, "a"], [0.7696412144, "o"], [0.8147194112, "y"], [0.8594296228, "c"], [0.8885004600, "x"], [0.9138914443, "u"], [0.9306347746, "h"], [0.9425942962, "l"], [0.9521619135, "s"], [0.9613615455, "v"], [0.9703771849, "m"], [0.9781048758, "f"], [0.9849126035, "b"], [0.9889604416, "w"], [0.9913523459, "r"], [0.9933762649, "d"], [0.9954001840, "n"], [0.9974241030, "g"], [0.9992640294, "q"], [0.9996320147, "j"], [0.9998160074, "k"], [1.0000000000, "z"], ],
|
116
|
-
'y' => [ [0.5555410988, "$"], [0.5980873695, "l"], [0.6377386722, "s"], [0.6736492860, "a"], [0.7088768175, "e"], [0.7438441271, "p"], [0.7767621898, "n"], [0.8066226458, "m"], [0.8352470481, "c"], [0.8630907849, "t"], [0.8858276681, "o"], [0.9072309144, "r"], [0.9262921641, "d"], [0.9444426373, "i"], [0.9558923983, "b"], [0.9643496080, "g"], [0.9714081254, "u"], [0.9783690596, "w"], [0.9846143838, "f"], [0.9887779332, "h"], [0.9924860944, "k"], [0.9945678691, "v"], [0.9964219497, "z"], [0.9980158085, "x"], [0.9992518622, "y"], [0.9998048336, "j"], [1.0000000000, "q"], ],
|
117
|
-
'z' => [ [0.2946584939, "e"], [0.4734384121, "a"], [0.6180677175, "i"], [0.7089900759, "o"], [0.7905720957, "$"], [0.8610624635, "z"], [0.8884997081, "y"], [0.9098073555, "u"], [0.9306771745, "l"], [0.9455633392, "h"], [0.9519848219, "b"], [0.9576765908, "m"], [0.9627845884, "k"], [0.9676007005, "w"], [0.9724168126, "n"], [0.9767950963, "s"], [0.9810274372, "t"], [0.9852597782, "d"], [0.9886164623, "g"], [0.9910974898, "v"], [0.9934325744, "c"], [0.9957676591, "r"], [0.9976649154, "p"], [0.9988324577, "f"], [0.9995621716, "j"], [0.9998540572, "q"], [1.0000000000, "x"], ],
|
118
|
-
} # BIGRAM_CDF_HSH
|
119
|
-
|
120
|
-
end
|
121
|
-
|
122
|
-
end
|
123
|
-
end
|
@@ -1,26 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
module Widget
|
3
|
-
|
4
|
-
class Monitor < AsIs
|
5
|
-
include CountingProcessor
|
6
|
-
register_processor
|
7
|
-
|
8
|
-
field :every, Integer, :default => 1000, :doc => "How often to announce progress"
|
9
|
-
|
10
|
-
def process(rec)
|
11
|
-
super(rec)
|
12
|
-
$stderr.puts("%-7d\t%s\t%s" % [count, report, rec.inspect[0..1000]]) if ready?
|
13
|
-
end
|
14
|
-
|
15
|
-
def ready?
|
16
|
-
(count % every) == 0
|
17
|
-
end
|
18
|
-
end
|
19
|
-
|
20
|
-
class DumpSystemConfig < Monitor
|
21
|
-
def setup ; require 'rbconfig' ; end
|
22
|
-
def report() super.merge({ :rbconfig => RbConfig::CONFIG }) end
|
23
|
-
end
|
24
|
-
|
25
|
-
end
|
26
|
-
end
|
@@ -1,66 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
class Counter < Wukong::Processor
|
3
|
-
field :count, Integer, :doc => 'count of records this run'
|
4
|
-
|
5
|
-
def setup
|
6
|
-
super
|
7
|
-
reset!
|
8
|
-
end
|
9
|
-
|
10
|
-
def reset!
|
11
|
-
self.count = 0
|
12
|
-
end
|
13
|
-
|
14
|
-
def beg_group(*args)
|
15
|
-
reset!
|
16
|
-
end
|
17
|
-
|
18
|
-
def end_group(key)
|
19
|
-
emit( [key, count] )
|
20
|
-
end
|
21
|
-
|
22
|
-
def process(record)
|
23
|
-
@count += 1
|
24
|
-
end
|
25
|
-
end
|
26
|
-
|
27
|
-
class GroupArrays < Wukong::Processor
|
28
|
-
def beg_group
|
29
|
-
@records = []
|
30
|
-
end
|
31
|
-
|
32
|
-
def end_group(key)
|
33
|
-
emit(key, @records)
|
34
|
-
end
|
35
|
-
|
36
|
-
def process(record)
|
37
|
-
@records << record
|
38
|
-
end
|
39
|
-
end
|
40
|
-
|
41
|
-
class Group < Wukong::Processor
|
42
|
-
def start(key, *vals)
|
43
|
-
@key = key
|
44
|
-
next_stage.tell(:beg_group, @key)
|
45
|
-
end
|
46
|
-
|
47
|
-
def end_group
|
48
|
-
next_stage.tell(:end_group, @key)
|
49
|
-
end
|
50
|
-
|
51
|
-
def process( (key, *vals) )
|
52
|
-
start(key, *vals) unless defined?(@key)
|
53
|
-
if key != @key
|
54
|
-
end_group
|
55
|
-
start(key, *vals)
|
56
|
-
end
|
57
|
-
emit( [key, *vals] )
|
58
|
-
end
|
59
|
-
|
60
|
-
def finally
|
61
|
-
end_group
|
62
|
-
super()
|
63
|
-
end
|
64
|
-
end
|
65
|
-
|
66
|
-
end
|
@@ -1,50 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
module Widget
|
3
|
-
|
4
|
-
class Stringifier < Processor
|
5
|
-
end
|
6
|
-
|
7
|
-
class ToJson < Stringifier
|
8
|
-
def process(record)
|
9
|
-
emit record.to_json
|
10
|
-
# emit MultiJson.dump(record)
|
11
|
-
end
|
12
|
-
register_processor
|
13
|
-
end
|
14
|
-
|
15
|
-
class FromJson < Stringifier
|
16
|
-
# FIXME some of this belongs in gorillib factories...
|
17
|
-
def process(record)
|
18
|
-
obj = MultiJson.load(record)
|
19
|
-
if obj.respond_to?(:has_key?) && obj.has_key?("_metadata")
|
20
|
-
metadata_hash = obj.delete("_metadata")
|
21
|
-
obj.define_singleton_method(:_metadata) do
|
22
|
-
metadata_hash
|
23
|
-
end
|
24
|
-
end
|
25
|
-
if obj.respond_to?(:has_key?) && obj.has_key?("_type")
|
26
|
-
klass = Gorillib::Factory(obj.delete("_type"))
|
27
|
-
obj = klass.receive(obj)
|
28
|
-
end
|
29
|
-
emit obj
|
30
|
-
end
|
31
|
-
register_processor
|
32
|
-
end
|
33
|
-
|
34
|
-
class ToTsv < Stringifier
|
35
|
-
def process(record)
|
36
|
-
emit record.join("\t")
|
37
|
-
end
|
38
|
-
register_processor
|
39
|
-
end
|
40
|
-
|
41
|
-
class FromTsv < Stringifier
|
42
|
-
def process(record)
|
43
|
-
emit record.chomp.split(/\t/)
|
44
|
-
end
|
45
|
-
register_processor
|
46
|
-
end
|
47
|
-
|
48
|
-
end
|
49
|
-
|
50
|
-
end
|
data/lib/wukong/workflow.rb
DELETED
@@ -1,22 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
class WorkflowGraph < Hanuman::Graph
|
3
|
-
end
|
4
|
-
|
5
|
-
class Workflow < WorkflowGraph
|
6
|
-
include Hanuman::IsOwnInputSlot
|
7
|
-
include Hanuman::IsOwnOutputSlot
|
8
|
-
|
9
|
-
#
|
10
|
-
# lifecycle
|
11
|
-
#
|
12
|
-
|
13
|
-
def setup
|
14
|
-
stages.each_value{|stage| stage.setup}
|
15
|
-
end
|
16
|
-
|
17
|
-
def stop
|
18
|
-
stages.each_value{|stage| stage.stop}
|
19
|
-
end
|
20
|
-
|
21
|
-
end
|
22
|
-
end
|
@@ -1,42 +0,0 @@
|
|
1
|
-
module Wukong
|
2
|
-
class Workflow < Hanuman::Graph
|
3
|
-
|
4
|
-
class ActionWithInputs < Hanuman::Action
|
5
|
-
include Hanuman::Slottable
|
6
|
-
include Hanuman::SplatInputs
|
7
|
-
include Hanuman::SplatOutputs
|
8
|
-
|
9
|
-
def self.make(workflow, *input_stages, &block)
|
10
|
-
options = input_stages.extract_options!
|
11
|
-
stage = new
|
12
|
-
workflow.add_stage stage
|
13
|
-
input_stages.map do |input|
|
14
|
-
workflow.connect(input, stage)
|
15
|
-
end
|
16
|
-
stage.receive!(options, &block)
|
17
|
-
stage
|
18
|
-
end
|
19
|
-
end
|
20
|
-
|
21
|
-
#
|
22
|
-
# A command is a workflow action that runs a type of command on many inputs,
|
23
|
-
# under a given configuration, into named outputs.
|
24
|
-
#
|
25
|
-
# @example
|
26
|
-
# bash 'create_archive.sh', filenames, :compression_level => 9 > 'archive.tar.gz'
|
27
|
-
#
|
28
|
-
class Command < ActionWithInputs
|
29
|
-
|
30
|
-
def self.make(workflow, stage_name, *input_stages, &block)
|
31
|
-
options = input_stages.extract_options!
|
32
|
-
super(workflow, *input_stages, options.merge(:name => stage_name, :script => stage_name), &block)
|
33
|
-
end
|
34
|
-
end
|
35
|
-
|
36
|
-
class Shell < Command
|
37
|
-
field :script, String
|
38
|
-
register_action
|
39
|
-
end
|
40
|
-
|
41
|
-
end
|
42
|
-
end
|
data/old/config/emr-example.yaml
DELETED
@@ -1,48 +0,0 @@
|
|
1
|
-
#
|
2
|
-
# Elastic MapReduce config in wukong
|
3
|
-
#
|
4
|
-
|
5
|
-
#
|
6
|
-
# Infrastructure options
|
7
|
-
#
|
8
|
-
|
9
|
-
# == Fill all your information into yet another file with your amazon key
|
10
|
-
:emr_credentials_file: ~/.wukong/credentials.json
|
11
|
-
#
|
12
|
-
# == Use the credentials file, set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars, or enter them here:
|
13
|
-
# :access_key: ASDFAHKHASDF
|
14
|
-
# :secret_access_key: ADSGHASDFJASDFASDF
|
15
|
-
#
|
16
|
-
# == Path to your keypair file
|
17
|
-
# :key_pair_file: ~/.wukong/keypairs/gibbon.pem
|
18
|
-
# == Keypair will be named after your file, or force the name:
|
19
|
-
# :key_pair: ~
|
20
|
-
|
21
|
-
# == Path to the Amazon elastic-mapreduce runner. Get a copy from
|
22
|
-
# http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
|
23
|
-
:emr_runner: ~/ics/hadoop/elastic-mapreduce/elastic-mapreduce
|
24
|
-
|
25
|
-
#
|
26
|
-
# Cluster Config
|
27
|
-
#
|
28
|
-
:num_instances: 1
|
29
|
-
:instance_type: m2.xlarge
|
30
|
-
:master_instance_type: ~
|
31
|
-
:hadoop_version: '0.20'
|
32
|
-
# :availability_zone: us-east-1b
|
33
|
-
|
34
|
-
#
|
35
|
-
# Running and reporting options
|
36
|
-
#
|
37
|
-
:alive: true
|
38
|
-
:enable_debugging: true
|
39
|
-
:emr_runner_verbose: true
|
40
|
-
:emr_runner_debug: ~
|
41
|
-
:step_action: CANCEL_AND_WAIT # CANCEL_AND_WAIT, TERMINATE_JOB_FLOW or CONTINUE
|
42
|
-
|
43
|
-
#
|
44
|
-
# Remote Paths
|
45
|
-
#
|
46
|
-
:emr_root: s3n://emr.infinitemonkeys.info
|
47
|
-
|
48
|
-
|
data/old/examples/README.txt
DELETED
@@ -1,17 +0,0 @@
|
|
1
|
-
Examples:
|
2
|
-
|
3
|
-
|
4
|
-
* sample_records -- extract a random sample from a collection of data
|
5
|
-
|
6
|
-
* word_count
|
7
|
-
|
8
|
-
* apache_log_parser -- example for parsing standard apache webserver log files.
|
9
|
-
|
10
|
-
* wordchains -- solving a word puzzle using breadth-first search of a graph
|
11
|
-
|
12
|
-
* graph -- some generic graph
|
13
|
-
|
14
|
-
* pagerank -- use the pagerank algorithm to find the most 'interesting'
|
15
|
-
(central) nodes of a network graph
|
16
|
-
|
17
|
-
|
@@ -1,165 +0,0 @@
|
|
1
|
-
Notes and scripts from a talk by Fredrik Möllerstrand, (@lenbust / http://fredrikmollerstrand.se) with some modifications by @mrflip
|
2
|
-
|
3
|
-
See http://bit.ly/6ItaHI and forks of that gist!
|
4
|
-
|
5
|
-
Wukong who?
|
6
|
-
-----------
|
7
|
-
|
8
|
-
Here be some notes from my talk on [Wukong](http://github.com/mrflip/wukong) at the January meetup of [Got.rb](http://www.meetup.com/got-rb/).
|
9
|
-
|
10
|
-
Wukong is a framework for writing Hadoop jobs in Ruby. Other such frameworks are [MRToolkit](http://code.google.com/p/mrtoolkit/) (which is also written in Ruby and which I have not tried it) and [Dumbo](http://github.com/klbostee/dumbo) (which is written in Python and which I love dearly). You could also write your jobs in Java(!) or as bare scripts hooked into [Hadoop Streaming](http://hadoop.apache.org/common/docs/current/streaming.html), but that would be nuts.
|
11
|
-
|
12
|
-
Wukong gives you the option of treating your data as a stream of lines or as a stream of fields or lightweight objects. My forenoon's experience of Wukong covers only the basic text streaming so we'll skip the structured data and interpret the data as dumb chunks of text.
|
13
|
-
|
14
|
-
The data, which I regrettably can not share with you on the wider interwebs, are records of jeans ordered by stores from a central jean distributor. There is one line per order, and each order is for a specific market and for a specified number of jeans for each size. There are 13 sizes in total.
|
15
|
-
|
16
|
-
An example of which:
|
17
|
-
|
18
|
-
SS10 vaxjobutiken 2010-01-10 10:45:54 sweden Storgatan 64 VÄXJÖ 352 30 SWEDEN retailer 120664 L34 0 0 0 0 0 1 1 1 2 2 1 1 0
|
19
|
-
SS10 vaxjobutiken 2009-01-10 10:45:54 sweden Storgatan 64 VÄXJÖ 352 30 SWEDEN retailer 120721 L32 0 0 0 0 0 1 2 2 2 1 1 1 0
|
20
|
-
SS09 kubic 2010-01-10 13:33:37 spain NULL NULL NULL NULL retailer 120571 L34 0 0 0 0 0 0 0 1 1 1 1 1 0
|
21
|
-
|
22
|
-
The integers at the end there describe how many of each jean size was ordered.
|
23
|
-
|
24
|
-
We'll use Wukong to summarize the orders for each country. The job is run locally (as oppposed to on Hadoop) to avoid any startup overhead.
|
25
|
-
|
26
|
-
sizes.rb
|
27
|
-
--------
|
28
|
-
$> ruby sizes.rb --run=local data/orders.tsv data/sizes
|
29
|
-
|
30
|
-
require 'rubygems'
|
31
|
-
require 'wukong'
|
32
|
-
module JeanSizes
|
33
|
-
class Mapper < Wukong::Streamer::RecordStreamer
|
34
|
-
def process(code,model,time,country,j1,j2,j3, n1,n2,c1, venue,n3,n4, *sizes)
|
35
|
-
yield [country, *sizes] if sizes.length == 13
|
36
|
-
end
|
37
|
-
end
|
38
|
-
|
39
|
-
class JeansListReducer < Wukong::Streamer::ListReducer
|
40
|
-
def finalize
|
41
|
-
return if values.empty?
|
42
|
-
sums = []; 13.times{ sums << 0 }
|
43
|
-
values.each do |country, *sizes|
|
44
|
-
sizes.map!(&:to_i)
|
45
|
-
sums = sums.zip(sizes).map{|sum, val| sum + val }
|
46
|
-
end
|
47
|
-
yield [key, *sums]
|
48
|
-
end
|
49
|
-
end
|
50
|
-
end
|
51
|
-
|
52
|
-
Wukong::Script.new(JeanSizes::Mapper, JeanSizes::JeansListReducer).run
|
53
|
-
|
54
|
-
*JeanSizes::Mapper#process*, being a RecordStreamer, is given one set of input fields to work with at a time. It picks out the good parts, namely the country at index 3 and the integers at index 11 through 23. The rest of the fields are unimportant and just given placeholder names.
|
55
|
-
|
56
|
-
The country is promoted to key and the sizes array is value. These are yielded as a list – since the reducer is a list reducer!
|
57
|
-
|
58
|
-
*JeanSizes::Reducer#finalize* is given the key ('sweden' for example) and a list of lists of integers. These are (over)cleverly summarized into one list, *sums*.
|
59
|
-
|
60
|
-
The output of these two steps is a much smaller data set containing the number of jeans of each size purchased, broken down by market.
|
61
|
-
|
62
|
-
An example of which:
|
63
|
-
|
64
|
-
sweden 807 1443 2215 2460 2316 2077 2392 2563 3068 2356 2051 1016 255
|
65
|
-
switzerland 90 201 731 886 585 325 404 624 770 721 635 295 41
|
66
|
-
unitedstates 446 1103 2007 2442 2863 2879 3920 3687 5588 4256 5299 3777 1842
|
67
|
-
|
68
|
-
|
69
|
-
That's all peachy, but what if I'd like to compare the relative amount of large jeans bought in Sweden with those bought in the US? A working hypothesis might be that swedes wear smaller jean sizes than do americans. Well, let's normalize the data and see what we can make of it.
|
70
|
-
|
71
|
-
normalize.rb
|
72
|
-
------------
|
73
|
-
$> ruby normalize.rb --run=local data/sizes.tsv data/normalized_sizes.tsv
|
74
|
-
|
75
|
-
require 'rubygems'
|
76
|
-
require 'wukong'
|
77
|
-
require 'active_support/core_ext/enumerable' # for array#sum
|
78
|
-
|
79
|
-
module Normalize
|
80
|
-
class Mapper < Wukong::Streamer::RecordStreamer
|
81
|
-
def process(country, *sizes)
|
82
|
-
sizes.map!(&:to_i)
|
83
|
-
sum = sizes.sum.to_f
|
84
|
-
normalized = sizes.map{|x| 100 * x/sum }
|
85
|
-
s = normalized.join(",")
|
86
|
-
yield [country, s]
|
87
|
-
end
|
88
|
-
end
|
89
|
-
end
|
90
|
-
|
91
|
-
Wukong::Script.new(Normalize::Mapper, nil).run
|
92
|
-
|
93
|
-
Again we're dealing with a line streamer. The normalization divides each jean size by the total number of jeans sold in that country and scales it up by 100 to make the figures into proper percentages.
|
94
|
-
|
95
|
-
You will also notice that I join the list of normalized values with a comma. Why in the name of Buddha would I do that? Bear with me.
|
96
|
-
|
97
|
-
Parts of the output looks like so:
|
98
|
-
|
99
|
-
sweden 1.01922538870458,4.06091370558376,8.19776969503178,9.41684319916863,12.2626803629242,10.2442143970582,9.56073384227987,8.30169071505656,9.25696470682282,9.83252727926776,8.85327151364963,5.76761661137536,3.22554858307686
|
100
|
-
switzerland 0.64996829422955,4.67660114140774,10.0665821179455,11.429930247305,12.2067216233354,9.89220038046925,6.40456563094483,5.15218769816107,9.27393785668992,14.0456563094483,11.5884590995561,3.1864299302473,1.42675967025999
|
101
|
-
unitedstates 4.59248547707497,9.41683911341594,13.2114986661348,10.6110847939365,13.9320352040689,9.19245057219078,9.77336757336259,7.17794011319155,7.13804881697375,6.08840908524271,5.0038644693211,2.75000623301503,1.11196988207136
|
102
|
-
|
103
|
-
|
104
|
-
Data visualized
|
105
|
-
---------------
|
106
|
-
|
107
|
-
In the words of the late great R. Dingly, '*data not visualized is not data*'.
|
108
|
-
|
109
|
-
Let's put our hypothesis to work and graph this. The quickest path to a graph just happens to be Google Charts, and here's
|
110
|
-
[a graph comparing jeans sizes bought in Sweden and the US ](http://chart.apis.google.com/chart?cht=bvg&chd=t:1.01922538870458,4.06091370558376,8.19776969503178,9.41684319916863,12.2626803629242,10.2442143970582,9.56073384227987,8.30169071505656,9.25696470682282,9.83252727926776,8.85327151364963,5.76761661137536,3.22554858307686|4.59248547707497,9.41683911341594,13.2114986661348,10.6110847939365,13.9320352040689,9.19245057219078,9.77336757336259,7.17794011319155,7.13804881697375,6.08840908524271,5.0038644693211,2.75000623301503,1.11196988207136&chds=0,20&chs=800x375&chdl=Sweden|USA&chco=eecc00,00eedd).
|
111
|
-
|
112
|
-
It seems that americans buy smaller sizes while swedes go for larger breeches, which is quite the opposite of what we thought.
|
113
|
-
|
114
|
-
As someone in the crowd pointed out during the meet, this might be due to the fact that the particular brand of jeans under scrutiny here has become mainstream in Sweden while it is still mostly worn by thin punk-rockers in the states. Again the the words of Mr. Dingly echo so true: '*data is nothing without interpretation*'.
|
115
|
-
|
116
|
-
Chaining Runs
|
117
|
-
-------------
|
118
|
-
|
119
|
-
**Local Mode**: If you're running in local mode, chaining is straightforward: just use a single dash '-' as the output file, and wukong will leave its output on STDOUT rather than dumping to a file on disk. (NOTE: this is only true as of wukong v1.4.5.) (OTHER NOTE: if you've already written a file named '-' to disk, use "rm -- -" to remove it).
|
120
|
-
|
121
|
-
./sizes.rb --run=local data/orders.tsv - | ./normalize.rb --run=local - data/normalized_sizes
|
122
|
-
|
123
|
-
For anything fancier (or for earlier versions of wukong): all the local runner does is to take your script and run
|
124
|
-
|
125
|
-
cat input.tsv | myscript.rb --map [..args..] | sort | myscript.rb --reduce [..args..] > output.tsv
|
126
|
-
|
127
|
-
You can do the chaining by hand:
|
128
|
-
|
129
|
-
cat input.tsv | myscript.rb --map [..args..] | sort | myscript.rb --reduce [..args..] | whatever | ./anotherscript.rb --map | sort | ./anotherscript --reduce > output.tsv
|
130
|
-
|
131
|
-
**Hadoop Mode**: Wukong doesn't let you chain jobs or define workflows. There are tools out there that enable this, but there are no mature solutions that I [@mrflip] know of.
|
132
|
-
|
133
|
-
In closing
|
134
|
-
----------
|
135
|
-
|
136
|
-
This has been a quick rundown of Wukong as I know it after a few hours of use. Improvements can certainly be made and I welcome any and all comments. Please consider amending the code in this presentation by forking [this gist](http://gist.github.com/278043).
|
137
|
-
|
138
|
-
Also, you should follow me on twitter: [@lenbust](http://twitter.com/lenbust).
|
139
|
-
|
140
|
-
Postscript
|
141
|
-
__________
|
142
|
-
|
143
|
-
The LineReducer used in sizes.rb is perfect for a small, local run such as this one. However, for large amounts of data it's best to avoid the ListReducer, as it collects every single record in memory before finalizing.
|
144
|
-
|
145
|
-
You can instead use an Accumulating reducer directly. Compare this class with the JeansListReducer above and you'll see we are applying the same basic workflow, just in explicitly separated steps.
|
146
|
-
|
147
|
-
class JeansAccumulatingReducer < Wukong::Streamer::AccumulatingReducer
|
148
|
-
attr_accessor :sums
|
149
|
-
|
150
|
-
# start the sum with 0 for each size
|
151
|
-
def start! *_
|
152
|
-
self.sums = []; 13.times{ self.sums << 0 }
|
153
|
-
end
|
154
|
-
|
155
|
-
# accumulate each size count into the sizes_sum
|
156
|
-
def accumulate country, *sizes
|
157
|
-
sizes.map!(&:to_i)
|
158
|
-
self.sums = self.sums.zip(sizes).map{|sum, val| sum + val }
|
159
|
-
end
|
160
|
-
|
161
|
-
# emit [country, size_0_sum, size_1_sum, ...]
|
162
|
-
def finalize
|
163
|
-
yield [key, *sums]
|
164
|
-
end
|
165
|
-
end
|