wukong 3.0.0.pre → 3.0.0.pre2
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +46 -33
- data/.gitmodules +3 -0
- data/.rspec +1 -1
- data/.travis.yml +8 -1
- data/.yardopts +0 -13
- data/Guardfile +4 -6
- data/{LICENSE.textile → LICENSE.md} +43 -55
- data/README-old.md +422 -0
- data/README.md +279 -418
- data/Rakefile +21 -5
- data/TODO.md +6 -6
- data/bin/wu-clean-encoding +31 -0
- data/bin/wu-lign +2 -2
- data/bin/wu-local +69 -0
- data/bin/wu-server +70 -0
- data/examples/Gemfile +38 -0
- data/examples/README.md +9 -0
- data/examples/dataflow/apache_log_line.rb +64 -25
- data/examples/dataflow/fibonacci_series.rb +101 -0
- data/examples/dataflow/parse_apache_logs.rb +37 -7
- data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
- data/examples/dataflow/simple.rb +4 -4
- data/examples/geo.rb +4 -0
- data/examples/geo/geo_grids.numbers +0 -0
- data/examples/geo/geolocated.rb +331 -0
- data/examples/geo/quadtile.rb +69 -0
- data/examples/geo/spec/geolocated_spec.rb +247 -0
- data/examples/geo/tile_fetcher.rb +77 -0
- data/examples/graph/minimum_spanning_tree.rb +61 -61
- data/examples/jabberwocky.txt +36 -0
- data/examples/models/wikipedia.rb +20 -0
- data/examples/munging/Gemfile +8 -0
- data/examples/munging/airline_flights/airline.rb +57 -0
- data/examples/munging/airline_flights/airline_flights.rake +83 -0
- data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
- data/examples/munging/airline_flights/airport.rb +211 -0
- data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
- data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
- data/examples/munging/airline_flights/flight.rb +156 -0
- data/examples/munging/airline_flights/models.rb +4 -0
- data/examples/munging/airline_flights/parse.rb +26 -0
- data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
- data/examples/munging/airline_flights/route.rb +35 -0
- data/examples/munging/airline_flights/tasks.rake +83 -0
- data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
- data/examples/munging/airline_flights/topcities.rb +167 -0
- data/examples/munging/airports/40_wbans.txt +40 -0
- data/examples/munging/airports/filter_weather_reports.rb +37 -0
- data/examples/munging/airports/join.pig +31 -0
- data/examples/munging/airports/to_tsv.rb +33 -0
- data/examples/munging/airports/usa_wbans.pig +19 -0
- data/examples/munging/airports/usa_wbans.txt +2157 -0
- data/examples/munging/airports/wbans.pig +19 -0
- data/examples/munging/airports/wbans.txt +2310 -0
- data/examples/munging/geo/geo_json.rb +54 -0
- data/examples/munging/geo/geo_models.rb +69 -0
- data/examples/munging/geo/geonames_models.rb +78 -0
- data/examples/munging/geo/iso_codes.rb +172 -0
- data/examples/munging/geo/reconcile_countries.rb +124 -0
- data/examples/munging/geo/tasks.rake +71 -0
- data/examples/munging/rake_helper.rb +62 -0
- data/examples/munging/weather/.gitignore +1 -0
- data/examples/munging/weather/Gemfile +4 -0
- data/examples/munging/weather/Rakefile +28 -0
- data/examples/munging/weather/extract_ish.rb +13 -0
- data/examples/munging/weather/models/weather.rb +119 -0
- data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
- data/examples/munging/wikipedia/README.md +34 -0
- data/examples/munging/wikipedia/Rakefile +193 -0
- data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
- data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
- data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
- data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
- data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
- data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
- data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
- data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
- data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
- data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
- data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
- data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
- data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
- data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
- data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
- data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
- data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
- data/examples/munging/wikipedia/pig_style_guide.md +25 -0
- data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
- data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
- data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
- data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
- data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
- data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
- data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
- data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
- data/examples/munging/wikipedia/utils/namespaces.json +1 -0
- data/examples/rake_helper.rb +85 -0
- data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
- data/examples/server_logs/logline.rb +95 -0
- data/examples/server_logs/models.rb +66 -0
- data/examples/server_logs/page_counts.pig +48 -0
- data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
- data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
- data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
- data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
- data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
- data/examples/string_reverser.rb +26 -0
- data/examples/text/pig_latin.rb +2 -2
- data/examples/text/regional_flavor/README.md +14 -0
- data/examples/text/regional_flavor/article_wordbags.pig +39 -0
- data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
- data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
- data/examples/word_count/accumulator.rb +26 -0
- data/examples/word_count/tokenizer.rb +13 -0
- data/examples/word_count/word_count.rb +6 -0
- data/examples/workflow/cherry_pie.dot +97 -0
- data/examples/workflow/cherry_pie.png +0 -0
- data/examples/workflow/cherry_pie.rb +61 -26
- data/lib/hanuman.rb +34 -7
- data/lib/hanuman/graph.rb +55 -31
- data/lib/hanuman/graphvizzer.rb +199 -178
- data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
- data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
- data/lib/hanuman/link.rb +35 -0
- data/lib/hanuman/registry.rb +46 -0
- data/lib/hanuman/stage.rb +76 -32
- data/lib/wukong.rb +23 -24
- data/lib/wukong/boot.rb +87 -0
- data/lib/wukong/configuration.rb +8 -0
- data/lib/wukong/dataflow.rb +45 -78
- data/lib/wukong/driver.rb +99 -0
- data/lib/wukong/emitter.rb +22 -0
- data/lib/wukong/model/faker.rb +24 -24
- data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
- data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
- data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
- data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
- data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
- data/lib/wukong/processor.rb +60 -114
- data/lib/wukong/spec_helpers.rb +81 -0
- data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
- data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
- data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
- data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
- data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
- data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
- data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
- data/lib/wukong/version.rb +2 -1
- data/lib/wukong/widget/filters.rb +311 -0
- data/lib/wukong/widget/processors.rb +156 -0
- data/lib/wukong/widget/reducers.rb +7 -0
- data/lib/wukong/widget/reducers/accumulator.rb +73 -0
- data/lib/wukong/widget/reducers/bin.rb +318 -0
- data/lib/wukong/widget/reducers/count.rb +61 -0
- data/lib/wukong/widget/reducers/group.rb +85 -0
- data/lib/wukong/widget/reducers/group_concat.rb +70 -0
- data/lib/wukong/widget/reducers/moments.rb +72 -0
- data/lib/wukong/widget/reducers/sort.rb +130 -0
- data/lib/wukong/widget/serializers.rb +287 -0
- data/lib/wukong/widget/sink.rb +10 -52
- data/lib/wukong/widget/source.rb +7 -113
- data/lib/wukong/widget/utils.rb +46 -0
- data/lib/wukong/widgets.rb +6 -0
- data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
- data/spec/examples/dataflow/parsing_spec.rb +12 -11
- data/spec/examples/dataflow/simple_spec.rb +32 -6
- data/spec/examples/dataflow/telegram_spec.rb +36 -36
- data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
- data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
- data/spec/examples/munging/airline_flights_spec.rb +202 -0
- data/spec/examples/text/pig_latin_spec.rb +13 -16
- data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
- data/spec/hanuman/graph_spec.rb +27 -2
- data/spec/hanuman/hanuman_spec.rb +10 -0
- data/spec/hanuman/registry_spec.rb +123 -0
- data/spec/hanuman/stage_spec.rb +61 -7
- data/spec/spec_helper.rb +29 -19
- data/spec/support/hanuman_test_helpers.rb +14 -12
- data/spec/support/shared_context_for_reducers.rb +37 -0
- data/spec/support/shared_examples_for_builders.rb +101 -0
- data/spec/support/shared_examples_for_shortcuts.rb +57 -0
- data/spec/support/wukong_test_helpers.rb +37 -11
- data/spec/wukong/dataflow_spec.rb +77 -55
- data/spec/wukong/local_runner_spec.rb +24 -24
- data/spec/wukong/model/faker_spec.rb +132 -131
- data/spec/wukong/runner_spec.rb +8 -8
- data/spec/wukong/widget/filters_spec.rb +61 -0
- data/spec/wukong/widget/processors_spec.rb +126 -0
- data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
- data/spec/wukong/widget/reducers/count_spec.rb +11 -0
- data/spec/wukong/widget/reducers/group_spec.rb +20 -0
- data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
- data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
- data/spec/wukong/widget/serializers_spec.rb +92 -0
- data/spec/wukong/widget/sink_spec.rb +15 -15
- data/spec/wukong/widget/source_spec.rb +65 -41
- data/spec/wukong/wukong_spec.rb +10 -0
- data/wukong.gemspec +17 -10
- metadata +359 -335
- data/.document +0 -5
- data/VERSION +0 -1
- data/bin/hdp-bin +0 -44
- data/bin/hdp-bzip +0 -23
- data/bin/hdp-cat +0 -3
- data/bin/hdp-catd +0 -3
- data/bin/hdp-cp +0 -3
- data/bin/hdp-du +0 -86
- data/bin/hdp-get +0 -3
- data/bin/hdp-kill +0 -3
- data/bin/hdp-kill-task +0 -3
- data/bin/hdp-ls +0 -11
- data/bin/hdp-mkdir +0 -2
- data/bin/hdp-mkdirp +0 -12
- data/bin/hdp-mv +0 -3
- data/bin/hdp-parts_to_keys.rb +0 -77
- data/bin/hdp-ps +0 -3
- data/bin/hdp-put +0 -3
- data/bin/hdp-rm +0 -32
- data/bin/hdp-sort +0 -40
- data/bin/hdp-stream +0 -40
- data/bin/hdp-stream-flat +0 -22
- data/bin/hdp-stream2 +0 -39
- data/bin/hdp-sync +0 -17
- data/bin/hdp-wc +0 -67
- data/bin/wu-flow +0 -10
- data/bin/wu-map +0 -17
- data/bin/wu-red +0 -17
- data/bin/wukong +0 -17
- data/data/CREDITS.md +0 -355
- data/data/graph/airfares.tsv +0 -2174
- data/data/text/gift_of_the_magi.txt +0 -225
- data/data/text/jabberwocky.txt +0 -36
- data/data/text/rectification_of_names.txt +0 -33
- data/data/twitter/a_atsigns_b.tsv +0 -64
- data/data/twitter/a_follows_b.tsv +0 -53
- data/data/twitter/tweet.tsv +0 -167
- data/data/twitter/twitter_user.tsv +0 -55
- data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
- data/docpages/INSTALL.textile +0 -92
- data/docpages/LICENSE.textile +0 -107
- data/docpages/README-elastic_map_reduce.textile +0 -377
- data/docpages/README-performance.textile +0 -90
- data/docpages/README-wulign.textile +0 -65
- data/docpages/UsingWukong-part1-get_ready.textile +0 -17
- data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
- data/docpages/UsingWukong-part3-parsing.textile +0 -138
- data/docpages/_config.yml +0 -39
- data/docpages/avro/avro_notes.textile +0 -56
- data/docpages/avro/performance.textile +0 -36
- data/docpages/avro/tethering.textile +0 -19
- data/docpages/bigdata-tips.textile +0 -143
- data/docpages/code/api_response_example.txt +0 -20
- data/docpages/code/parser_skeleton.rb +0 -38
- data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
- data/docpages/favicon.ico +0 -0
- data/docpages/gem.css +0 -16
- data/docpages/hadoop-tips.textile +0 -83
- data/docpages/index.textile +0 -92
- data/docpages/intro.textile +0 -8
- data/docpages/moreinfo.textile +0 -174
- data/docpages/news.html +0 -24
- data/docpages/pig/PigLatinExpressionsList.txt +0 -122
- data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
- data/docpages/pig/commandline_params.txt +0 -26
- data/docpages/pig/cookbook.html +0 -481
- data/docpages/pig/images/hadoop-logo.jpg +0 -0
- data/docpages/pig/images/instruction_arrow.png +0 -0
- data/docpages/pig/images/pig-logo.gif +0 -0
- data/docpages/pig/piglatin_ref1.html +0 -1103
- data/docpages/pig/piglatin_ref2.html +0 -14340
- data/docpages/pig/setup.html +0 -505
- data/docpages/pig/skin/basic.css +0 -166
- data/docpages/pig/skin/breadcrumbs.js +0 -237
- data/docpages/pig/skin/fontsize.js +0 -166
- data/docpages/pig/skin/getBlank.js +0 -40
- data/docpages/pig/skin/getMenu.js +0 -45
- data/docpages/pig/skin/images/chapter.gif +0 -0
- data/docpages/pig/skin/images/chapter_open.gif +0 -0
- data/docpages/pig/skin/images/current.gif +0 -0
- data/docpages/pig/skin/images/external-link.gif +0 -0
- data/docpages/pig/skin/images/header_white_line.gif +0 -0
- data/docpages/pig/skin/images/page.gif +0 -0
- data/docpages/pig/skin/images/pdfdoc.gif +0 -0
- data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
- data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
- data/docpages/pig/skin/print.css +0 -54
- data/docpages/pig/skin/profile.css +0 -181
- data/docpages/pig/skin/screen.css +0 -587
- data/docpages/pig/tutorial.html +0 -1059
- data/docpages/pig/udf.html +0 -1509
- data/docpages/tutorial.textile +0 -283
- data/docpages/usage.textile +0 -195
- data/docpages/wutils.textile +0 -263
- data/examples/dataflow/complex.rb +0 -11
- data/examples/dataflow/donuts.rb +0 -13
- data/examples/tiny_count/jabberwocky_output.tsv +0 -92
- data/examples/word_count.rb +0 -48
- data/examples/workflow/fiddle.rb +0 -24
- data/lib/away/escapement.rb +0 -129
- data/lib/away/exe.rb +0 -11
- data/lib/away/experimental.rb +0 -5
- data/lib/away/from_file.rb +0 -52
- data/lib/away/job.rb +0 -56
- data/lib/away/job/rake_compat.rb +0 -17
- data/lib/away/registry.rb +0 -79
- data/lib/away/runner.rb +0 -276
- data/lib/away/runner/execute.rb +0 -121
- data/lib/away/script.rb +0 -161
- data/lib/away/script/hadoop_command.rb +0 -240
- data/lib/away/source/file_list_source.rb +0 -15
- data/lib/away/source/looper.rb +0 -18
- data/lib/away/task.rb +0 -219
- data/lib/hanuman/action.rb +0 -21
- data/lib/hanuman/chain.rb +0 -4
- data/lib/hanuman/graphviz.rb +0 -74
- data/lib/hanuman/resource.rb +0 -6
- data/lib/hanuman/slot.rb +0 -87
- data/lib/hanuman/slottable.rb +0 -220
- data/lib/wukong/bad_record.rb +0 -15
- data/lib/wukong/event.rb +0 -44
- data/lib/wukong/local_runner.rb +0 -55
- data/lib/wukong/mapred.rb +0 -3
- data/lib/wukong/universe.rb +0 -48
- data/lib/wukong/widget/filter.rb +0 -81
- data/lib/wukong/widget/gibberish.rb +0 -123
- data/lib/wukong/widget/monitor.rb +0 -26
- data/lib/wukong/widget/reducer.rb +0 -66
- data/lib/wukong/widget/stringifier.rb +0 -50
- data/lib/wukong/workflow.rb +0 -22
- data/lib/wukong/workflow/command.rb +0 -42
- data/old/config/emr-example.yaml +0 -48
- data/old/examples/README.txt +0 -17
- data/old/examples/contrib/jeans/README.markdown +0 -165
- data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
- data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
- data/old/examples/contrib/jeans/data/sizes +0 -3
- data/old/examples/contrib/jeans/normalize.rb +0 -20
- data/old/examples/contrib/jeans/sizes.rb +0 -55
- data/old/examples/corpus/bnc_word_freq.rb +0 -44
- data/old/examples/corpus/bucket_counter.rb +0 -47
- data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
- data/old/examples/corpus/sentence_bigrams.rb +0 -53
- data/old/examples/corpus/sentence_coocurrence.rb +0 -66
- data/old/examples/corpus/stopwords.rb +0 -138
- data/old/examples/corpus/words_to_bigrams.rb +0 -53
- data/old/examples/emr/README.textile +0 -110
- data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
- data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
- data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
- data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
- data/old/examples/network_graph/adjacency_list.rb +0 -74
- data/old/examples/network_graph/breadth_first_search.rb +0 -72
- data/old/examples/network_graph/gen_2paths.rb +0 -68
- data/old/examples/network_graph/gen_multi_edge.rb +0 -112
- data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
- data/old/examples/pagerank/README.textile +0 -6
- data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
- data/old/examples/pagerank/pagerank.rb +0 -72
- data/old/examples/pagerank/pagerank_initialize.rb +0 -42
- data/old/examples/pagerank/run_pagerank.sh +0 -21
- data/old/examples/sample_records.rb +0 -33
- data/old/examples/server_logs/apache_log_parser.rb +0 -15
- data/old/examples/server_logs/nook.rb +0 -48
- data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
- data/old/examples/server_logs/user_agent.rb +0 -40
- data/old/examples/simple_word_count.rb +0 -82
- data/old/examples/size.rb +0 -61
- data/old/examples/stats/avg_value_frequency.rb +0 -86
- data/old/examples/stats/binning_percentile_estimator.rb +0 -140
- data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
- data/old/examples/stats/rank_and_bin.rb +0 -173
- data/old/examples/stupidly_simple_filter.rb +0 -40
- data/old/examples/word_count.rb +0 -75
- data/old/graph/graphviz_builder.rb +0 -580
- data/old/graph_easy/Attributes.pm +0 -4181
- data/old/graph_easy/Graphviz.pm +0 -2232
- data/old/wukong.rb +0 -18
- data/old/wukong/and_pig.rb +0 -38
- data/old/wukong/bad_record.rb +0 -18
- data/old/wukong/datatypes.rb +0 -24
- data/old/wukong/datatypes/enum.rb +0 -127
- data/old/wukong/datatypes/fake_types.rb +0 -17
- data/old/wukong/decorator.rb +0 -28
- data/old/wukong/encoding/asciize.rb +0 -108
- data/old/wukong/extensions.rb +0 -16
- data/old/wukong/extensions/array.rb +0 -18
- data/old/wukong/extensions/blank.rb +0 -93
- data/old/wukong/extensions/class.rb +0 -189
- data/old/wukong/extensions/date_time.rb +0 -53
- data/old/wukong/extensions/emittable.rb +0 -69
- data/old/wukong/extensions/enumerable.rb +0 -79
- data/old/wukong/extensions/hash.rb +0 -167
- data/old/wukong/extensions/hash_keys.rb +0 -16
- data/old/wukong/extensions/hash_like.rb +0 -150
- data/old/wukong/extensions/hashlike_class.rb +0 -47
- data/old/wukong/extensions/module.rb +0 -2
- data/old/wukong/extensions/pathname.rb +0 -27
- data/old/wukong/extensions/string.rb +0 -65
- data/old/wukong/extensions/struct.rb +0 -17
- data/old/wukong/extensions/symbol.rb +0 -11
- data/old/wukong/filename_pattern.rb +0 -74
- data/old/wukong/helper.rb +0 -7
- data/old/wukong/helper/stopwords.rb +0 -195
- data/old/wukong/helper/tokenize.rb +0 -35
- data/old/wukong/logger.rb +0 -38
- data/old/wukong/periodic_monitor.rb +0 -72
- data/old/wukong/schema.rb +0 -269
- data/old/wukong/script.rb +0 -286
- data/old/wukong/script/avro_command.rb +0 -5
- data/old/wukong/script/cassandra_loader_script.rb +0 -40
- data/old/wukong/script/emr_command.rb +0 -168
- data/old/wukong/script/hadoop_command.rb +0 -237
- data/old/wukong/script/local_command.rb +0 -41
- data/old/wukong/store.rb +0 -10
- data/old/wukong/store/base.rb +0 -27
- data/old/wukong/store/cassandra.rb +0 -10
- data/old/wukong/store/cassandra/streaming.rb +0 -75
- data/old/wukong/store/cassandra/struct_loader.rb +0 -21
- data/old/wukong/store/cassandra_model.rb +0 -91
- data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
- data/old/wukong/store/chunked_flat_file_store.rb +0 -48
- data/old/wukong/store/conditional_store.rb +0 -57
- data/old/wukong/store/factory.rb +0 -8
- data/old/wukong/store/flat_file_store.rb +0 -89
- data/old/wukong/store/key_store.rb +0 -51
- data/old/wukong/store/null_store.rb +0 -15
- data/old/wukong/store/read_thru_store.rb +0 -22
- data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
- data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
- data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
- data/old/wukong/streamer.rb +0 -30
- data/old/wukong/streamer/accumulating_reducer.rb +0 -83
- data/old/wukong/streamer/base.rb +0 -126
- data/old/wukong/streamer/counting_reducer.rb +0 -25
- data/old/wukong/streamer/filter.rb +0 -20
- data/old/wukong/streamer/instance_streamer.rb +0 -15
- data/old/wukong/streamer/json_streamer.rb +0 -21
- data/old/wukong/streamer/line_streamer.rb +0 -12
- data/old/wukong/streamer/list_reducer.rb +0 -31
- data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
- data/old/wukong/streamer/record_streamer.rb +0 -14
- data/old/wukong/streamer/reducer.rb +0 -11
- data/old/wukong/streamer/set_reducer.rb +0 -14
- data/old/wukong/streamer/struct_streamer.rb +0 -48
- data/old/wukong/streamer/summing_reducer.rb +0 -29
- data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
- data/old/wukong/typed_struct.rb +0 -12
- data/spec/away/encoding_spec.rb +0 -32
- data/spec/away/exe_spec.rb +0 -20
- data/spec/away/flow_spec.rb +0 -82
- data/spec/away/graph_spec.rb +0 -6
- data/spec/away/job_spec.rb +0 -15
- data/spec/away/rake_compat_spec.rb +0 -9
- data/spec/away/script_spec.rb +0 -81
- data/spec/hanuman/graphviz_spec.rb +0 -29
- data/spec/hanuman/slot_spec.rb +0 -2
- data/spec/support/examples_helper.rb +0 -10
- data/spec/support/streamer_test_helpers.rb +0 -6
- data/spec/support/wukong_widget_helpers.rb +0 -66
- data/spec/wukong/processor_spec.rb +0 -109
- data/spec/wukong/widget/filter_spec.rb +0 -99
- data/spec/wukong/widget/stringifier_spec.rb +0 -51
- data/spec/wukong/workflow/command_spec.rb +0 -5
@@ -1,36 +0,0 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
h2. Bulk Streaming use cases
|
4
|
-
|
5
|
-
* Take a bunch of nightly calculations and need to flood it into the DB -- http://sna-projects.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/ In this case, it's important that the bulk load happen efficiently but with low stress on the DB. I'm willing to make it so that data streams to each cassandra node in the sort order and pre-partitioned just like the node wants to use it. Should run at the full streaming speed of the disk.
|
6
|
-
|
7
|
-
* Building a new table or moving a legacy database over to cassandra. I want to write from one or several nodes, probably not in the cluster, and data will be completely unpartitioned. I might be able to make some guarantees about uniqueness of keys and rows (that is, you'll generally only see a key once, and/or when you see a key it will contain the entire row). 20k inserts/s / receiving node.
|
8
|
-
|
9
|
-
* Using cassandra to replace HDFS. Replication is for compute, not for availability -- so efficient writing at consistency level ANY is important. Would like to get 100k inserts/s per receiving node.
|
10
|
-
|
11
|
-
* A brand new user wants to just stuff his goddamn data into the goddamn database and start playing with it. It had better be not-terribly-slow, and it had better be really easy to take whatever insane format it shows up in and cram that into the data hole. It should also be conceptually straighforward: it should look like I'm writing hashes or hashes of hashes.
|
12
|
-
|
13
|
-
|
14
|
-
===========================================================================
|
15
|
-
From http://sna-projects.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/
|
16
|
-
|
17
|
-
Here are the times taken:
|
18
|
-
|
19
|
-
* 100GB: 28mins (400 mappers, 90 reducers)
|
20
|
-
* 512GB: 2hrs, 16mins (2313 mappers, 350 reducers)
|
21
|
-
* 1TB: 5hrs, 39mins (4608 mappers, 700 reducers)
|
22
|
-
|
23
|
-
Data transfer between the clusters happens at a steady rate bound by the disk or network. For our Amazon instances this is around 40MB/second.
|
24
|
-
|
25
|
-
Online Performance
|
26
|
-
|
27
|
-
Lookup time for a single Voldemort node compares well to a single MySQL instance as well. To test this we ran local tests against the 100GB per-node data from the 1 TB test. This test as well was run on an Amazon Extra Large instance with 15GB of RAM and the 4 ephemeral disks in a RAID 10 configuration. To run the tests we simulated we simulated 1 million requests from a real request stream recorded on our production system against each of storage systems. We see the following performance for 1 million requests against a single node:
|
28
|
-
MySQL Voldemort
|
29
|
-
Reqs per sec. 727 1291
|
30
|
-
Median req. time 0.23 ms 0.05 ms
|
31
|
-
Avg. req. time 13.7 ms 7.7 ms
|
32
|
-
|
33
|
-
99th percentile req. time
|
34
|
-
127.2 ms 100.7 ms
|
35
|
-
|
36
|
-
These numbers are both for local requests with no network involved as the only intention is to benchmark the storage layer of these systems.
|
@@ -1,19 +0,0 @@
|
|
1
|
-
The idea is that folks would somehow write a Ruby application. On startup, it starts a server speaking InputProtocol:
|
2
|
-
|
3
|
-
http://svn.apache.org/viewvc/avro/trunk/share/schemas/org/apache/avro/mapred/tether/InputProtocol.avpr?view=markup
|
4
|
-
|
5
|
-
Then it gets a port from the environment variable AVRO_TETHER_OUTPUT_PORT. It connects to this port, using OutputProtocol, and uses the configure() message to send the port of its input server to its parent:
|
6
|
-
|
7
|
-
http://svn.apache.org/viewvc/avro/trunk/share/schemas/org/apache/avro/mapred/tether/OutputProtocol.avpr?view=markup
|
8
|
-
|
9
|
-
The meat of maps and reduces consists of the parent sending inputs to the child's with the input() message and the child sending outputs back to the parent with the output() message.
|
10
|
-
|
11
|
-
If it helps any, there's a Java implementation of the child including a demo WordCount application:
|
12
|
-
|
13
|
-
http://svn.apache.org/viewvc/avro/trunk/lang/java/src/test/java/org/apache/avro/mapred/tether/
|
14
|
-
|
15
|
-
One nit, should you choose to accept this task, is that Avro's ruby RPC stuff will need to be enhanced. It doesn't yet support request-only messages. I can probably cajole someone to help with this and there's a workaround for debugging (switch things to HTTP).
|
16
|
-
|
17
|
-
http://svn.apache.org/viewvc/avro/trunk/lang/ruby/lib/avro/ipc.rb?view=markup
|
18
|
-
|
19
|
-
|
@@ -1,143 +0,0 @@
|
|
1
|
-
---
|
2
|
-
layout: default
|
3
|
-
title: mrflip.github.com/wukong - Lessons Learned working with Big Data
|
4
|
-
collapse: false
|
5
|
-
---
|
6
|
-
|
7
|
-
h2. Random Thoughts on Big Data
|
8
|
-
|
9
|
-
Stuff changes when you cross the 100GB barrier. Here are random musings on why it might make sense to
|
10
|
-
|
11
|
-
* Sort everything
|
12
|
-
* Don't do any error handling
|
13
|
-
* Catch errors and emit them along with your data
|
14
|
-
* Make everything ASCII
|
15
|
-
* Abandon integer keys
|
16
|
-
* Use bash as your data-analysis IDE.
|
17
|
-
|
18
|
-
h2(#dropacid). Drop ACID, explore Big Data
|
19
|
-
|
20
|
-
The traditional "ACID quartet":http://en.wikipedia.org/wiki/ACID for relational databases can be re-interpreted in a Big Data context:
|
21
|
-
|
22
|
-
* A -- Associative
|
23
|
-
* C -- Commutative
|
24
|
-
* I -- Idempotent
|
25
|
-
* D -- Distributed
|
26
|
-
* (*) -- (and where possible, left in sort order)
|
27
|
-
|
28
|
-
Finally, where possible leave things in sort order by some appropriate index. Clearly I'm not talking about introducing extra unnecessary sorts on ephemeral data. For things that will be read (and experimented with) much more often than they're written, though, it's worth running a final sort. Now you can
|
29
|
-
|
30
|
-
* Efficiently index into a massive dataset with binary search
|
31
|
-
* Do a direct merge sort on two files with the same sort order
|
32
|
-
* Run a reducer directly across the data
|
33
|
-
* Assign a synthetic key by just serially numbering lines (either distribute a unique prefix to each mapper
|
34
|
-
|
35
|
-
Note: for files that will live on the DFS, you should usually *not* do a total sort,
|
36
|
-
|
37
|
-
h2. If it's not broken, it's wrong
|
38
|
-
|
39
|
-
Something that goes wrong one in a five million times will crop up hundreds of times in a billion-record collection.
|
40
|
-
|
41
|
-
h3. Error is not normally distributed
|
42
|
-
|
43
|
-
What's more, errors introduced will not in general be normally distributed and their impact may not decrease with increasing data size.
|
44
|
-
|
45
|
-
h3. Do your exception handling in-band
|
46
|
-
|
47
|
-
A large, heavily-used cluster will want to have ganglia or "scribe":http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/ or the like collecting and managing log data. "Splunk":http://www.splunk.com/ is a compelling option I haven't myself used, but it is "broadly endorsed.":http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/
|
48
|
-
|
49
|
-
However, it's worth considering another extremely efficient, simple and powerful distributed system for routing massive quantities of data in a structured way, namely wukong|hadoop itself.
|
50
|
-
|
51
|
-
Wukong gives you a BadRecord class -- just rescue errors, pass the full or partial contents of the offending input. and emit the BadRecord instance in-band. They'll be serialized out along with the rest, and at your preference can be made to reduce to a single instance. Do analysis on them at your leisure; by default, any StructStreamer will silently discard *inbound* BadRecords -- they won't survive past the current generation.
|
52
|
-
|
53
|
-
h2(#keys). Keys
|
54
|
-
|
55
|
-
* Artificial key: assigned externally, key is not a function of the object's intrinsic values. A social security number is an artificial key. Artificial
|
56
|
-
|
57
|
-
* Natural Key: minimal subset of fields with _intrinsic semantic value_ that _uniquely identify_ the record. My name isn't unique, but my fingerprint is both uniqe and intrinsic. Given the object (me) you can generate the key, and given the key there's exactly one object (me) that matches.
|
58
|
-
|
59
|
-
h4. other fields
|
60
|
-
|
61
|
-
* Mutable:
|
62
|
-
** A user's 'bio' section.
|
63
|
-
|
64
|
-
* Immutable:
|
65
|
-
** A user's created_at date is immutable: it doesn't help identify the person but it will never change.
|
66
|
-
|
67
|
-
h4. Natural keys are right for big data
|
68
|
-
|
69
|
-
Synthetic keys suck. They demand locality or a central keymaster.
|
70
|
-
|
71
|
-
* Use the natural key
|
72
|
-
* Hash the natural key. This has some drawbacks
|
73
|
-
|
74
|
-
OK, fine. you need a synthetic key
|
75
|
-
|
76
|
-
* Do a total sort, and use nl
|
77
|
-
* Generate
|
78
|
-
* Use a single reducer to reduce locality. YUCK.
|
79
|
-
* have each mapper generate a unique prefix; number each line as "prefix#{line_number}" or whatever.
|
80
|
-
|
81
|
-
How do you get a unique prefix?
|
82
|
-
|
83
|
-
* Distribute a unique prefix to each mapper out-of-band. People using Streaming are out of luck.
|
84
|
-
|
85
|
-
* Use a UUID -- that's what they're for. Drawback: ridiculously long
|
86
|
-
|
87
|
-
* Hash the machine name, PID and timestamp to something short. Check after the
|
88
|
-
fact that uniqueness was achieved. Use the birthday party formula to find out
|
89
|
-
how often this will happen. (In practice, almost never.)
|
90
|
-
|
91
|
-
You can consider your fields are one of three types:
|
92
|
-
|
93
|
-
* Key
|
94
|
-
** natural: a unique username, a URL, the MD5 hash of a URL
|
95
|
-
** synthetic: an integer generated by some central keymaster
|
96
|
-
* Mutable:
|
97
|
-
** eg A user’s ‘bio’ section.
|
98
|
-
* Immutable:
|
99
|
-
** A user’s created_at date is immutable: it doesn’t help identify the person but it will never change.
|
100
|
-
|
101
|
-
The meaning of a key depends on its semantics. Is a URL a key?
|
102
|
-
|
103
|
-
* A location: (compare: "The head of household residing at 742 Evergreen Terr, Springfield USA")
|
104
|
-
* An entity handle (URI): (compare: "Homer J Simpson (aka Max Power)")
|
105
|
-
* An observation of that entity: Many URLs are handles to a __stream__ -- http://twitter.com/mrflip names the resource "mrflip's twitter stream", but loading that page offers only the last 20 entries in that stream. (compare: "The collection of all words spoken by the residents of 742 Evergreen Terr, Springfield USA")
|
106
|
-
|
107
|
-
h2(#bashide). The command line is an IDE
|
108
|
-
|
109
|
-
{% highlight sh %}
|
110
|
-
cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")'
|
111
|
-
{% endhighlight %}
|
112
|
-
|
113
|
-
h2(#encoding). Encode once, and carefully.
|
114
|
-
|
115
|
-
Encoding violates idempotence. Data brought in from elsewhere *must* be considered unparsable, ill-formatted and rife with illegal characters.
|
116
|
-
|
117
|
-
* Immediately fix a copy of the original data with as minimal encoding as possible.
|
118
|
-
* Follow this with a separate parse stage to emit perfectly well-formed, tab-separated / newline delimited data
|
119
|
-
* In this parse stage, encode the data to 7-bits, free of internal tabs, backslashes, carriage return/line feed or control characters. You want your encoding scheme to be
|
120
|
-
** perfectly reversible
|
121
|
-
** widely implemented
|
122
|
-
** easily parseable
|
123
|
-
** recognizable: incoming data that is mostly inoffensive (a json record, or each line of a document such as this one) should be minimally altered from its original. This lets you do rough exploration with sort/cut/grep and friends.
|
124
|
-
** !! Involve **NO QUOTING**, only escaping. I can write a simple regexp to decode entities such as %10, \n or . This regexp will behave harmlessly with ill-formed data (eg %%10 or &&; or \ at end of line) and is robust against data being split or interpolated. Schemes such as "quoting: it's bad", %Q{quoting: "just say no"} or <notextile><notextile>tagged markup</notextile></notextile> require a recursive parser. An extra or missing quote mark is almost impossible to backtrack. And av
|
125
|
-
|
126
|
-
In the absence of some lightweight, mostly-transparent, ASCII-compatible *AND* idempotent encoding scheme lurking in a back closet of some algorithms book -- how to handle the initial lousy payload coming off the wire?
|
127
|
-
|
128
|
-
* For data that is *mostly* text in a western language, you'll do well wiht XML encoding (with <notextile>[\n\r\t\\]</notextile> forced to encode as entities)
|
129
|
-
* URL encoding isn't as recognizable, but is also safe. Use this for things like URIs and filenames, or if you want to be /really/ paranoid about escaping.
|
130
|
-
* For binary data, Binhex is efficient enough and every toolkit can handle it. There are more data-efficient ascii-compatible encoding schemes but it's not worth the hassle for the 10% or whatever gain in size.
|
131
|
-
* If your payload itself is XML data, consider using \0 (nul) between records, with a fixed number of tab-separated metadata fields leading the XML data, which can then include tabs, newlines, or whatever the hell it wants. No changes are made to the data apart from a quick gsub to remove any (highly illegal) \0 in the XML data itself. A later parse round will convert it to structured hadoop-able data. Ex:
|
132
|
-
|
133
|
-
{% highlight html %}
|
134
|
-
feed_request 20090809101112 200 OK <?xml version='1.0' encoding='utf-8' ?>
|
135
|
-
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
136
|
-
<html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
|
137
|
-
<head>
|
138
|
-
<title>infochimps.org — Find Any Dataset in the World</title>
|
139
|
-
{% endhighlight %}
|
140
|
-
|
141
|
-
p. Many of the command line utilities (@cat@, @grep@, etc.) will accept nul-delimited files.
|
142
|
-
|
143
|
-
You may be tempted to use XML around your XML so you can XML while you XML. Ultimately, you'll find this can only be done right by doing a full parse of the input -- and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON).
|
@@ -1,20 +0,0 @@
|
|
1
|
-
[
|
2
|
-
{ // TwitterUser
|
3
|
-
"id":123456789,
|
4
|
-
// Basic fields
|
5
|
-
"screen_name":"nena", "protected":false, "created_at":"Thu Apr 23 02:00:00 +0000 2009",
|
6
|
-
"followers_count":0, "friends_count":1, "statuses_count":1, "favourites_count":0,
|
7
|
-
// TwitterUserProfile fields
|
8
|
-
"name":"nena", "url":null, "location":null, "description":null,"time_zone":null,"utc_offset":null,
|
9
|
-
// TwitterUserStyle
|
10
|
-
"profile_background_color":"9ae4e8", "profile_text_color":"000000", "profile_link_color":"0000ff", "profile_sidebar_border_color":"87bc44", "profile_sidebar_fill_color":"e0ff92", "profile_background_tile":false,
|
11
|
-
"profile_background_image_url":"http:\/\/static.twitter.com\/images\/themes\/theme1\/bg.gif"
|
12
|
-
"profile_image_url":"http:\/\/s3.amazonaws.com\/twitter_production\/profile_images\/123456789\/crane_normal.JPG",
|
13
|
-
// with enclosed Tweet
|
14
|
-
"status": {
|
15
|
-
"id":123456789,
|
16
|
-
// the twitter_user_id is implied
|
17
|
-
"created_at":"Thu Apr 23 02:00:00 +0000 2009", "favorited":false, "truncated":false, "source":"web",
|
18
|
-
"in_reply_to_user_id":null, "in_reply_to_status_id":null, "in_reply_to_screen_name":null
|
19
|
-
"text":"My cat's breath smells like cat food." },
|
20
|
-
},
|
@@ -1,38 +0,0 @@
|
|
1
|
-
# extract each record from request contents
|
2
|
-
# and stream it to output
|
3
|
-
class TwitterRequestParser < Wukong::Streamer::StructStreamer
|
4
|
-
def process request
|
5
|
-
request.parse do |obj|
|
6
|
-
yield obj
|
7
|
-
end
|
8
|
-
end
|
9
|
-
end
|
10
|
-
|
11
|
-
# Incoming Request:
|
12
|
-
class TwitterFollowersRequest < Struct.new(
|
13
|
-
:url, :scraped_at, :response_code, :response_message, :moreinfo, :contents)
|
14
|
-
include Monkeyshines::ScrapeRequest
|
15
|
-
end
|
16
|
-
|
17
|
-
# Outgoing classes:
|
18
|
-
class TwitterUser < TypedStruct.new( :id, :scraped_at, :screen_name, :protected, :created_at,
|
19
|
-
:followers_count, :friends_count, :statuses_count, :favourites_count )
|
20
|
-
end
|
21
|
-
class Tweet < TypedStruct.new(:id, :created_at, :twitter_user_id, :favorited, :truncated,
|
22
|
-
:text, :source, :in_reply_to_user_id, :in_reply_to_status_id, :in_reply_to_screen_name)
|
23
|
-
end
|
24
|
-
|
25
|
-
# Parsing code:
|
26
|
-
TwitterFollowersRequest.class_eval do
|
27
|
-
include Monkeyshines::RawJsonContents
|
28
|
-
def parse &block
|
29
|
-
parsed_contents.each do |user_tweet_hash|
|
30
|
-
yield AFollowsB.new user_tweet_hash["id"], self.moreinfo[:request_user_id]
|
31
|
-
yield TwitterUser.from_hash user_tweet_hash
|
32
|
-
yield Tweet.from_hash user_tweet_hash
|
33
|
-
end
|
34
|
-
end
|
35
|
-
end
|
36
|
-
|
37
|
-
# This makes the script go.
|
38
|
-
Wukong::Script.new(TwitterRequestParser, TwitterRequestUniqer).run
|
Binary file
|
data/docpages/favicon.ico
DELETED
Binary file
|
data/docpages/gem.css
DELETED
@@ -1,16 +0,0 @@
|
|
1
|
-
#header a { color: #00a; }
|
2
|
-
hr { border-color: #66a ; }
|
3
|
-
h2 { border-color: #acc ; }
|
4
|
-
h1 { border-color: #acc ; }
|
5
|
-
.download { border-color: #acc ; }
|
6
|
-
#footer { border-color: #a0e0e8 ; }
|
7
|
-
|
8
|
-
#header a { margin-left:0.125em; margin-right:0.125em; }
|
9
|
-
h1.gemheader {
|
10
|
-
margin: -30px 0 0.5em -65px ;
|
11
|
-
text-indent: 65px ;
|
12
|
-
height: 90px ;
|
13
|
-
padding: 50px 0 10px 0px;
|
14
|
-
background: url('/images/wukong.png') no-repeat 0px 0px ;
|
15
|
-
}
|
16
|
-
.quiet { font-size: 0.85em ; color: #777 ; font-style: italic }
|
@@ -1,83 +0,0 @@
|
|
1
|
-
---
|
2
|
-
layout: default
|
3
|
-
title: mrflip.github.com/wukong - NFS on Hadoop FTW
|
4
|
-
collapse: false
|
5
|
-
---
|
6
|
-
|
7
|
-
h2. Hadoop Config Tips
|
8
|
-
|
9
|
-
h3(#hadoopnfs). Setup NFS within the cluster
|
10
|
-
|
11
|
-
If you're lazy, I recommend setting up NFS -- it makes dispatching simple config and script files much easier. (And if you're not lazy, what the hell are you doing using Wukong?). Be careful though -- used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.
|
12
|
-
|
13
|
-
Installing NFS to share files along the cluster gives the following conveniences:
|
14
|
-
* You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
|
15
|
-
* The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.
|
16
|
-
|
17
|
-
First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.
|
18
|
-
|
19
|
-
As root, on the master (change @compute-1.internal@ to match your setup):
|
20
|
-
|
21
|
-
<pre>
|
22
|
-
apt-get install nfs-kernel-server
|
23
|
-
echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
|
24
|
-
/etc/init.d/nfs-kernel-server stop ;
|
25
|
-
</pre>
|
26
|
-
|
27
|
-
(The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
|
28
|
-
|
29
|
-
Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':
|
30
|
-
|
31
|
-
<pre>
|
32
|
-
visudo # uncomment the last line, to allow group sudo to sudo
|
33
|
-
groupadd admin
|
34
|
-
adduser chimpy
|
35
|
-
usermod -a -G sudo,admin chimpy
|
36
|
-
su chimpy # now you are the new user
|
37
|
-
ssh-keygen -t rsa # accept all the defaults
|
38
|
-
cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc
|
39
|
-
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
|
40
|
-
</pre>
|
41
|
-
|
42
|
-
Then on each slave (replacing domU-xx-... by the internal name for the master node):
|
43
|
-
|
44
|
-
<pre>
|
45
|
-
apt-get install nfs-common ;
|
46
|
-
echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab
|
47
|
-
/etc/init.d/nfs-common restart
|
48
|
-
mkdir /mnt/home
|
49
|
-
mount /mnt/home
|
50
|
-
ln -s /mnt/home/chimpy /home/chimpy
|
51
|
-
</pre>
|
52
|
-
|
53
|
-
You should now be in business.
|
54
|
-
|
55
|
-
Performance tradeoffs should be small as long as you're just sending code files and gems around. *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.
|
56
|
-
|
57
|
-
* http://nfs.sourceforge.net/nfs-howto/ar01s03.html
|
58
|
-
* The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it carefully.
|
59
|
-
|
60
|
-
h3(#awstools). Tools for EC2 and S3 Management
|
61
|
-
|
62
|
-
* http://s3sync.net/wiki
|
63
|
-
* http://jets3t.s3.amazonaws.com/applications/applications.html#uploader
|
64
|
-
* "ElasticFox"
|
65
|
-
* "S3Fox (S3 Organizer)":
|
66
|
-
* "FoxyProxy":
|
67
|
-
|
68
|
-
|
69
|
-
h3. Random EC2 notes
|
70
|
-
|
71
|
-
* "How to Mount EBS volume at launch":http://clouddevelopertips.blogspot.com/2009/08/mount-ebs-volume-created-from-snapshot.html
|
72
|
-
|
73
|
-
* The Cloudera AMIs and distribution include BZip2 support. This means that if you have input files with a .bz2 extension, they will be naturally un-bzipped and streamed. (Note that there is a non-trivial penalty for doing so: each bzip'ed file must go, in whole, to a single mapper; and the CPU load for un-bzipping is sizeable.)
|
74
|
-
|
75
|
-
* To _produce_ bzip2 files, specify the @--compress_output=@ flag. If you have the BZip2 patches installed, you can give @--compress_output=bz2@; everyone should be able to use @--compress_output=gz@.
|
76
|
-
|
77
|
-
* For excellent performance you can patch your install for "Parallel LZO Splitting":http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/
|
78
|
-
|
79
|
-
* If you're using XFS, consider setting the nobarrier option
|
80
|
-
/dev/sdf /mnt/data2 xfs noatime,nodiratime,nobarrier 0 0
|
81
|
-
|
82
|
-
* The first write to any disk location is about 5x slower than later writes. Explanation, and how to pre-soften a volume, here: http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?instance-storage.html
|
83
|
-
|
data/docpages/index.textile
DELETED
@@ -1,92 +0,0 @@
|
|
1
|
-
---
|
2
|
-
layout: default
|
3
|
-
title: mrflip.github.com/wukong
|
4
|
-
collapse: true
|
5
|
-
---
|
6
|
-
h1(gemheader). wukong %(small):: hadoop made easy%
|
7
|
-
|
8
|
-
p(description). {{ site.description }}
|
9
|
-
|
10
|
-
Treat your dataset like a
|
11
|
-
* stream of lines when it's efficient to process by lines
|
12
|
-
* stream of field arrays when it's efficient to deal directly with fields
|
13
|
-
* stream of lightweight objects when it's efficient to deal with objects
|
14
|
-
|
15
|
-
Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
|
16
|
-
|
17
|
-
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
18
|
-
|
19
|
-
<notextile><div class="toggle"></notextile>
|
20
|
-
|
21
|
-
h2. Documentation index
|
22
|
-
|
23
|
-
* "Install and set up wukong":INSTALL.html
|
24
|
-
** "Get the code":INSTALL.html#getcode
|
25
|
-
** "Setup":INSTALL.html#setup
|
26
|
-
** "Installing and Running Wukong with Hadoop":INSTALL.html#gethadoop
|
27
|
-
** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":INSTALL.html#others
|
28
|
-
|
29
|
-
* "Tutorial":tutorial.html
|
30
|
-
** "Count Words":tutorial.html#wordcount
|
31
|
-
** "Structured data":tutorial.html#structstream
|
32
|
-
** "Accumulators":tutorial.html#accumulators including a UniqByLastReducer and a GroupBy reducer.
|
33
|
-
|
34
|
-
* "Usage notes":usage.html
|
35
|
-
** "How to run a Wukong script":usage.html#running
|
36
|
-
** "How to test your scripts":usage.html#testing
|
37
|
-
** "Wukong Plays nicely with others":usage.html#playnice
|
38
|
-
** "Schema export":usage.html#schema_export to Pig and SQL
|
39
|
-
** "Using wukong with internal streaming":usage.html#stayinruby
|
40
|
-
** "Using wukong to Batch-Process ActiveRecord Objects":usage.html#activerecord
|
41
|
-
|
42
|
-
* "Wutils":wutils.html -- command-line utilies for working with data from the command line
|
43
|
-
** "Overview of wutils":wutils.html#wutils -- command listing
|
44
|
-
** "Stupid command-line tricks":wutils.html#cmdlinetricks using the wutils
|
45
|
-
** "wu-lign":wutils.html#wulign -- present a tab-separated file as aligned columns
|
46
|
-
** Dear Lazyweb, please build us a "tab-oriented version of the Textutils library":wutils.html#wutilsinc
|
47
|
-
|
48
|
-
* Links and tips for "configuring and working with hadoop":hadoop-tips.html
|
49
|
-
* Some opinionated "thoughts on working with big data,":bigdata-tips.html on why you should drop acid, treat exceptions as records, and happily embrace variable-length strings as primary keys.
|
50
|
-
* Wukong is licensed under the "Apache License":LICENSE.html (same as Hadoop)
|
51
|
-
|
52
|
-
* "More info":moreinfo.html
|
53
|
-
** "Why is it called Wukong?":moreinfo.html#name
|
54
|
-
** "Don't Use Wukong, use this instead":moreinfo.html#whateverdude
|
55
|
-
** "Further Reading and useful links":moreinfo.html#links
|
56
|
-
** "Note on Patches/Pull Requests":moreinfo.html#patches
|
57
|
-
** "What's up with Wukong::AndPig?":moreinfo.html#andpig
|
58
|
-
** "Map/Reduce Algorithms":moreinfo.html#algorithms
|
59
|
-
** "TODOs":moreinfo.html#TODO
|
60
|
-
|
61
|
-
* Work in progress: an intro to data processing with wukong:
|
62
|
-
** "Part 1, Get Ready":UsingWukong-part1-getready.html
|
63
|
-
** "Part 2, Thinking Big Data":UsingWukong-part2-ThinkingBigData.html
|
64
|
-
** "Part 3, Parsing":UsingWukong-part3-parsing.html
|
65
|
-
|
66
|
-
<notextile></div></notextile>
|
67
|
-
|
68
|
-
{% include intro.textile %}
|
69
|
-
|
70
|
-
<notextile><div class="toggle"></notextile>
|
71
|
-
|
72
|
-
h2. Credits
|
73
|
-
|
74
|
-
Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
75
|
-
|
76
|
-
Patches submitted by:
|
77
|
-
* gemified by Ben Woosley (ben.woosley@gmail.com)
|
78
|
-
* ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui@masuidrive.jp - http://blog.masuidrive.jp/
|
79
|
-
|
80
|
-
Thanks to:
|
81
|
-
* "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
|
82
|
-
* "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
|
83
|
-
|
84
|
-
<notextile></div><div class="toggle"></notextile>
|
85
|
-
|
86
|
-
h2. Help!
|
87
|
-
|
88
|
-
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
89
|
-
|
90
|
-
<notextile></div></notextile>
|
91
|
-
|
92
|
-
{% include news.html %}
|