wukong 3.0.0.pre → 3.0.0.pre2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (476) hide show
  1. data/.gitignore +46 -33
  2. data/.gitmodules +3 -0
  3. data/.rspec +1 -1
  4. data/.travis.yml +8 -1
  5. data/.yardopts +0 -13
  6. data/Guardfile +4 -6
  7. data/{LICENSE.textile → LICENSE.md} +43 -55
  8. data/README-old.md +422 -0
  9. data/README.md +279 -418
  10. data/Rakefile +21 -5
  11. data/TODO.md +6 -6
  12. data/bin/wu-clean-encoding +31 -0
  13. data/bin/wu-lign +2 -2
  14. data/bin/wu-local +69 -0
  15. data/bin/wu-server +70 -0
  16. data/examples/Gemfile +38 -0
  17. data/examples/README.md +9 -0
  18. data/examples/dataflow/apache_log_line.rb +64 -25
  19. data/examples/dataflow/fibonacci_series.rb +101 -0
  20. data/examples/dataflow/parse_apache_logs.rb +37 -7
  21. data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
  22. data/examples/dataflow/simple.rb +4 -4
  23. data/examples/geo.rb +4 -0
  24. data/examples/geo/geo_grids.numbers +0 -0
  25. data/examples/geo/geolocated.rb +331 -0
  26. data/examples/geo/quadtile.rb +69 -0
  27. data/examples/geo/spec/geolocated_spec.rb +247 -0
  28. data/examples/geo/tile_fetcher.rb +77 -0
  29. data/examples/graph/minimum_spanning_tree.rb +61 -61
  30. data/examples/jabberwocky.txt +36 -0
  31. data/examples/models/wikipedia.rb +20 -0
  32. data/examples/munging/Gemfile +8 -0
  33. data/examples/munging/airline_flights/airline.rb +57 -0
  34. data/examples/munging/airline_flights/airline_flights.rake +83 -0
  35. data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
  36. data/examples/munging/airline_flights/airport.rb +211 -0
  37. data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
  38. data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
  39. data/examples/munging/airline_flights/flight.rb +156 -0
  40. data/examples/munging/airline_flights/models.rb +4 -0
  41. data/examples/munging/airline_flights/parse.rb +26 -0
  42. data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
  43. data/examples/munging/airline_flights/route.rb +35 -0
  44. data/examples/munging/airline_flights/tasks.rake +83 -0
  45. data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
  46. data/examples/munging/airline_flights/topcities.rb +167 -0
  47. data/examples/munging/airports/40_wbans.txt +40 -0
  48. data/examples/munging/airports/filter_weather_reports.rb +37 -0
  49. data/examples/munging/airports/join.pig +31 -0
  50. data/examples/munging/airports/to_tsv.rb +33 -0
  51. data/examples/munging/airports/usa_wbans.pig +19 -0
  52. data/examples/munging/airports/usa_wbans.txt +2157 -0
  53. data/examples/munging/airports/wbans.pig +19 -0
  54. data/examples/munging/airports/wbans.txt +2310 -0
  55. data/examples/munging/geo/geo_json.rb +54 -0
  56. data/examples/munging/geo/geo_models.rb +69 -0
  57. data/examples/munging/geo/geonames_models.rb +78 -0
  58. data/examples/munging/geo/iso_codes.rb +172 -0
  59. data/examples/munging/geo/reconcile_countries.rb +124 -0
  60. data/examples/munging/geo/tasks.rake +71 -0
  61. data/examples/munging/rake_helper.rb +62 -0
  62. data/examples/munging/weather/.gitignore +1 -0
  63. data/examples/munging/weather/Gemfile +4 -0
  64. data/examples/munging/weather/Rakefile +28 -0
  65. data/examples/munging/weather/extract_ish.rb +13 -0
  66. data/examples/munging/weather/models/weather.rb +119 -0
  67. data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
  68. data/examples/munging/wikipedia/README.md +34 -0
  69. data/examples/munging/wikipedia/Rakefile +193 -0
  70. data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
  71. data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
  72. data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
  73. data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
  74. data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
  75. data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
  76. data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
  77. data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
  78. data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
  79. data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
  80. data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
  81. data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
  82. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
  83. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
  84. data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
  85. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
  86. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
  87. data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
  88. data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
  89. data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
  90. data/examples/munging/wikipedia/pig_style_guide.md +25 -0
  91. data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
  92. data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
  93. data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
  94. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
  95. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
  96. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
  97. data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
  98. data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
  99. data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
  100. data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
  101. data/examples/munging/wikipedia/utils/namespaces.json +1 -0
  102. data/examples/rake_helper.rb +85 -0
  103. data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
  104. data/examples/server_logs/logline.rb +95 -0
  105. data/examples/server_logs/models.rb +66 -0
  106. data/examples/server_logs/page_counts.pig +48 -0
  107. data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
  108. data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
  109. data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
  110. data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
  111. data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
  112. data/examples/string_reverser.rb +26 -0
  113. data/examples/text/pig_latin.rb +2 -2
  114. data/examples/text/regional_flavor/README.md +14 -0
  115. data/examples/text/regional_flavor/article_wordbags.pig +39 -0
  116. data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
  117. data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
  118. data/examples/word_count/accumulator.rb +26 -0
  119. data/examples/word_count/tokenizer.rb +13 -0
  120. data/examples/word_count/word_count.rb +6 -0
  121. data/examples/workflow/cherry_pie.dot +97 -0
  122. data/examples/workflow/cherry_pie.png +0 -0
  123. data/examples/workflow/cherry_pie.rb +61 -26
  124. data/lib/hanuman.rb +34 -7
  125. data/lib/hanuman/graph.rb +55 -31
  126. data/lib/hanuman/graphvizzer.rb +199 -178
  127. data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
  128. data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
  129. data/lib/hanuman/link.rb +35 -0
  130. data/lib/hanuman/registry.rb +46 -0
  131. data/lib/hanuman/stage.rb +76 -32
  132. data/lib/wukong.rb +23 -24
  133. data/lib/wukong/boot.rb +87 -0
  134. data/lib/wukong/configuration.rb +8 -0
  135. data/lib/wukong/dataflow.rb +45 -78
  136. data/lib/wukong/driver.rb +99 -0
  137. data/lib/wukong/emitter.rb +22 -0
  138. data/lib/wukong/model/faker.rb +24 -24
  139. data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
  140. data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
  141. data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
  142. data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
  143. data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
  144. data/lib/wukong/processor.rb +60 -114
  145. data/lib/wukong/spec_helpers.rb +81 -0
  146. data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
  147. data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
  148. data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
  149. data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
  150. data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
  151. data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
  152. data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
  153. data/lib/wukong/version.rb +2 -1
  154. data/lib/wukong/widget/filters.rb +311 -0
  155. data/lib/wukong/widget/processors.rb +156 -0
  156. data/lib/wukong/widget/reducers.rb +7 -0
  157. data/lib/wukong/widget/reducers/accumulator.rb +73 -0
  158. data/lib/wukong/widget/reducers/bin.rb +318 -0
  159. data/lib/wukong/widget/reducers/count.rb +61 -0
  160. data/lib/wukong/widget/reducers/group.rb +85 -0
  161. data/lib/wukong/widget/reducers/group_concat.rb +70 -0
  162. data/lib/wukong/widget/reducers/moments.rb +72 -0
  163. data/lib/wukong/widget/reducers/sort.rb +130 -0
  164. data/lib/wukong/widget/serializers.rb +287 -0
  165. data/lib/wukong/widget/sink.rb +10 -52
  166. data/lib/wukong/widget/source.rb +7 -113
  167. data/lib/wukong/widget/utils.rb +46 -0
  168. data/lib/wukong/widgets.rb +6 -0
  169. data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
  170. data/spec/examples/dataflow/parsing_spec.rb +12 -11
  171. data/spec/examples/dataflow/simple_spec.rb +32 -6
  172. data/spec/examples/dataflow/telegram_spec.rb +36 -36
  173. data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
  174. data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
  175. data/spec/examples/munging/airline_flights_spec.rb +202 -0
  176. data/spec/examples/text/pig_latin_spec.rb +13 -16
  177. data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
  178. data/spec/hanuman/graph_spec.rb +27 -2
  179. data/spec/hanuman/hanuman_spec.rb +10 -0
  180. data/spec/hanuman/registry_spec.rb +123 -0
  181. data/spec/hanuman/stage_spec.rb +61 -7
  182. data/spec/spec_helper.rb +29 -19
  183. data/spec/support/hanuman_test_helpers.rb +14 -12
  184. data/spec/support/shared_context_for_reducers.rb +37 -0
  185. data/spec/support/shared_examples_for_builders.rb +101 -0
  186. data/spec/support/shared_examples_for_shortcuts.rb +57 -0
  187. data/spec/support/wukong_test_helpers.rb +37 -11
  188. data/spec/wukong/dataflow_spec.rb +77 -55
  189. data/spec/wukong/local_runner_spec.rb +24 -24
  190. data/spec/wukong/model/faker_spec.rb +132 -131
  191. data/spec/wukong/runner_spec.rb +8 -8
  192. data/spec/wukong/widget/filters_spec.rb +61 -0
  193. data/spec/wukong/widget/processors_spec.rb +126 -0
  194. data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
  195. data/spec/wukong/widget/reducers/count_spec.rb +11 -0
  196. data/spec/wukong/widget/reducers/group_spec.rb +20 -0
  197. data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
  198. data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
  199. data/spec/wukong/widget/serializers_spec.rb +92 -0
  200. data/spec/wukong/widget/sink_spec.rb +15 -15
  201. data/spec/wukong/widget/source_spec.rb +65 -41
  202. data/spec/wukong/wukong_spec.rb +10 -0
  203. data/wukong.gemspec +17 -10
  204. metadata +359 -335
  205. data/.document +0 -5
  206. data/VERSION +0 -1
  207. data/bin/hdp-bin +0 -44
  208. data/bin/hdp-bzip +0 -23
  209. data/bin/hdp-cat +0 -3
  210. data/bin/hdp-catd +0 -3
  211. data/bin/hdp-cp +0 -3
  212. data/bin/hdp-du +0 -86
  213. data/bin/hdp-get +0 -3
  214. data/bin/hdp-kill +0 -3
  215. data/bin/hdp-kill-task +0 -3
  216. data/bin/hdp-ls +0 -11
  217. data/bin/hdp-mkdir +0 -2
  218. data/bin/hdp-mkdirp +0 -12
  219. data/bin/hdp-mv +0 -3
  220. data/bin/hdp-parts_to_keys.rb +0 -77
  221. data/bin/hdp-ps +0 -3
  222. data/bin/hdp-put +0 -3
  223. data/bin/hdp-rm +0 -32
  224. data/bin/hdp-sort +0 -40
  225. data/bin/hdp-stream +0 -40
  226. data/bin/hdp-stream-flat +0 -22
  227. data/bin/hdp-stream2 +0 -39
  228. data/bin/hdp-sync +0 -17
  229. data/bin/hdp-wc +0 -67
  230. data/bin/wu-flow +0 -10
  231. data/bin/wu-map +0 -17
  232. data/bin/wu-red +0 -17
  233. data/bin/wukong +0 -17
  234. data/data/CREDITS.md +0 -355
  235. data/data/graph/airfares.tsv +0 -2174
  236. data/data/text/gift_of_the_magi.txt +0 -225
  237. data/data/text/jabberwocky.txt +0 -36
  238. data/data/text/rectification_of_names.txt +0 -33
  239. data/data/twitter/a_atsigns_b.tsv +0 -64
  240. data/data/twitter/a_follows_b.tsv +0 -53
  241. data/data/twitter/tweet.tsv +0 -167
  242. data/data/twitter/twitter_user.tsv +0 -55
  243. data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
  244. data/docpages/INSTALL.textile +0 -92
  245. data/docpages/LICENSE.textile +0 -107
  246. data/docpages/README-elastic_map_reduce.textile +0 -377
  247. data/docpages/README-performance.textile +0 -90
  248. data/docpages/README-wulign.textile +0 -65
  249. data/docpages/UsingWukong-part1-get_ready.textile +0 -17
  250. data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
  251. data/docpages/UsingWukong-part3-parsing.textile +0 -138
  252. data/docpages/_config.yml +0 -39
  253. data/docpages/avro/avro_notes.textile +0 -56
  254. data/docpages/avro/performance.textile +0 -36
  255. data/docpages/avro/tethering.textile +0 -19
  256. data/docpages/bigdata-tips.textile +0 -143
  257. data/docpages/code/api_response_example.txt +0 -20
  258. data/docpages/code/parser_skeleton.rb +0 -38
  259. data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
  260. data/docpages/favicon.ico +0 -0
  261. data/docpages/gem.css +0 -16
  262. data/docpages/hadoop-tips.textile +0 -83
  263. data/docpages/index.textile +0 -92
  264. data/docpages/intro.textile +0 -8
  265. data/docpages/moreinfo.textile +0 -174
  266. data/docpages/news.html +0 -24
  267. data/docpages/pig/PigLatinExpressionsList.txt +0 -122
  268. data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
  269. data/docpages/pig/commandline_params.txt +0 -26
  270. data/docpages/pig/cookbook.html +0 -481
  271. data/docpages/pig/images/hadoop-logo.jpg +0 -0
  272. data/docpages/pig/images/instruction_arrow.png +0 -0
  273. data/docpages/pig/images/pig-logo.gif +0 -0
  274. data/docpages/pig/piglatin_ref1.html +0 -1103
  275. data/docpages/pig/piglatin_ref2.html +0 -14340
  276. data/docpages/pig/setup.html +0 -505
  277. data/docpages/pig/skin/basic.css +0 -166
  278. data/docpages/pig/skin/breadcrumbs.js +0 -237
  279. data/docpages/pig/skin/fontsize.js +0 -166
  280. data/docpages/pig/skin/getBlank.js +0 -40
  281. data/docpages/pig/skin/getMenu.js +0 -45
  282. data/docpages/pig/skin/images/chapter.gif +0 -0
  283. data/docpages/pig/skin/images/chapter_open.gif +0 -0
  284. data/docpages/pig/skin/images/current.gif +0 -0
  285. data/docpages/pig/skin/images/external-link.gif +0 -0
  286. data/docpages/pig/skin/images/header_white_line.gif +0 -0
  287. data/docpages/pig/skin/images/page.gif +0 -0
  288. data/docpages/pig/skin/images/pdfdoc.gif +0 -0
  289. data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
  290. data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
  291. data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  292. data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
  293. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
  294. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  295. data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
  296. data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
  297. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  298. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  299. data/docpages/pig/skin/print.css +0 -54
  300. data/docpages/pig/skin/profile.css +0 -181
  301. data/docpages/pig/skin/screen.css +0 -587
  302. data/docpages/pig/tutorial.html +0 -1059
  303. data/docpages/pig/udf.html +0 -1509
  304. data/docpages/tutorial.textile +0 -283
  305. data/docpages/usage.textile +0 -195
  306. data/docpages/wutils.textile +0 -263
  307. data/examples/dataflow/complex.rb +0 -11
  308. data/examples/dataflow/donuts.rb +0 -13
  309. data/examples/tiny_count/jabberwocky_output.tsv +0 -92
  310. data/examples/word_count.rb +0 -48
  311. data/examples/workflow/fiddle.rb +0 -24
  312. data/lib/away/escapement.rb +0 -129
  313. data/lib/away/exe.rb +0 -11
  314. data/lib/away/experimental.rb +0 -5
  315. data/lib/away/from_file.rb +0 -52
  316. data/lib/away/job.rb +0 -56
  317. data/lib/away/job/rake_compat.rb +0 -17
  318. data/lib/away/registry.rb +0 -79
  319. data/lib/away/runner.rb +0 -276
  320. data/lib/away/runner/execute.rb +0 -121
  321. data/lib/away/script.rb +0 -161
  322. data/lib/away/script/hadoop_command.rb +0 -240
  323. data/lib/away/source/file_list_source.rb +0 -15
  324. data/lib/away/source/looper.rb +0 -18
  325. data/lib/away/task.rb +0 -219
  326. data/lib/hanuman/action.rb +0 -21
  327. data/lib/hanuman/chain.rb +0 -4
  328. data/lib/hanuman/graphviz.rb +0 -74
  329. data/lib/hanuman/resource.rb +0 -6
  330. data/lib/hanuman/slot.rb +0 -87
  331. data/lib/hanuman/slottable.rb +0 -220
  332. data/lib/wukong/bad_record.rb +0 -15
  333. data/lib/wukong/event.rb +0 -44
  334. data/lib/wukong/local_runner.rb +0 -55
  335. data/lib/wukong/mapred.rb +0 -3
  336. data/lib/wukong/universe.rb +0 -48
  337. data/lib/wukong/widget/filter.rb +0 -81
  338. data/lib/wukong/widget/gibberish.rb +0 -123
  339. data/lib/wukong/widget/monitor.rb +0 -26
  340. data/lib/wukong/widget/reducer.rb +0 -66
  341. data/lib/wukong/widget/stringifier.rb +0 -50
  342. data/lib/wukong/workflow.rb +0 -22
  343. data/lib/wukong/workflow/command.rb +0 -42
  344. data/old/config/emr-example.yaml +0 -48
  345. data/old/examples/README.txt +0 -17
  346. data/old/examples/contrib/jeans/README.markdown +0 -165
  347. data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
  348. data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
  349. data/old/examples/contrib/jeans/data/sizes +0 -3
  350. data/old/examples/contrib/jeans/normalize.rb +0 -20
  351. data/old/examples/contrib/jeans/sizes.rb +0 -55
  352. data/old/examples/corpus/bnc_word_freq.rb +0 -44
  353. data/old/examples/corpus/bucket_counter.rb +0 -47
  354. data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
  355. data/old/examples/corpus/sentence_bigrams.rb +0 -53
  356. data/old/examples/corpus/sentence_coocurrence.rb +0 -66
  357. data/old/examples/corpus/stopwords.rb +0 -138
  358. data/old/examples/corpus/words_to_bigrams.rb +0 -53
  359. data/old/examples/emr/README.textile +0 -110
  360. data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
  361. data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
  362. data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
  363. data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
  364. data/old/examples/network_graph/adjacency_list.rb +0 -74
  365. data/old/examples/network_graph/breadth_first_search.rb +0 -72
  366. data/old/examples/network_graph/gen_2paths.rb +0 -68
  367. data/old/examples/network_graph/gen_multi_edge.rb +0 -112
  368. data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
  369. data/old/examples/pagerank/README.textile +0 -6
  370. data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
  371. data/old/examples/pagerank/pagerank.rb +0 -72
  372. data/old/examples/pagerank/pagerank_initialize.rb +0 -42
  373. data/old/examples/pagerank/run_pagerank.sh +0 -21
  374. data/old/examples/sample_records.rb +0 -33
  375. data/old/examples/server_logs/apache_log_parser.rb +0 -15
  376. data/old/examples/server_logs/nook.rb +0 -48
  377. data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
  378. data/old/examples/server_logs/user_agent.rb +0 -40
  379. data/old/examples/simple_word_count.rb +0 -82
  380. data/old/examples/size.rb +0 -61
  381. data/old/examples/stats/avg_value_frequency.rb +0 -86
  382. data/old/examples/stats/binning_percentile_estimator.rb +0 -140
  383. data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
  384. data/old/examples/stats/rank_and_bin.rb +0 -173
  385. data/old/examples/stupidly_simple_filter.rb +0 -40
  386. data/old/examples/word_count.rb +0 -75
  387. data/old/graph/graphviz_builder.rb +0 -580
  388. data/old/graph_easy/Attributes.pm +0 -4181
  389. data/old/graph_easy/Graphviz.pm +0 -2232
  390. data/old/wukong.rb +0 -18
  391. data/old/wukong/and_pig.rb +0 -38
  392. data/old/wukong/bad_record.rb +0 -18
  393. data/old/wukong/datatypes.rb +0 -24
  394. data/old/wukong/datatypes/enum.rb +0 -127
  395. data/old/wukong/datatypes/fake_types.rb +0 -17
  396. data/old/wukong/decorator.rb +0 -28
  397. data/old/wukong/encoding/asciize.rb +0 -108
  398. data/old/wukong/extensions.rb +0 -16
  399. data/old/wukong/extensions/array.rb +0 -18
  400. data/old/wukong/extensions/blank.rb +0 -93
  401. data/old/wukong/extensions/class.rb +0 -189
  402. data/old/wukong/extensions/date_time.rb +0 -53
  403. data/old/wukong/extensions/emittable.rb +0 -69
  404. data/old/wukong/extensions/enumerable.rb +0 -79
  405. data/old/wukong/extensions/hash.rb +0 -167
  406. data/old/wukong/extensions/hash_keys.rb +0 -16
  407. data/old/wukong/extensions/hash_like.rb +0 -150
  408. data/old/wukong/extensions/hashlike_class.rb +0 -47
  409. data/old/wukong/extensions/module.rb +0 -2
  410. data/old/wukong/extensions/pathname.rb +0 -27
  411. data/old/wukong/extensions/string.rb +0 -65
  412. data/old/wukong/extensions/struct.rb +0 -17
  413. data/old/wukong/extensions/symbol.rb +0 -11
  414. data/old/wukong/filename_pattern.rb +0 -74
  415. data/old/wukong/helper.rb +0 -7
  416. data/old/wukong/helper/stopwords.rb +0 -195
  417. data/old/wukong/helper/tokenize.rb +0 -35
  418. data/old/wukong/logger.rb +0 -38
  419. data/old/wukong/periodic_monitor.rb +0 -72
  420. data/old/wukong/schema.rb +0 -269
  421. data/old/wukong/script.rb +0 -286
  422. data/old/wukong/script/avro_command.rb +0 -5
  423. data/old/wukong/script/cassandra_loader_script.rb +0 -40
  424. data/old/wukong/script/emr_command.rb +0 -168
  425. data/old/wukong/script/hadoop_command.rb +0 -237
  426. data/old/wukong/script/local_command.rb +0 -41
  427. data/old/wukong/store.rb +0 -10
  428. data/old/wukong/store/base.rb +0 -27
  429. data/old/wukong/store/cassandra.rb +0 -10
  430. data/old/wukong/store/cassandra/streaming.rb +0 -75
  431. data/old/wukong/store/cassandra/struct_loader.rb +0 -21
  432. data/old/wukong/store/cassandra_model.rb +0 -91
  433. data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
  434. data/old/wukong/store/chunked_flat_file_store.rb +0 -48
  435. data/old/wukong/store/conditional_store.rb +0 -57
  436. data/old/wukong/store/factory.rb +0 -8
  437. data/old/wukong/store/flat_file_store.rb +0 -89
  438. data/old/wukong/store/key_store.rb +0 -51
  439. data/old/wukong/store/null_store.rb +0 -15
  440. data/old/wukong/store/read_thru_store.rb +0 -22
  441. data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
  442. data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
  443. data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
  444. data/old/wukong/streamer.rb +0 -30
  445. data/old/wukong/streamer/accumulating_reducer.rb +0 -83
  446. data/old/wukong/streamer/base.rb +0 -126
  447. data/old/wukong/streamer/counting_reducer.rb +0 -25
  448. data/old/wukong/streamer/filter.rb +0 -20
  449. data/old/wukong/streamer/instance_streamer.rb +0 -15
  450. data/old/wukong/streamer/json_streamer.rb +0 -21
  451. data/old/wukong/streamer/line_streamer.rb +0 -12
  452. data/old/wukong/streamer/list_reducer.rb +0 -31
  453. data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
  454. data/old/wukong/streamer/record_streamer.rb +0 -14
  455. data/old/wukong/streamer/reducer.rb +0 -11
  456. data/old/wukong/streamer/set_reducer.rb +0 -14
  457. data/old/wukong/streamer/struct_streamer.rb +0 -48
  458. data/old/wukong/streamer/summing_reducer.rb +0 -29
  459. data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
  460. data/old/wukong/typed_struct.rb +0 -12
  461. data/spec/away/encoding_spec.rb +0 -32
  462. data/spec/away/exe_spec.rb +0 -20
  463. data/spec/away/flow_spec.rb +0 -82
  464. data/spec/away/graph_spec.rb +0 -6
  465. data/spec/away/job_spec.rb +0 -15
  466. data/spec/away/rake_compat_spec.rb +0 -9
  467. data/spec/away/script_spec.rb +0 -81
  468. data/spec/hanuman/graphviz_spec.rb +0 -29
  469. data/spec/hanuman/slot_spec.rb +0 -2
  470. data/spec/support/examples_helper.rb +0 -10
  471. data/spec/support/streamer_test_helpers.rb +0 -6
  472. data/spec/support/wukong_widget_helpers.rb +0 -66
  473. data/spec/wukong/processor_spec.rb +0 -109
  474. data/spec/wukong/widget/filter_spec.rb +0 -99
  475. data/spec/wukong/widget/stringifier_spec.rb +0 -51
  476. data/spec/wukong/workflow/command_spec.rb +0 -5
data/.gitignore CHANGED
@@ -1,46 +1,59 @@
1
+ ## OS
2
+ .DS_Store
3
+ Icon
4
+ nohup.out
5
+ .bak
1
6
 
7
+ *.pem
2
8
 
3
-
4
-
5
-
9
+ ## EDITORS
10
+ \#*
11
+ .\#*
12
+ \#*\#
13
+ *~
14
+ *.swp
15
+ REVISION
16
+ TAGS*
17
+ tmtags
18
+ *_flymake.*
19
+ *_flymake
20
+ *.tmproj
21
+ .project
22
+ .settings
6
23
 
7
24
  ## COMPILED
8
- ## EDITORS
9
- ## OS
10
- ## OTHER SCM
11
- ## PROJECT::GENERAL
12
- ## PROJECT::SPECIFIC
13
- *.log
25
+ a.out
14
26
  *.o
15
27
  *.pyc
16
- *.rdb
17
28
  *.so
18
- *.swp
19
- *.tmproj
20
- *_flymake
21
- *_flymake.*
22
- *private*
23
- *~
24
- .DS_Store
25
- .\#*
26
- .bak
29
+
30
+ ## OTHER SCM
27
31
  .bzr
28
32
  .hg
29
- .project
30
- .settings
31
33
  .svn
32
- .yardoc
33
- /.rbenv-version
34
- Gemfile.lock
35
- Icon?
36
- REVISION
37
- TAGS*
38
- \#*
39
- a.out
34
+
35
+ ## PROJECT::GENERAL
36
+
37
+ log/*
38
+ tmp/*
39
+ pkg/*
40
+
40
41
  coverage
42
+ rdoc
41
43
  doc
42
- nohup.out
43
44
  pkg
44
- rdoc
45
- tmp/*
46
- tmtags
45
+ .rake_test_cache
46
+ .bundle
47
+ .yardoc
48
+
49
+ .vendor
50
+
51
+ ## PROJECT::SPECIFIC
52
+
53
+ old/*
54
+ docpages
55
+ away
56
+
57
+ .rbx
58
+ Gemfile.lock
59
+ Backup*of*.numbers
@@ -1,3 +1,6 @@
1
1
  [submodule "notes"]
2
2
  path = notes
3
3
  url = git://github.com/infochimps-labs/wukong.wiki.git
4
+ [submodule "data"]
5
+ path = data
6
+ url = http://github.com/infochimps-labs/wukong-example-data.git
data/.rspec CHANGED
@@ -1,2 +1,2 @@
1
+ --format=progress
1
2
  --color
2
- --format documentation
@@ -1,12 +1,19 @@
1
1
  language: ruby
2
2
  rvm:
3
3
  - 1.9.3
4
+ - 1.9.2
5
+ - jruby-19mode # JRuby in 1.9 mode
6
+ - rbx-19mode
4
7
 
5
8
  before_install: "git clone -b version_1 git://github.com/infochimps-labs/gorillib.git ~/builds/infochimps-labs/gorillib"
6
9
 
10
+ bundler_args: --without docs support
11
+
7
12
  branches:
8
13
  only:
14
+ - master
9
15
  - wukong_ng
16
+ - vanilla_2
10
17
 
11
18
  notifications:
12
- email: false
19
+ email: false
data/.yardopts CHANGED
@@ -1,19 +1,6 @@
1
1
  --readme README.md
2
2
  --markup markdown
3
3
  -
4
- VERSION
5
4
  CHANGELOG.md
6
5
  LICENSE.md
7
6
  README.md
8
- notes/INSTALL.md
9
- notes/core_concepts.md
10
- notes/knife-cluster-commands.md
11
- notes/philosophy.md
12
- notes/silverware.md
13
- notes/style_guide.md
14
- notes/tips_and_troubleshooting.md
15
- notes/walkthrough-hadoop.md
16
- notes/homebase-layout.txt
17
-
18
- notes/*.md
19
- notes/*.txt
data/Guardfile CHANGED
@@ -1,14 +1,12 @@
1
1
  # -*- ruby -*-
2
2
 
3
- format = 'progress' # 'doc' for more verbose, 'progress' for less
4
- tags = %w[ ] # builder_spec model_spec
3
+ format = 'doc' # 'doc' for more verbose, 'progress' for less
4
+ tags = %w[ only ] # builder_spec model_spec
5
5
 
6
6
  guard 'rspec', :version => 2, :cli => "--format #{format} #{ tags.map{|tag| "--tag #{tag}"}.join(' ') }" do
7
7
  watch(%r{^spec/.+_spec\.rb$})
8
- watch(%r{^examples/(\w+)/(.+)\.rb$}) { |m| "spec/examples/#{m[1]}_spec.rb" }
9
- watch(%r{^examples/(\w+)\.rb$}) { |m| "spec/examples/#{m[1]}_spec.rb" }
10
- watch(%r{^lib/(.+)/(.+)\.rb$}) { |m| "spec/#{m[1]}/#{m[2]}_spec.rb" }
11
- watch(%r{^lib/(\w+)\.rb$}) { |m| "spec/#{m[1]}}_spec.rb" }
8
+ watch(%r{^examples/(.+)\.rb$}) { |m| "spec/examples/#{m[1]}_spec.rb" }
9
+ watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}}_spec.rb" }
12
10
  watch('spec/spec_helper.rb') { 'spec' }
13
11
  watch(/spec\/support\/(.+)\.rb/) { 'spec' }
14
12
  end
@@ -1,107 +1,95 @@
1
- ---
2
- layout: default
3
- title: Apache License
4
- ---
1
+ # License for Wukong
5
2
 
3
+ The wukong code is __Copyright (c) 2011, 2012 Infochimps, Inc__
6
4
 
7
- h1(gemheader). {{ site.gemname }} %(small):: license%
8
-
9
-
10
- The wukong code is __Copyright (c) 2009 Philip (flip) Kromer__
11
-
12
- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
5
+ This code is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
13
6
 
14
7
  http://www.apache.org/licenses/LICENSE-2.0
15
8
 
16
9
  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an **AS IS** BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
17
10
 
18
- h1. Apache License
11
+ __________________________________________________________________________
12
+
13
+ # Apache License
14
+
19
15
 
20
16
  Apache License
21
17
  Version 2.0, January 2004
22
18
  http://www.apache.org/licenses/
23
19
 
24
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
25
-
26
- <notextile><div class="toggle"></notextile>
20
+ _TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION_
27
21
 
28
- h2. 1. Definitions.
22
+ ## 1. Definitions.
29
23
 
30
24
  * **License** shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
25
+
31
26
  * **Licensor** shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
32
- * **Legal Entity** shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, **control** means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
27
+
28
+ * **Legal Entity** shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
29
+
33
30
  * **You** (or **Your**) shall mean an individual or Legal Entity exercising permissions granted by this License.
31
+
34
32
  * **Source** form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
33
+
35
34
  * **Object** form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
35
+
36
36
  * **Work** shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
37
- * **Derivative Works** shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
38
- * **Contribution** shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, **submitted** means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
39
- * **Contributor** shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
40
37
 
41
- <notextile></div><div class="toggle"></notextile>
38
+ * **Derivative Works** shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
42
39
 
43
- h2. 2. Grant of Copyright License.
40
+ * **Contribution** shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
44
41
 
45
- Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
42
+ * **Contributor** shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
46
43
 
44
+ ## 2. Grant of Copyright License.
47
45
 
48
- <notextile></div><div class="toggle"></notextile>
46
+ Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
49
47
 
50
- h2. 3. Grant of Patent License.
48
+ ## 3. Grant of Patent License.
51
49
 
52
50
  Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
53
51
 
54
- <notextile></div><div class="toggle"></notextile>
55
-
56
- h2. 4. Redistribution.
52
+ ## 4. Redistribution.
57
53
 
58
54
  You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
59
55
 
60
- # You must give any other recipients of the Work or Derivative Works a copy of this License; and
61
- # You must cause any modified files to carry prominent notices stating that You changed the files; and
62
- # You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
63
- # If the Work includes a __NOTICE__ text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
56
+ - (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
57
+ - (b) You must cause any modified files to carry prominent notices stating that You changed the files; and
58
+ - (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
59
+ - (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
64
60
 
65
61
  You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
66
62
 
67
- <notextile></div><div class="toggle"></notextile>
68
-
69
- h2. 5. Submission of Contributions.
63
+ ## 5. Submission of Contributions.
70
64
 
71
65
  Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
72
66
 
73
- <notextile></div><div class="toggle"></notextile>
74
-
75
- h2. 6. Trademarks.
67
+ ## 6. Trademarks.
76
68
 
77
69
  This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
78
70
 
79
- <notextile></div><div class="toggle"></notextile>
71
+ ## 7. Disclaimer of Warranty.
80
72
 
81
- h2. 7. Disclaimer of Warranty.
73
+ Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
82
74
 
83
- Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an **AS IS** BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
84
-
85
- <notextile></div><div class="toggle"></notextile>
86
-
87
- h2. 8. Limitation of Liability.
75
+ ## 8. Limitation of Liability.
88
76
 
89
77
  In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
90
78
 
91
- <notextile></div><div class="toggle"></notextile>
92
-
93
- h2. 9. Accepting Warranty or Additional Liability.
79
+ ## 9. Accepting Warranty or Additional Liability.
94
80
 
95
81
  While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
96
82
 
97
- END OF TERMS AND CONDITIONS
98
-
99
- <notextile></div><div class="toggle"></notextile>
83
+ _END OF TERMS AND CONDITIONS_
100
84
 
101
- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
85
+ ## APPENDIX: How to apply the Apache License to your work.
102
86
 
103
- http://www.apache.org/licenses/LICENSE-2.0
104
-
105
- Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an **AS IS** BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
87
+ To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets `[]` replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
106
88
 
107
- <notextile></div></notextile>
89
+ > Copyright [yyyy] [name of copyright owner]
90
+ >
91
+ > Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
92
+ >
93
+ > http://www.apache.org/licenses/LICENSE-2.0
94
+ >
95
+ > Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
@@ -0,0 +1,422 @@
1
+ # Wukong [![Build Status](https://secure.travis-ci.org/infochimps-labs/wukong.png)](http://travis-ci.org/infochimps-labs/wukong)
2
+
3
+ Wukong is a toolkit for rapid, agile development of dataflows at any scale.
4
+
5
+ (note: the syntax below is mostly false)
6
+
7
+
8
+ <a name="design"></a>
9
+ ## Design Overview
10
+
11
+ The fundamental principle of Wukong/Hanuman is *powerful black boxes, beautiful glue*. In general, they don't do the thing -- they coordinate the boxes that do the thing, to let you implement rapidly, nimbly and readably. Hanuman elegantly describes high-level data flows; Wukong is a pragmatic collection of dataflow primitives. They both emphasize scalability, readability and rapid development over performance or universality.
12
+
13
+ Wukong/Hanuman are chiefly concerned with these specific types of graphs:
14
+
15
+ * **dataflow** -- chains of simple modules to handle continuous data processing -- coordinates Flume, Unix pipes, ZeroMQ, Esper, Storm.
16
+ * **workflows** -- episodic jobs sequences, joined by dependency links -- comparable to Rake, Azkaban or Oozie.
17
+ * **map/reduce** -- Hadoop's standard *disordered/partitioned stream > partition, sort & group > process groups* workflow. Comparable to MRJob and Dumbo.
18
+ * **queue workers** -- pub/sub asynchronously triggered jobs -- comparable Resque, RabbitMQ/AMQP, Amazon Simple Worker, Heroku workers.
19
+
20
+ In addition, wukong stages may be deployed into **http middlware**: lightweight distributed API handlers -- comparable to Rack, Goliath or Twisted.
21
+
22
+ When you're describing a Wukong/Hanuman flow, you're writing pure expressive ruby, not some hokey interpreted language or clumsy XML format. Thanks to JRuby, it can speak directly to Java-based components like Hadoop, Flume, Storm or Spark.
23
+
24
+ ## What's where
25
+
26
+ * Configliere -- Manage settings
27
+ - Layer - Project settings through a late-resolved stack of config objects.
28
+ * Gorillib
29
+ - Type, RecordType
30
+ - TypeConversion
31
+ - Model
32
+ - PathHelpers
33
+ * Wukong
34
+ - fs - Abstracts file hdfs s3n s3hdfs scp
35
+ - streamer - Black-box data transform
36
+ - job - Workflow definition
37
+ - flow - Dataflow definition
38
+ - widgets - Common data transforms
39
+ - RubyHadoop - Hadoop jobs using streamers
40
+ - RubyFlume - Flume decorators using streamers
41
+ * Hanuman -- Elegant small graph assembly
42
+ * Swineherd -- Common interface on ugly tools
43
+ - Turn readable hash into safe commandline (param conv, escaping)
44
+ - Execute command, capture stdin/stderr
45
+ - Summarize execution with a broham-able hash
46
+ - Common modules: Input/output, Java, gnu, configliere
47
+ - Template
48
+ - Hadoop, pig, flume
49
+ - ?? Cp, mv, rm, zip, tar, bz2, gz, ssh, scp
50
+ - ?? Remotely execute command
51
+
52
+ <a name="design-rules"></a>
53
+ ### Story
54
+
55
+ [Narrative Method Structure](http://avdi.org/talks/confident-code-rubymidwest-2011/confident-code.html)
56
+
57
+ * Gather input
58
+ * Perform work
59
+ * Deliver results
60
+ * Handle failure
61
+
62
+
63
+ <a name="design-rules"></a>
64
+ ### Design Rules
65
+
66
+ * **whiteboard rule**: the user-facing conceptual model should match the picture you would draw on the whiteboard in an engineering discussion. The fundamental goal is to abstract the necessary messiness surrounding the industrial-strength components it orchestrates while still providing their essential power.
67
+ * **common cases are simple, complex cases are always possible**: The code should be as simple as the story it tells. For the things you do all the time, you only need to describe how this data flow is different from all other data flows. However, at no point in the project lifecycle should Wukong/Hanuman hit a brick wall or peat bog requiring its total replacement. A complex production system may, for example, require that you replace a critical path with custom Java code -- but that's a small set of substitutions in an otherwise stable, scalable graph. In the world of web programming, Ruby on Rails passes this test; Sinatra and Drupal do not.
68
+ * **petabyte rule**: Wukong/Hanuman coordinate industrial-strength components that wort at terabyte- and petabyte-scale. Conceptual simplicity makes it an excellent tool even for small jobs, but scalability is key. All components must assume an asynchronous, unreliable and distributed system.
69
+ * **laptop rule**:
70
+ * **no dark magick**: the core libraries provide *elegant, predictable magic or no magic at all*. We use metaprogramming heavily, but always predictably, and only in service of making common cases simple.
71
+ - Soupy multi-option `case` statements are a smell.
72
+ - Complex tasks will require code that is more explicit, but readable and organically connected to the typical usage. For example, many data flows will require a custom `Wukong::Streamer` class; but that class is no more complex than the built-in streamer models and receives all the same sugar methods they do.
73
+ * **get shit done**: sometimes ugly tasks require ugly solutions. Shelling out to the hadoop process monitor and parsing its output is acceptable if it is robust and obviates the need for a native protocol handler.
74
+ * **be clever early, boring late**: magic in service of having a terse language for assembling a graph is great. However, the assembled graph should be stomic and largely free of any conditional logic or dependencies.
75
+ - for example, the data flow `split` statement allows you to set a condition on each branch. The assembled graph, however, is typically a `fanout` stage followed by `filter` stages.
76
+ - the graph language has some helpers to refer to graph stages. The compiled graph uses explicit mostly-readable but unambiguous static handles.
77
+ - some stages offer light polymorphism -- for example, `select` accepts either a regexp or block. This is handled at the factory level, and the resulting stage is free of conditional logic.
78
+ * **no lock-in**: needless to say, Wukong works seamlessly with the Infochimps platform, making robust, reliable massive-scale dataflows amazingly simple. However, wukong flows are not tied to the cloud: they project to Hadoop, Flume or any of the other open-source components that power our platform.
79
+
80
+ __________________________________________________________________________
81
+
82
+ <a name="stage"></a>
83
+ ## Stage
84
+
85
+ A graph is composed of `stage`s.
86
+
87
+ * *desc* (alias `description`)
88
+
89
+ #### Actions
90
+
91
+ each action
92
+
93
+ * the default action is `call`
94
+ * all stages respond to `nothing`, and like ze goggles, do `nothing`.
95
+
96
+ __________________________________________________________________________
97
+
98
+ <a name="dataflows"></a>
99
+ ## Workflows
100
+
101
+ Wukong workflows work somewhat differently than you may be familiar with Rake and such.
102
+
103
+ In wukong, a stage corresponds to a product; you can then act on that product.
104
+
105
+ Consider first compiling a c program:
106
+
107
+ to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
108
+ to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`
109
+
110
+ In this case, you define the *steps*, implying the products.
111
+
112
+
113
+ Something rake can't do (but we should be able to): make it so I can define a dependency that runs **last**
114
+
115
+ ### Defining jobs
116
+
117
+ Wukong.job(:launch) do
118
+ task :aim do
119
+ #...
120
+ end
121
+ task :enter do
122
+ end
123
+ task :commit do
124
+ # ...
125
+ end
126
+ end
127
+
128
+ Wukong.job(:recall) do
129
+ task :smash_with_rock do
130
+ #...
131
+ end
132
+ task :reprogram do
133
+ # ...
134
+ end
135
+ end
136
+
137
+ * stages construct products
138
+ - these have default actions
139
+ * hanuman tracks defined order
140
+
141
+ * do steps run in order, or is dependency explicit?
142
+ * what about idempotency?
143
+
144
+ * `task` vs `action` vs `product`; `job`, `task`, `group`, `namespace`.
145
+
146
+ ### documenting
147
+
148
+ Inline option (`:desc` or `:description`?)
149
+
150
+ ```ruby
151
+ task :foo, :description => "pity the foo" do
152
+ # ...
153
+ end
154
+ ```
155
+
156
+ DSL method option
157
+
158
+ ```ruby
159
+ task :foo do
160
+ description "pity the foo"
161
+ # ...
162
+ end
163
+ ```
164
+
165
+ ### actions
166
+
167
+ default action:
168
+
169
+ ```ruby
170
+ script 'nukes/launch_codes.rb' do
171
+ # ...
172
+ end
173
+ ```
174
+
175
+ define the `:undo` action:
176
+
177
+ ```ruby
178
+ script 'nukes/launch_codes.rb', :undo do
179
+ # ...
180
+ end
181
+ ```
182
+
183
+ <a name="file-name-templates"></a>
184
+ ### File name templates
185
+
186
+ * *timestamp*: timestamp of run. everything in this invocation will have the same timestamp.
187
+ * *user*: username; `ENV['USER']` by default
188
+ * *sources*: basenames of job inputs, minus extension, non-`\w` replaced with '_', joined by '-', max 50 chars.
189
+ * *job*: job flow name
190
+
191
+ <a name="job-versioning-of-clobbered"></a>
192
+ ### versioning of clobbered files
193
+
194
+ * when files are generated or removed, relocate to a timestamped location
195
+ - a file `/path/to/file.txt` is relocated to `~/.wukong/backups/path/to/file.txt.wukong-20110102120011` where `20110102120011` is the [job timestamp](#file-naming)
196
+ - accepts a `max_size` param
197
+ - raises if it can't write to directory -- must explicitly say `--safe_file_ops=false`
198
+
199
+ <a name="job-running"></a>
200
+ ### running
201
+
202
+ * `clobber` -- run, but clear all dependencies
203
+ * `undo` --
204
+ * `clean` --
205
+
206
+
207
+ ### Utility and Filesystem tasks
208
+
209
+ The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.
210
+
211
+ ### Configuration
212
+
213
+
214
+ #### Commandline args
215
+
216
+ * handled by configliere: `nukes launch --launch_code=GLG20`
217
+
218
+ * TODO: configliere needs context-specific config vars, so I only get information about the `launch` action in the `nukes` job when I run `nukes launch --help`
219
+
220
+
221
+
222
+ __________________________________________________________________________
223
+
224
+ <a name="dataflows"></a>
225
+ ## Dataflows
226
+
227
+
228
+ Data flows
229
+
230
+ * you can have a consumer connect to a provider, or vice versa
231
+ - producer binds to a port, consumers connect to it: pub/sub
232
+ - consumers open a port, producer connects to many: megaphone
233
+
234
+ * you can bring the provider on line first, and the consumers later, or vice versa.
235
+
236
+
237
+ <a name="dataflow-syntax"></a>
238
+ ## Syntax
239
+
240
+ **note: this is a scratch pad; actual syntax evolving rapidly and currently looks not much like the following**
241
+
242
+ read('/foo/bar') # source( FileSource.new('/foo/bar') )
243
+ writes('/foo/bar') # sink( FileSink.new('/foo/bar') )
244
+
245
+ ... | file('/foo/bar') # this we know is a source
246
+ file('/foo/bar') | ... # this we know is a sink
247
+ file('/foo/bar') # don't know; maybe we can guess later
248
+
249
+ Here is an example Wukong script, `count_followers.rb`:
250
+
251
+ from :json
252
+
253
+ mapper do |user|
254
+ year_month = Time.parse(user[:created_at]).strftime("%Y%M")
255
+ emit [ user[:followers_count], year_month ]
256
+ end
257
+
258
+ reducer do
259
+ start{ @count = 0 }
260
+
261
+ each do |followers_count, year_month|
262
+ @count += 1
263
+ end
264
+
265
+ finally{ emit [*@group_key, @count] }
266
+ end
267
+
268
+ You can run this from the commandline:
269
+
270
+ wukong count_followers.rb users.json followers_histogram.tsv
271
+
272
+ It will run in local mode, effectively doing
273
+
274
+ cat users.json | {the map block} | sort | {the reduce block} > followers_histogram.tsv
275
+
276
+ You can instead run it in Hadoop mode, and it will launch the job across a distributed Hadoop cluster
277
+
278
+ wukong --run=hadoop count_followers.rb users.json followers_histogram.tsv
279
+
280
+ <a name="formatters"></a>
281
+ #### Data Formats (Serialization / Deserialization)
282
+
283
+ * tsv/csv
284
+ * json
285
+ * xml
286
+ * avro
287
+ * apache_log
288
+ * flat
289
+ * regexp
290
+ * [Tagged Netstrings](http://tnetstrings.org/)
291
+ * [ZeroMQ Property Language](http://rfc.zeromq.org/spec:4)
292
+
293
+ * gz/bz2/zip/snappy
294
+
295
+ <a name="data-packets"></a>
296
+ #### Data Packets
297
+
298
+ Data consists of
299
+
300
+ - record
301
+ - schema
302
+ - metadata
303
+
304
+ ## Delivery Guarantees
305
+
306
+ Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker records that fact locally. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else it could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.
307
+
308
+ What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) then that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.
309
+
310
+ So clearly there are multiple possible message delivery guarantees that could be provided:
311
+
312
+ * At most once—this handles the first case described. Messages are immediately marked as consumed, so they can't be given out twice, but many failure scenarios may lead to losing messages.
313
+ * At least once—this is the second case where we guarantee each message will be delivered at least once, but in failure cases may be delivered twice.
314
+ * Exactly once—this is what people actually want, each message is delivered once and only once.
315
+
316
+
317
+ __________________________________________________________________________
318
+
319
+ <a name="design-questions"></a>
320
+ ## Design Questions
321
+
322
+ * **filename helpers**:
323
+ - `':data_dir:/this/that/:script:-:user:-:timestamp:.:ext:'`?
324
+ - `path_to(:data_dir, 'this/that', "???")`?
325
+
326
+ * `class Wukong::Foo::Base` vs `class Wukong::Foo`
327
+ - the latter is more natural, and still allows
328
+ - I'd like
329
+
330
+
331
+
332
+ __________________________________________________________________________
333
+ __________________________________________________________________________
334
+ __________________________________________________________________________
335
+
336
+
337
+ <a name="references"></a>
338
+ ## References
339
+
340
+ <a name="refs-workflow"></a>
341
+ ### Workflow
342
+
343
+ * **Rake**
344
+
345
+ - [Rake Docs](http://rdoc.info/gems/rake/file/README.rdoc)
346
+ - [Rake Tutorial](http://jasonseifer.com/2010/04/06/rake-tutorial) by Jason Seifer -- 2010, with a good overview of why Rake is useful
347
+ - [Rake Tutorial](http://martinfowler.com/articles/rake.html) by Martin Fowler -- from 2005, so may lack some modernities
348
+ - [Rake Tutorial](http://onestepback.org/index.cgi/Tech/Rake/Tutorial/RakeTutorialRules.red) -- from 2005, so may lack some modernities
349
+
350
+ * **Rake Examples**
351
+
352
+ - [resque's redis.rake](https://github.com/defunkt/resque/blob/master/lib/tasks/redis.rake) and [resque/tasks](https://github.com/defunkt/resque/blob/master/lib/resque/tasks.rb)
353
+ - [rails' Rails Ties](https://github.com/rails/rails/tree/master/railties/lib/rails/tasks)
354
+
355
+ * **Thor**
356
+
357
+ - [Thor Wiki](https://github.com/wycats/thor/wiki)
358
+ -
359
+
360
+ * **Chef**
361
+
362
+ - [Chef Wiki](http://wiki.opscode.com/display/chef/Home)
363
+ - specifically, [Chef Resources](http://wiki.opscode.com/display/chef/Resources)
364
+
365
+ * **Other**
366
+
367
+ - [**Gradle**](http://gradle.org/) -- a modern take on `ant` + `maven`. The [Gradle overview](http://gradle.org/overview) states its case.
368
+
369
+ <a name="refs-dataflow"></a>
370
+ ### Dataflow
371
+
372
+ * **Esper**
373
+
374
+ - Must read: [StreamSQL Event Processing with Esper](http://www.igvita.com/2011/05/27/streamsql-event-processing-with-esper/)
375
+ - [Esper docs](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
376
+ - [Esper EPL Reference](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
377
+
378
+ * **Storm**
379
+
380
+ - [A Storm is coming: more details and plans for release](http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html)
381
+ - [Storm: distributed and fault-tolerant realtime computation](http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation) -- slideshare presentation
382
+ - [Storm: the Hadoop of Realtime Processing](http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce)
383
+
384
+ * **Kafka**: LinkedIn's high-throughput messaging queue
385
+
386
+ - [Kafka's Design: Why we built this](http://incubator.apache.org/kafka/design.html)
387
+
388
+ * **ZeroMQ**: tcp sockets like you think they should work
389
+
390
+ - [ZeroMQ: A Modern & Fast Networking Stack](http://www.igvita.com/2010/09/03/zeromq-modern-fast-networking-stack/)
391
+ - [ZeroMQ Guide](http://zguide.zeromq.org/page:all)
392
+ - [ZeroMQ: An Introduction](http://nichol.as/zeromq-an-introduction)
393
+ - [Routing with Ruby & ZeroMQ Devices](http://www.igvita.com/2010/11/17/routing-with-ruby-zeromq-devices/)
394
+ - [Ruby bindings for ZeroMQ](http://zeromq.github.com/rbzmq/) and the [Ruby-FFI bindings](http://www.zeromq.org/bindings:ruby-ffi)
395
+ - [Learn ruby ZeroMQ](https://github.com/andrewvc/learn-ruby-zeromq) by @andrewvc
396
+
397
+ * **Other**
398
+
399
+ - [Infopipes: An abstraction for multimedia streamin](http://web.cecs.pdx.edu/~black/publications/Mms062%203rd%20try.pdf) Black et al 2002
400
+ - [Yahoo Pipes](http://pipes.yahoo.com/pipes/)
401
+ - [Yahoo Pipes wikipedia page](http://en.wikipedia.org/wiki/Yahoo_Pipes)
402
+ - [Streambase](http://www.streambase.com/products/streambasecep/faqs/) -- Why is is so goddamn hard to find out anything real about a project once it gets an enterprise version? Seriously, the consistent fundamental brokenness of enterprise product is astonishing. It's like they take inspiration from shitty major-label band websites but layer a whiteout of [web jargon bullshit](http://www.dack.com/web/bullshit.html) in place of inessential flash animation. Anyway I think Streambase is kinda similar but who the hell can tell.
403
+ - [Scribe](http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/)
404
+ - [Splunk Case Study](http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/)
405
+
406
+ <a name="refs-dataflow"></a>
407
+ ### Messaging Queue
408
+
409
+ - [DripDrop](https://github.com/andrewvc/dripdrop) - a message passing library with a unified API abstracting HTTP, zeroMQ and websockets.
410
+
411
+
412
+ <a name="refs-dataflow"></a>
413
+ ### Data Processing
414
+
415
+ * **Hadoop**
416
+
417
+ - [Hadoop]()
418
+
419
+
420
+ * **Spark/Mesos**
421
+
422
+ - [Mesos](http://www.mesosproject.org/)