wukong 3.0.0.pre → 3.0.0.pre2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (476) hide show
  1. data/.gitignore +46 -33
  2. data/.gitmodules +3 -0
  3. data/.rspec +1 -1
  4. data/.travis.yml +8 -1
  5. data/.yardopts +0 -13
  6. data/Guardfile +4 -6
  7. data/{LICENSE.textile → LICENSE.md} +43 -55
  8. data/README-old.md +422 -0
  9. data/README.md +279 -418
  10. data/Rakefile +21 -5
  11. data/TODO.md +6 -6
  12. data/bin/wu-clean-encoding +31 -0
  13. data/bin/wu-lign +2 -2
  14. data/bin/wu-local +69 -0
  15. data/bin/wu-server +70 -0
  16. data/examples/Gemfile +38 -0
  17. data/examples/README.md +9 -0
  18. data/examples/dataflow/apache_log_line.rb +64 -25
  19. data/examples/dataflow/fibonacci_series.rb +101 -0
  20. data/examples/dataflow/parse_apache_logs.rb +37 -7
  21. data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
  22. data/examples/dataflow/simple.rb +4 -4
  23. data/examples/geo.rb +4 -0
  24. data/examples/geo/geo_grids.numbers +0 -0
  25. data/examples/geo/geolocated.rb +331 -0
  26. data/examples/geo/quadtile.rb +69 -0
  27. data/examples/geo/spec/geolocated_spec.rb +247 -0
  28. data/examples/geo/tile_fetcher.rb +77 -0
  29. data/examples/graph/minimum_spanning_tree.rb +61 -61
  30. data/examples/jabberwocky.txt +36 -0
  31. data/examples/models/wikipedia.rb +20 -0
  32. data/examples/munging/Gemfile +8 -0
  33. data/examples/munging/airline_flights/airline.rb +57 -0
  34. data/examples/munging/airline_flights/airline_flights.rake +83 -0
  35. data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
  36. data/examples/munging/airline_flights/airport.rb +211 -0
  37. data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
  38. data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
  39. data/examples/munging/airline_flights/flight.rb +156 -0
  40. data/examples/munging/airline_flights/models.rb +4 -0
  41. data/examples/munging/airline_flights/parse.rb +26 -0
  42. data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
  43. data/examples/munging/airline_flights/route.rb +35 -0
  44. data/examples/munging/airline_flights/tasks.rake +83 -0
  45. data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
  46. data/examples/munging/airline_flights/topcities.rb +167 -0
  47. data/examples/munging/airports/40_wbans.txt +40 -0
  48. data/examples/munging/airports/filter_weather_reports.rb +37 -0
  49. data/examples/munging/airports/join.pig +31 -0
  50. data/examples/munging/airports/to_tsv.rb +33 -0
  51. data/examples/munging/airports/usa_wbans.pig +19 -0
  52. data/examples/munging/airports/usa_wbans.txt +2157 -0
  53. data/examples/munging/airports/wbans.pig +19 -0
  54. data/examples/munging/airports/wbans.txt +2310 -0
  55. data/examples/munging/geo/geo_json.rb +54 -0
  56. data/examples/munging/geo/geo_models.rb +69 -0
  57. data/examples/munging/geo/geonames_models.rb +78 -0
  58. data/examples/munging/geo/iso_codes.rb +172 -0
  59. data/examples/munging/geo/reconcile_countries.rb +124 -0
  60. data/examples/munging/geo/tasks.rake +71 -0
  61. data/examples/munging/rake_helper.rb +62 -0
  62. data/examples/munging/weather/.gitignore +1 -0
  63. data/examples/munging/weather/Gemfile +4 -0
  64. data/examples/munging/weather/Rakefile +28 -0
  65. data/examples/munging/weather/extract_ish.rb +13 -0
  66. data/examples/munging/weather/models/weather.rb +119 -0
  67. data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
  68. data/examples/munging/wikipedia/README.md +34 -0
  69. data/examples/munging/wikipedia/Rakefile +193 -0
  70. data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
  71. data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
  72. data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
  73. data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
  74. data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
  75. data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
  76. data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
  77. data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
  78. data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
  79. data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
  80. data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
  81. data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
  82. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
  83. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
  84. data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
  85. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
  86. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
  87. data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
  88. data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
  89. data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
  90. data/examples/munging/wikipedia/pig_style_guide.md +25 -0
  91. data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
  92. data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
  93. data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
  94. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
  95. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
  96. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
  97. data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
  98. data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
  99. data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
  100. data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
  101. data/examples/munging/wikipedia/utils/namespaces.json +1 -0
  102. data/examples/rake_helper.rb +85 -0
  103. data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
  104. data/examples/server_logs/logline.rb +95 -0
  105. data/examples/server_logs/models.rb +66 -0
  106. data/examples/server_logs/page_counts.pig +48 -0
  107. data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
  108. data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
  109. data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
  110. data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
  111. data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
  112. data/examples/string_reverser.rb +26 -0
  113. data/examples/text/pig_latin.rb +2 -2
  114. data/examples/text/regional_flavor/README.md +14 -0
  115. data/examples/text/regional_flavor/article_wordbags.pig +39 -0
  116. data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
  117. data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
  118. data/examples/word_count/accumulator.rb +26 -0
  119. data/examples/word_count/tokenizer.rb +13 -0
  120. data/examples/word_count/word_count.rb +6 -0
  121. data/examples/workflow/cherry_pie.dot +97 -0
  122. data/examples/workflow/cherry_pie.png +0 -0
  123. data/examples/workflow/cherry_pie.rb +61 -26
  124. data/lib/hanuman.rb +34 -7
  125. data/lib/hanuman/graph.rb +55 -31
  126. data/lib/hanuman/graphvizzer.rb +199 -178
  127. data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
  128. data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
  129. data/lib/hanuman/link.rb +35 -0
  130. data/lib/hanuman/registry.rb +46 -0
  131. data/lib/hanuman/stage.rb +76 -32
  132. data/lib/wukong.rb +23 -24
  133. data/lib/wukong/boot.rb +87 -0
  134. data/lib/wukong/configuration.rb +8 -0
  135. data/lib/wukong/dataflow.rb +45 -78
  136. data/lib/wukong/driver.rb +99 -0
  137. data/lib/wukong/emitter.rb +22 -0
  138. data/lib/wukong/model/faker.rb +24 -24
  139. data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
  140. data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
  141. data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
  142. data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
  143. data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
  144. data/lib/wukong/processor.rb +60 -114
  145. data/lib/wukong/spec_helpers.rb +81 -0
  146. data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
  147. data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
  148. data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
  149. data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
  150. data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
  151. data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
  152. data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
  153. data/lib/wukong/version.rb +2 -1
  154. data/lib/wukong/widget/filters.rb +311 -0
  155. data/lib/wukong/widget/processors.rb +156 -0
  156. data/lib/wukong/widget/reducers.rb +7 -0
  157. data/lib/wukong/widget/reducers/accumulator.rb +73 -0
  158. data/lib/wukong/widget/reducers/bin.rb +318 -0
  159. data/lib/wukong/widget/reducers/count.rb +61 -0
  160. data/lib/wukong/widget/reducers/group.rb +85 -0
  161. data/lib/wukong/widget/reducers/group_concat.rb +70 -0
  162. data/lib/wukong/widget/reducers/moments.rb +72 -0
  163. data/lib/wukong/widget/reducers/sort.rb +130 -0
  164. data/lib/wukong/widget/serializers.rb +287 -0
  165. data/lib/wukong/widget/sink.rb +10 -52
  166. data/lib/wukong/widget/source.rb +7 -113
  167. data/lib/wukong/widget/utils.rb +46 -0
  168. data/lib/wukong/widgets.rb +6 -0
  169. data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
  170. data/spec/examples/dataflow/parsing_spec.rb +12 -11
  171. data/spec/examples/dataflow/simple_spec.rb +32 -6
  172. data/spec/examples/dataflow/telegram_spec.rb +36 -36
  173. data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
  174. data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
  175. data/spec/examples/munging/airline_flights_spec.rb +202 -0
  176. data/spec/examples/text/pig_latin_spec.rb +13 -16
  177. data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
  178. data/spec/hanuman/graph_spec.rb +27 -2
  179. data/spec/hanuman/hanuman_spec.rb +10 -0
  180. data/spec/hanuman/registry_spec.rb +123 -0
  181. data/spec/hanuman/stage_spec.rb +61 -7
  182. data/spec/spec_helper.rb +29 -19
  183. data/spec/support/hanuman_test_helpers.rb +14 -12
  184. data/spec/support/shared_context_for_reducers.rb +37 -0
  185. data/spec/support/shared_examples_for_builders.rb +101 -0
  186. data/spec/support/shared_examples_for_shortcuts.rb +57 -0
  187. data/spec/support/wukong_test_helpers.rb +37 -11
  188. data/spec/wukong/dataflow_spec.rb +77 -55
  189. data/spec/wukong/local_runner_spec.rb +24 -24
  190. data/spec/wukong/model/faker_spec.rb +132 -131
  191. data/spec/wukong/runner_spec.rb +8 -8
  192. data/spec/wukong/widget/filters_spec.rb +61 -0
  193. data/spec/wukong/widget/processors_spec.rb +126 -0
  194. data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
  195. data/spec/wukong/widget/reducers/count_spec.rb +11 -0
  196. data/spec/wukong/widget/reducers/group_spec.rb +20 -0
  197. data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
  198. data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
  199. data/spec/wukong/widget/serializers_spec.rb +92 -0
  200. data/spec/wukong/widget/sink_spec.rb +15 -15
  201. data/spec/wukong/widget/source_spec.rb +65 -41
  202. data/spec/wukong/wukong_spec.rb +10 -0
  203. data/wukong.gemspec +17 -10
  204. metadata +359 -335
  205. data/.document +0 -5
  206. data/VERSION +0 -1
  207. data/bin/hdp-bin +0 -44
  208. data/bin/hdp-bzip +0 -23
  209. data/bin/hdp-cat +0 -3
  210. data/bin/hdp-catd +0 -3
  211. data/bin/hdp-cp +0 -3
  212. data/bin/hdp-du +0 -86
  213. data/bin/hdp-get +0 -3
  214. data/bin/hdp-kill +0 -3
  215. data/bin/hdp-kill-task +0 -3
  216. data/bin/hdp-ls +0 -11
  217. data/bin/hdp-mkdir +0 -2
  218. data/bin/hdp-mkdirp +0 -12
  219. data/bin/hdp-mv +0 -3
  220. data/bin/hdp-parts_to_keys.rb +0 -77
  221. data/bin/hdp-ps +0 -3
  222. data/bin/hdp-put +0 -3
  223. data/bin/hdp-rm +0 -32
  224. data/bin/hdp-sort +0 -40
  225. data/bin/hdp-stream +0 -40
  226. data/bin/hdp-stream-flat +0 -22
  227. data/bin/hdp-stream2 +0 -39
  228. data/bin/hdp-sync +0 -17
  229. data/bin/hdp-wc +0 -67
  230. data/bin/wu-flow +0 -10
  231. data/bin/wu-map +0 -17
  232. data/bin/wu-red +0 -17
  233. data/bin/wukong +0 -17
  234. data/data/CREDITS.md +0 -355
  235. data/data/graph/airfares.tsv +0 -2174
  236. data/data/text/gift_of_the_magi.txt +0 -225
  237. data/data/text/jabberwocky.txt +0 -36
  238. data/data/text/rectification_of_names.txt +0 -33
  239. data/data/twitter/a_atsigns_b.tsv +0 -64
  240. data/data/twitter/a_follows_b.tsv +0 -53
  241. data/data/twitter/tweet.tsv +0 -167
  242. data/data/twitter/twitter_user.tsv +0 -55
  243. data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
  244. data/docpages/INSTALL.textile +0 -92
  245. data/docpages/LICENSE.textile +0 -107
  246. data/docpages/README-elastic_map_reduce.textile +0 -377
  247. data/docpages/README-performance.textile +0 -90
  248. data/docpages/README-wulign.textile +0 -65
  249. data/docpages/UsingWukong-part1-get_ready.textile +0 -17
  250. data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
  251. data/docpages/UsingWukong-part3-parsing.textile +0 -138
  252. data/docpages/_config.yml +0 -39
  253. data/docpages/avro/avro_notes.textile +0 -56
  254. data/docpages/avro/performance.textile +0 -36
  255. data/docpages/avro/tethering.textile +0 -19
  256. data/docpages/bigdata-tips.textile +0 -143
  257. data/docpages/code/api_response_example.txt +0 -20
  258. data/docpages/code/parser_skeleton.rb +0 -38
  259. data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
  260. data/docpages/favicon.ico +0 -0
  261. data/docpages/gem.css +0 -16
  262. data/docpages/hadoop-tips.textile +0 -83
  263. data/docpages/index.textile +0 -92
  264. data/docpages/intro.textile +0 -8
  265. data/docpages/moreinfo.textile +0 -174
  266. data/docpages/news.html +0 -24
  267. data/docpages/pig/PigLatinExpressionsList.txt +0 -122
  268. data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
  269. data/docpages/pig/commandline_params.txt +0 -26
  270. data/docpages/pig/cookbook.html +0 -481
  271. data/docpages/pig/images/hadoop-logo.jpg +0 -0
  272. data/docpages/pig/images/instruction_arrow.png +0 -0
  273. data/docpages/pig/images/pig-logo.gif +0 -0
  274. data/docpages/pig/piglatin_ref1.html +0 -1103
  275. data/docpages/pig/piglatin_ref2.html +0 -14340
  276. data/docpages/pig/setup.html +0 -505
  277. data/docpages/pig/skin/basic.css +0 -166
  278. data/docpages/pig/skin/breadcrumbs.js +0 -237
  279. data/docpages/pig/skin/fontsize.js +0 -166
  280. data/docpages/pig/skin/getBlank.js +0 -40
  281. data/docpages/pig/skin/getMenu.js +0 -45
  282. data/docpages/pig/skin/images/chapter.gif +0 -0
  283. data/docpages/pig/skin/images/chapter_open.gif +0 -0
  284. data/docpages/pig/skin/images/current.gif +0 -0
  285. data/docpages/pig/skin/images/external-link.gif +0 -0
  286. data/docpages/pig/skin/images/header_white_line.gif +0 -0
  287. data/docpages/pig/skin/images/page.gif +0 -0
  288. data/docpages/pig/skin/images/pdfdoc.gif +0 -0
  289. data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
  290. data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
  291. data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  292. data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
  293. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
  294. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  295. data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
  296. data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
  297. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  298. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  299. data/docpages/pig/skin/print.css +0 -54
  300. data/docpages/pig/skin/profile.css +0 -181
  301. data/docpages/pig/skin/screen.css +0 -587
  302. data/docpages/pig/tutorial.html +0 -1059
  303. data/docpages/pig/udf.html +0 -1509
  304. data/docpages/tutorial.textile +0 -283
  305. data/docpages/usage.textile +0 -195
  306. data/docpages/wutils.textile +0 -263
  307. data/examples/dataflow/complex.rb +0 -11
  308. data/examples/dataflow/donuts.rb +0 -13
  309. data/examples/tiny_count/jabberwocky_output.tsv +0 -92
  310. data/examples/word_count.rb +0 -48
  311. data/examples/workflow/fiddle.rb +0 -24
  312. data/lib/away/escapement.rb +0 -129
  313. data/lib/away/exe.rb +0 -11
  314. data/lib/away/experimental.rb +0 -5
  315. data/lib/away/from_file.rb +0 -52
  316. data/lib/away/job.rb +0 -56
  317. data/lib/away/job/rake_compat.rb +0 -17
  318. data/lib/away/registry.rb +0 -79
  319. data/lib/away/runner.rb +0 -276
  320. data/lib/away/runner/execute.rb +0 -121
  321. data/lib/away/script.rb +0 -161
  322. data/lib/away/script/hadoop_command.rb +0 -240
  323. data/lib/away/source/file_list_source.rb +0 -15
  324. data/lib/away/source/looper.rb +0 -18
  325. data/lib/away/task.rb +0 -219
  326. data/lib/hanuman/action.rb +0 -21
  327. data/lib/hanuman/chain.rb +0 -4
  328. data/lib/hanuman/graphviz.rb +0 -74
  329. data/lib/hanuman/resource.rb +0 -6
  330. data/lib/hanuman/slot.rb +0 -87
  331. data/lib/hanuman/slottable.rb +0 -220
  332. data/lib/wukong/bad_record.rb +0 -15
  333. data/lib/wukong/event.rb +0 -44
  334. data/lib/wukong/local_runner.rb +0 -55
  335. data/lib/wukong/mapred.rb +0 -3
  336. data/lib/wukong/universe.rb +0 -48
  337. data/lib/wukong/widget/filter.rb +0 -81
  338. data/lib/wukong/widget/gibberish.rb +0 -123
  339. data/lib/wukong/widget/monitor.rb +0 -26
  340. data/lib/wukong/widget/reducer.rb +0 -66
  341. data/lib/wukong/widget/stringifier.rb +0 -50
  342. data/lib/wukong/workflow.rb +0 -22
  343. data/lib/wukong/workflow/command.rb +0 -42
  344. data/old/config/emr-example.yaml +0 -48
  345. data/old/examples/README.txt +0 -17
  346. data/old/examples/contrib/jeans/README.markdown +0 -165
  347. data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
  348. data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
  349. data/old/examples/contrib/jeans/data/sizes +0 -3
  350. data/old/examples/contrib/jeans/normalize.rb +0 -20
  351. data/old/examples/contrib/jeans/sizes.rb +0 -55
  352. data/old/examples/corpus/bnc_word_freq.rb +0 -44
  353. data/old/examples/corpus/bucket_counter.rb +0 -47
  354. data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
  355. data/old/examples/corpus/sentence_bigrams.rb +0 -53
  356. data/old/examples/corpus/sentence_coocurrence.rb +0 -66
  357. data/old/examples/corpus/stopwords.rb +0 -138
  358. data/old/examples/corpus/words_to_bigrams.rb +0 -53
  359. data/old/examples/emr/README.textile +0 -110
  360. data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
  361. data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
  362. data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
  363. data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
  364. data/old/examples/network_graph/adjacency_list.rb +0 -74
  365. data/old/examples/network_graph/breadth_first_search.rb +0 -72
  366. data/old/examples/network_graph/gen_2paths.rb +0 -68
  367. data/old/examples/network_graph/gen_multi_edge.rb +0 -112
  368. data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
  369. data/old/examples/pagerank/README.textile +0 -6
  370. data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
  371. data/old/examples/pagerank/pagerank.rb +0 -72
  372. data/old/examples/pagerank/pagerank_initialize.rb +0 -42
  373. data/old/examples/pagerank/run_pagerank.sh +0 -21
  374. data/old/examples/sample_records.rb +0 -33
  375. data/old/examples/server_logs/apache_log_parser.rb +0 -15
  376. data/old/examples/server_logs/nook.rb +0 -48
  377. data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
  378. data/old/examples/server_logs/user_agent.rb +0 -40
  379. data/old/examples/simple_word_count.rb +0 -82
  380. data/old/examples/size.rb +0 -61
  381. data/old/examples/stats/avg_value_frequency.rb +0 -86
  382. data/old/examples/stats/binning_percentile_estimator.rb +0 -140
  383. data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
  384. data/old/examples/stats/rank_and_bin.rb +0 -173
  385. data/old/examples/stupidly_simple_filter.rb +0 -40
  386. data/old/examples/word_count.rb +0 -75
  387. data/old/graph/graphviz_builder.rb +0 -580
  388. data/old/graph_easy/Attributes.pm +0 -4181
  389. data/old/graph_easy/Graphviz.pm +0 -2232
  390. data/old/wukong.rb +0 -18
  391. data/old/wukong/and_pig.rb +0 -38
  392. data/old/wukong/bad_record.rb +0 -18
  393. data/old/wukong/datatypes.rb +0 -24
  394. data/old/wukong/datatypes/enum.rb +0 -127
  395. data/old/wukong/datatypes/fake_types.rb +0 -17
  396. data/old/wukong/decorator.rb +0 -28
  397. data/old/wukong/encoding/asciize.rb +0 -108
  398. data/old/wukong/extensions.rb +0 -16
  399. data/old/wukong/extensions/array.rb +0 -18
  400. data/old/wukong/extensions/blank.rb +0 -93
  401. data/old/wukong/extensions/class.rb +0 -189
  402. data/old/wukong/extensions/date_time.rb +0 -53
  403. data/old/wukong/extensions/emittable.rb +0 -69
  404. data/old/wukong/extensions/enumerable.rb +0 -79
  405. data/old/wukong/extensions/hash.rb +0 -167
  406. data/old/wukong/extensions/hash_keys.rb +0 -16
  407. data/old/wukong/extensions/hash_like.rb +0 -150
  408. data/old/wukong/extensions/hashlike_class.rb +0 -47
  409. data/old/wukong/extensions/module.rb +0 -2
  410. data/old/wukong/extensions/pathname.rb +0 -27
  411. data/old/wukong/extensions/string.rb +0 -65
  412. data/old/wukong/extensions/struct.rb +0 -17
  413. data/old/wukong/extensions/symbol.rb +0 -11
  414. data/old/wukong/filename_pattern.rb +0 -74
  415. data/old/wukong/helper.rb +0 -7
  416. data/old/wukong/helper/stopwords.rb +0 -195
  417. data/old/wukong/helper/tokenize.rb +0 -35
  418. data/old/wukong/logger.rb +0 -38
  419. data/old/wukong/periodic_monitor.rb +0 -72
  420. data/old/wukong/schema.rb +0 -269
  421. data/old/wukong/script.rb +0 -286
  422. data/old/wukong/script/avro_command.rb +0 -5
  423. data/old/wukong/script/cassandra_loader_script.rb +0 -40
  424. data/old/wukong/script/emr_command.rb +0 -168
  425. data/old/wukong/script/hadoop_command.rb +0 -237
  426. data/old/wukong/script/local_command.rb +0 -41
  427. data/old/wukong/store.rb +0 -10
  428. data/old/wukong/store/base.rb +0 -27
  429. data/old/wukong/store/cassandra.rb +0 -10
  430. data/old/wukong/store/cassandra/streaming.rb +0 -75
  431. data/old/wukong/store/cassandra/struct_loader.rb +0 -21
  432. data/old/wukong/store/cassandra_model.rb +0 -91
  433. data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
  434. data/old/wukong/store/chunked_flat_file_store.rb +0 -48
  435. data/old/wukong/store/conditional_store.rb +0 -57
  436. data/old/wukong/store/factory.rb +0 -8
  437. data/old/wukong/store/flat_file_store.rb +0 -89
  438. data/old/wukong/store/key_store.rb +0 -51
  439. data/old/wukong/store/null_store.rb +0 -15
  440. data/old/wukong/store/read_thru_store.rb +0 -22
  441. data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
  442. data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
  443. data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
  444. data/old/wukong/streamer.rb +0 -30
  445. data/old/wukong/streamer/accumulating_reducer.rb +0 -83
  446. data/old/wukong/streamer/base.rb +0 -126
  447. data/old/wukong/streamer/counting_reducer.rb +0 -25
  448. data/old/wukong/streamer/filter.rb +0 -20
  449. data/old/wukong/streamer/instance_streamer.rb +0 -15
  450. data/old/wukong/streamer/json_streamer.rb +0 -21
  451. data/old/wukong/streamer/line_streamer.rb +0 -12
  452. data/old/wukong/streamer/list_reducer.rb +0 -31
  453. data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
  454. data/old/wukong/streamer/record_streamer.rb +0 -14
  455. data/old/wukong/streamer/reducer.rb +0 -11
  456. data/old/wukong/streamer/set_reducer.rb +0 -14
  457. data/old/wukong/streamer/struct_streamer.rb +0 -48
  458. data/old/wukong/streamer/summing_reducer.rb +0 -29
  459. data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
  460. data/old/wukong/typed_struct.rb +0 -12
  461. data/spec/away/encoding_spec.rb +0 -32
  462. data/spec/away/exe_spec.rb +0 -20
  463. data/spec/away/flow_spec.rb +0 -82
  464. data/spec/away/graph_spec.rb +0 -6
  465. data/spec/away/job_spec.rb +0 -15
  466. data/spec/away/rake_compat_spec.rb +0 -9
  467. data/spec/away/script_spec.rb +0 -81
  468. data/spec/hanuman/graphviz_spec.rb +0 -29
  469. data/spec/hanuman/slot_spec.rb +0 -2
  470. data/spec/support/examples_helper.rb +0 -10
  471. data/spec/support/streamer_test_helpers.rb +0 -6
  472. data/spec/support/wukong_widget_helpers.rb +0 -66
  473. data/spec/wukong/processor_spec.rb +0 -109
  474. data/spec/wukong/widget/filter_spec.rb +0 -99
  475. data/spec/wukong/widget/stringifier_spec.rb +0 -51
  476. data/spec/wukong/workflow/command_spec.rb +0 -5
data/README.md CHANGED
@@ -1,422 +1,283 @@
1
- # Wukong [![Build Status](https://secure.travis-ci.org/infochimps-labs/wukong.png)](http://travis-ci.org/infochimps-labs/wukong)
2
-
3
- Wukong is a toolkit for rapid, agile development of dataflows at any scale.
4
-
5
- (note: the syntax below is mostly false)
6
-
7
-
8
- <a name="design"></a>
9
- ## Design Overview
10
-
11
- The fundamental principle of Wukong/Hanuman is *powerful black boxes, beautiful glue*. In general, they don't do the thing -- they coordinate the boxes that do the thing, to let you implement rapidly, nimbly and readably. Hanuman elegantly describes high-level data flows; Wukong is a pragmatic collection of dataflow primitives. They both emphasize scalability, readability and rapid development over performance or universality.
12
-
13
- Wukong/Hanuman are chiefly concerned with these specific types of graphs:
14
-
15
- * **dataflow** -- chains of simple modules to handle continuous data processing -- coordinates Flume, Unix pipes, ZeroMQ, Esper, Storm.
16
- * **workflows** -- episodic jobs sequences, joined by dependency links -- comparable to Rake, Azkaban or Oozie.
17
- * **map/reduce** -- Hadoop's standard *disordered/partitioned stream > partition, sort & group > process groups* workflow. Comparable to MRJob and Dumbo.
18
- * **queue workers** -- pub/sub asynchronously triggered jobs -- comparable Resque, RabbitMQ/AMQP, Amazon Simple Worker, Heroku workers.
19
-
20
- In addition, wukong stages may be deployed into **http middlware**: lightweight distributed API handlers -- comparable to Rack, Goliath or Twisted.
21
-
22
- When you're describing a Wukong/Hanuman flow, you're writing pure expressive ruby, not some hokey interpreted language or clumsy XML format. Thanks to JRuby, it can speak directly to Java-based components like Hadoop, Flume, Storm or Spark.
23
-
24
- ## What's where
25
-
26
- * Configliere -- Manage settings
27
- - Layer - Project settings through a late-resolved stack of config objects.
28
- * Gorillib
29
- - Type, RecordType
30
- - TypeConversion
31
- - Model
32
- - PathHelpers
33
- * Wukong
34
- - fs - Abstracts file hdfs s3n s3hdfs scp
35
- - streamer - Black-box data transform
36
- - job - Workflow definition
37
- - flow - Dataflow definition
38
- - widgets - Common data transforms
39
- - RubyHadoop - Hadoop jobs using streamers
40
- - RubyFlume - Flume decorators using streamers
41
- * Hanuman -- Elegant small graph assembly
42
- * Swineherd -- Common interface on ugly tools
43
- - Turn readable hash into safe commandline (param conv, escaping)
44
- - Execute command, capture stdin/stderr
45
- - Summarize execution with a broham-able hash
46
- - Common modules: Input/output, Java, gnu, configliere
47
- - Template
48
- - Hadoop, pig, flume
49
- - ?? Cp, mv, rm, zip, tar, bz2, gz, ssh, scp
50
- - ?? Remotely execute command
51
-
52
- <a name="design-rules"></a>
53
- ### Story
54
-
55
- [Narrative Method Structure](http://avdi.org/talks/confident-code-rubymidwest-2011/confident-code.html)
56
-
57
- * Gather input
58
- * Perform work
59
- * Deliver results
60
- * Handle failure
61
-
62
-
63
- <a name="design-rules"></a>
64
- ### Design Rules
65
-
66
- * **whiteboard rule**: the user-facing conceptual model should match the picture you would draw on the whiteboard in an engineering discussion. The fundamental goal is to abstract the necessary messiness surrounding the industrial-strength components it orchestrates while still providing their essential power.
67
- * **common cases are simple, complex cases are always possible**: The code should be as simple as the story it tells. For the things you do all the time, you only need to describe how this data flow is different from all other data flows. However, at no point in the project lifecycle should Wukong/Hanuman hit a brick wall or peat bog requiring its total replacement. A complex production system may, for example, require that you replace a critical path with custom Java code -- but that's a small set of substitutions in an otherwise stable, scalable graph. In the world of web programming, Ruby on Rails passes this test; Sinatra and Drupal do not.
68
- * **petabyte rule**: Wukong/Hanuman coordinate industrial-strength components that wort at terabyte- and petabyte-scale. Conceptual simplicity makes it an excellent tool even for small jobs, but scalability is key. All components must assume an asynchronous, unreliable and distributed system.
69
- * **laptop rule**:
70
- * **no dark magick**: the core libraries provide *elegant, predictable magic or no magic at all*. We use metaprogramming heavily, but always predictably, and only in service of making common cases simple.
71
- - Soupy multi-option `case` statements are a smell.
72
- - Complex tasks will require code that is more explicit, but readable and organically connected to the typical usage. For example, many data flows will require a custom `Wukong::Streamer` class; but that class is no more complex than the built-in streamer models and receives all the same sugar methods they do.
73
- * **get shit done**: sometimes ugly tasks require ugly solutions. Shelling out to the hadoop process monitor and parsing its output is acceptable if it is robust and obviates the need for a native protocol handler.
74
- * **be clever early, boring late**: magic in service of having a terse language for assembling a graph is great. However, the assembled graph should be stomic and largely free of any conditional logic or dependencies.
75
- - for example, the data flow `split` statement allows you to set a condition on each branch. The assembled graph, however, is typically a `fanout` stage followed by `filter` stages.
76
- - the graph language has some helpers to refer to graph stages. The compiled graph uses explicit mostly-readable but unambiguous static handles.
77
- - some stages offer light polymorphism -- for example, `select` accepts either a regexp or block. This is handled at the factory level, and the resulting stage is free of conditional logic.
78
- * **no lock-in**: needless to say, Wukong works seamlessly with the Infochimps platform, making robust, reliable massive-scale dataflows amazingly simple. However, wukong flows are not tied to the cloud: they project to Hadoop, Flume or any of the other open-source components that power our platform.
79
-
80
- __________________________________________________________________________
81
-
82
- <a name="stage"></a>
83
- ## Stage
84
-
85
- A graph is composed of `stage`s.
86
-
87
- * *desc* (alias `description`)
88
-
89
- #### Actions
90
-
91
- each action
92
-
93
- * the default action is `call`
94
- * all stages respond to `nothing`, and like ze goggles, do `nothing`.
95
-
96
- __________________________________________________________________________
97
-
98
- <a name="dataflows"></a>
99
- ## Workflows
100
-
101
- Wukong workflows work somewhat differently than you may be familiar with Rake and such.
102
-
103
- In wukong, a stage corresponds to a resource; you can then act on that resource.
104
-
105
- Consider first compiling a c program:
106
-
107
- to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
108
- to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`
109
-
110
- In this case, you define the *steps*, implying the resources.
111
-
112
-
113
- Something rake can't do (but we should be able to): make it so I can define a dependency that runs **last**
114
-
115
- ### Defining jobs
116
-
117
- Wukong.job(:launch) do
118
- task :aim do
119
- #...
120
- end
121
- task :enter do
122
- end
123
- task :commit do
124
- # ...
125
- end
126
- end
127
-
128
- Wukong.job(:recall) do
129
- task :smash_with_rock do
130
- #...
131
- end
132
- task :reprogram do
133
- # ...
134
- end
135
- end
136
-
137
- * stages construct resources
138
- - these have default actions
139
- * hanuman tracks defined order
140
-
141
- * do steps run in order, or is dependency explicit?
142
- * what about idempotency?
143
-
144
- * `task` vs `action` vs `resource`; `job`, `task`, `group`, `namespace`.
145
-
146
- ### documenting
147
-
148
- Inline option (`:desc` or `:description`?)
149
-
150
- ```ruby
151
- task :foo, :description => "pity the foo" do
152
- # ...
153
- end
154
- ```
155
-
156
- DSL method option
157
-
158
- ```ruby
159
- task :foo do
160
- description "pity the foo"
161
- # ...
162
- end
163
- ```
164
-
165
- ### actions
166
-
167
- default action:
168
-
169
- ```ruby
170
- script 'nukes/launch_codes.rb' do
171
- # ...
172
- end
173
- ```
1
+ # Wukong
2
+
3
+ Wukong is a toolkit for rapid, agile development of data applications
4
+ at any scale.
5
+
6
+ The core concept in Wukong is a **Processor**. Wukong processors are
7
+ simple Ruby classes that do one thing and do it well. This codebase
8
+ implements processors and other core Wukong classes and provides a
9
+ tool, `wu-local`, to run and combine processors on the command-line.
10
+
11
+ Wukong's larger theme is *powerful black boxes, beautiful glue*. The
12
+ Wukong ecosystem consists of other tools which run Wukong processors
13
+ in various topologies across a variety of different backends. Code
14
+ written in Wukong can be easily ported between environments and
15
+ frameworks: local command-line scripts on your laptop instantly turn
16
+ into powerful jobs running in Hadoop.
17
+
18
+ Here is a list of various other projects which you may also want to
19
+ peruse when trying to understand the full Wukong experience:
20
+
21
+ * <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
22
+ * <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
23
+ * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
24
+
25
+ For a more holistic perspective also see the Infochimps Platform
26
+ Community Edition (**FIXME: link to this**) which combines all the
27
+ Wukong tools together into a jetpack which fits comfortably over the
28
+ shoulders of developers.
29
+
30
+ <a name="processors"></a>
31
+ ## Writing Simple Processors
32
+
33
+ The fundamental unit of computation in Wukong is the processor. A
34
+ processor is Ruby class which
35
+
36
+ * subclasses `Wukong::Processor` (use the `Wukong.processor` method as sugar for this)
37
+ * defines a `process` method which takes an input record, does something, and calls `yield` on the output
38
+
39
+ Here's a processor that reverses all each input record:
40
+
41
+ ```ruby
42
+ # in string_reverser.rb
43
+ Wukong.processor(:string_reverser) do
44
+ def process string
45
+ yield string.reverse
46
+ end
47
+ end
48
+ ```
49
+
50
+ When you're developing your application, run your processors on the
51
+ command line on flat input files using `wu-local`:
52
+
53
+ ```
54
+ $ cat novel.txt
55
+ It was the best of times, it was the worst of times.
56
+ ...
57
+
58
+ $ cat novel.txt | wu-local string_reverser.rb
59
+ .semit fo tsrow eht saw ti ,semit fo tseb eht saw tI
60
+ ```
61
+
62
+ You can use yield as often (or never) as you need. Here's a more
63
+ complicated example to illustrate:
64
+
65
+ ```ruby
66
+ # in processors.rb
67
+
68
+ Wukong.processor(:tokenizer) do
69
+ def process line
70
+ line.split.each { |token| yield token }
71
+ end
72
+ end
174
73
 
175
- define the `:undo` action:
176
-
177
- ```ruby
178
- script 'nukes/launch_codes.rb', :undo do
179
- # ...
180
- end
181
- ```
182
-
183
- <a name="file-name-templates"></a>
184
- ### File name templates
185
-
186
- * *timestamp*: timestamp of run. everything in this invocation will have the same timestamp.
187
- * *user*: username; `ENV['USER']` by default
188
- * *sources*: basenames of job inputs, minus extension, non-`\w` replaced with '_', joined by '-', max 50 chars.
189
- * *job*: job flow name
190
-
191
- <a name="job-versioning-of-clobbered"></a>
192
- ### versioning of clobbered files
193
-
194
- * when files are generated or removed, relocate to a timestamped location
195
- - a file `/path/to/file.txt` is relocated to `~/.wukong/backups/path/to/file.txt.wukong-20110102120011` where `20110102120011` is the [job timestamp](#file-naming)
196
- - accepts a `max_size` param
197
- - raises if it can't write to directory -- must explicitly say `--safe_file_ops=false`
198
-
199
- <a name="job-running"></a>
200
- ### running
201
-
202
- * `clobber` -- run, but clear all dependencies
203
- * `undo` --
204
- * `clean` --
205
-
206
-
207
- ### Utility and Filesystem tasks
208
-
209
- The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.
210
-
211
- ### Configuration
212
-
213
-
214
- #### Commandline args
215
-
216
- * handled by configliere: `nukes launch --launch_code=GLG20`
217
-
218
- * TODO: configliere needs context-specific config vars, so I only get information about the `launch` action in the `nukes` job when I run `nukes launch --help`
219
-
220
-
221
-
222
- __________________________________________________________________________
223
-
224
- <a name="dataflows"></a>
225
- ## Dataflows
226
-
227
-
228
- Data flows
229
-
230
- * you can have a consumer connect to a provider, or vice versa
231
- - producer binds to a port, consumers connect to it: pub/sub
232
- - consumers open a port, producer connects to many: megaphone
74
+ Wukong.processor(:starts_with) do
233
75
 
234
- * you can bring the provider on line first, and the consumers later, or vice versa.
235
-
236
-
237
- <a name="dataflow-syntax"></a>
238
- ## Syntax
239
-
240
- **note: this is a scratch pad; actual syntax evolving rapidly and currently looks not much like the following**
241
-
242
- read('/foo/bar') # source( FileSource.new('/foo/bar') )
243
- writes('/foo/bar') # sink( FileSink.new('/foo/bar') )
244
-
245
- ... | file('/foo/bar') # this we know is a source
246
- file('/foo/bar') | ... # this we know is a sink
247
- file('/foo/bar') # don't know; maybe we can guess later
248
-
249
- Here is an example Wukong script, `count_followers.rb`:
250
-
251
- from :json
252
-
253
- mapper do |user|
254
- year_month = Time.parse(user[:created_at]).strftime("%Y%M")
255
- emit [ user[:followers_count], year_month ]
256
- end
257
-
258
- reducer do
259
- start{ @count = 0 }
260
-
261
- each do |followers_count, year_month|
262
- @count += 1
263
- end
264
-
265
- finally{ emit [*@group_key, @count] }
266
- end
76
+ field :letter, String, :default => 'a'
77
+
78
+ def process word
79
+ yield word if word =~ Regexp.new("^#{letter}", true)
80
+ end
81
+ end
82
+ ```
83
+
84
+ Let's start by running the `tokenizer`. We've defined two processors
85
+ in the file `processors.rb` and neither one is named `processors` so
86
+ we have to tell `wu-local` the name of the processor we want to run
87
+ explicitly.
88
+
89
+ ```
90
+ $ cat novel.txt | wu-local processors.rb --run=tokenizer
91
+ It
92
+ was
93
+ the
94
+ best
95
+ of
96
+ times,
97
+ ...
98
+ ```
99
+
100
+ You can combine the output of one processor with another right in the
101
+ shell. Let's add the `starts_with` filter and also pass in the
102
+ *field* `letter`, defined in that processor:
103
+
104
+ ```
105
+ $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local processors.rb --run=starts_with --letter=t
106
+ the
107
+ times
108
+ the
109
+ times
110
+ ...
111
+ ```
112
+
113
+ Wanting to match on a regular expression is such a common task that
114
+ Wukong has a built-in "widget" called `regexp` that you can use
115
+ directly:
116
+
117
+ ```
118
+ $ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local regexp --match='^t'
119
+ ```
120
+
121
+ There are many more simple <a href="#widgets">widgets</a> like these.
122
+
123
+ <a name="flows"></a>
124
+ ## Combining Processors into Dataflows
125
+
126
+ Combining processors which each do one thing well together in a chain
127
+ is mimicing the tried and true UNIX pipeline. Wukong lets you define
128
+ these pipelines more formally as a dataflow. Here's the dataflow for
129
+ the last example:
130
+
131
+ ```
132
+ # in find_t_words.rb
133
+ Wukong.dataflow(:find_t_words) do
134
+ tokenizer > regexp(match: /^t/)
135
+ end
136
+ ```
137
+
138
+ The DSL Wukong provides for combining processors is designed to
139
+ similar to the processing of developing them on the command line. You
140
+ can run this dataflow directly
141
+
142
+ ```
143
+ $ cat novel.txt | wu-local find_t_words.rb
144
+ the
145
+ times
146
+ the
147
+ times
148
+ ...
149
+ ```
150
+
151
+ and it works exactly like before.
152
+
153
+ <a name="serialization></a>
154
+ ## Serialization
155
+
156
+ The process method for a Processor must accept a String argument and
157
+ yield a String argument (or something that will `to_s` appropriately).
158
+
159
+ **Coming Soon:** The ability to define `consumes` and `emits` to
160
+ automatically handle serialization and deserialization.
161
+
162
+ <a name="widgets></a>
163
+ ## Widgets
164
+
165
+ Wukong has a number of built-in widgets that are useful for
166
+ scaffolding your dataflows.
167
+
168
+ ### Serializers
169
+
170
+ Serializers are widgets which don't change the semantic meaning of a
171
+ record, merely its representation. Here's a list:
172
+
173
+ * `to_json`, `from_json` for turning records into JSON or parsing JSON into records
174
+ * `to_tsv`, `from_tsv` for turning Array records into TSV or parsing TSV into Array records
175
+ * `pretty` for pretty printing JSON inputs
176
+
177
+ When you're writing processors that are capable of running in
178
+ isolation you'll want to ensure that you deserialize and serialize
179
+ records on the way in and out, like this
180
+
181
+ ```ruby
182
+ Wukong.processor(:on_my_own) do
183
+ def process json
184
+ obj = MultiJson.load(json)
267
185
 
268
- You can run this from the commandline:
269
-
270
- wukong count_followers.rb users.json followers_histogram.tsv
186
+ # do something with obj...
271
187
 
272
- It will run in local mode, effectively doing
273
-
274
- cat users.json | {the map block} | sort | {the reduce block} > followers_histogram.tsv
275
-
276
- You can instead run it in Hadoop mode, and it will launch the job across a distributed Hadoop cluster
277
-
278
- wukong --run=hadoop count_followers.rb users.json followers_histogram.tsv
279
-
280
- <a name="formatters"></a>
281
- #### Data Formats (Serialization / Deserialization)
282
-
283
- * tsv/csv
284
- * json
285
- * xml
286
- * avro
287
- * apache_log
288
- * flat
289
- * regexp
290
- * [Tagged Netstrings](http://tnetstrings.org/)
291
- * [ZeroMQ Property Language](http://rfc.zeromq.org/spec:4)
292
-
293
- * gz/bz2/zip/snappy
294
-
295
- <a name="data-packets"></a>
296
- #### Data Packets
297
-
298
- Data consists of
299
-
300
- - record
301
- - schema
302
- - metadata
303
-
304
- ## Delivery Guarantees
305
-
306
- Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker records that fact locally. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else it could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.
307
-
308
- What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) then that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.
309
-
310
- So clearly there are multiple possible message delivery guarantees that could be provided:
311
-
312
- * At most once—this handles the first case described. Messages are immediately marked as consumed, so they can't be given out twice, but many failure scenarios may lead to losing messages.
313
- * At least once—this is the second case where we guarantee each message will be delivered at least once, but in failure cases may be delivered twice.
314
- * Exactly once—this is what people actually want, each message is delivered once and only once.
315
-
316
-
317
- __________________________________________________________________________
318
-
319
- <a name="design-questions"></a>
320
- ## Design Questions
321
-
322
- * **filename helpers**:
323
- - `':data_dir:/this/that/:script:-:user:-:timestamp:.:ext:'`?
324
- - `path_to(:data_dir, 'this/that', "???")`?
325
-
326
- * `class Wukong::Foo::Base` vs `class Wukong::Foo`
327
- - the latter is more natural, and still allows
328
- - I'd like
329
-
330
-
331
-
332
- __________________________________________________________________________
333
- __________________________________________________________________________
334
- __________________________________________________________________________
335
-
336
-
337
- <a name="references"></a>
338
- ## References
339
-
340
- <a name="refs-workflow"></a>
341
- ### Workflow
342
-
343
- * **Rake**
344
-
345
- - [Rake Docs](http://rdoc.info/gems/rake/file/README.rdoc)
346
- - [Rake Tutorial](http://jasonseifer.com/2010/04/06/rake-tutorial) by Jason Seifer -- 2010, with a good overview of why Rake is useful
347
- - [Rake Tutorial](http://martinfowler.com/articles/rake.html) by Martin Fowler -- from 2005, so may lack some modernities
348
- - [Rake Tutorial](http://onestepback.org/index.cgi/Tech/Rake/Tutorial/RakeTutorialRules.red) -- from 2005, so may lack some modernities
349
-
350
- * **Rake Examples**
351
-
352
- - [resque's redis.rake](https://github.com/defunkt/resque/blob/master/lib/tasks/redis.rake) and [resque/tasks](https://github.com/defunkt/resque/blob/master/lib/resque/tasks.rb)
353
- - [rails' Rails Ties](https://github.com/rails/rails/tree/master/railties/lib/rails/tasks)
354
-
355
- * **Thor**
356
-
357
- - [Thor Wiki](https://github.com/wycats/thor/wiki)
358
- -
359
-
360
- * **Chef**
361
-
362
- - [Chef Wiki](http://wiki.opscode.com/display/chef/Home)
363
- - specifically, [Chef Resources](http://wiki.opscode.com/display/chef/Resources)
364
-
365
- * **Other**
366
-
367
- - [**Gradle**](http://gradle.org/) -- a modern take on `ant` + `maven`. The [Gradle overview](http://gradle.org/overview) states its case.
368
-
369
- <a name="refs-dataflow"></a>
370
- ### Dataflow
371
-
372
- * **Esper**
373
-
374
- - Must read: [StreamSQL Event Processing with Esper](http://www.igvita.com/2011/05/27/streamsql-event-processing-with-esper/)
375
- - [Esper docs](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
376
- - [Esper EPL Reference](http://esper.codehaus.org/esper-4.5.0/doc/reference/en/html_single/index.html#epl_clauses)
377
-
378
- * **Storm**
379
-
380
- - [A Storm is coming: more details and plans for release](http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html)
381
- - [Storm: distributed and fault-tolerant realtime computation](http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation) -- slideshare presentation
382
- - [Storm: the Hadoop of Realtime Processing](http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce)
383
-
384
- * **Kafka**: LinkedIn's high-throughput messaging queue
385
-
386
- - [Kafka's Design: Why we built this](http://incubator.apache.org/kafka/design.html)
387
-
388
- * **ZeroMQ**: tcp sockets like you think they should work
389
-
390
- - [ZeroMQ: A Modern & Fast Networking Stack](http://www.igvita.com/2010/09/03/zeromq-modern-fast-networking-stack/)
391
- - [ZeroMQ Guide](http://zguide.zeromq.org/page:all)
392
- - [ZeroMQ: An Introduction](http://nichol.as/zeromq-an-introduction)
393
- - [Routing with Ruby & ZeroMQ Devices](http://www.igvita.com/2010/11/17/routing-with-ruby-zeromq-devices/)
394
- - [Ruby bindings for ZeroMQ](http://zeromq.github.com/rbzmq/) and the [Ruby-FFI bindings](http://www.zeromq.org/bindings:ruby-ffi)
395
- - [Learn ruby ZeroMQ](https://github.com/andrewvc/learn-ruby-zeromq) by @andrewvc
396
-
397
- * **Other**
398
-
399
- - [Infopipes: An abstraction for multimedia streamin](http://web.cecs.pdx.edu/~black/publications/Mms062%203rd%20try.pdf) Black et al 2002
400
- - [Yahoo Pipes](http://pipes.yahoo.com/pipes/)
401
- - [Yahoo Pipes wikipedia page](http://en.wikipedia.org/wiki/Yahoo_Pipes)
402
- - [Streambase](http://www.streambase.com/products/streambasecep/faqs/) -- Why is is so goddamn hard to find out anything real about a project once it gets an enterprise version? Seriously, the consistent fundamental brokenness of enterprise product is astonishing. It's like they take inspiration from shitty major-label band websites but layer a whiteout of [web jargon bullshit](http://www.dack.com/web/bullshit.html) in place of inessential flash animation. Anyway I think Streambase is kinda similar but who the hell can tell.
403
- - [Scribe](http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/)
404
- - [Splunk Case Study](http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/)
405
-
406
- <a name="refs-dataflow"></a>
407
- ### Messaging Queue
408
-
409
- - [DripDrop](https://github.com/andrewvc/dripdrop) - a message passing library with a unified API abstracting HTTP, zeroMQ and websockets.
410
-
411
-
412
- <a name="refs-dataflow"></a>
413
- ### Data Processing
414
-
415
- * **Hadoop**
416
-
417
- - [Hadoop]()
418
-
419
-
420
- * **Spark/Mesos**
421
-
422
- - [Mesos](http://www.mesosproject.org/)
188
+ yield MultiJson.dump(obj)
189
+ end
190
+ end
191
+ ```
192
+
193
+ For processors which will only run inside a data flow, you can
194
+ optimize by not doing any (de)serialization until except at the very
195
+ beginning and at the end
196
+
197
+ ```ruby
198
+ Wukong.dataflow(:complicated) do
199
+ from_json > proc_1 > proc_2 > proc_3 ... proc_n > to_json
200
+ end
201
+ ```
202
+
203
+ in this approach, no serialization will be done between processors.
204
+
205
+ ### General Purpose
206
+
207
+ There are several general purpose processors which implement common
208
+ patterns on input and output data. These are most useful within the
209
+ context of a dataflow definition.
210
+
211
+ * `null` does what you think it doesn't
212
+ * `map` perform some block on each
213
+ * `flatten` flatten the input array
214
+ * `filter`, `select`, `reject` only let certain records through based on a block
215
+ * `regexp`, `not_regexp` only pass records matching (or not matching) a regular expression
216
+ * `limit` only let some number of records pass
217
+ * `logger` send events to the local log stream
218
+ * `extract` extract some part of each input event
219
+
220
+ Some of these widgets can be used directly, perhaps with some
221
+ arguments
222
+
223
+ ```ruby
224
+ Wukong.processor(:log_everything) do
225
+ proc_1 > proc_2 > ... > logger
226
+ end
227
+
228
+ Wukong.processor(:log_everything_important) do
229
+ proc_1 > proc_2 > ... > regexp(match: /important/i) > logger
230
+ end
231
+ ```
232
+
233
+ Other widgets require a block to define their action:
234
+
235
+ ```ruby
236
+ Wukong.processor(:log_everything_important) do
237
+ parser > select { |record| record.priority =~ /important/i } > logger
238
+ end
239
+ ```
240
+
241
+ ### Reducers
242
+
243
+ There are a selection of widgets that do aggregative operations like
244
+ counting, sorting, and summing.
245
+
246
+ * `count` emits a final count of all input records
247
+ * `sort` can sort input streams
248
+ * `group` will group records by some extracting part and give a count of each group's size
249
+ * `moments` will emit more complicated statistics (mean, std. dev.) on the group given some other value to measure
250
+
251
+ Here's an example of sorting data right on the command line
252
+
253
+ ```
254
+ $ head tokens.txt | wu-local sort
255
+ abhor
256
+ abide
257
+ abide
258
+ able
259
+ able
260
+ able
261
+ about
262
+ ...
263
+ ```
264
+
265
+ Try adding group:
266
+
267
+ ```
268
+ $ head tokens.txt | wu-local sort | wu-local group
269
+ {:group=>"abhor", :count=>1}
270
+ {:group=>"abide", :count=>2}
271
+ {:group=>"able", :count=>3}
272
+ {:group=>"about", :count=>3}
273
+ {:group=>"above", :count=>1}
274
+ ...
275
+ ```
276
+
277
+ You can also use these within a more complicated dataflow:
278
+
279
+ ```ruby
280
+ Wukong.dataflow(:word_count) do
281
+ tokenize > remove_stopwords > sort > group
282
+ end
283
+ ```