wukong 3.0.0.pre → 3.0.0.pre2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (476) hide show
  1. data/.gitignore +46 -33
  2. data/.gitmodules +3 -0
  3. data/.rspec +1 -1
  4. data/.travis.yml +8 -1
  5. data/.yardopts +0 -13
  6. data/Guardfile +4 -6
  7. data/{LICENSE.textile → LICENSE.md} +43 -55
  8. data/README-old.md +422 -0
  9. data/README.md +279 -418
  10. data/Rakefile +21 -5
  11. data/TODO.md +6 -6
  12. data/bin/wu-clean-encoding +31 -0
  13. data/bin/wu-lign +2 -2
  14. data/bin/wu-local +69 -0
  15. data/bin/wu-server +70 -0
  16. data/examples/Gemfile +38 -0
  17. data/examples/README.md +9 -0
  18. data/examples/dataflow/apache_log_line.rb +64 -25
  19. data/examples/dataflow/fibonacci_series.rb +101 -0
  20. data/examples/dataflow/parse_apache_logs.rb +37 -7
  21. data/examples/{dataflow.rb → dataflow/scraper_macro_flow.rb} +0 -0
  22. data/examples/dataflow/simple.rb +4 -4
  23. data/examples/geo.rb +4 -0
  24. data/examples/geo/geo_grids.numbers +0 -0
  25. data/examples/geo/geolocated.rb +331 -0
  26. data/examples/geo/quadtile.rb +69 -0
  27. data/examples/geo/spec/geolocated_spec.rb +247 -0
  28. data/examples/geo/tile_fetcher.rb +77 -0
  29. data/examples/graph/minimum_spanning_tree.rb +61 -61
  30. data/examples/jabberwocky.txt +36 -0
  31. data/examples/models/wikipedia.rb +20 -0
  32. data/examples/munging/Gemfile +8 -0
  33. data/examples/munging/airline_flights/airline.rb +57 -0
  34. data/examples/munging/airline_flights/airline_flights.rake +83 -0
  35. data/{lib/wukong/settings.rb → examples/munging/airline_flights/airplane.rb} +0 -0
  36. data/examples/munging/airline_flights/airport.rb +211 -0
  37. data/examples/munging/airline_flights/airport_id_unification.rb +129 -0
  38. data/examples/munging/airline_flights/airport_ok_chars.rb +4 -0
  39. data/examples/munging/airline_flights/flight.rb +156 -0
  40. data/examples/munging/airline_flights/models.rb +4 -0
  41. data/examples/munging/airline_flights/parse.rb +26 -0
  42. data/examples/munging/airline_flights/reconcile_airports.rb +142 -0
  43. data/examples/munging/airline_flights/route.rb +35 -0
  44. data/examples/munging/airline_flights/tasks.rake +83 -0
  45. data/examples/munging/airline_flights/timezone_fixup.rb +62 -0
  46. data/examples/munging/airline_flights/topcities.rb +167 -0
  47. data/examples/munging/airports/40_wbans.txt +40 -0
  48. data/examples/munging/airports/filter_weather_reports.rb +37 -0
  49. data/examples/munging/airports/join.pig +31 -0
  50. data/examples/munging/airports/to_tsv.rb +33 -0
  51. data/examples/munging/airports/usa_wbans.pig +19 -0
  52. data/examples/munging/airports/usa_wbans.txt +2157 -0
  53. data/examples/munging/airports/wbans.pig +19 -0
  54. data/examples/munging/airports/wbans.txt +2310 -0
  55. data/examples/munging/geo/geo_json.rb +54 -0
  56. data/examples/munging/geo/geo_models.rb +69 -0
  57. data/examples/munging/geo/geonames_models.rb +78 -0
  58. data/examples/munging/geo/iso_codes.rb +172 -0
  59. data/examples/munging/geo/reconcile_countries.rb +124 -0
  60. data/examples/munging/geo/tasks.rake +71 -0
  61. data/examples/munging/rake_helper.rb +62 -0
  62. data/examples/munging/weather/.gitignore +1 -0
  63. data/examples/munging/weather/Gemfile +4 -0
  64. data/examples/munging/weather/Rakefile +28 -0
  65. data/examples/munging/weather/extract_ish.rb +13 -0
  66. data/examples/munging/weather/models/weather.rb +119 -0
  67. data/examples/munging/weather/utils/noaa_downloader.rb +46 -0
  68. data/examples/munging/wikipedia/README.md +34 -0
  69. data/examples/munging/wikipedia/Rakefile +193 -0
  70. data/examples/munging/wikipedia/articles/extract_articles-parsed.rb +79 -0
  71. data/examples/munging/wikipedia/articles/extract_articles-templated.rb +136 -0
  72. data/examples/munging/wikipedia/articles/textualize_articles.rb +54 -0
  73. data/examples/munging/wikipedia/articles/verify_structure.rb +43 -0
  74. data/examples/munging/wikipedia/articles/wp2txt-LICENSE.txt +22 -0
  75. data/examples/munging/wikipedia/articles/wp2txt_article.rb +259 -0
  76. data/examples/munging/wikipedia/articles/wp2txt_utils.rb +452 -0
  77. data/examples/munging/wikipedia/dbpedia/dbpedia_common.rb +4 -0
  78. data/examples/munging/wikipedia/dbpedia/dbpedia_extract_geocoordinates.rb +78 -0
  79. data/examples/munging/wikipedia/dbpedia/extract_links.rb +193 -0
  80. data/examples/munging/wikipedia/dbpedia/sameas_extractor.rb +20 -0
  81. data/examples/munging/wikipedia/n1_subuniverse/n1_nodes.pig +18 -0
  82. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb +21 -0
  83. data/examples/munging/wikipedia/page_metadata/extract_page_metadata.rb.old +27 -0
  84. data/examples/munging/wikipedia/pagelinks/augment_pagelinks.pig +29 -0
  85. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb +14 -0
  86. data/examples/munging/wikipedia/pagelinks/extract_pagelinks.rb.old +25 -0
  87. data/examples/munging/wikipedia/pagelinks/undirect_pagelinks.pig +29 -0
  88. data/examples/munging/wikipedia/pageviews/augment_pageviews.pig +32 -0
  89. data/examples/munging/wikipedia/pageviews/extract_pageviews.rb +85 -0
  90. data/examples/munging/wikipedia/pig_style_guide.md +25 -0
  91. data/examples/munging/wikipedia/redirects/redirects_page_metadata.pig +19 -0
  92. data/examples/munging/wikipedia/subuniverse/sub_articles.pig +23 -0
  93. data/examples/munging/wikipedia/subuniverse/sub_page_metadata.pig +24 -0
  94. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_from.pig +22 -0
  95. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_into.pig +22 -0
  96. data/examples/munging/wikipedia/subuniverse/sub_pagelinks_within.pig +26 -0
  97. data/examples/munging/wikipedia/subuniverse/sub_pageviews.pig +29 -0
  98. data/examples/munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig +24 -0
  99. data/examples/munging/wikipedia/utils/get_namespaces.rb +86 -0
  100. data/examples/munging/wikipedia/utils/munging_utils.rb +68 -0
  101. data/examples/munging/wikipedia/utils/namespaces.json +1 -0
  102. data/examples/rake_helper.rb +85 -0
  103. data/examples/server_logs/geo_ip_mapping/munge_geolite.rb +82 -0
  104. data/examples/server_logs/logline.rb +95 -0
  105. data/examples/server_logs/models.rb +66 -0
  106. data/examples/server_logs/page_counts.pig +48 -0
  107. data/examples/server_logs/server_logs-01-parse-script.rb +13 -0
  108. data/examples/server_logs/server_logs-02-histograms-full.rb +33 -0
  109. data/examples/server_logs/server_logs-02-histograms-mapper.rb +14 -0
  110. data/{old/examples/server_logs/breadcrumbs.rb → examples/server_logs/server_logs-03-breadcrumbs-full.rb} +26 -30
  111. data/examples/server_logs/server_logs-04-page_page_edges-full.rb +40 -0
  112. data/examples/string_reverser.rb +26 -0
  113. data/examples/text/pig_latin.rb +2 -2
  114. data/examples/text/regional_flavor/README.md +14 -0
  115. data/examples/text/regional_flavor/article_wordbags.pig +39 -0
  116. data/examples/text/regional_flavor/j01-article_wordbags.rb +4 -0
  117. data/examples/text/regional_flavor/simple_pig_script.pig +27 -0
  118. data/examples/word_count/accumulator.rb +26 -0
  119. data/examples/word_count/tokenizer.rb +13 -0
  120. data/examples/word_count/word_count.rb +6 -0
  121. data/examples/workflow/cherry_pie.dot +97 -0
  122. data/examples/workflow/cherry_pie.png +0 -0
  123. data/examples/workflow/cherry_pie.rb +61 -26
  124. data/lib/hanuman.rb +34 -7
  125. data/lib/hanuman/graph.rb +55 -31
  126. data/lib/hanuman/graphvizzer.rb +199 -178
  127. data/lib/hanuman/graphvizzer/gv_models.rb +161 -0
  128. data/lib/hanuman/graphvizzer/gv_presenter.rb +97 -0
  129. data/lib/hanuman/link.rb +35 -0
  130. data/lib/hanuman/registry.rb +46 -0
  131. data/lib/hanuman/stage.rb +76 -32
  132. data/lib/wukong.rb +23 -24
  133. data/lib/wukong/boot.rb +87 -0
  134. data/lib/wukong/configuration.rb +8 -0
  135. data/lib/wukong/dataflow.rb +45 -78
  136. data/lib/wukong/driver.rb +99 -0
  137. data/lib/wukong/emitter.rb +22 -0
  138. data/lib/wukong/model/faker.rb +24 -24
  139. data/lib/wukong/model/flatpack_parser/flat.rb +60 -0
  140. data/lib/wukong/model/flatpack_parser/flatpack.rb +4 -0
  141. data/lib/wukong/model/flatpack_parser/lang.rb +46 -0
  142. data/lib/wukong/model/flatpack_parser/parser.rb +55 -0
  143. data/lib/wukong/model/flatpack_parser/tokens.rb +130 -0
  144. data/lib/wukong/processor.rb +60 -114
  145. data/lib/wukong/spec_helpers.rb +81 -0
  146. data/lib/wukong/spec_helpers/integration_driver.rb +144 -0
  147. data/lib/wukong/spec_helpers/integration_driver_matchers.rb +219 -0
  148. data/lib/wukong/spec_helpers/processor_helpers.rb +95 -0
  149. data/lib/wukong/spec_helpers/processor_methods.rb +108 -0
  150. data/lib/wukong/spec_helpers/shared_examples.rb +15 -0
  151. data/lib/wukong/spec_helpers/spec_driver.rb +28 -0
  152. data/lib/wukong/spec_helpers/spec_driver_matchers.rb +195 -0
  153. data/lib/wukong/version.rb +2 -1
  154. data/lib/wukong/widget/filters.rb +311 -0
  155. data/lib/wukong/widget/processors.rb +156 -0
  156. data/lib/wukong/widget/reducers.rb +7 -0
  157. data/lib/wukong/widget/reducers/accumulator.rb +73 -0
  158. data/lib/wukong/widget/reducers/bin.rb +318 -0
  159. data/lib/wukong/widget/reducers/count.rb +61 -0
  160. data/lib/wukong/widget/reducers/group.rb +85 -0
  161. data/lib/wukong/widget/reducers/group_concat.rb +70 -0
  162. data/lib/wukong/widget/reducers/moments.rb +72 -0
  163. data/lib/wukong/widget/reducers/sort.rb +130 -0
  164. data/lib/wukong/widget/serializers.rb +287 -0
  165. data/lib/wukong/widget/sink.rb +10 -52
  166. data/lib/wukong/widget/source.rb +7 -113
  167. data/lib/wukong/widget/utils.rb +46 -0
  168. data/lib/wukong/widgets.rb +6 -0
  169. data/spec/examples/dataflow/fibonacci_series_spec.rb +18 -0
  170. data/spec/examples/dataflow/parsing_spec.rb +12 -11
  171. data/spec/examples/dataflow/simple_spec.rb +32 -6
  172. data/spec/examples/dataflow/telegram_spec.rb +36 -36
  173. data/spec/examples/graph/minimum_spanning_tree_spec.rb +30 -31
  174. data/spec/examples/munging/airline_flights/identifiers_spec.rb +16 -0
  175. data/spec/examples/munging/airline_flights_spec.rb +202 -0
  176. data/spec/examples/text/pig_latin_spec.rb +13 -16
  177. data/spec/examples/workflow/cherry_pie_spec.rb +34 -4
  178. data/spec/hanuman/graph_spec.rb +27 -2
  179. data/spec/hanuman/hanuman_spec.rb +10 -0
  180. data/spec/hanuman/registry_spec.rb +123 -0
  181. data/spec/hanuman/stage_spec.rb +61 -7
  182. data/spec/spec_helper.rb +29 -19
  183. data/spec/support/hanuman_test_helpers.rb +14 -12
  184. data/spec/support/shared_context_for_reducers.rb +37 -0
  185. data/spec/support/shared_examples_for_builders.rb +101 -0
  186. data/spec/support/shared_examples_for_shortcuts.rb +57 -0
  187. data/spec/support/wukong_test_helpers.rb +37 -11
  188. data/spec/wukong/dataflow_spec.rb +77 -55
  189. data/spec/wukong/local_runner_spec.rb +24 -24
  190. data/spec/wukong/model/faker_spec.rb +132 -131
  191. data/spec/wukong/runner_spec.rb +8 -8
  192. data/spec/wukong/widget/filters_spec.rb +61 -0
  193. data/spec/wukong/widget/processors_spec.rb +126 -0
  194. data/spec/wukong/widget/reducers/bin_spec.rb +92 -0
  195. data/spec/wukong/widget/reducers/count_spec.rb +11 -0
  196. data/spec/wukong/widget/reducers/group_spec.rb +20 -0
  197. data/spec/wukong/widget/reducers/moments_spec.rb +36 -0
  198. data/spec/wukong/widget/reducers/sort_spec.rb +26 -0
  199. data/spec/wukong/widget/serializers_spec.rb +92 -0
  200. data/spec/wukong/widget/sink_spec.rb +15 -15
  201. data/spec/wukong/widget/source_spec.rb +65 -41
  202. data/spec/wukong/wukong_spec.rb +10 -0
  203. data/wukong.gemspec +17 -10
  204. metadata +359 -335
  205. data/.document +0 -5
  206. data/VERSION +0 -1
  207. data/bin/hdp-bin +0 -44
  208. data/bin/hdp-bzip +0 -23
  209. data/bin/hdp-cat +0 -3
  210. data/bin/hdp-catd +0 -3
  211. data/bin/hdp-cp +0 -3
  212. data/bin/hdp-du +0 -86
  213. data/bin/hdp-get +0 -3
  214. data/bin/hdp-kill +0 -3
  215. data/bin/hdp-kill-task +0 -3
  216. data/bin/hdp-ls +0 -11
  217. data/bin/hdp-mkdir +0 -2
  218. data/bin/hdp-mkdirp +0 -12
  219. data/bin/hdp-mv +0 -3
  220. data/bin/hdp-parts_to_keys.rb +0 -77
  221. data/bin/hdp-ps +0 -3
  222. data/bin/hdp-put +0 -3
  223. data/bin/hdp-rm +0 -32
  224. data/bin/hdp-sort +0 -40
  225. data/bin/hdp-stream +0 -40
  226. data/bin/hdp-stream-flat +0 -22
  227. data/bin/hdp-stream2 +0 -39
  228. data/bin/hdp-sync +0 -17
  229. data/bin/hdp-wc +0 -67
  230. data/bin/wu-flow +0 -10
  231. data/bin/wu-map +0 -17
  232. data/bin/wu-red +0 -17
  233. data/bin/wukong +0 -17
  234. data/data/CREDITS.md +0 -355
  235. data/data/graph/airfares.tsv +0 -2174
  236. data/data/text/gift_of_the_magi.txt +0 -225
  237. data/data/text/jabberwocky.txt +0 -36
  238. data/data/text/rectification_of_names.txt +0 -33
  239. data/data/twitter/a_atsigns_b.tsv +0 -64
  240. data/data/twitter/a_follows_b.tsv +0 -53
  241. data/data/twitter/tweet.tsv +0 -167
  242. data/data/twitter/twitter_user.tsv +0 -55
  243. data/data/wikipedia/dbpedia-sentences.tsv +0 -1000
  244. data/docpages/INSTALL.textile +0 -92
  245. data/docpages/LICENSE.textile +0 -107
  246. data/docpages/README-elastic_map_reduce.textile +0 -377
  247. data/docpages/README-performance.textile +0 -90
  248. data/docpages/README-wulign.textile +0 -65
  249. data/docpages/UsingWukong-part1-get_ready.textile +0 -17
  250. data/docpages/UsingWukong-part2-ThinkingBigData.textile +0 -75
  251. data/docpages/UsingWukong-part3-parsing.textile +0 -138
  252. data/docpages/_config.yml +0 -39
  253. data/docpages/avro/avro_notes.textile +0 -56
  254. data/docpages/avro/performance.textile +0 -36
  255. data/docpages/avro/tethering.textile +0 -19
  256. data/docpages/bigdata-tips.textile +0 -143
  257. data/docpages/code/api_response_example.txt +0 -20
  258. data/docpages/code/parser_skeleton.rb +0 -38
  259. data/docpages/diagrams/MapReduceDiagram.graffle +0 -0
  260. data/docpages/favicon.ico +0 -0
  261. data/docpages/gem.css +0 -16
  262. data/docpages/hadoop-tips.textile +0 -83
  263. data/docpages/index.textile +0 -92
  264. data/docpages/intro.textile +0 -8
  265. data/docpages/moreinfo.textile +0 -174
  266. data/docpages/news.html +0 -24
  267. data/docpages/pig/PigLatinExpressionsList.txt +0 -122
  268. data/docpages/pig/PigLatinReferenceManual.txt +0 -1640
  269. data/docpages/pig/commandline_params.txt +0 -26
  270. data/docpages/pig/cookbook.html +0 -481
  271. data/docpages/pig/images/hadoop-logo.jpg +0 -0
  272. data/docpages/pig/images/instruction_arrow.png +0 -0
  273. data/docpages/pig/images/pig-logo.gif +0 -0
  274. data/docpages/pig/piglatin_ref1.html +0 -1103
  275. data/docpages/pig/piglatin_ref2.html +0 -14340
  276. data/docpages/pig/setup.html +0 -505
  277. data/docpages/pig/skin/basic.css +0 -166
  278. data/docpages/pig/skin/breadcrumbs.js +0 -237
  279. data/docpages/pig/skin/fontsize.js +0 -166
  280. data/docpages/pig/skin/getBlank.js +0 -40
  281. data/docpages/pig/skin/getMenu.js +0 -45
  282. data/docpages/pig/skin/images/chapter.gif +0 -0
  283. data/docpages/pig/skin/images/chapter_open.gif +0 -0
  284. data/docpages/pig/skin/images/current.gif +0 -0
  285. data/docpages/pig/skin/images/external-link.gif +0 -0
  286. data/docpages/pig/skin/images/header_white_line.gif +0 -0
  287. data/docpages/pig/skin/images/page.gif +0 -0
  288. data/docpages/pig/skin/images/pdfdoc.gif +0 -0
  289. data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
  290. data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
  291. data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  292. data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
  293. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
  294. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  295. data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
  296. data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
  297. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  298. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  299. data/docpages/pig/skin/print.css +0 -54
  300. data/docpages/pig/skin/profile.css +0 -181
  301. data/docpages/pig/skin/screen.css +0 -587
  302. data/docpages/pig/tutorial.html +0 -1059
  303. data/docpages/pig/udf.html +0 -1509
  304. data/docpages/tutorial.textile +0 -283
  305. data/docpages/usage.textile +0 -195
  306. data/docpages/wutils.textile +0 -263
  307. data/examples/dataflow/complex.rb +0 -11
  308. data/examples/dataflow/donuts.rb +0 -13
  309. data/examples/tiny_count/jabberwocky_output.tsv +0 -92
  310. data/examples/word_count.rb +0 -48
  311. data/examples/workflow/fiddle.rb +0 -24
  312. data/lib/away/escapement.rb +0 -129
  313. data/lib/away/exe.rb +0 -11
  314. data/lib/away/experimental.rb +0 -5
  315. data/lib/away/from_file.rb +0 -52
  316. data/lib/away/job.rb +0 -56
  317. data/lib/away/job/rake_compat.rb +0 -17
  318. data/lib/away/registry.rb +0 -79
  319. data/lib/away/runner.rb +0 -276
  320. data/lib/away/runner/execute.rb +0 -121
  321. data/lib/away/script.rb +0 -161
  322. data/lib/away/script/hadoop_command.rb +0 -240
  323. data/lib/away/source/file_list_source.rb +0 -15
  324. data/lib/away/source/looper.rb +0 -18
  325. data/lib/away/task.rb +0 -219
  326. data/lib/hanuman/action.rb +0 -21
  327. data/lib/hanuman/chain.rb +0 -4
  328. data/lib/hanuman/graphviz.rb +0 -74
  329. data/lib/hanuman/resource.rb +0 -6
  330. data/lib/hanuman/slot.rb +0 -87
  331. data/lib/hanuman/slottable.rb +0 -220
  332. data/lib/wukong/bad_record.rb +0 -15
  333. data/lib/wukong/event.rb +0 -44
  334. data/lib/wukong/local_runner.rb +0 -55
  335. data/lib/wukong/mapred.rb +0 -3
  336. data/lib/wukong/universe.rb +0 -48
  337. data/lib/wukong/widget/filter.rb +0 -81
  338. data/lib/wukong/widget/gibberish.rb +0 -123
  339. data/lib/wukong/widget/monitor.rb +0 -26
  340. data/lib/wukong/widget/reducer.rb +0 -66
  341. data/lib/wukong/widget/stringifier.rb +0 -50
  342. data/lib/wukong/workflow.rb +0 -22
  343. data/lib/wukong/workflow/command.rb +0 -42
  344. data/old/config/emr-example.yaml +0 -48
  345. data/old/examples/README.txt +0 -17
  346. data/old/examples/contrib/jeans/README.markdown +0 -165
  347. data/old/examples/contrib/jeans/data/normalized_sizes +0 -3
  348. data/old/examples/contrib/jeans/data/orders.tsv +0 -1302
  349. data/old/examples/contrib/jeans/data/sizes +0 -3
  350. data/old/examples/contrib/jeans/normalize.rb +0 -20
  351. data/old/examples/contrib/jeans/sizes.rb +0 -55
  352. data/old/examples/corpus/bnc_word_freq.rb +0 -44
  353. data/old/examples/corpus/bucket_counter.rb +0 -47
  354. data/old/examples/corpus/dbpedia_abstract_to_sentences.rb +0 -86
  355. data/old/examples/corpus/sentence_bigrams.rb +0 -53
  356. data/old/examples/corpus/sentence_coocurrence.rb +0 -66
  357. data/old/examples/corpus/stopwords.rb +0 -138
  358. data/old/examples/corpus/words_to_bigrams.rb +0 -53
  359. data/old/examples/emr/README.textile +0 -110
  360. data/old/examples/emr/dot_wukong_dir/credentials.json +0 -7
  361. data/old/examples/emr/dot_wukong_dir/emr.yaml +0 -69
  362. data/old/examples/emr/dot_wukong_dir/emr_bootstrap.sh +0 -33
  363. data/old/examples/emr/elastic_mapreduce_example.rb +0 -28
  364. data/old/examples/network_graph/adjacency_list.rb +0 -74
  365. data/old/examples/network_graph/breadth_first_search.rb +0 -72
  366. data/old/examples/network_graph/gen_2paths.rb +0 -68
  367. data/old/examples/network_graph/gen_multi_edge.rb +0 -112
  368. data/old/examples/network_graph/gen_symmetric_links.rb +0 -64
  369. data/old/examples/pagerank/README.textile +0 -6
  370. data/old/examples/pagerank/gen_initial_pagerank_graph.pig +0 -57
  371. data/old/examples/pagerank/pagerank.rb +0 -72
  372. data/old/examples/pagerank/pagerank_initialize.rb +0 -42
  373. data/old/examples/pagerank/run_pagerank.sh +0 -21
  374. data/old/examples/sample_records.rb +0 -33
  375. data/old/examples/server_logs/apache_log_parser.rb +0 -15
  376. data/old/examples/server_logs/nook.rb +0 -48
  377. data/old/examples/server_logs/nook/faraday_dummy_adapter.rb +0 -94
  378. data/old/examples/server_logs/user_agent.rb +0 -40
  379. data/old/examples/simple_word_count.rb +0 -82
  380. data/old/examples/size.rb +0 -61
  381. data/old/examples/stats/avg_value_frequency.rb +0 -86
  382. data/old/examples/stats/binning_percentile_estimator.rb +0 -140
  383. data/old/examples/stats/data/avg_value_frequency.tsv +0 -3
  384. data/old/examples/stats/rank_and_bin.rb +0 -173
  385. data/old/examples/stupidly_simple_filter.rb +0 -40
  386. data/old/examples/word_count.rb +0 -75
  387. data/old/graph/graphviz_builder.rb +0 -580
  388. data/old/graph_easy/Attributes.pm +0 -4181
  389. data/old/graph_easy/Graphviz.pm +0 -2232
  390. data/old/wukong.rb +0 -18
  391. data/old/wukong/and_pig.rb +0 -38
  392. data/old/wukong/bad_record.rb +0 -18
  393. data/old/wukong/datatypes.rb +0 -24
  394. data/old/wukong/datatypes/enum.rb +0 -127
  395. data/old/wukong/datatypes/fake_types.rb +0 -17
  396. data/old/wukong/decorator.rb +0 -28
  397. data/old/wukong/encoding/asciize.rb +0 -108
  398. data/old/wukong/extensions.rb +0 -16
  399. data/old/wukong/extensions/array.rb +0 -18
  400. data/old/wukong/extensions/blank.rb +0 -93
  401. data/old/wukong/extensions/class.rb +0 -189
  402. data/old/wukong/extensions/date_time.rb +0 -53
  403. data/old/wukong/extensions/emittable.rb +0 -69
  404. data/old/wukong/extensions/enumerable.rb +0 -79
  405. data/old/wukong/extensions/hash.rb +0 -167
  406. data/old/wukong/extensions/hash_keys.rb +0 -16
  407. data/old/wukong/extensions/hash_like.rb +0 -150
  408. data/old/wukong/extensions/hashlike_class.rb +0 -47
  409. data/old/wukong/extensions/module.rb +0 -2
  410. data/old/wukong/extensions/pathname.rb +0 -27
  411. data/old/wukong/extensions/string.rb +0 -65
  412. data/old/wukong/extensions/struct.rb +0 -17
  413. data/old/wukong/extensions/symbol.rb +0 -11
  414. data/old/wukong/filename_pattern.rb +0 -74
  415. data/old/wukong/helper.rb +0 -7
  416. data/old/wukong/helper/stopwords.rb +0 -195
  417. data/old/wukong/helper/tokenize.rb +0 -35
  418. data/old/wukong/logger.rb +0 -38
  419. data/old/wukong/periodic_monitor.rb +0 -72
  420. data/old/wukong/schema.rb +0 -269
  421. data/old/wukong/script.rb +0 -286
  422. data/old/wukong/script/avro_command.rb +0 -5
  423. data/old/wukong/script/cassandra_loader_script.rb +0 -40
  424. data/old/wukong/script/emr_command.rb +0 -168
  425. data/old/wukong/script/hadoop_command.rb +0 -237
  426. data/old/wukong/script/local_command.rb +0 -41
  427. data/old/wukong/store.rb +0 -10
  428. data/old/wukong/store/base.rb +0 -27
  429. data/old/wukong/store/cassandra.rb +0 -10
  430. data/old/wukong/store/cassandra/streaming.rb +0 -75
  431. data/old/wukong/store/cassandra/struct_loader.rb +0 -21
  432. data/old/wukong/store/cassandra_model.rb +0 -91
  433. data/old/wukong/store/chh_chunked_flat_file_store.rb +0 -37
  434. data/old/wukong/store/chunked_flat_file_store.rb +0 -48
  435. data/old/wukong/store/conditional_store.rb +0 -57
  436. data/old/wukong/store/factory.rb +0 -8
  437. data/old/wukong/store/flat_file_store.rb +0 -89
  438. data/old/wukong/store/key_store.rb +0 -51
  439. data/old/wukong/store/null_store.rb +0 -15
  440. data/old/wukong/store/read_thru_store.rb +0 -22
  441. data/old/wukong/store/tokyo_tdb_key_store.rb +0 -33
  442. data/old/wukong/store/tyrant_rdb_key_store.rb +0 -57
  443. data/old/wukong/store/tyrant_tdb_key_store.rb +0 -20
  444. data/old/wukong/streamer.rb +0 -30
  445. data/old/wukong/streamer/accumulating_reducer.rb +0 -83
  446. data/old/wukong/streamer/base.rb +0 -126
  447. data/old/wukong/streamer/counting_reducer.rb +0 -25
  448. data/old/wukong/streamer/filter.rb +0 -20
  449. data/old/wukong/streamer/instance_streamer.rb +0 -15
  450. data/old/wukong/streamer/json_streamer.rb +0 -21
  451. data/old/wukong/streamer/line_streamer.rb +0 -12
  452. data/old/wukong/streamer/list_reducer.rb +0 -31
  453. data/old/wukong/streamer/rank_and_bin_reducer.rb +0 -145
  454. data/old/wukong/streamer/record_streamer.rb +0 -14
  455. data/old/wukong/streamer/reducer.rb +0 -11
  456. data/old/wukong/streamer/set_reducer.rb +0 -14
  457. data/old/wukong/streamer/struct_streamer.rb +0 -48
  458. data/old/wukong/streamer/summing_reducer.rb +0 -29
  459. data/old/wukong/streamer/uniq_by_last_reducer.rb +0 -51
  460. data/old/wukong/typed_struct.rb +0 -12
  461. data/spec/away/encoding_spec.rb +0 -32
  462. data/spec/away/exe_spec.rb +0 -20
  463. data/spec/away/flow_spec.rb +0 -82
  464. data/spec/away/graph_spec.rb +0 -6
  465. data/spec/away/job_spec.rb +0 -15
  466. data/spec/away/rake_compat_spec.rb +0 -9
  467. data/spec/away/script_spec.rb +0 -81
  468. data/spec/hanuman/graphviz_spec.rb +0 -29
  469. data/spec/hanuman/slot_spec.rb +0 -2
  470. data/spec/support/examples_helper.rb +0 -10
  471. data/spec/support/streamer_test_helpers.rb +0 -6
  472. data/spec/support/wukong_widget_helpers.rb +0 -66
  473. data/spec/wukong/processor_spec.rb +0 -109
  474. data/spec/wukong/widget/filter_spec.rb +0 -99
  475. data/spec/wukong/widget/stringifier_spec.rb +0 -51
  476. data/spec/wukong/workflow/command_spec.rb +0 -5
@@ -1,123 +0,0 @@
1
- module Wukong
2
- module Widget
3
- #
4
- # Uses bigram frequencies taken from the British National Corpus
5
- #
6
- class Gibberish < Wukong::Source
7
- register_source
8
- include CappedGenerator
9
-
10
- field :min_wordlen, Integer, :default => 3, :doc => "Shortest word length to generate"
11
- field :max_wordlen, Integer, :default => 10, :doc => "Largest word length to generate"
12
- field :max_linelen, Integer, :default => 80, :doc => "Max line length to generate; might be shorter by up to `min_wordlen` chars."
13
- attr_accessor :rng
14
-
15
- def setup
16
- self.rng = Random.new
17
- super
18
- end
19
-
20
- def next_item
21
- funny_line
22
- end
23
-
24
- def random_from_cdf cdf, except=[]
25
- offset = rng.rand
26
- cdf.find{|freq, c| (offset < freq) && (not except.include?(c)) }.last
27
- end
28
-
29
- def funny_charseq(len = 5)
30
- char = random_from_cdf(BIGRAM_CDF_HSH['^'])
31
- word = [char]
32
- len.times do
33
- next_char = random_from_cdf(BIGRAM_CDF_HSH[char], [])
34
- break if next_char == '$'
35
- word << next_char
36
- end
37
- word.join
38
- end
39
-
40
- def funny_word(cap)
41
- cap = [max_wordlen, cap].min
42
- word = ''
43
- 100.times do
44
- word = funny_charseq(cap)
45
- return word if word.length >= min_wordlen && word.length <= cap
46
- end
47
- word
48
- end
49
-
50
- def funny_line
51
- min_wordlen = 3
52
- min_linelen = max_linelen - min_wordlen
53
- line = ""
54
- max_linelen.times do
55
- line << funny_word(max_linelen-line.length) << " "
56
- return line.strip if line.strip.length > min_linelen
57
- end
58
- end
59
-
60
- UNIGRAM_CDF = [
61
- [0.1083564980, "^"],
62
- [0.2034715900, "e"],
63
- [0.2836505016, "a"],
64
- [0.3608587709, "i"],
65
- [0.4254609572, "r"],
66
- [0.4878711345, "n"],
67
- [0.5491446929, "o"],
68
- [0.6081589386, "t"],
69
- [0.6604080108, "s"],
70
- [0.7104964048, "l"],
71
- [0.7482015140, "c"],
72
- [0.7807004566, "u"],
73
- [0.8107323736, "d"],
74
- [0.8387591070, "m"],
75
- [0.8645706232, "h"],
76
- [0.8900984199, "p"],
77
- [0.9124467010, "g"],
78
- [0.9302838994, "y"],
79
- [0.9480503129, "b"],
80
- [0.9604887171, "f"],
81
- [0.9713402644, "k"],
82
- [0.9806072759, "v"],
83
- [0.9890852104, "w"],
84
- [0.9930607653, "z"],
85
- [0.9962141717, "x"],
86
- [0.9983324949, "j"],
87
- [1.0000000000, "q"],
88
- ] # UNIGRAM_CDF
89
-
90
- BIGRAM_CDF_HSH = {
91
- '^' => [ [0.1036218381, "s"], [0.1906712502, "c"], [0.2682537643, "p"], [0.3317216047, "a"], [0.3936580351, "m"], [0.4519587055, "b"], [0.5062006040, "d"], [0.5582364154, "t"], [0.6060742359, "r"], [0.6465762814, "h"], [0.6851613871, "f"], [0.7236340466, "e"], [0.7599541648, "g"], [0.7954175502, "i"], [0.8298153741, "l"], [0.8625318598, "u"], [0.8880517895, "o"], [0.9135503009, "w"], [0.9378761593, "n"], [0.9575114053, "k"], [0.9749726916, "v"], [0.9844556534, "j"], [0.9891783932, "y"], [0.9938957784, "q"], [0.9981580244, "z"], [1.0000000000, "x"], ],
92
- 'a' => [ [0.1475494063, "n"], [0.2731292197, "t"], [0.3885347092, "l"], [0.5028764536, "r"], [0.5712094131, "$"], [0.6238394686, "s"], [0.6733868342, "c"], [0.7157919112, "m"], [0.7521546266, "b"], [0.7858181792, "d"], [0.8182153686, "i"], [0.8462128503, "p"], [0.8732913142, "g"], [0.8945879254, "u"], [0.9098711204, "y"], [0.9241484612, "k"], [0.9383968565, "v"], [0.9508578706, "e"], [0.9609453582, "f"], [0.9688908829, "w"], [0.9766844440, "h"], [0.9836313508, "z"], [0.9897822579, "a"], [0.9943628746, "x"], [0.9964758921, "o"], [0.9984297096, "j"], [1.0000000000, "q"], ],
93
- 'b' => [ [0.1548283857, "a"], [0.3006433493, "l"], [0.4444335587, "e"], [0.5720257340, "o"], [0.6942947650, "i"], [0.7909604520, "r"], [0.8741386630, "u"], [0.8985663434, "$"], [0.9228307371, "b"], [0.9419352732, "s"], [0.9592109990, "y"], [0.9649586885, "t"], [0.9702165181, "c"], [0.9754416903, "h"], [0.9804382613, "d"], [0.9838672806, "m"], [0.9871983279, "j"], [0.9895496555, "p"], [0.9918683257, "n"], [0.9939257372, "f"], [0.9954606316, "w"], [0.9969628686, "v"], [0.9981385324, "g"], [0.9991182522, "k"], [0.9996081121, "z"], [0.9998367134, "x"], [1.0000000000, "q"], ],
94
- 'c' => [ [0.1561567107, "o"], [0.3072354045, "h"], [0.4498199612, "a"], [0.5342381436, "e"], [0.6138245160, "$"], [0.6810697689, "i"], [0.7428984704, "t"], [0.8037423445, "k"], [0.8614778568, "r"], [0.9075339304, "u"], [0.9451574185, "l"], [0.9638537531, "y"], [0.9812728895, "c"], [0.9879666390, "s"], [0.9894131044, "n"], [0.9908287939, "m"], [0.9920906041, "q"], [0.9933062506, "d"], [0.9944757332, "p"], [0.9955067245, "z"], [0.9964300003, "g"], [0.9972763364, "f"], [0.9980303450, "x"], [0.9987535777, "b"], [0.9993998707, "v"], [0.9998307328, "w"], [1.0000000000, "j"], ],
95
- 'd' => [ [0.2735843589, "$"], [0.4794729623, "e"], [0.6243696992, "i"], [0.6957748112, "a"], [0.7647456579, "o"], [0.8067270725, "r"], [0.8421205154, "u"], [0.8667529607, "l"], [0.8871737408, "d"], [0.9052375341, "y"], [0.9209829795, "s"], [0.9341589227, "g"], [0.9428334074, "n"], [0.9506578312, "w"], [0.9579219876, "m"], [0.9650122679, "h"], [0.9720445896, "b"], [0.9768938004, "f"], [0.9816657329, "c"], [0.9861478719, "t"], [0.9899731458, "v"], [0.9937597805, "p"], [0.9968702305, "j"], [0.9984351152, "z"], [0.9994590522, "k"], [0.9997295261, "q"], [1.0000000000, "x"], ],
96
- 'e' => [ [0.1940659046, "$"], [0.3647626484, "r"], [0.4678651164, "n"], [0.5406383057, "d"], [0.6084582820, "s"], [0.6653470299, "l"], [0.7148364586, "t"], [0.7581282711, "a"], [0.7938804641, "c"], [0.8234045409, "m"], [0.8492808081, "e"], [0.8678553564, "p"], [0.8835079971, "i"], [0.8977820342, "x"], [0.9112447692, "g"], [0.9229689997, "u"], [0.9342845291, "v"], [0.9454719582, "f"], [0.9560981859, "y"], [0.9666695133, "o"], [0.9757646370, "b"], [0.9844937597, "w"], [0.9899593739, "h"], [0.9938206839, "k"], [0.9962789903, "q"], [0.9987189967, "z"], [1.0000000000, "j"], ],
97
- 'f' => [ [0.1662002052, "i"], [0.2924246665, "o"], [0.4152439593, "e"], [0.5190316261, "a"], [0.6096650807, "l"], [0.6926485680, "f"], [0.7724134714, "r"], [0.8517119134, "u"], [0.9112790372, "$"], [0.9520011195, "t"], [0.9698199459, "y"], [0.9751842523, "s"], [0.9794290512, "c"], [0.9824610505, "m"], [0.9849332960, "n"], [0.9873588954, "b"], [0.9897378487, "w"], [0.9918835712, "h"], [0.9938893554, "d"], [0.9958484933, "p"], [0.9973878160, "g"], [0.9983673850, "k"], [0.9990670772, "j"], [0.9994868924, "v"], [0.9997667693, "x"], [0.9999067077, "z"], [1.0000000000, "q"], ],
98
- 'g' => [ [0.2294771276, "$"], [0.3781089361, "e"], [0.4743756166, "a"], [0.5630354639, "i"], [0.6496183602, "r"], [0.7118749675, "o"], [0.7740277273, "l"], [0.8293005867, "h"], [0.8777714315, "u"], [0.9066929747, "g"], [0.9340827665, "n"], [0.9539176489, "y"], [0.9633677761, "s"], [0.9717534659, "m"], [0.9797237655, "t"], [0.9840853627, "w"], [0.9880834934, "b"], [0.9911469962, "c"], [0.9936912612, "d"], [0.9960278311, "f"], [0.9975855444, "p"], [0.9983644011, "k"], [0.9988836388, "j"], [0.9993769147, "v"], [0.9996365336, "x"], [0.9998701906, "z"], [1.0000000000, "q"], ],
99
- 'h' => [ [0.1921235417, "e"], [0.3644435453, "a"], [0.5135657608, "o"], [0.6515836087, "i"], [0.7645154764, "$"], [0.8146424798, "y"], [0.8527884551, "u"], [0.8857868591, "r"], [0.9185379999, "t"], [0.9357115562, "l"], [0.9473778857, "n"], [0.9580551622, "m"], [0.9658776721, "w"], [0.9721491672, "s"], [0.9781284420, "h"], [0.9828714006, "b"], [0.9868051341, "c"], [0.9895699867, "f"], [0.9922674040, "p"], [0.9947400364, "d"], [0.9965832715, "k"], [0.9982691573, "g"], [0.9991233394, "v"], [0.9994829950, "j"], [0.9997752152, "q"], [0.9999550430, "z"], [1.0000000000, "x"], ],
100
- 'i' => [ [0.1973908665, "n"], [0.3137122288, "s"], [0.4121934907, "c"], [0.4902570808, "t"], [0.5606405603, "o"], [0.6174823966, "a"], [0.6742490851, "l"], [0.7181354315, "$"], [0.7553711928, "d"], [0.7895559513, "e"], [0.8198179919, "m"], [0.8494187314, "r"], [0.8767049169, "g"], [0.9008874961, "v"], [0.9198247552, "p"], [0.9372139685, "z"], [0.9544904600, "f"], [0.9682274876, "b"], [0.9765388402, "k"], [0.9841588325, "u"], [0.9891110760, "i"], [0.9927707765, "x"], [0.9952957444, "q"], [0.9969114232, "h"], [0.9981513628, "j"], [0.9992109475, "y"], [1.0000000000, "w"], ],
101
- 'j' => [ [0.2434949329, "a"], [0.4396055875, "u"], [0.6343467543, "o"], [0.8184059162, "e"], [0.9104354971, "i"], [0.9367296631, "$"], [0.9457682827, "k"], [0.9520679266, "m"], [0.9569980827, "n"], [0.9619282388, "d"], [0.9668583950, "l"], [0.9709668584, "j"], [0.9748014243, "t"], [0.9786359901, "c"], [0.9821966584, "h"], [0.9857573268, "s"], [0.9887701999, "r"], [0.9912352780, "b"], [0.9934264585, "p"], [0.9953437414, "f"], [0.9972610244, "y"], [0.9983566146, "v"], [0.9989044098, "w"], [0.9994522049, "g"], [1.0000000000, "z"], ],
102
- 'k' => [ [0.2145110410, "e"], [0.4128214725, "$"], [0.5650430412, "i"], [0.6673260974, "a"], [0.7285462225, "o"], [0.7652782976, "l"], [0.7947388120, "u"], [0.8235042507, "y"], [0.8500775277, "s"], [0.8754745228, "n"], [0.9003903117, "r"], [0.9246644923, "h"], [0.9366411806, "t"], [0.9481366626, "w"], [0.9578142544, "b"], [0.9671175747, "m"], [0.9750307437, "k"], [0.9813933594, "f"], [0.9869539646, "p"], [0.9911778859, "c"], [0.9941185906, "d"], [0.9963642196, "g"], [0.9978613057, "v"], [0.9990375876, "j"], [0.9994653264, "x"], [0.9997861306, "z"], [1.0000000000, "q"], ],
103
- 'l' => [ [0.1721321920, "e"], [0.3312328418, "i"], [0.4596138030, "a"], [0.5644569032, "l"], [0.6646897335, "$"], [0.7533389707, "o"], [0.8401464166, "y"], [0.8755574604, "u"], [0.8986783120, "d"], [0.9217875801, "t"], [0.9320853942, "s"], [0.9413986030, "m"], [0.9491480267, "f"], [0.9568163653, "c"], [0.9643341172, "k"], [0.9708440964, "p"], [0.9770413187, "v"], [0.9830300363, "b"], [0.9868757891, "g"], [0.9903392834, "n"], [0.9929340083, "w"], [0.9953318120, "h"], [0.9976137798, "r"], [0.9984130478, "z"], [0.9990848962, "j"], [0.9997104102, "x"], [1.0000000000, "q"], ],
104
- 'm' => [ [0.2069558017, "a"], [0.3746403064, "e"], [0.5182279267, "i"], [0.6378221716, "$"], [0.7489907877, "o"], [0.8112410724, "p"], [0.8594348411, "u"], [0.8979194700, "m"], [0.9363212918, "b"], [0.9540420246, "y"], [0.9636269537, "s"], [0.9718662664, "n"], [0.9770417141, "l"], [0.9810371597, "t"], [0.9841010247, "f"], [0.9867922575, "c"], [0.9893385778, "r"], [0.9917192837, "d"], [0.9936445503, "h"], [0.9955698168, "w"], [0.9970810475, "g"], [0.9981368388, "k"], [0.9987371908, "v"], [0.9993168409, "j"], [0.9996273678, "q"], [0.9998757893, "z"], [1.0000000000, "x"], ],
105
- 'n' => [ [0.1731959913, "$"], [0.2872468996, "g"], [0.4009352398, "e"], [0.5103657289, "t"], [0.5939609170, "i"], [0.6679155123, "a"], [0.7369707900, "d"], [0.7893386386, "o"], [0.8332651581, "c"], [0.8768384062, "s"], [0.8977279066, "n"], [0.9132160720, "u"], [0.9263893795, "k"], [0.9386981016, "f"], [0.9482736181, "y"], [0.9546882844, "l"], [0.9609170184, "v"], [0.9668947437, "h"], [0.9727330197, "b"], [0.9780785750, "r"], [0.9831359351, "p"], [0.9879051001, "m"], [0.9919026458, "w"], [0.9952773181, "z"], [0.9978431846, "j"], [0.9994514995, "q"], [1.0000000000, "x"], ],
106
- 'o' => [ [0.1792873578, "n"], [0.3023095060, "r"], [0.3778063954, "l"], [0.4461351994, "u"], [0.5056388308, "m"], [0.5562320680, "$"], [0.6058215838, "s"], [0.6533752497, "t"], [0.6988551895, "p"], [0.7370818222, "c"], [0.7749770375, "o"], [0.8078347079, "g"], [0.8369805032, "d"], [0.8623197326, "w"], [0.8856041740, "v"], [0.9054797504, "i"], [0.9235656727, "b"], [0.9396914977, "a"], [0.9530334163, "f"], [0.9646140881, "e"], [0.9753425436, "k"], [0.9832965618, "x"], [0.9897734052, "y"], [0.9942806822, "h"], [0.9976422017, "z"], [0.9989678714, "j"], [1.0000000000, "q"], ],
107
- 'p' => [ [0.1630755943, "e"], [0.2878539934, "a"], [0.4087231238, "o"], [0.5266603027, "r"], [0.6227101232, "h"], [0.7102822856, "i"], [0.7778535388, "l"], [0.8271512341, "$"], [0.8667893995, "p"], [0.9040865494, "u"], [0.9379289968, "t"], [0.9606573026, "s"], [0.9756579845, "y"], [0.9790217737, "m"], [0.9823628347, "f"], [0.9856584390, "n"], [0.9888858584, "c"], [0.9914996136, "w"], [0.9938633574, "b"], [0.9958407200, "d"], [0.9972498750, "g"], [0.9985453884, "k"], [0.9992499659, "v"], [0.9995227056, "x"], [0.9997499886, "z"], [0.9999545434, "j"], [1.0000000000, "q"], ],
108
- 'q' => [ [0.8851774530, "u"], [0.9144050104, "$"], [0.9345859429, "a"], [0.9512874043, "i"], [0.9592901879, "e"], [0.9641614475, "l"], [0.9686847599, "v"], [0.9718162839, "r"], [0.9745998608, "b"], [0.9773834377, "f"], [0.9801670146, "m"], [0.9829505915, "o"], [0.9853862213, "s"], [0.9878218511, "t"], [0.9899095338, "k"], [0.9919972164, "q"], [0.9933890049, "p"], [0.9947807933, "n"], [0.9961725818, "w"], [0.9975643702, "d"], [0.9986082116, "c"], [0.9993041058, "h"], [0.9996520529, "g"], [1.0000000000, "j"], ],
109
- 'r' => [ [0.1577633281, "e"], [0.2944208938, "a"], [0.4238126886, "i"], [0.5479055899, "$"], [0.6580327633, "o"], [0.6988881305, "t"], [0.7281847248, "u"], [0.7566550510, "y"], [0.7843889208, "d"], [0.8109282943, "s"], [0.8362103032, "r"], [0.8607468746, "m"], [0.8826160368, "n"], [0.9026081334, "c"], [0.9184778704, "g"], [0.9321651818, "b"], [0.9458165685, "l"], [0.9588213105, "k"], [0.9713769938, "p"], [0.9787864636, "v"], [0.9857827993, "f"], [0.9924647938, "h"], [0.9966230780, "w"], [0.9981229343, "z"], [0.9990839201, "j"], [0.9997575083, "q"], [1.0000000000, "x"], ],
110
- 's' => [ [0.1670682820, "t"], [0.2939047006, "$"], [0.4058387838, "e"], [0.4933650184, "s"], [0.5800473055, "i"], [0.6458085794, "h"], [0.7020976536, "a"], [0.7513464293, "c"], [0.7976858072, "o"], [0.8403273628, "u"], [0.8806036445, "p"], [0.9077099042, "m"], [0.9273649962, "l"], [0.9428558738, "k"], [0.9567810067, "y"], [0.9666307619, "n"], [0.9762029027, "w"], [0.9812332737, "q"], [0.9854863247, "f"], [0.9895728071, "b"], [0.9922823226, "d"], [0.9949141061, "r"], [0.9973237982, "g"], [0.9985119874, "v"], [0.9992337845, "z"], [0.9999222680, "j"], [1.0000000000, "x"], ],
111
- 't' => [ [0.1842635651, "i"], [0.3614680523, "e"], [0.4876171188, "$"], [0.5830228191, "a"], [0.6670435441, "r"], [0.7509364585, "o"], [0.8272885472, "h"], [0.8631934954, "u"], [0.8966897053, "t"], [0.9280622929, "y"], [0.9435961971, "l"], [0.9553842675, "s"], [0.9665037901, "c"], [0.9732777521, "w"], [0.9779870813, "m"], [0.9817034204, "f"], [0.9851739699, "z"], [0.9884970456, "b"], [0.9917218055, "n"], [0.9942976807, "p"], [0.9964016399, "g"], [0.9975912617, "d"], [0.9984662727, "k"], [0.9993117891, "v"], [0.9998033683, "j"], [0.9999213473, "x"], [1.0000000000, "q"], ],
112
- 'u' => [ [0.1659793262, "n"], [0.3036437970, "s"], [0.4266152500, "r"], [0.5220573795, "l"], [0.5933801082, "m"], [0.6608108832, "t"], [0.6990520058, "c"], [0.7350257976, "e"], [0.7697320265, "i"], [0.8040097834, "a"], [0.8368414475, "p"], [0.8691910806, "b"], [0.8962205202, "d"], [0.9214825130, "$"], [0.9456197668, "g"], [0.9569207148, "f"], [0.9658829200, "o"], [0.9740595933, "k"], [0.9797189938, "v"], [0.9851641583, "x"], [0.9894667309, "z"], [0.9922696517, "h"], [0.9948047775, "y"], [0.9966793423, "u"], [0.9984289361, "j"], [0.9997143520, "w"], [1.0000000000, "q"], ],
113
- 'v' => [ [0.4405209116, "e"], [0.6701101928, "i"], [0.8340846481, "a"], [0.9149135988, "o"], [0.9356999750, "$"], [0.9479714500, "u"], [0.9574254946, "r"], [0.9664412722, "y"], [0.9723891811, "s"], [0.9777109942, "l"], [0.9827197596, "v"], [0.9859754570, "n"], [0.9884172302, "c"], [0.9902955172, "t"], [0.9919859755, "d"], [0.9936138242, "m"], [0.9946781868, "g"], [0.9956799399, "k"], [0.9966190834, "p"], [0.9973703982, "b"], [0.9979964939, "f"], [0.9984973704, "h"], [0.9989356374, "j"], [0.9992486852, "q"], [0.9995617330, "x"], [0.9998121713, "w"], [1.0000000000, "z"], ],
114
- 'w' => [ [0.2197508897, "a"], [0.3764713934, "e"], [0.5284697509, "i"], [0.6595264166, "o"], [0.7374076102, "$"], [0.8001642486, "h"], [0.8445113605, "n"], [0.8720914317, "r"], [0.8938543663, "l"], [0.9145223104, "s"], [0.9261565836, "y"], [0.9366274295, "d"], [0.9460717219, "t"], [0.9542157131, "b"], [0.9619490829, "u"], [0.9690665207, "k"], [0.9750889680, "f"], [0.9811114153, "c"], [0.9867916781, "m"], [0.9911716397, "p"], [0.9946619217, "w"], [0.9976731454, "g"], [0.9985628251, "j"], [0.9993840679, "z"], [0.9998631262, "q"], [1.0000000000, "v"], ],
115
- 'x' => [ [0.2546458142, "$"], [0.4077276909, "i"], [0.5054277829, "t"], [0.5863845446, "p"], [0.6559337626, "e"], [0.7238270469, "a"], [0.7696412144, "o"], [0.8147194112, "y"], [0.8594296228, "c"], [0.8885004600, "x"], [0.9138914443, "u"], [0.9306347746, "h"], [0.9425942962, "l"], [0.9521619135, "s"], [0.9613615455, "v"], [0.9703771849, "m"], [0.9781048758, "f"], [0.9849126035, "b"], [0.9889604416, "w"], [0.9913523459, "r"], [0.9933762649, "d"], [0.9954001840, "n"], [0.9974241030, "g"], [0.9992640294, "q"], [0.9996320147, "j"], [0.9998160074, "k"], [1.0000000000, "z"], ],
116
- 'y' => [ [0.5555410988, "$"], [0.5980873695, "l"], [0.6377386722, "s"], [0.6736492860, "a"], [0.7088768175, "e"], [0.7438441271, "p"], [0.7767621898, "n"], [0.8066226458, "m"], [0.8352470481, "c"], [0.8630907849, "t"], [0.8858276681, "o"], [0.9072309144, "r"], [0.9262921641, "d"], [0.9444426373, "i"], [0.9558923983, "b"], [0.9643496080, "g"], [0.9714081254, "u"], [0.9783690596, "w"], [0.9846143838, "f"], [0.9887779332, "h"], [0.9924860944, "k"], [0.9945678691, "v"], [0.9964219497, "z"], [0.9980158085, "x"], [0.9992518622, "y"], [0.9998048336, "j"], [1.0000000000, "q"], ],
117
- 'z' => [ [0.2946584939, "e"], [0.4734384121, "a"], [0.6180677175, "i"], [0.7089900759, "o"], [0.7905720957, "$"], [0.8610624635, "z"], [0.8884997081, "y"], [0.9098073555, "u"], [0.9306771745, "l"], [0.9455633392, "h"], [0.9519848219, "b"], [0.9576765908, "m"], [0.9627845884, "k"], [0.9676007005, "w"], [0.9724168126, "n"], [0.9767950963, "s"], [0.9810274372, "t"], [0.9852597782, "d"], [0.9886164623, "g"], [0.9910974898, "v"], [0.9934325744, "c"], [0.9957676591, "r"], [0.9976649154, "p"], [0.9988324577, "f"], [0.9995621716, "j"], [0.9998540572, "q"], [1.0000000000, "x"], ],
118
- } # BIGRAM_CDF_HSH
119
-
120
- end
121
-
122
- end
123
- end
@@ -1,26 +0,0 @@
1
- module Wukong
2
- module Widget
3
-
4
- class Monitor < AsIs
5
- include CountingProcessor
6
- register_processor
7
-
8
- field :every, Integer, :default => 1000, :doc => "How often to announce progress"
9
-
10
- def process(rec)
11
- super(rec)
12
- $stderr.puts("%-7d\t%s\t%s" % [count, report, rec.inspect[0..1000]]) if ready?
13
- end
14
-
15
- def ready?
16
- (count % every) == 0
17
- end
18
- end
19
-
20
- class DumpSystemConfig < Monitor
21
- def setup ; require 'rbconfig' ; end
22
- def report() super.merge({ :rbconfig => RbConfig::CONFIG }) end
23
- end
24
-
25
- end
26
- end
@@ -1,66 +0,0 @@
1
- module Wukong
2
- class Counter < Wukong::Processor
3
- field :count, Integer, :doc => 'count of records this run'
4
-
5
- def setup
6
- super
7
- reset!
8
- end
9
-
10
- def reset!
11
- self.count = 0
12
- end
13
-
14
- def beg_group(*args)
15
- reset!
16
- end
17
-
18
- def end_group(key)
19
- emit( [key, count] )
20
- end
21
-
22
- def process(record)
23
- @count += 1
24
- end
25
- end
26
-
27
- class GroupArrays < Wukong::Processor
28
- def beg_group
29
- @records = []
30
- end
31
-
32
- def end_group(key)
33
- emit(key, @records)
34
- end
35
-
36
- def process(record)
37
- @records << record
38
- end
39
- end
40
-
41
- class Group < Wukong::Processor
42
- def start(key, *vals)
43
- @key = key
44
- next_stage.tell(:beg_group, @key)
45
- end
46
-
47
- def end_group
48
- next_stage.tell(:end_group, @key)
49
- end
50
-
51
- def process( (key, *vals) )
52
- start(key, *vals) unless defined?(@key)
53
- if key != @key
54
- end_group
55
- start(key, *vals)
56
- end
57
- emit( [key, *vals] )
58
- end
59
-
60
- def finally
61
- end_group
62
- super()
63
- end
64
- end
65
-
66
- end
@@ -1,50 +0,0 @@
1
- module Wukong
2
- module Widget
3
-
4
- class Stringifier < Processor
5
- end
6
-
7
- class ToJson < Stringifier
8
- def process(record)
9
- emit record.to_json
10
- # emit MultiJson.dump(record)
11
- end
12
- register_processor
13
- end
14
-
15
- class FromJson < Stringifier
16
- # FIXME some of this belongs in gorillib factories...
17
- def process(record)
18
- obj = MultiJson.load(record)
19
- if obj.respond_to?(:has_key?) && obj.has_key?("_metadata")
20
- metadata_hash = obj.delete("_metadata")
21
- obj.define_singleton_method(:_metadata) do
22
- metadata_hash
23
- end
24
- end
25
- if obj.respond_to?(:has_key?) && obj.has_key?("_type")
26
- klass = Gorillib::Factory(obj.delete("_type"))
27
- obj = klass.receive(obj)
28
- end
29
- emit obj
30
- end
31
- register_processor
32
- end
33
-
34
- class ToTsv < Stringifier
35
- def process(record)
36
- emit record.join("\t")
37
- end
38
- register_processor
39
- end
40
-
41
- class FromTsv < Stringifier
42
- def process(record)
43
- emit record.chomp.split(/\t/)
44
- end
45
- register_processor
46
- end
47
-
48
- end
49
-
50
- end
@@ -1,22 +0,0 @@
1
- module Wukong
2
- class WorkflowGraph < Hanuman::Graph
3
- end
4
-
5
- class Workflow < WorkflowGraph
6
- include Hanuman::IsOwnInputSlot
7
- include Hanuman::IsOwnOutputSlot
8
-
9
- #
10
- # lifecycle
11
- #
12
-
13
- def setup
14
- stages.each_value{|stage| stage.setup}
15
- end
16
-
17
- def stop
18
- stages.each_value{|stage| stage.stop}
19
- end
20
-
21
- end
22
- end
@@ -1,42 +0,0 @@
1
- module Wukong
2
- class Workflow < Hanuman::Graph
3
-
4
- class ActionWithInputs < Hanuman::Action
5
- include Hanuman::Slottable
6
- include Hanuman::SplatInputs
7
- include Hanuman::SplatOutputs
8
-
9
- def self.make(workflow, *input_stages, &block)
10
- options = input_stages.extract_options!
11
- stage = new
12
- workflow.add_stage stage
13
- input_stages.map do |input|
14
- workflow.connect(input, stage)
15
- end
16
- stage.receive!(options, &block)
17
- stage
18
- end
19
- end
20
-
21
- #
22
- # A command is a workflow action that runs a type of command on many inputs,
23
- # under a given configuration, into named outputs.
24
- #
25
- # @example
26
- # bash 'create_archive.sh', filenames, :compression_level => 9 > 'archive.tar.gz'
27
- #
28
- class Command < ActionWithInputs
29
-
30
- def self.make(workflow, stage_name, *input_stages, &block)
31
- options = input_stages.extract_options!
32
- super(workflow, *input_stages, options.merge(:name => stage_name, :script => stage_name), &block)
33
- end
34
- end
35
-
36
- class Shell < Command
37
- field :script, String
38
- register_action
39
- end
40
-
41
- end
42
- end
@@ -1,48 +0,0 @@
1
- #
2
- # Elastic MapReduce config in wukong
3
- #
4
-
5
- #
6
- # Infrastructure options
7
- #
8
-
9
- # == Fill all your information into yet another file with your amazon key
10
- :emr_credentials_file: ~/.wukong/credentials.json
11
- #
12
- # == Use the credentials file, set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars, or enter them here:
13
- # :access_key: ASDFAHKHASDF
14
- # :secret_access_key: ADSGHASDFJASDFASDF
15
- #
16
- # == Path to your keypair file
17
- # :key_pair_file: ~/.wukong/keypairs/gibbon.pem
18
- # == Keypair will be named after your file, or force the name:
19
- # :key_pair: ~
20
-
21
- # == Path to the Amazon elastic-mapreduce runner. Get a copy from
22
- # http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
23
- :emr_runner: ~/ics/hadoop/elastic-mapreduce/elastic-mapreduce
24
-
25
- #
26
- # Cluster Config
27
- #
28
- :num_instances: 1
29
- :instance_type: m2.xlarge
30
- :master_instance_type: ~
31
- :hadoop_version: '0.20'
32
- # :availability_zone: us-east-1b
33
-
34
- #
35
- # Running and reporting options
36
- #
37
- :alive: true
38
- :enable_debugging: true
39
- :emr_runner_verbose: true
40
- :emr_runner_debug: ~
41
- :step_action: CANCEL_AND_WAIT # CANCEL_AND_WAIT, TERMINATE_JOB_FLOW or CONTINUE
42
-
43
- #
44
- # Remote Paths
45
- #
46
- :emr_root: s3n://emr.infinitemonkeys.info
47
-
48
-
@@ -1,17 +0,0 @@
1
- Examples:
2
-
3
-
4
- * sample_records -- extract a random sample from a collection of data
5
-
6
- * word_count
7
-
8
- * apache_log_parser -- example for parsing standard apache webserver log files.
9
-
10
- * wordchains -- solving a word puzzle using breadth-first search of a graph
11
-
12
- * graph -- some generic graph
13
-
14
- * pagerank -- use the pagerank algorithm to find the most 'interesting'
15
- (central) nodes of a network graph
16
-
17
-
@@ -1,165 +0,0 @@
1
- Notes and scripts from a talk by Fredrik Möllerstrand, (@lenbust / http://fredrikmollerstrand.se) with some modifications by @mrflip
2
-
3
- See http://bit.ly/6ItaHI and forks of that gist!
4
-
5
- Wukong who?
6
- -----------
7
-
8
- Here be some notes from my talk on [Wukong](http://github.com/mrflip/wukong) at the January meetup of [Got.rb](http://www.meetup.com/got-rb/).
9
-
10
- Wukong is a framework for writing Hadoop jobs in Ruby. Other such frameworks are [MRToolkit](http://code.google.com/p/mrtoolkit/) (which is also written in Ruby and which I have not tried it) and [Dumbo](http://github.com/klbostee/dumbo) (which is written in Python and which I love dearly). You could also write your jobs in Java(!) or as bare scripts hooked into [Hadoop Streaming](http://hadoop.apache.org/common/docs/current/streaming.html), but that would be nuts.
11
-
12
- Wukong gives you the option of treating your data as a stream of lines or as a stream of fields or lightweight objects. My forenoon's experience of Wukong covers only the basic text streaming so we'll skip the structured data and interpret the data as dumb chunks of text.
13
-
14
- The data, which I regrettably can not share with you on the wider interwebs, are records of jeans ordered by stores from a central jean distributor. There is one line per order, and each order is for a specific market and for a specified number of jeans for each size. There are 13 sizes in total.
15
-
16
- An example of which:
17
-
18
- SS10 vaxjobutiken 2010-01-10 10:45:54 sweden Storgatan 64 VÄXJÖ 352 30 SWEDEN retailer 120664 L34 0 0 0 0 0 1 1 1 2 2 1 1 0
19
- SS10 vaxjobutiken 2009-01-10 10:45:54 sweden Storgatan 64 VÄXJÖ 352 30 SWEDEN retailer 120721 L32 0 0 0 0 0 1 2 2 2 1 1 1 0
20
- SS09 kubic 2010-01-10 13:33:37 spain NULL NULL NULL NULL retailer 120571 L34 0 0 0 0 0 0 0 1 1 1 1 1 0
21
-
22
- The integers at the end there describe how many of each jean size was ordered.
23
-
24
- We'll use Wukong to summarize the orders for each country. The job is run locally (as oppposed to on Hadoop) to avoid any startup overhead.
25
-
26
- sizes.rb
27
- --------
28
- $> ruby sizes.rb --run=local data/orders.tsv data/sizes
29
-
30
- require 'rubygems'
31
- require 'wukong'
32
- module JeanSizes
33
- class Mapper < Wukong::Streamer::RecordStreamer
34
- def process(code,model,time,country,j1,j2,j3, n1,n2,c1, venue,n3,n4, *sizes)
35
- yield [country, *sizes] if sizes.length == 13
36
- end
37
- end
38
-
39
- class JeansListReducer < Wukong::Streamer::ListReducer
40
- def finalize
41
- return if values.empty?
42
- sums = []; 13.times{ sums << 0 }
43
- values.each do |country, *sizes|
44
- sizes.map!(&:to_i)
45
- sums = sums.zip(sizes).map{|sum, val| sum + val }
46
- end
47
- yield [key, *sums]
48
- end
49
- end
50
- end
51
-
52
- Wukong::Script.new(JeanSizes::Mapper, JeanSizes::JeansListReducer).run
53
-
54
- *JeanSizes::Mapper#process*, being a RecordStreamer, is given one set of input fields to work with at a time. It picks out the good parts, namely the country at index 3 and the integers at index 11 through 23. The rest of the fields are unimportant and just given placeholder names.
55
-
56
- The country is promoted to key and the sizes array is value. These are yielded as a list – since the reducer is a list reducer!
57
-
58
- *JeanSizes::Reducer#finalize* is given the key ('sweden' for example) and a list of lists of integers. These are (over)cleverly summarized into one list, *sums*.
59
-
60
- The output of these two steps is a much smaller data set containing the number of jeans of each size purchased, broken down by market.
61
-
62
- An example of which:
63
-
64
- sweden 807 1443 2215 2460 2316 2077 2392 2563 3068 2356 2051 1016 255
65
- switzerland 90 201 731 886 585 325 404 624 770 721 635 295 41
66
- unitedstates 446 1103 2007 2442 2863 2879 3920 3687 5588 4256 5299 3777 1842
67
-
68
-
69
- That's all peachy, but what if I'd like to compare the relative amount of large jeans bought in Sweden with those bought in the US? A working hypothesis might be that swedes wear smaller jean sizes than do americans. Well, let's normalize the data and see what we can make of it.
70
-
71
- normalize.rb
72
- ------------
73
- $> ruby normalize.rb --run=local data/sizes.tsv data/normalized_sizes.tsv
74
-
75
- require 'rubygems'
76
- require 'wukong'
77
- require 'active_support/core_ext/enumerable' # for array#sum
78
-
79
- module Normalize
80
- class Mapper < Wukong::Streamer::RecordStreamer
81
- def process(country, *sizes)
82
- sizes.map!(&:to_i)
83
- sum = sizes.sum.to_f
84
- normalized = sizes.map{|x| 100 * x/sum }
85
- s = normalized.join(",")
86
- yield [country, s]
87
- end
88
- end
89
- end
90
-
91
- Wukong::Script.new(Normalize::Mapper, nil).run
92
-
93
- Again we're dealing with a line streamer. The normalization divides each jean size by the total number of jeans sold in that country and scales it up by 100 to make the figures into proper percentages.
94
-
95
- You will also notice that I join the list of normalized values with a comma. Why in the name of Buddha would I do that? Bear with me.
96
-
97
- Parts of the output looks like so:
98
-
99
- sweden 1.01922538870458,4.06091370558376,8.19776969503178,9.41684319916863,12.2626803629242,10.2442143970582,9.56073384227987,8.30169071505656,9.25696470682282,9.83252727926776,8.85327151364963,5.76761661137536,3.22554858307686
100
- switzerland 0.64996829422955,4.67660114140774,10.0665821179455,11.429930247305,12.2067216233354,9.89220038046925,6.40456563094483,5.15218769816107,9.27393785668992,14.0456563094483,11.5884590995561,3.1864299302473,1.42675967025999
101
- unitedstates 4.59248547707497,9.41683911341594,13.2114986661348,10.6110847939365,13.9320352040689,9.19245057219078,9.77336757336259,7.17794011319155,7.13804881697375,6.08840908524271,5.0038644693211,2.75000623301503,1.11196988207136
102
-
103
-
104
- Data visualized
105
- ---------------
106
-
107
- In the words of the late great R. Dingly, '*data not visualized is not data*'.
108
-
109
- Let's put our hypothesis to work and graph this. The quickest path to a graph just happens to be Google Charts, and here's
110
- [a graph comparing jeans sizes bought in Sweden and the US ](http://chart.apis.google.com/chart?cht=bvg&chd=t:1.01922538870458,4.06091370558376,8.19776969503178,9.41684319916863,12.2626803629242,10.2442143970582,9.56073384227987,8.30169071505656,9.25696470682282,9.83252727926776,8.85327151364963,5.76761661137536,3.22554858307686|4.59248547707497,9.41683911341594,13.2114986661348,10.6110847939365,13.9320352040689,9.19245057219078,9.77336757336259,7.17794011319155,7.13804881697375,6.08840908524271,5.0038644693211,2.75000623301503,1.11196988207136&chds=0,20&chs=800x375&chdl=Sweden|USA&chco=eecc00,00eedd).
111
-
112
- It seems that americans buy smaller sizes while swedes go for larger breeches, which is quite the opposite of what we thought.
113
-
114
- As someone in the crowd pointed out during the meet, this might be due to the fact that the particular brand of jeans under scrutiny here has become mainstream in Sweden while it is still mostly worn by thin punk-rockers in the states. Again the the words of Mr. Dingly echo so true: '*data is nothing without interpretation*'.
115
-
116
- Chaining Runs
117
- -------------
118
-
119
- **Local Mode**: If you're running in local mode, chaining is straightforward: just use a single dash '-' as the output file, and wukong will leave its output on STDOUT rather than dumping to a file on disk. (NOTE: this is only true as of wukong v1.4.5.) (OTHER NOTE: if you've already written a file named '-' to disk, use "rm -- -" to remove it).
120
-
121
- ./sizes.rb --run=local data/orders.tsv - | ./normalize.rb --run=local - data/normalized_sizes
122
-
123
- For anything fancier (or for earlier versions of wukong): all the local runner does is to take your script and run
124
-
125
- cat input.tsv | myscript.rb --map [..args..] | sort | myscript.rb --reduce [..args..] > output.tsv
126
-
127
- You can do the chaining by hand:
128
-
129
- cat input.tsv | myscript.rb --map [..args..] | sort | myscript.rb --reduce [..args..] | whatever | ./anotherscript.rb --map | sort | ./anotherscript --reduce > output.tsv
130
-
131
- **Hadoop Mode**: Wukong doesn't let you chain jobs or define workflows. There are tools out there that enable this, but there are no mature solutions that I [@mrflip] know of.
132
-
133
- In closing
134
- ----------
135
-
136
- This has been a quick rundown of Wukong as I know it after a few hours of use. Improvements can certainly be made and I welcome any and all comments. Please consider amending the code in this presentation by forking [this gist](http://gist.github.com/278043).
137
-
138
- Also, you should follow me on twitter: [@lenbust](http://twitter.com/lenbust).
139
-
140
- Postscript
141
- __________
142
-
143
- The LineReducer used in sizes.rb is perfect for a small, local run such as this one. However, for large amounts of data it's best to avoid the ListReducer, as it collects every single record in memory before finalizing.
144
-
145
- You can instead use an Accumulating reducer directly. Compare this class with the JeansListReducer above and you'll see we are applying the same basic workflow, just in explicitly separated steps.
146
-
147
- class JeansAccumulatingReducer < Wukong::Streamer::AccumulatingReducer
148
- attr_accessor :sums
149
-
150
- # start the sum with 0 for each size
151
- def start! *_
152
- self.sums = []; 13.times{ self.sums << 0 }
153
- end
154
-
155
- # accumulate each size count into the sizes_sum
156
- def accumulate country, *sizes
157
- sizes.map!(&:to_i)
158
- self.sums = self.sums.zip(sizes).map{|sum, val| sum + val }
159
- end
160
-
161
- # emit [country, size_0_sum, size_1_sum, ...]
162
- def finalize
163
- yield [key, *sums]
164
- end
165
- end