wukong 0.1.4 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (63) hide show
  1. data/INSTALL.textile +89 -0
  2. data/README.textile +41 -74
  3. data/docpages/INSTALL.textile +94 -0
  4. data/{doc → docpages}/LICENSE.textile +0 -0
  5. data/{doc → docpages}/README-wulign.textile +6 -0
  6. data/docpages/UsingWukong-part1-get_ready.textile +17 -0
  7. data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
  8. data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
  9. data/docpages/_config.yml +39 -0
  10. data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
  11. data/{doc → docpages}/code/api_response_example.txt +0 -0
  12. data/{doc → docpages}/code/parser_skeleton.rb +0 -0
  13. data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
  14. data/docpages/favicon.ico +0 -0
  15. data/docpages/gem.css +16 -0
  16. data/docpages/hadoop-tips.textile +83 -0
  17. data/docpages/index.textile +90 -0
  18. data/docpages/intro.textile +8 -0
  19. data/docpages/moreinfo.textile +174 -0
  20. data/docpages/news.html +24 -0
  21. data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
  22. data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
  23. data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
  24. data/docpages/tutorial.textile +283 -0
  25. data/docpages/usage.textile +195 -0
  26. data/docpages/wutils.textile +263 -0
  27. data/wukong.gemspec +80 -50
  28. metadata +87 -54
  29. data/doc/INSTALL.textile +0 -41
  30. data/doc/README-tutorial.textile +0 -163
  31. data/doc/README-wutils.textile +0 -128
  32. data/doc/TODO.textile +0 -61
  33. data/doc/UsingWukong-part1-setup.textile +0 -2
  34. data/doc/UsingWukong-part2-scraping.textile +0 -2
  35. data/doc/hadoop-nfs.textile +0 -51
  36. data/doc/hadoop-setup.textile +0 -29
  37. data/doc/index.textile +0 -124
  38. data/doc/links.textile +0 -42
  39. data/doc/usage.textile +0 -102
  40. data/doc/utils.textile +0 -48
  41. data/examples/and_pig/sample_queries.rb +0 -128
  42. data/lib/wukong/and_pig.rb +0 -62
  43. data/lib/wukong/and_pig/README.textile +0 -12
  44. data/lib/wukong/and_pig/as.rb +0 -37
  45. data/lib/wukong/and_pig/data_types.rb +0 -30
  46. data/lib/wukong/and_pig/functions.rb +0 -50
  47. data/lib/wukong/and_pig/generate.rb +0 -85
  48. data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
  49. data/lib/wukong/and_pig/junk.rb +0 -51
  50. data/lib/wukong/and_pig/operators.rb +0 -8
  51. data/lib/wukong/and_pig/operators/compound.rb +0 -29
  52. data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
  53. data/lib/wukong/and_pig/operators/execution.rb +0 -15
  54. data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
  55. data/lib/wukong/and_pig/operators/foreach.rb +0 -98
  56. data/lib/wukong/and_pig/operators/groupies.rb +0 -212
  57. data/lib/wukong/and_pig/operators/load_store.rb +0 -65
  58. data/lib/wukong/and_pig/operators/meta.rb +0 -42
  59. data/lib/wukong/and_pig/operators/relational.rb +0 -129
  60. data/lib/wukong/and_pig/pig_struct.rb +0 -48
  61. data/lib/wukong/and_pig/pig_var.rb +0 -95
  62. data/lib/wukong/and_pig/symbol.rb +0 -29
  63. data/lib/wukong/and_pig/utils.rb +0 -0
@@ -1,29 +0,0 @@
1
-
2
- h2. Hadoop on EC2
3
-
4
- * http://www.cloudera.com/hadoop-ec2
5
- * http://www.cloudera.com/hadoop-ec2-ebs-beta
6
-
7
-
8
- h3. Setup NFS within the cluster
9
-
10
- *
11
- * http://nfs.sourceforge.net/nfs-howto/ar01s03.html
12
-
13
-
14
- h3. Miscellaneous Hadoop Tips
15
-
16
- * The Cloudera AMIs and distribution include BZip2 support. This means that if you have input files with a .bz2 extension, they will be naturally un-bzipped and streamed. (Note that there is a non-trivial penalty for doing so: each bzip'ed file must go, in whole, to a single mapper; and the CPU load for un-bzipping is sizeable.)
17
-
18
- * To _produce_ bzip2 files, specify the new @--compress_output=@ flag. If you have the BZip2 patches installed, you can give @--compress_output=bz2@; everyone should be able to use @--compress_output=gz@.
19
-
20
- * For excellent performance you can patch your install for "Parallel LZO Splitting":http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/
21
-
22
-
23
- h3. Tools for EC2 and S3 Management
24
-
25
- * http://s3sync.net/wiki
26
- * http://jets3t.s3.amazonaws.com/applications/applications.html#uploader
27
- * "ElasticFox"
28
- * "S3Fox (S3 Organizer)":
29
- * "FoxyProxy":
data/doc/index.textile DELETED
@@ -1,124 +0,0 @@
1
- ---
2
- layout: default
3
- title: mrflip.github.com/wukong
4
- collapse: false
5
- ---
6
-
7
- h1(gemheader). wukong %(small):: hadoop made easy%
8
-
9
-
10
- p(description). {{ site.description }}
11
-
12
-
13
- Treat your dataset like a
14
- * stream of lines when it's efficient to process by lines
15
- * stream of field arrays when it's efficient to deal directly with fields
16
- * stream of lightweight objects when it's efficient to deal with objects
17
-
18
- Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
19
-
20
- <notextile><div class="toggle"></notextile>
21
-
22
- h2. How to write a Wukong script
23
-
24
- Here's a script to count words in a text stream:
25
-
26
- <pre><code>
27
- require 'wukong'
28
- module WordCount
29
- class Mapper < Wukong::Streamer::LineStreamer
30
- # Emit each word in the line.
31
- def process line
32
- words = line.strip.split(/\W+/).reject(&:blank?)
33
- words.each{|word| yield [word, 1] }
34
- end
35
- end
36
-
37
- class Reducer < Wukong::Streamer::ListReducer
38
- def finalize
39
- yield [ key, values.map(&:last).map(&:to_i).sum ]
40
- end
41
- end
42
- end
43
-
44
- Wukong::Script.new(
45
- WordCount::Mapper,
46
- WordCount::Reducer
47
- ).run # Execute the script
48
- </code></pre>
49
-
50
- The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
51
-
52
- In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
53
-
54
- <notextile></div><div class="toggle"></notextile>
55
-
56
- h2. Structured data stream
57
-
58
- You can also use structs to treat your dataset as a stream of objects:
59
-
60
- <pre><code>
61
- require 'wukong'
62
- require 'my_blog' #defines the blog models
63
- # structs for our input objects
64
- Tweet = Struct.new( :id, :created_at, :twitter_user_id,
65
- :in_reply_to_user_id, :in_reply_to_status_id, :text )
66
- TwitterUser = Struct.new( :id, :username, :fullname,
67
- :homepage, :location, :description )
68
- module TwitBlog
69
- class Mapper < Wukong::Streamer::RecordStreamer
70
- # Watch for tweets by me
71
- MY_USER_ID = 24601
72
- #
73
- # If this is a tweet is by me, convert it to a Post.
74
- #
75
- # If it is a tweet not by me, convert it to a Comment that
76
- # will be paired with the correct Post.
77
- #
78
- # If it is a TwitterUser, convert it to a User record and
79
- # a user_location record
80
- #
81
- def process record
82
- case record
83
- when TwitterUser
84
- user = MyBlog::User.new.merge(record) # grab the fields in common
85
- user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
86
- yield user
87
- yield user_loc
88
- when Tweet
89
- if record.twitter_user_id == MY_USER_ID
90
- post = MyBlog::Post.new.merge record
91
- post.link = "http://twitter.com/statuses/show/#{record.id}"
92
- post.body = record.text
93
- post.title = record.text[0..65] + "..."
94
- yield post
95
- else
96
- comment = MyBlog::Comment.new.merge record
97
- comment.body = record.text
98
- comment.post_id = record.in_reply_to_status_id
99
- yield comment
100
- end
101
- end
102
- end
103
- end
104
- end
105
- Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
106
- </code></pre>
107
-
108
- <notextile></div><div class="toggle"></notextile>
109
-
110
- h2. More info
111
-
112
- There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
113
-
114
- h3. Authors
115
-
116
- Philip (flip) Kromer (flip@infochimps.org)
117
-
118
- Patches submitted by:
119
- * gemified by Ben Woosley (ben.woosley@gmail.com)
120
- * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui@masuidrive.jp - http://blog.masuidrive.jp/
121
-
122
- <notextile></div></notextile>
123
-
124
- {% include news.html %}
data/doc/links.textile DELETED
@@ -1,42 +0,0 @@
1
- h3. Setting up your cluster:
2
-
3
- * "Running Hadoop On Ubuntu Linux (Single-Node Cluster)":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) and "unning Hadoop On Ubuntu Linux (Multi-Node Cluster).":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
4
- * "Running Hadoop MapReduce on Amazon EC2 and S3":http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
5
-
6
-
7
-
8
-
9
- * "Hadoop Overview by Doug Cutting":http://video.google.com/videoplay?docid=-4912926263813234341 - the founder of the Hadoop project. 49m
10
-
11
- * "Cluster Computing and Map|Reduce":http://www.youtube.com/results?search_query=cluster+computing+and+mapreduce
12
- ** "Lecture 1: Overview":http://www.youtube.com/watch?v=yjPBkvYh-ss
13
- ** "Lecture 2 (technical): Map|Reduce":http://www.youtube.com/watch?v=-vD6PUdf3Js
14
- ** "Lecture 3 (technical): GFS (Google File System)":http://www.youtube.com/watch?v=5Eib_H_zCEY
15
- ** "Lecture 4 (theoretical): Canopy Clustering":http://www.youtube.com/watch?v=1ZDybXl212Q
16
- ** "Lecture 5 (theoretical): Breadth-First Search":http://www.youtube.com/watch?v=BT-piFBP4fE
17
-
18
-
19
- * http://www.cloudera.com/hadoop-training
20
-
21
- ** "Thinking at Scale":http://www.cloudera.com/hadoop-training-thinking-at-scale
22
- ** "Mapreduce and HDFS":http://www.cloudera.com/hadoop-training-mapreduce-hdfs
23
- ** "A Tour of the Hadoop Ecosystem":http://www.cloudera.com/hadoop-training-ecosystem-tour
24
- ** "Programming with Hadoop":http://www.cloudera.com/hadoop-training-programming-with-hadoop
25
-
26
- ** "Hadoop and Hive: introduction":http://www.cloudera.com/hadoop-training-hive-introduction
27
- ** "Hadoop and Hive: tutorial":http://www.cloudera.com/hadoop-training-hive-tutorial
28
- ** "Hadoop and Pig: Introduction":http://www.cloudera.com/hadoop-training-pig-introduction
29
- ** "Hadoop and Pig: Tutorial":http://www.cloudera.com/hadoop-training-pig-tutorial
30
-
31
- ** "Mapreduce Algorithms":http://www.cloudera.com/hadoop-training-mapreduce-algorithms
32
- ** "Exercise: Getting started with Hadoop":http://www.cloudera.com/hadoop-training-exercise-getting-started-with-hadoop
33
- ** "Exercise: Writing mapreduce programs":http://www.cloudera.com/hadoop-training-exercise-writing-mapreduce-programs
34
-
35
-
36
-
37
-
38
- ---------------------------------------------------------------------------
39
-
40
- * "Hadoop Wiki: Hadoop Streaming":http://wiki.apache.org/hadoop/HadoopStreaming
41
- * "Hadoop Docs: Hadoop Streaming":http://hadoop.apache.org/common/docs/current/streaming.html
42
-
data/doc/usage.textile DELETED
@@ -1,102 +0,0 @@
1
- ---
2
- layout: default
3
- title: Usage notes
4
- ---
5
-
6
- h1(gemheader). {{ site.gemname }} %(small):: usage%
7
-
8
-
9
- <notextile><div class="toggle"></notextile>
10
-
11
- h2. How to run a Wukong script
12
-
13
- To run your script using local files and no connection to a hadoop cluster,
14
-
15
- pre. your/script.rb --run=local path/to/input_files path/to/output_dir
16
-
17
- To run the command across a Hadoop cluster,
18
-
19
- pre. your/script.rb --run=hadoop path/to/input_files path/to/output_dir
20
-
21
- You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
22
-
23
- If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths. (your/script path, of course, lives on the local filesystem).
24
-
25
- You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
26
-
27
- pre. ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
28
- --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
29
-
30
- Note that all @--options@ must precede (in any order) all non-options.
31
-
32
- <notextile></div><div class="toggle"></notextile>
33
-
34
- h2. How to test your scripts
35
-
36
- To run mapper on its own:
37
-
38
- pre. cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
39
-
40
- or if your test data lies on the HDFS,
41
-
42
- pre. hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
43
-
44
- Next graduate to running @--run=local@ mode so you can inspect the reducer.
45
-
46
- <notextile></div><div class="toggle"></notextile>
47
-
48
- h2. What tools does Wukong work with?
49
-
50
- Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. We're looking forward to being friends with "martinis":http://datamapper.org and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
51
-
52
- <notextile></div><div class="toggle"></notextile>
53
-
54
- h2. Design
55
-
56
- ...
57
-
58
- <notextile></div><div class="toggle"></notextile>
59
-
60
- h2. Caveats
61
-
62
- ...
63
-
64
- <notextile></div><div class="toggle"></notextile>
65
-
66
- h2. TODOs
67
-
68
- ...
69
-
70
- <notextile></div><div class="toggle"></notextile>
71
-
72
- h2. Note on Patches/Pull Requests
73
-
74
- * Fork the project.
75
- * Make your feature addition or bug fix.
76
- * Add tests for it. This is important so I don't break it in a future version unintentionally.
77
- * Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
78
- * Send me a pull request. Bonus points for topic branches.
79
-
80
- <notextile></div><div class="toggle"></notextile>
81
-
82
- h2. Endnotes
83
-
84
-
85
- h3. Why is it called Wukong?
86
-
87
- Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
88
-
89
- bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
90
-
91
- bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
92
-
93
- The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
94
-
95
-
96
- * What's up with Wukong::AndPig?
97
- ** @Wukong::AndPig@ is a small library to more easily generate code for the "Pig":http://hadoop.apache.org/pig data analysis language. See its "README":wukong/and_pig/README.textile for more.
98
-
99
-
100
-
101
- <notextile></div></notextile>
102
-
data/doc/utils.textile DELETED
@@ -1,48 +0,0 @@
1
-
2
- <something to tab and align table>
3
-
4
-
5
- * uniq - report or filter out repeated lines in a file
6
- ** -c produces line<tab>count
7
- ** --ignore f1,f2,... discards given fields from consideration. field syntax same as for cut, etc.
8
-
9
- * sort - sort lines of text files
10
- ** columns indexed as tab-separated
11
- ** can specify any column order, uses same field spec as cut
12
- * tsort - topological sort of a directed graph
13
-
14
- * cut - select portions of each line of a file
15
- ** can reorder columns
16
- * nl - line numbering filter
17
- ** takes prefix, suffix
18
- ** count \t line -OR- line \t count
19
-
20
- * wc - word, line, character, and byte count
21
- ** field count (tab-separated fields)
22
- * paste - merge corresponding or subsequent lines of files
23
- * expand, unexpand - expand tabs to spaces, and vice versa
24
- * seq
25
- * simple row, column sums
26
- * join - relational database operator
27
- * tac
28
-
29
- * cat - concatenate and print files
30
- * head - display first lines of a file
31
- * tail - display the last part of a file
32
- * shuf
33
- * split - split a file into pieces
34
- * csplit - split files based on context
35
- * tee - pipe fitting
36
-
37
- * ls - list directory contents.
38
- * df - display free disk space
39
- * du - display disk usage statistics
40
- ** tab-delimited, space aligned
41
-
42
- * od - octal, decimal, hex, ASCII dump
43
- * printf - formatted output
44
- * cksum, sum - display file checksums and block counts
45
- * md5sum
46
-
47
- * diff
48
- * comm
@@ -1,128 +0,0 @@
1
- #!/usr/bin/env ruby
2
- $: << File.dirname(__FILE__) + '/../../lib'
3
- require 'wukong' ; include Wukong
4
- require 'wukong/and_pig' ; include Wukong::AndPig
5
-
6
- # PIG_DIR = '/usr/local/share/pig'
7
- PIG_DIR = '/public/share/pig'
8
- # full pathname to the pig executable
9
- # Wukong::AndPig::PIG_EXECUTABLE = "#{PIG_DIR}/bin/pig"
10
- Wukong::AndPig::PIG_EXECUTABLE = "/public/bin/pig -x local"
11
-
12
- #
13
- HDFS_BASE_DIR = 'foo/meta/lang'
14
- Wukong::AndPig::PigVar.working_dir = HDFS_BASE_DIR
15
- Wukong::AndPig.comments = false
16
- # Wukong::AndPig.emit_dest = :captured
17
-
18
- Wukong::AndPig::PigVar.emit "REGISTER #{PIG_DIR}/contrib/piggybank/java/piggybank.jar"
19
-
20
- #
21
- # Load basic types
22
- #
23
-
24
- # class Token < Struct.new(:rsrc, :context, :user_id, :token, :usages)
25
- # end
26
- # :tokens_users_0 << Token.pig_load('meta/datanerds/token_count/users_tokens')
27
- # :tokens_users_0 << Token.pig_load('/tmp/users_tokens.tsv')
28
- # :tokens_users << :tokens_users_0.generate(:user_id, :token, :usages)
29
- # :tokens_users.checkpoint!
30
-
31
- class Token < TypedStruct.new(
32
- [:user_id, Integer], [:token, String], [:usages, Integer])
33
- end
34
- :tokens_users << Token.pig_load('/tmp/users_tokens.tsv')
35
- :tokens_users.describe
36
-
37
- pig_comment %Q{
38
- # ***************************************************************************
39
- #
40
- # Global totals
41
- #
42
- # Each row in Tokens lists a (user, token, usages)
43
- # We want
44
- # Sum of all usage counts = total tokens seen in tweet stream.
45
- # Number of distinct tokens
46
- # Number of distinct users <- different than total in twitter_users.tsv
47
- # because we want only users that say stuff.
48
- }
49
-
50
- def count_distinct relation, field, options={}
51
- result_name = options[:as] || "#{relation.name}_#{field}_count".to_sym
52
- a = relation.
53
- generate(field).set!.describe.
54
- distinct(options).set!
55
- result_name << a.
56
- group(:all).set!.
57
- generate(["COUNT(#{a.relation}.#{field})", :u_count, Integer]).set!
58
- end
59
-
60
- pig_comment "Count Users"
61
- tok_users_count = count_distinct(:tokens_users, :user_id).checkpoint!
62
-
63
- pig_comment "Count Tokens"
64
- tok_tokens_count = count_distinct(:tokens_users, :token, :parallel => 10).checkpoint!
65
-
66
-
67
- pig_comment %Q{
68
- # ***************************************************************************
69
- #
70
- # Statistics for each user
71
- }
72
-
73
- def user_stats users_tokens
74
- users_tokens.describe.
75
- group( :user_id).set!.describe.
76
- generate(
77
- [:group, :user_id],
78
- ["(int)COUNT(#{users_tokens.relation})", :tot_tokens, Integer],
79
- [ "(int)SUM(#{users_tokens.relation}.usages)", :tot_usages, Integer],
80
- [ "FLATTEN(#{users_tokens.relation}.token", :token, String ],
81
- [ "FLATTEN(#{users_tokens.relation}.usages", :usages, Integer]).set!.describe.
82
- # [ "FLATTEN(#{users_tokens.relation}.(token, usages) )", [:token, :usages], TypedStruct.new([:token, String], [:usages, Integer])]).set!.
83
- generate(:user_id, :token, :usages,
84
- ["(float)(1.0*usages / tot_usages)", :usage_pct, Float],
85
- ["(float)(1.0*usages / tot_usages) * (1.0*(float)usages / tot_usages)", :usage_pct_sq, Float]).set!
86
- end
87
-
88
- :user_stats << user_stats(:tokens_users)
89
- :user_stats.describe.checkpoint!
90
- puts "UserStats = LOAD 'foo/meta/lang/user_stats' AS (user_id, token, usages, usage_pct, usage_pct_sq) ;"
91
-
92
- UserStats = TypedStruct.new([:user_id, Integer],
93
- [:token, String],
94
- [:usages, Integer],
95
- [:usage_pct, Float],
96
- [:usage_pct_sq, Float])
97
- :user_stats << UserStats.pig_load('foo/meta/lang/user_stats')
98
-
99
- def range_and_dispersion user_stats
100
-
101
- n_users = 436
102
- n_tokens = 61630
103
-
104
- token_stats = user_stats.group(:token).set!
105
- token_stats = token_stats.foreach(
106
- ["(float)SUM(#{user_stats.relation}.usage_pct) / #{n_users.to_f}", :avg_uspct ],
107
- ["(float)SUM(#{user_stats.relation}.usage_pct_sq)", :sum_uspct_sq],
108
- ["org.apache.pig.piggybank.evaluation.math.SQRT(
109
- (sum_uspct_sq /436) -
110
- ( (SUM(#{user_stats.relation}.usage_pct)/436.0) * (SUM(#{user_stats.relation}.usage_pct)/436.0) )
111
- )", :stdev_uspct],
112
- ["1 - ( ( stdev_uspct / avg_uspct ) / org.apache.pig.piggybank.evaluation.math.SQRT(436.0 - 1.0) )", :dispersion],
113
- [
114
- [:group, :token, String ],
115
- ["(int)COUNT(#{user_stats.relation}) ", :range, Integer ],
116
- ["(int)COUNT(#{user_stats.relation}) / #{n_users.to_f}", :pct_range, Integer ],
117
- ["(int)SUM( #{user_stats.relation}.usages)", :tot_usages, Integer],
118
- ["(float)( 1.0e6*SUM(#{user_stats.relation}.usages) / #{n_tokens.to_f})", :ppm_usages, Float],
119
- [:avg_uspct, :avg_uspct],
120
- [:stdev_uspct, :stdev_uspct],
121
- [:dispersion, :dispersion]
122
- ]
123
- ).set!
124
- end
125
-
126
- range_and_dispersion(:user_stats).checkpoint!
127
-
128
- Wukong::AndPig.finish