wukong 1.5.2 → 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,3 +1,8 @@
1
+ h2. Wukong v1.5.3
2
+
3
+ * A couple of bugfixes. Sorry about that.
4
+ * Documentation fixes
5
+
1
6
  h2. Wukong v1.5.0
2
7
 
3
8
  h4. Elastic Map-Reduce
@@ -16,6 +21,11 @@ Incompatible changes to option handling and script launching:
16
21
  * Script doesn't use extra_options any more. You should relocate them to the initializer or to configliere.
17
22
  * there is no more default_mapper or default_reducer
18
23
 
24
+ h2. Wukong v.14.12 2010-08-31
25
+
26
+ * Improvements to the pig conversion methods
27
+ * @hdp-rm@ respects the -skipTrash method
28
+
19
29
 
20
30
  h2. Wukong v1.4.11 2010-07-30
21
31
 
@@ -30,9 +30,7 @@ I'm pushing to release "Wukong 3.0 the actual 1.0 release".
30
30
  * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
31
31
  * May make some things that are derived classes into mixin'ed modules
32
32
  * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
33
-
34
-
35
- *
33
+
36
34
 
37
35
  h2. Help!
38
36
 
@@ -0,0 +1,3 @@
1
+ #!/usr/bin/env bash
2
+
3
+ exec hadoop dfs -cp "$@"
@@ -7,7 +7,7 @@ USAGE= %Q{
7
7
  # space aligned file that is still suitable for further processing. For example,
8
8
  # given the log-file input
9
9
  #
10
- # <pre><code>
10
+ # # cat tag_usage.tsv
11
11
  # 2009-07-21T21:39:40 day 65536 3.15479 68750 1171316
12
12
  # 2009-07-21T21:39:45 doing 65536 1.04533 26230 1053956
13
13
  # 2009-07-21T21:41:53 hapaxlegomenon 65536 0.87574e-05 23707 10051141
@@ -15,11 +15,10 @@ USAGE= %Q{
15
15
  # 2009-07-21T21:44:29 world 65536 1.09110 32850 200916
16
16
  # 2009-07-21T21:44:39 world+series 65536 0.49380 9929 7972025
17
17
  # 2009-07-21T21:44:54 iranelection 65536 2.91775 14592 136342
18
- # </code></pre>
19
18
  #
20
19
  # wulign will reformat it to read
21
20
  #
22
- # <pre><code>
21
+ # # cat tag_usage.tsv | wu-lign
23
22
  # 2009-07-21T21:39:40 day 65536 3.154791234 68750 1171316
24
23
  # 2009-07-21T21:39:45 doing 65536 1.045330000 26230 1053956
25
24
  # 2009-07-21T21:41:53 hapaxlegomenon 65536 0.000008757 23707 10051141
@@ -27,65 +26,61 @@ USAGE= %Q{
27
26
  # 2009-07-21T21:44:29 world 65536 1.091100000 32850 200916
28
27
  # 2009-07-21T21:44:39 world+series 65536 0.493800000 9929 7972025
29
28
  # 2009-07-21T21:44:54 iranelection 65536 2.917750000 14592 136342
30
- # </code></pre>
31
29
  #
32
30
  # The fields are still tab-delimited by exactly one tab -- only spaces are used to
33
31
  # pad out fields. You can still use cuttab and friends to manipulate columns.
34
32
  #
35
- # wulign isn't intended to be smart, or correct, or reliable -- only to be
36
- # useful for previewing and organizing tab-formatted files. In general
37
- # @wulign(foo).split("\t").map(&:strip)@ *should* give output semantically
38
- # equivalent to its input. (That is, the only changes should be insertion of
39
- # spaces and re-formatting of numerics.) But still -- reserve its use for human
40
- # inspection only.
41
- #
42
- # (Note: tab characters in this source code file have been converted to spaces;
43
- # replace whitespace with tab in the first example if you'd like to play along at
44
- # home.)
45
- #
46
- # h2. How it works
47
- #
48
- # Wulign takes the first 1000 lines, splits by TAB characters into fields, and
49
- # tries to guess the format -- int, float, or string -- for each. It builds a
50
- # consensus of the width and type for corresponding columns in the chunk. If a
51
- # column has mixed numeric and string formats it degrades to :mixed, which is
52
- # basically treated as :string. If a column has mixed :float and :int elements all
53
- # of them are formatted as float.
54
- #
55
33
  # h2. Command-line arguments
56
34
  #
57
35
  # You can give sprintf-style positional arguments on the command line that will be
58
36
  # applied to the corresponding columns. (Blank args are used for placeholding and
59
37
  # auto-formatting is still applied). So with the example above,
60
38
  #
61
- # @cat foo | wulign '' '' '' '%8.4e'@
39
+ # cat foo | wulign '' '' '' '%8.4e'
62
40
  #
63
41
  # will format the fourth column with "%8.4e", while the first three columns and
64
42
  # fifth-and-higher columns are formatted as usual.
65
43
  #
66
- # <pre><code>
67
44
  # ...
68
45
  # 2009-07-21T21:39:45 doing 65536 1.0453e+00 26230 1053956
69
46
  # 2009-07-21T21:41:53 hapaxlegomenon 65536 8.7574e-06 23707 10051141
70
47
  # 2009-07-21T21:44:00 concert 500 2.9290e-01 13367 9733414
71
48
  # ....
72
- # </code></pre>
49
+ #
50
+ # h2. How it works
51
+ #
52
+ # Wu-lign takes the first 500ish lines, splits into fields on TAB characters,
53
+ # and tries to guess the format (int, float, or string) for each. It builds a
54
+ # consensus of the width and type for corresponding columns in the chunk. If a
55
+ # column has mixed numeric and string formats it degrades to :mixed, which is
56
+ # basically treated as :string. If a column has mixed :float and :int elements all
57
+ # of them are formatted as float.
73
58
  #
74
59
  # h2. Notes
75
60
  #
76
- # * It has no knowledge of header rows. An all-text first line will screw everything up.
61
+ # * Header rows: the first line is used for width alignment but not for type detection.
62
+ # This means that an initial row of text headers will inform column spacing
63
+ # but still allow a column of floats (say) to be properly aligned as floats.
77
64
  #
78
- # * It also requires a unanimous vote. One screwy line can coerce the whole mess
79
- # to :mixed; width formatting will still be applied, though.
65
+ # * It requires a unanimous vote. One screwy line can coerce the whole mess to
66
+ # :mixed; width formatting will still be applied, though.
80
67
  #
81
- # * It won't set columns wider than 70 chars -- this allows for the occasional
68
+ # * It won't set columns wider than 100 chars -- this allows for the occasional
82
69
  # super-wide column without completely breaking your screen.
83
70
  #
84
71
  # * For :float values, wulign tries to guess at the right number of significant
85
72
  # digits to the left and right of the decimal point.
86
73
  #
87
- # * wulign does not parse 'TSV files' in their strict sense -- there is no quoting
88
- # or escaping; every tab delimits a field, every newline a record.
74
+ # * wulign parses only plain-jane 'TSV files': no quoting or escaping; every tab
75
+ # delimits a field, every newline a record.
76
+ #
77
+ # wulign isn't intended to be smart, or correct, or reliable -- only to be
78
+ # useful for previewing and organizing tab-formatted files. In general
79
+ # wulign(foo).split("\t").map(&:strip) *should* give output semantically
80
+ # equivalent to its input. (That is, the only changes should be insertion of
81
+ # spaces and re-formatting of numerics.) But still -- reserve its use for human
82
+ # inspection only.
83
+ #
89
84
  }
90
85
 
91
86
  if ARGV[0] == '--help'
@@ -111,7 +106,7 @@ def get_type val
111
106
  when val == '' then type = nil
112
107
  when val =~ INT_RE then type = :int
113
108
  when val =~ FLOAT_RE then type = :float
114
- else type = :str end
109
+ else type = :str end
115
110
  end
116
111
 
117
112
  def consensus_type val, alltype, is_first
@@ -149,7 +144,7 @@ FORMAT_GUESSING_LINES.times do
149
144
  row.each_with_index{|col,i|
150
145
  next if skip_col[i]
151
146
  # Let the first row be text (headers)
152
- col_types[i] = consensus_type(col, col_types[i], rows.length == 1)
147
+ col_types[i] = consensus_type(col, col_types[i], rows.length == 1)
153
148
  if col_types[i] == :float
154
149
  mantissa, radix = f_width(col)
155
150
  col_minmag[i] = [radix, col_minmag[i], 1].compact.max
@@ -175,7 +170,7 @@ def dump_row row, format
175
170
  puts row.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
176
171
  end
177
172
  def dump_header row, maxw
178
- puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t")
173
+ puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t")
179
174
  end
180
175
 
181
176
  pad = [''] * maxw.length
@@ -38,7 +38,7 @@ wu-lign isn't intended to be smart, or correct, or reliable -- only to be useful
38
38
 
39
39
  h2. How it works
40
40
 
41
- Wu-Lign takes the first 1000 lines, splits by TAB characters into fields, and tries to guess the format -- int, float, or string -- for each. It builds a consensus of the width and type for corresponding columns in the chunk. If a column has mixed numeric and string formats it degrades to :mixed, which is basically treated as :string. If a column has mixed :float and :int elements all of them are formatted as float.
41
+ Wu-Lign takes the first 500ish lines, splits into fields on TAB characters, and tries to guess the format (int, float, or string) for each. It builds a consensus of the width and type for corresponding columns in the chunk. If a column has mixed numeric and string formats it degrades to :mixed, which is basically treated as :string. If a column has mixed :float and :int elements all of them are formatted as float.
42
42
 
43
43
  h2. Command-line arguments
44
44
 
@@ -58,8 +58,8 @@ will format the fourth column with "%8.4e", while the first three columns and fi
58
58
 
59
59
  h2. Notes
60
60
 
61
- * It has no knowledge of header rows. An all-text first line will screw everything up.
62
- * It also requires a unanimous vote. One screwy line can coerce the whole mess to :mixed; width formatting will still be applied, though.
63
- * It won't set columns wider than 70 chars -- this allows for the occasional super-wide column without completely breaking your screen.
64
- * For :float values, wu-lign tries to guess at the right number of significant digits to the left and right of the decimal point.
65
- * wu-lign does not parse 'TSV files' in their strict sense -- there is no quoting or escaping; every tab delimits a field, every newline a record.
61
+ * Header rows: the first line is used for width alignment but not for type detection. This means that an initial row of text headers will inform column spacing but still allow a column of floats (say) to be properly aligned as floats.
62
+ * It requires a unanimous vote. One screwy line can coerce the whole mess to :mixed; width formatting will still be applied, though.
63
+ * It won't set columns wider than 100 chars -- this allows for the occasional super-wide column without completely breaking your screen.
64
+ * For :float values, wulign tries to guess at the right number of significant digits to the left and right of the decimal point.
65
+ * wulign parses only plain-jane 'TSV files': no quoting or escaping; every tab delimits a field, every newline a record.
@@ -3,6 +3,15 @@ Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
3
3
  require 'rubygems'
4
4
  require 'wukong'
5
5
 
6
+ #
7
+ # * Copy the emr.yaml from here into ~/.wukong/emr.yaml
8
+ # and edit it to suit.
9
+ # * Download the Amazon elastic-mapreduce runner. Get a copy from
10
+ # http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
11
+ # * Find out what breaks, fix it or ask us for help (coders@infochimps.org) and
12
+ # submit a patch
13
+ #
14
+
6
15
  class FooStreamer < Wukong::Streamer::LineStreamer
7
16
  def initialize *args
8
17
  super *args
@@ -0,0 +1,52 @@
1
+ #
2
+ # Elastic MapReduce config in wukong
3
+ #
4
+
5
+ #
6
+ # Infrastructure options
7
+ #
8
+
9
+ # == Fill all your information into yet another file with your amazon key Sorry
10
+ # that it needs to be in so many stupid places, nobody can agree on a
11
+ # filename or format.
12
+ :emr_credentials_file: ~/.wukong/credentials.json
13
+ #
14
+ # == Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars, or enter them here:
15
+ # :access_key: ASDFAHKHASDF
16
+ # :secret_access_key: ADSGHASDFJASDFASDF
17
+ #
18
+ # == Path to your keypair file.
19
+ :key_pair_file: ~/.wukong/keypairs/gibbon.pem
20
+ # == Keypair will be named after your file, or force the name:
21
+ # :key_pair: ~
22
+
23
+ # == Path to the Amazon elastic-mapreduce runner. Get a copy from
24
+ # http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
25
+ :emr_runner: ~/ics/hadoop/elastic-mapreduce/elastic-mapreduce
26
+
27
+ #
28
+ # Cluster Config
29
+ #
30
+ :num_instances: 1
31
+ :instance_type: m2.xlarge
32
+ :master_instance_type: ~
33
+ :hadoop_version: '0.20'
34
+ :availability_zone: us-east-1b
35
+
36
+ #
37
+ # Running and reporting options
38
+ #
39
+ :alive: false
40
+ :enable_debugging: true
41
+ :emr_runner_verbose: true
42
+ :emr_runner_debug: ~
43
+ :step_action: CANCEL_AND_WAIT # CANCEL_AND_WAIT, TERMINATE_JOB_FLOW or CONTINUE
44
+
45
+ #
46
+ # Remote Paths
47
+ #
48
+ # Wukong is opinionated about the paths and locations of scripts and
49
+ # everything. Make an S3 bucket and let the wookiee win -- or hack
50
+ # lib/wukong/script/emr_command.rb to be more flexible and send us back a patch.
51
+ #
52
+ :emr_root: s3n://emr.infinitemonkeys.info
@@ -5,7 +5,9 @@ require 'wukong/bad_record'
5
5
  autoload :TypedStruct, 'wukong/typed_struct'
6
6
  require 'configliere'; Configliere.use :define
7
7
  module Wukong
8
- autoload :Dfs, 'wukong/dfs'
9
- autoload :Script, 'wukong/script'
10
- autoload :Streamer, 'wukong/streamer'
8
+ autoload :Dfs, 'wukong/dfs'
9
+ autoload :Script, 'wukong/script'
10
+ autoload :Streamer, 'wukong/streamer'
11
+ autoload :Store, 'wukong/store'
12
+ autoload :FilenamePattern, 'wukong/filename_pattern'
11
13
  end
@@ -0,0 +1,73 @@
1
+ module Wukong
2
+ class FilenamePattern
3
+ # the filename pattern, e.g. 'ripd/:handle/:date/:handle+:timestamp-:pid-:hostname.tsv'
4
+ attr_accessor :pattern
5
+ # custom token replacements
6
+ attr_accessor :token_val_defaults
7
+
8
+ DEFAULT_PATTERN_STR = ":dest_dir/:handle_prefix/:handle/:date/:handle:timestamp-:pid-:hostname.tsv"
9
+
10
+ def initialize pattern, token_val_defaults={}
11
+ self.pattern = pattern
12
+ self.token_val_defaults = token_val_defaults
13
+ end
14
+
15
+ #
16
+ # walk through pattern, replacing tokens (eg :time or :pid) with the
17
+ # corresponding value.
18
+ #
19
+ def make token_vals={}
20
+ token_vals = token_val_defaults.merge token_vals
21
+ token_vals[:timestamp] ||= Time.now.utc.strftime("%Y%m%d%H%M%S")
22
+ # CHH_NOTE: The following is broken for patterns that need a ":" or
23
+ # patterns that need text following a token with no special chars in
24
+ # between.
25
+ val = pattern.gsub(/:(\w+)/){ replace($1, token_vals) }
26
+ val
27
+ end
28
+
29
+ def to_s token_vals={}
30
+ make token_vals
31
+ end
32
+
33
+ #
34
+ # substitute for token
35
+ #
36
+ def replace token, token_vals
37
+ token = token.to_sym
38
+ return token_vals[token] if token_vals.include? token
39
+ case token
40
+ when :pid then pid
41
+ when :hostname then hostname
42
+ when :handle then token_vals[:handle]
43
+ when :handle_prefix then token_vals[:handle].to_s[0..5]
44
+ when :timestamp then token_vals[:timestamp]
45
+ when :date then token_vals[:timestamp][ 0..7]
46
+ when :time then token_vals[:timestamp][ 8..13]
47
+ when :hour then token_vals[:timestamp][ 8..9]
48
+ when :h4 then "%0.2d" % (( token_vals[:timestamp][8..9].to_i / 4 ) * 4)
49
+ when :min then token_vals[:timestamp][10..11]
50
+ when :sec then token_vals[:timestamp][12..13]
51
+ when :s10 then "%0.2d" % (( token_vals[:timestamp][12..13].to_i / 10 ) * 10)
52
+ else
53
+ raise "Don't know how to encode token #{token} #{token_vals[token]}"
54
+ end
55
+ end
56
+
57
+ # Memoized: the hostname for the machine running this script.
58
+ def hostname
59
+ @hostname ||= ENV['HOSTNAME'] || `hostname`.delete("\n")
60
+ end
61
+ # Memoized: the Process ID for this invocation.
62
+ def pid
63
+ @pid ||= Process.pid
64
+ end
65
+
66
+ # Characters deemed safe in a filename;
67
+ SAFE_CHARS = 'a-zA-Z0-9_\-\.\+\/\;'
68
+ def self.sanitize str
69
+ str.gsub(%r{[^#{SAFE_CHARS}]+}, '-')
70
+ end
71
+
72
+ end
73
+ end
@@ -66,17 +66,17 @@ module TokyoDbConnection
66
66
  ].freeze unless defined?(TokyoDbConnection::TyrantDb::DB_SERVERS)
67
67
 
68
68
  DB_PORTS = {
69
- :screen_names => 12002,
70
- :search_ids => 12003,
69
+ :tw_screen_names => 12002,
70
+ :tw_search_ids => 12003,
71
71
  #
72
- :tw_user_info => 14000,
73
- :tw_wordbag => 14101,
74
- :tw_influence => 14102,
75
- :tw_trstrank => 14103,
76
- :tw_conversation => 14104,
72
+ :tw_user_info => 14000,
73
+ :tw_wordbag => 14101,
74
+ :tw_influence => 14102,
75
+ :tw_trstrank => 14103,
76
+ :tw_conversation => 14104,
77
77
  #
78
- :screen_names2 => 12004,
79
- :search_ids2 => 12005,
78
+ :tw_screen_names2 => 12004,
79
+ :tw_search_ids2 => 12005,
80
80
  #
81
81
  :tw_user_info2 => 14200,
82
82
  :tw_wordbag2 => 14201,
@@ -84,7 +84,7 @@ module TokyoDbConnection
84
84
  :tw_trstrank2 => 14203,
85
85
  :tw_conversation2 => 14204,
86
86
  :tw_strong_links2 => 14205,
87
- :tw_word_stats2 => 14206,
87
+ :tw_word_stats2 => 14210,
88
88
  #
89
89
  :ip_geo_census => 14400,
90
90
  } unless defined?(TokyoDbConnection::TyrantDb::DB_PORTS)
@@ -123,7 +123,7 @@ module TokyoDbConnection
123
123
  end
124
124
 
125
125
  def handle_error action, e
126
- warn "#{action} failed: #{e} #{e.backtrace.join("\t")}" ;
126
+ Log.warn "#{action} failed: #{e} #{e.backtrace.join("\t")}" ;
127
127
  invalidate!
128
128
  end
129
129
 
@@ -1,72 +1,70 @@
1
- module Monkeyshines
2
- module Monitor
3
- #
4
- # Accepts a lightweight call every iteration.
5
- #
6
- # Once either a time or an iteration criterion is met, executes the block
7
- # and resets the timer until next execution.
8
- #
9
- # Note that the +time_interval+ is measured *excution to execution* and not
10
- # in multiples of iter_interval. Say I set a time_interval of 300s, and
11
- # happen to iterate at 297s and 310s after start. Then the monitor will
12
- # execute at 310s, and the next execution will happen on or after 610s.
13
- #
14
- # Also note that when *either* criterion is met, *both* criteria are
15
- # reset. Say I set a time interval of 300s and an +iter_interval+ of 10_000;
16
- # and that at 250s I reach iteration 10_000. Then the monitor will execute
17
- # on or after 20_000 iteration or 550s, whichever happens first.
18
- #
19
- class PeriodicMonitor
20
- attr_accessor :time_interval, :iter_interval
21
- attr_accessor :last_time, :current_iter, :iter, :started_at
1
+ module Wukong::Monitor
2
+ #
3
+ # Accepts a lightweight call every iteration.
4
+ #
5
+ # Once either a time or an iteration criterion is met, executes the block
6
+ # and resets the timer until next execution.
7
+ #
8
+ # Note that the +time_interval+ is measured *excution to execution* and not
9
+ # in multiples of iter_interval. Say I set a time_interval of 300s, and
10
+ # happen to iterate at 297s and 310s after start. Then the monitor will
11
+ # execute at 310s, and the next execution will happen on or after 610s.
12
+ #
13
+ # Also note that when *either* criterion is met, *both* criteria are
14
+ # reset. Say I set a time interval of 300s and an +iter_interval+ of 10_000;
15
+ # and that at 250s I reach iteration 10_000. Then the monitor will execute
16
+ # on or after 20_000 iteration or 550s, whichever happens first.
17
+ #
18
+ class PeriodicMonitor
19
+ attr_accessor :time_interval, :iter_interval
20
+ attr_accessor :last_time, :current_iter, :iter, :started_at
22
21
 
23
- def initialize options={}
24
- self.started_at = Time.now.utc.to_f
25
- self.last_time = started_at
26
- self.iter = 0
27
- self.current_iter = 0
28
- self.time_interval = options[:time]
29
- self.iter_interval = options[:iters]
30
- end
22
+ def initialize options={}
23
+ self.started_at = Time.now.utc.to_f
24
+ self.last_time = started_at
25
+ self.iter = 0
26
+ self.current_iter = 0
27
+ self.time_interval = options[:time]
28
+ self.iter_interval = options[:iters]
29
+ end
31
30
 
32
- # True if more than +iter_interval+ has elapsed since last execution.
33
- def enough_iterations?
34
- iter % iter_interval == 0 if iter_interval
35
- end
31
+ # True if more than +iter_interval+ has elapsed since last execution.
32
+ def enough_iterations?
33
+ iter % iter_interval == 0 if iter_interval
34
+ end
36
35
 
37
- # True if more than +time_interval+ has elapsed since last execution.
38
- def enough_time? now
39
- (now - last_time) > time_interval if time_interval
40
- end
36
+ # True if more than +time_interval+ has elapsed since last execution.
37
+ def enough_time? now
38
+ (now - last_time) > time_interval if time_interval
39
+ end
41
40
 
42
- # Time since monitor was created
43
- def since
44
- Time.now.utc.to_f - started_at
45
- end
46
- # Overall iterations per second
47
- def rate
48
- iter.to_f / since.to_f
49
- end
50
- # "Instantaneous" iterations per second
51
- def inst_rate now
52
- current_iter.to_f / (now-last_time).to_f
53
- end
41
+ # Time since monitor was created
42
+ def since
43
+ Time.now.utc.to_f - started_at
44
+ end
45
+ # Overall iterations per second
46
+ def rate
47
+ iter.to_f / since.to_f
48
+ end
49
+ # "Instantaneous" iterations per second
50
+ def inst_rate now
51
+ current_iter.to_f / (now-last_time).to_f
52
+ end
54
53
 
55
- #
56
- # if the interval conditions are met, executes block; otherwise just does
57
- # bookkeeping and returns.
58
- #
59
- def periodically &block
60
- self.iter += 1
61
- self.current_iter += 1
62
- now = Time.now.utc.to_f
63
- if enough_iterations? || enough_time?(now)
64
- block.call(iter, (now-last_time))
65
- self.last_time = now
66
- self.current_iter = 0
67
- end
68
- end
54
+ #
55
+ # if the interval conditions are met, executes block; otherwise just does
56
+ # bookkeeping and returns.
57
+ #
58
+ def periodically &block
59
+ self.iter += 1
60
+ self.current_iter += 1
61
+ now = Time.now.utc.to_f
62
+ if enough_iterations? || enough_time?(now)
63
+ block.call(iter, (now-last_time))
64
+ self.last_time = now
65
+ self.current_iter = 0
69
66
  end
70
-
71
67
  end
72
68
  end
69
+
70
+ end
@@ -32,6 +32,7 @@ module Wukong
32
32
  Settings.define :max_maps_per_node, :jobconf => true, :description => 'mapred.max.maps.per.node', :wukong => true
33
33
  Settings.define :max_maps_per_cluster, :jobconf => true, :description => 'mapred.max.maps.per.cluster', :wukong => true
34
34
  Settings.define :max_record_length, :jobconf => true, :description => 'mapred.linerecordreader.maxlength', :wukong => true # "Safeguards against corrupted data: lines longer than this (in bytes) are treated as bad records."
35
+ Settings.define :min_input_split_size, :jobconf => true, :description => 'mapred.min.split.size', :wukong => true
35
36
  Settings.define :noempty, :description => "don't create zero-byte reduce files (hadoop mode only)", :wukong => true
36
37
 
37
38
  #
@@ -48,14 +49,14 @@ module Wukong
48
49
  hadoop_commandline = [
49
50
  hadoop_runner,
50
51
  "jar #{Settings[:hadoop_home]}/contrib/streaming/hadoop-*streaming*.jar",
51
- hadoop_jobconf_options,
52
- "-D mapred.job.name '#{job_name}",
53
- "-mapper '#{map_commandline}'",
54
- "-reducer '#{reduce_commandline}'",
52
+ "-mapper '#{mapper_commandline}'",
53
+ "-reducer '#{reducer_commandline}'",
55
54
  "-input '#{input_paths}'",
56
55
  "-output '#{output_path}'",
56
+ hadoop_jobconf_options,
57
+ "-jobconf mapred.job.name='#{job_name}'",
57
58
  hadoop_recycle_env,
58
- hadoop_other_args(input_paths, output_path),
59
+ hadoop_other_args,
59
60
  ].flatten.compact.join(" \t\\\n ")
60
61
  Log.info " Launching hadoop!"
61
62
  execute_command!(hadoop_commandline)
@@ -94,7 +95,7 @@ module Wukong
94
95
  # if not, the resulting nil will be elided later
95
96
  def jobconf option
96
97
  if options[option]
97
- "-D %s=%s" % [options.description_for(option), options[option]]
98
+ "-jobconf %s=%s" % [options.description_for(option), options[option]]
98
99
  end
99
100
  end
100
101
 
@@ -1,14 +1,15 @@
1
- module Monkeyshines
1
+ module Wukong
2
2
  module Store
3
- extend FactoryModule
4
- autoload :Base, 'monkeyshines/store/base'
5
- autoload :FlatFileStore, 'monkeyshines/store/flat_file_store'
6
- autoload :ConditionalStore, 'monkeyshines/store/conditional_store'
7
- autoload :ChunkedFlatFileStore, 'monkeyshines/store/chunked_flat_file_store'
8
- autoload :KeyStore, 'monkeyshines/store/key_store'
9
- autoload :TokyoTdbKeyStore, 'monkeyshines/store/tokyo_tdb_key_store'
10
- autoload :TyrantTdbKeyStore, 'monkeyshines/store/tyrant_tdb_key_store'
11
- autoload :TyrantRdbKeyStore, 'monkeyshines/store/tyrant_rdb_key_store'
12
- autoload :ReadThruStore, 'monkeyshines/store/read_thru_store'
3
+ # extend FactoryModule
4
+ autoload :Base, 'wukong/store/base'
5
+ autoload :FlatFileStore, 'wukong/store/flat_file_store'
6
+ # autoload :ConditionalStore, 'monkeyshines/store/conditional_store'
7
+ autoload :ChunkedFlatFileStore, 'wukong/store/chunked_flat_file_store'
8
+ autoload :ChhChunkedFlatFileStore, 'wukong/store/chh_chunked_flat_file_store'
9
+ # autoload :KeyStore, 'monkeyshines/store/key_store'
10
+ # autoload :TokyoTdbKeyStore, 'monkeyshines/store/tokyo_tdb_key_store'
11
+ # autoload :TyrantTdbKeyStore, 'monkeyshines/store/tyrant_tdb_key_store'
12
+ # autoload :TyrantRdbKeyStore, 'monkeyshines/store/tyrant_rdb_key_store'
13
+ # autoload :ReadThruStore, 'monkeyshines/store/read_thru_store'
13
14
  end
14
15
  end
@@ -1,18 +1,16 @@
1
- module Monkeyshines
1
+ module Wukong
2
2
  module Store
3
3
  class Base
4
- attr_accessor :options
5
- def initialize _options={}
6
- self.options = _options
7
- Log.info "Creating #{self.class}"
4
+ def initialize options={}
5
+ Log.info "Creating #{self.class} with #{options.inspect}"
8
6
  end
9
7
 
10
- #
8
+ #Iterate through each object casting it as a new object of klass.
11
9
  def each_as klass, &block
12
10
  self.each do |*args|
13
11
  begin
14
12
  item = klass.new *args[1..-1]
15
- rescue Exception => e
13
+ rescue StandardError => e
16
14
  Log.info [args, e.to_s, self].join("\t")
17
15
  raise e
18
16
  end
@@ -0,0 +1,37 @@
1
+ module Wukong
2
+ module Store
3
+ class ChhChunkedFlatFileStore < Wukong::Store::FlatFileStore
4
+ attr_accessor :filename_pattern, :handle, :rootdir
5
+
6
+ # Move to configliere
7
+ Settings.define :chunk_file_pattern, :default => ":rootdir/:date/:handle:timestamp-:pid.tsv",:description => "The pattern for chunked files."
8
+ Settings.define :chunk_file_rootdir, :default => nil, :description => "The root directory for the chunked files."
9
+
10
+ #Note that filemode is inherited from flat_file
11
+
12
+ def initialize options={}
13
+ # super wants a :filename in the options or it will fail. We need to get the initial filename
14
+ # set up before we call super, so we need all of the parts of the pattern set up.
15
+ self.rootdir = options[:rootdir] || Settings[:chunk_file_rootdir]
16
+ self.handle = options[:handle]
17
+ pattern = options[:pattern] || Settings[:chunk_file_pattern]
18
+ self.filename_pattern = FilenamePattern.new(pattern, :handle => handle, :rootdir => self.rootdir)
19
+ options[:filename] = filename_pattern.make()
20
+
21
+ super options
22
+
23
+ self.mkdir!
24
+ end
25
+
26
+ def new_chunk
27
+ new_filename = filename_pattern.make()
28
+ Log.info "Rotating chunked file #{filename} into #{new_filename}"
29
+ self.flush
30
+ self.close
31
+ @filename = new_filename
32
+ self.mkdir!
33
+ end
34
+
35
+ end
36
+ end
37
+ end
@@ -1,22 +1,29 @@
1
- module Monkeyshines
1
+ module Wukong
2
2
  module Store
3
- class ChunkedFlatFileStore < Monkeyshines::Store::FlatFileStore
4
- attr_accessor :filename_pattern, :chunk_monitor, :handle
3
+ class ChunkedFlatFileStore < Wukong::Store::FlatFileStore
4
+ attr_accessor :filename_pattern, :chunk_monitor, :handle, :chunktime, :rootdir
5
5
 
6
- DEFAULT_OPTIONS = {
7
- :chunktime => 4*60*60, # default 4 hours
8
- :pattern => ":rootdir/:date/:handle+:timestamp-:pid.tsv",
9
- :rootdir => nil,
10
- :filemode => 'w',
11
- }
6
+ # Move to configliere
7
+ Settings.define :chunk_file_pattern, :default => ":rootdir/:date/:handle:timestamp-:pid.tsv",:description => "The pattern for chunked files."
8
+ Settings.define :chunk_file_chunktime, :default => 4*60*60,:description => "The time interval to keep a chunk file open."
9
+ Settings.define :chunk_file_rootdir, :default => nil, :description => "The root directory for the chunked files."
10
+
11
+ #Note that filemode is inherited from flat_file
12
12
 
13
- def initialize _options
14
- self.options = DEFAULT_OPTIONS.deep_merge(_options)
15
- raise "You don't really want a chunk time this small: #{options[:chunktime]}" unless options[:chunktime] > 600
16
- self.chunk_monitor = Monkeyshines::Monitor::PeriodicMonitor.new( :time => options[:chunktime] )
17
- self.handle = options[:handle] || Monkeyshines::CONFIG[:handle]
18
- self.filename_pattern = Monkeyshines::Utils::FilenamePattern.new(options[:pattern], :handle => handle, :rootdir => options[:rootdir])
19
- super options.merge(:filename => filename_pattern.make())
13
+ def initialize options={}
14
+ # super wants a :filename in the options or it will fail. We need to get the initial filename
15
+ # set up before we call super, so we need all of the parts of the pattern set up.
16
+ self.chunktime = options[:chunktime] || Settings[:chunk_file_chunktime]
17
+ self.rootdir = options[:rootdir] || Settings[:chunk_file_rootdir]
18
+ self.handle = options[:handle]
19
+ pattern = options[:pattern] || Settings[:chunk_file_pattern]
20
+ self.filename_pattern = FilenamePattern.new(pattern, :handle => handle, :rootdir => self.rootdir)
21
+ options[:filename] = filename_pattern.make()
22
+
23
+ super options
24
+
25
+ Log.warn "You don't really want a chunk time this small: #{self.chunktime}" unless self.chunktime > 600
26
+ self.chunk_monitor = Wukong::PeriodicMonitor.new( :time => self.chunktime )
20
27
  self.mkdir!
21
28
  end
22
29
 
@@ -1,6 +1,6 @@
1
1
  require 'fileutils'; include FileUtils
2
2
 
3
- module Monkeyshines
3
+ module Wukong
4
4
  module Store
5
5
  #
6
6
  class FlatFileStore < Store::Base
@@ -10,7 +10,7 @@ module Monkeyshines
10
10
  # +filename_root+ : first part of name for files
11
11
  #
12
12
  def initialize options={}
13
- Log.debug "New #{self.class} as #{options.inspect}"
13
+ super options
14
14
  self.filename = options[:filename] or raise "Missing filename in #{self.class}"
15
15
  self.filemode = options[:filemode] || 'r'
16
16
  skip!(options[:skip]) if options[:skip]
@@ -21,7 +21,6 @@ module Monkeyshines
21
21
  #
22
22
  def each &block
23
23
  file.each do |line|
24
- next if line[0..0] == '#'
25
24
  attrs = line.chomp.split("\t")
26
25
  next if attrs.blank?
27
26
  yield *attrs
@@ -54,6 +53,10 @@ module Monkeyshines
54
53
  @file = nil
55
54
  end
56
55
 
56
+ def flush
57
+ @file.flush if @file
58
+ end
59
+
57
60
  # Ensure the file's directory exists
58
61
  def mkdir!
59
62
  dir = File.dirname(filename)
@@ -64,7 +67,7 @@ module Monkeyshines
64
67
 
65
68
  # write to the file
66
69
  def save obj
67
- file << obj.to_flat.join("\t")+"\n"
70
+ file.puts obj
68
71
  obj
69
72
  end
70
73
 
@@ -74,14 +77,10 @@ module Monkeyshines
74
77
  File.size(filename)
75
78
  end
76
79
 
77
- def set key, *args, &block
78
- tok, obj = block.call
79
- save obj
80
- end
81
-
82
- # delegates to +#save+ -- writes the object to the file
80
+ # delegates to +#save+ -- writes the object to the file. Returns self for chaining on the stream.
83
81
  def <<(obj)
84
82
  save obj
83
+ self
85
84
  end
86
85
 
87
86
  end
@@ -0,0 +1,7 @@
1
+ module Wukong
2
+ module Streamer
3
+ class EmStreamer
4
+
5
+ end
6
+ end
7
+ end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{wukong}
8
- s.version = "1.5.2"
8
+ s.version = "1.5.3"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Philip (flip) Kromer"]
12
- s.date = %q{2010-08-11}
12
+ s.date = %q{2010-08-19}
13
13
  s.description = %q{ Treat your dataset like a:
14
14
 
15
15
  * stream of lines when it's efficient to process by lines
@@ -37,6 +37,7 @@ Gem::Specification.new do |s|
37
37
  "bin/hdp-bzip",
38
38
  "bin/hdp-cat",
39
39
  "bin/hdp-catd",
40
+ "bin/hdp-cp",
40
41
  "bin/hdp-du",
41
42
  "bin/hdp-get",
42
43
  "bin/hdp-kill",
@@ -138,6 +139,7 @@ Gem::Specification.new do |s|
138
139
  "examples/count_keys.rb",
139
140
  "examples/count_keys_at_mapper.rb",
140
141
  "examples/emr/elastic_mapreduce_example.rb",
142
+ "examples/emr/emr.yaml",
141
143
  "examples/keystore/cassandra_batch_test.rb",
142
144
  "examples/keystore/conditional_outputter_example.rb",
143
145
  "examples/network_graph/adjacency_list.rb",
@@ -185,6 +187,7 @@ Gem::Specification.new do |s|
185
187
  "lib/wukong/extensions/string.rb",
186
188
  "lib/wukong/extensions/struct.rb",
187
189
  "lib/wukong/extensions/symbol.rb",
190
+ "lib/wukong/filename_pattern.rb",
188
191
  "lib/wukong/keystore/cassandra_conditional_outputter.rb",
189
192
  "lib/wukong/keystore/redis_db.rb",
190
193
  "lib/wukong/keystore/tyrant_db.rb",
@@ -205,6 +208,7 @@ Gem::Specification.new do |s|
205
208
  "lib/wukong/script/local_command.rb",
206
209
  "lib/wukong/store.rb",
207
210
  "lib/wukong/store/base.rb",
211
+ "lib/wukong/store/chh_chunked_flat_file_store.rb",
208
212
  "lib/wukong/store/chunked_flat_file_store.rb",
209
213
  "lib/wukong/store/conditional_store.rb",
210
214
  "lib/wukong/store/factory.rb",
@@ -222,6 +226,7 @@ Gem::Specification.new do |s|
222
226
  "lib/wukong/streamer/count_keys.rb",
223
227
  "lib/wukong/streamer/count_lines.rb",
224
228
  "lib/wukong/streamer/counting_reducer.rb",
229
+ "lib/wukong/streamer/em_streamer.rb",
225
230
  "lib/wukong/streamer/filter.rb",
226
231
  "lib/wukong/streamer/line_streamer.rb",
227
232
  "lib/wukong/streamer/list_reducer.rb",
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wukong
3
3
  version: !ruby/object:Gem::Version
4
- hash: 7
4
+ hash: 5
5
5
  prerelease: false
6
6
  segments:
7
7
  - 1
8
8
  - 5
9
- - 2
10
- version: 1.5.2
9
+ - 3
10
+ version: 1.5.3
11
11
  platform: ruby
12
12
  authors:
13
13
  - Philip (flip) Kromer
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2010-08-11 00:00:00 -05:00
18
+ date: 2010-08-19 00:00:00 -05:00
19
19
  default_executable:
20
20
  dependencies:
21
21
  - !ruby/object:Gem::Dependency
@@ -131,6 +131,7 @@ files:
131
131
  - bin/hdp-bzip
132
132
  - bin/hdp-cat
133
133
  - bin/hdp-catd
134
+ - bin/hdp-cp
134
135
  - bin/hdp-du
135
136
  - bin/hdp-get
136
137
  - bin/hdp-kill
@@ -232,6 +233,7 @@ files:
232
233
  - examples/count_keys.rb
233
234
  - examples/count_keys_at_mapper.rb
234
235
  - examples/emr/elastic_mapreduce_example.rb
236
+ - examples/emr/emr.yaml
235
237
  - examples/keystore/cassandra_batch_test.rb
236
238
  - examples/keystore/conditional_outputter_example.rb
237
239
  - examples/network_graph/adjacency_list.rb
@@ -279,6 +281,7 @@ files:
279
281
  - lib/wukong/extensions/string.rb
280
282
  - lib/wukong/extensions/struct.rb
281
283
  - lib/wukong/extensions/symbol.rb
284
+ - lib/wukong/filename_pattern.rb
282
285
  - lib/wukong/keystore/cassandra_conditional_outputter.rb
283
286
  - lib/wukong/keystore/redis_db.rb
284
287
  - lib/wukong/keystore/tyrant_db.rb
@@ -299,6 +302,7 @@ files:
299
302
  - lib/wukong/script/local_command.rb
300
303
  - lib/wukong/store.rb
301
304
  - lib/wukong/store/base.rb
305
+ - lib/wukong/store/chh_chunked_flat_file_store.rb
302
306
  - lib/wukong/store/chunked_flat_file_store.rb
303
307
  - lib/wukong/store/conditional_store.rb
304
308
  - lib/wukong/store/factory.rb
@@ -316,6 +320,7 @@ files:
316
320
  - lib/wukong/streamer/count_keys.rb
317
321
  - lib/wukong/streamer/count_lines.rb
318
322
  - lib/wukong/streamer/counting_reducer.rb
323
+ - lib/wukong/streamer/em_streamer.rb
319
324
  - lib/wukong/streamer/filter.rb
320
325
  - lib/wukong/streamer/line_streamer.rb
321
326
  - lib/wukong/streamer/list_reducer.rb