hog 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ *.bundle
19
+ *.so
20
+ *.o
21
+ *.a
22
+ mkmf.log
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in hog.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 jondot
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,173 @@
1
+ # Hog
2
+
3
+ Supercharge your Ruby Pig UDFs with Hog.
4
+
5
+
6
+ ## Why Use Hog?
7
+
8
+ Use Hog when you want to process (map) tuples with Hadoop, and generally
9
+ belive that:
10
+
11
+ * It's too much overhead to write custom M/R code
12
+ * You're too lazy to process with Pig data pipes and sprinkled custom UDFs, and they're not powerful
13
+ enough anyway.
14
+ * Data processing with Hive queries is wrong
15
+ * Cascading is an overkill for your case
16
+ * Streaming Hadoop (unix processes) is the last resort
17
+
18
+ And instead want to develop within your familiar Ruby ecosystem but
19
+ also:
20
+
21
+ * Control and tweak job parameters via Pig
22
+ * Use Pig's job management
23
+ * Use Pig's I/O (serdes) easily
24
+
25
+ And you also want to use gems, jars, and everything Ruby (JRuby) has to offer for a really quick turn-around and pleasent development experience.
26
+
27
+
28
+ ## Quickstart
29
+
30
+ Let's say you want to perform geolocation, detect a device's form
31
+ factor, and shove bits and fields from the nested structure into a Hive table to be able to query
32
+ it efficiently.
33
+
34
+ This is your raw event:
35
+
36
+ ```javascript
37
+ {
38
+ "x-forwarded-for":"8.8.8.8",
39
+ "user-agent": "123",
40
+ "pageview": {
41
+ "userid":"ud-123",
42
+ "timestamp":123123123,
43
+ "headers":{
44
+ "user-agent":"...",
45
+ "cookies":"..."
46
+ }
47
+ }
48
+ }
49
+ ```
50
+
51
+ To do this with Pig alone, you would need:
52
+
53
+ * A pig UDF for geolocation
54
+ * A pig UDF for form-factor detection
55
+ * A way to extract JSON fields from a tuple in Pig
56
+ * Combine everything in Pig code
57
+
58
+ And you'll give up
59
+
60
+ * Testability
61
+ * Maintainability
62
+ * Development speed
63
+ * Eventually, sanity
64
+
65
+
66
+ #### Using Hog
67
+
68
+ With Hog, you'll mostly feel you're _describing_ how you want the data to be
69
+ shaped, and it'll do the heavy lifting.
70
+
71
+ You always have the option to 'drop' to code with the `prepare`
72
+ block.
73
+
74
+
75
+ ```ruby
76
+ # gem install hog
77
+ require 'hog'
78
+
79
+ TupleProcessor = Hog.tuple "mytuple" do
80
+
81
+ # Your own custom feature extraction. This can be any free style
82
+ # Ruby or Java code
83
+ #
84
+ prepare do |hsh|
85
+ loc = ip2location(hsh["x-forwarded-for"])
86
+ hsh.merge!(loc)
87
+
88
+ form_factor = formfactor(hsh["user-agent"])
89
+ hsh.merge!(form_factor)
90
+ end
91
+
92
+
93
+ # Describe your data columns (i.e. for Hive), and how you'd like
94
+ # to pull them out of your nested data.
95
+ #
96
+ # You can also put a fixed value, or serialize a sub-nested
97
+ # structure as-is for later use with Hive's json_object
98
+ #
99
+ chararray :id, :path => 'pageview.userid'
100
+ chararray :created_at, :path => 'pageview.timestamp', :with_nil => ''
101
+ float :sampling, :value => 1.0
102
+ chararray :ev_type, :value => 'pageview'
103
+ chararray :ev_info, :json => 'pageview.headers', :default => {}
104
+ end
105
+ ```
106
+
107
+ As you'll notice - there's very little code, and you specify the bare
108
+ essentials -- which you would have had to specify (types and fields) with Pig in any way you'd try to use it.
109
+
110
+
111
+ Hog will generate a `TupleProcessor` that conforms to your description and logic.
112
+
113
+
114
+ #### Testing
115
+
116
+ To test your mapper, just use `TupleProcessor` within your specs and
117
+ give it a raw event. It's plain Ruby.
118
+
119
+
120
+ #### Hadoop
121
+
122
+ To hookup with Hadoop, rig `TupleProcessor` within a Pig JRuby UDF shell like so:
123
+
124
+ ```
125
+ require 'tuple_processor'
126
+ require 'pigudf'
127
+
128
+ class TupleProcessorUdf < PigUdf
129
+ # TupleProcessor will automatically generate your UDF schema!, no matter how complex or
130
+ # convoluted.
131
+ outputSchema TupleProcessor.schema
132
+
133
+ # Use TupleProcessor as the processing logic.
134
+ def process(line)
135
+ # since 1 raw json can output several rows, 'process' returns
136
+ # an array. we pass that to Pig as Pig's own 'DataBag'.
137
+ # You can also do without and just return whatever TupleProcessor#process returns.
138
+ databag = DataBag.new
139
+ TupleProcessor.process(line).each do |res|
140
+ databag.add(res)
141
+ end
142
+
143
+ databag
144
+ end
145
+ end
146
+ ```
147
+
148
+
149
+ ## Gems and Jars
150
+
151
+ So you can't really use gems or jars with Pig Ruby UDFs unless all machines have
152
+ them; and then it becomes ugly to manage really.
153
+
154
+ The solution is neat: pack everything as you would with a JRuby project,
155
+ for example, use `warbler`; and your job code becomes a standalone-jar
156
+ which you deploy, version, and possibly generate via CI.
157
+
158
+
159
+ ## Related Projects
160
+
161
+ * PigPen - https://github.com/Netflix/PigPen
162
+ * datafu - https://github.com/linkedin/datafu
163
+
164
+
165
+ # Contributing
166
+
167
+ Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).
168
+
169
+ # Copyright
170
+
171
+ Copyright (c) 2014 [Dotan Nahum](http://gplus.to/dotan) [@jondot](http://twitter.com/jondot). See MIT-LICENSE for further details.
172
+
173
+
data/Rakefile ADDED
@@ -0,0 +1,2 @@
1
+ require "bundler/gem_tasks"
2
+
data/bin/hog ADDED
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'hog'
4
+
5
+ # hog binary is reserved for future use
@@ -0,0 +1,13 @@
1
+ -- Since most work is in your TupleProcessor, this file serves as a simple shim
2
+ -- that loads data into raw lines, and sends them off to your UDF.
3
+ -- As a result, it will mostly stay the same.
4
+ -- You can tweak any M/R or Pig params here as well.
5
+
6
+ REGISTER 'your-warbled-dependencies-jar.jar'
7
+ REGISTER 'tuple_processor_udf.rb' USING jruby AS processor;
8
+
9
+ data = FOREACH ( LOAD '/raw-data' AS (line:CHARARRAY) )
10
+ GENERATE flatten(process.process(line));
11
+
12
+ store data into '/out-data' USING PARQUETFILE
13
+
@@ -0,0 +1,29 @@
1
+ require 'hog'
2
+
3
+ TupleProcessor = Hog.tuple "mytuple" do
4
+
5
+ # Your own custom feature extraction. This can be any free style
6
+ # Ruby or Java code
7
+ #
8
+ prepare do |hsh|
9
+ loc = ip2location(hsh["x-forwarded-for"])
10
+ hsh.merge!(loc)
11
+
12
+ form_factor = formfactor(hsh["user-agent"])
13
+ hsh.merge!(form_factor)
14
+ end
15
+
16
+
17
+ # Describe your data columns (i.e. for Hive), and how you'd like
18
+ # to pull them out of your nested data.
19
+ #
20
+ # You can also put a fixed value, or serialize a sub-nested
21
+ # structure as-is for later use with Hive's json_object
22
+ #
23
+ chararray :id, :path => 'pageview.userid'
24
+ chararray :created_at, :path => 'pageview.timestamp', :with_nil => ''
25
+ float :sampling, :value => 1.0
26
+ chararray :ev_type, :value => 'pageview'
27
+ chararray :ev_info, :json => 'pageview.headers', :default => {}
28
+ end
29
+
@@ -0,0 +1,22 @@
1
+ require 'pigudf'
2
+ require 'tuple_processor'
3
+
4
+ class TupleProcessorUdf < PigUdf
5
+ # TupleProcessor will automatically generate your UDF schema!, no matter how complex or
6
+ # convoluted.
7
+ outputSchema TupleProcessor.schema
8
+
9
+ # Use TupleProcessor as the processing logic.
10
+ def process(line)
11
+ # since 1 raw json can output several rows, 'process' returns
12
+ # an array. we pass that to Pig as Pig's own 'DataBag'.
13
+ # You can also do without and just return whatever TupleProcessor#process returns.
14
+ databag = DataBag.new
15
+ TupleProcessor.process(line).each do |res|
16
+ databag.add(res)
17
+ end
18
+
19
+ databag
20
+ end
21
+ end
22
+
data/hog.gemspec ADDED
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'hog/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "hog"
8
+ spec.version = Hog::VERSION
9
+ spec.authors = ["jondot"]
10
+ spec.email = ["jondotan@gmail.com"]
11
+ spec.summary = %q{}
12
+ spec.description = %q{}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.6"
22
+ spec.add_development_dependency "rake"
23
+ end
data/lib/hog/field.rb ADDED
@@ -0,0 +1,52 @@
1
+ module Hog
2
+ class Field
3
+ attr_reader :opts
4
+
5
+ def initialize(type, name, opts={})
6
+ @name = name
7
+ @type = type
8
+ @opts = opts
9
+ opts[:path] ||= name.to_s
10
+ end
11
+
12
+ def get(hash)
13
+ val = nil
14
+ if opts[:value]
15
+ val = opts[:value]
16
+ elsif opts[:json]
17
+ lkval = lookup(opts[:json],hash) || opts[:default]
18
+ val = JSON.dump(lkval)
19
+ else
20
+ val = lookup(opts[:path], hash)
21
+ end
22
+ get_value(val)
23
+ end
24
+
25
+ def to_s
26
+ "#{@name}:#{@type}"
27
+ end
28
+
29
+ private
30
+
31
+ def get_value(val)
32
+ # xxx optimize?
33
+ if @type == "int" && [TrueClass, FalseClass].include?(val.class)
34
+ val ? 1 : 0
35
+ elsif @type == "chararray"
36
+ val.to_s
37
+ else
38
+ val
39
+ end
40
+ end
41
+
42
+ def lookup(path, hash)
43
+ begin
44
+ path.split('.').inject(hash) {|acc, value| acc[value]}
45
+ rescue
46
+ return opts[:with_nil] if opts[:with_nil]
47
+ raise "No path: #{path} in hash: #{hash}"
48
+ end
49
+ end
50
+ end
51
+ end
52
+
@@ -0,0 +1,16 @@
1
+ class PigUdf
2
+ def self.outputSchema(str)
3
+
4
+ end
5
+ end
6
+
7
+ class DataBag
8
+ attr_reader :data
9
+ def initialize
10
+ @data = []
11
+ end
12
+ def add(el)
13
+ @data << el
14
+ end
15
+ end
16
+
data/lib/hog/tuple.rb ADDED
@@ -0,0 +1,41 @@
1
+ # todos
2
+ # - safe hash path lookup + provide default value?
3
+
4
+
5
+ module Hog
6
+ class Tuple
7
+ def initialize(name, block)
8
+ @name = name
9
+ @fields = []
10
+ self.instance_eval &block
11
+ end
12
+
13
+ def process(hsh)
14
+ @prepare.call(hsh)
15
+ res = []
16
+ @fields.each do |f|
17
+ res << f.get(hsh)
18
+ end
19
+ res
20
+ end
21
+
22
+ def prepare(&block)
23
+ @prepare = block
24
+ end
25
+
26
+ def to_s
27
+ "(#{@fields.map{|f| f.to_s }.join(',')})"
28
+ end
29
+
30
+ def method_missing(meth, *args, &block)
31
+ if [:chararray, :float, :double, :int, :long, :map].include?(meth)
32
+ f = Field.new(meth.to_s, *args)
33
+ @fields << f
34
+ return f
35
+ else
36
+ super
37
+ end
38
+ end
39
+ end
40
+ end
41
+
data/lib/hog/utils.rb ADDED
@@ -0,0 +1,12 @@
1
+ module Hog
2
+ module Utils
3
+ def explode(hsh, path, into_path)
4
+ arr = hsh.delete(path)
5
+ exploded = []
6
+ arr.each do |item|
7
+ exploded << ({ into_path => item }.merge(hsh))
8
+ end
9
+ exploded
10
+ end
11
+ end
12
+ end
@@ -0,0 +1,3 @@
1
+ module Hog
2
+ VERSION = "0.0.1"
3
+ end
data/lib/hog.rb ADDED
@@ -0,0 +1,19 @@
1
+ require "hog/version"
2
+ require 'json'
3
+ require 'hog/tuple'
4
+ require 'hog/field'
5
+ require 'hog/utils'
6
+
7
+ module Hog
8
+ class << self
9
+ include Hog::Utils
10
+ end
11
+
12
+ def self.tuple(name, &block)
13
+ Tuple.new(name, block)
14
+ end
15
+ end
16
+
17
+
18
+
19
+
metadata ADDED
@@ -0,0 +1,95 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: hog
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - jondot
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2014-06-18 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bundler
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: '1.6'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: '1.6'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rake
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ description: ''
47
+ email:
48
+ - jondotan@gmail.com
49
+ executables:
50
+ - hog
51
+ extensions: []
52
+ extra_rdoc_files: []
53
+ files:
54
+ - .gitignore
55
+ - Gemfile
56
+ - LICENSE.txt
57
+ - README.md
58
+ - Rakefile
59
+ - bin/hog
60
+ - example/runner.pig
61
+ - example/tuple_processor.rb
62
+ - example/tuple_processor_udf.rb
63
+ - hog.gemspec
64
+ - lib/hog.rb
65
+ - lib/hog/field.rb
66
+ - lib/hog/testing/pig_stubs.rb
67
+ - lib/hog/tuple.rb
68
+ - lib/hog/utils.rb
69
+ - lib/hog/version.rb
70
+ homepage: ''
71
+ licenses:
72
+ - MIT
73
+ post_install_message:
74
+ rdoc_options: []
75
+ require_paths:
76
+ - lib
77
+ required_ruby_version: !ruby/object:Gem::Requirement
78
+ none: false
79
+ requirements:
80
+ - - ! '>='
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ required_rubygems_version: !ruby/object:Gem::Requirement
84
+ none: false
85
+ requirements:
86
+ - - ! '>='
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ requirements: []
90
+ rubyforge_project:
91
+ rubygems_version: 1.8.23
92
+ signing_key:
93
+ specification_version: 3
94
+ summary: ''
95
+ test_files: []