hackboxen 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,44 @@
1
+ h3. Deprecations/Changes
2
+
3
+ * HackBoxen::Paths methods have changed.
4
+
5
+ * You no longer need to include the line @HACKBOX_DIR = File.basename(__FILE__)@ at the top of the Rakefile.
6
+
7
+ * @Settings@ as used by the Rakefile is now @WorkingConfig@.
8
+
9
+ * @working_environment@ is now only available in the JSON flavor and the @env/@ directory has moved to the same level as @fixd/@.
10
+
11
+ * Output data and Icss need to end up in @fixd/data/@.
12
+
13
+ * @rake scaffold@ is no longer the command to build a hackbox.
14
+
15
+ * Config files are no longer read as a directory, and also no longer read from the dataroot as the @config/@ output directory is no longer being created.
16
+
17
+ * Much old code was refactored or removed.
18
+
19
+ h3. New Functionality
20
+
21
+ * Default Hackboxen paths can be accessed by using @path_to(:fixd_dir)@. See HackBoxen::Paths for using/adding others.
22
+
23
+ * You may now @require 'hackboxen' anywhere and it will recognize if you are in a hackbox directory (or not) and allow you appropriate access to HackBoxen methods.
24
+
25
+ * Tasks for moving icss and endpoint code have been added. Include @'hb:icss'@ and @'hb:endpoint'@ in the default Rakefile task if you want to use them.
26
+
27
+ * @filesystem_scheme@ now defaults to the local filesystem if not specified in the @config.yaml@
28
+
29
+ * A logging helper has been added. Use @include HackBoxen::Logging@ and then @logs_to STDOUT, 'file'@ inside of a class to access an instance variable @@log@ that contains a formatted log4r Logger.
30
+
31
+ * A binary executable has been added, @hb-scaffold@ that can be run from anywhere and is designed to replace the rake task.
32
+
33
+ h3. Still Needing
34
+
35
+ * Make Hackboxen a gem. This will require the separation of the actual hackbox code from the hackboxen library (not done yet) and the creation of a coderoot to connect the hackbox library with the other code (implemented).
36
+
37
+ * When Hackboxen is a gem, its version can be added to the requires hash in a hackbox @config.yaml@ so we can keep track of potentially breaking changes to legacy code.
38
+
39
+ * Full spec coverage for the hackboxen library.
40
+
41
+ * Implementation of @'hb:mini'@ and @ConfigValidator@. Some of this code is written, but it needs to be fleshed out and decided upon.
42
+
43
+ * The separation (completely) of a @config.yaml@ from an @icss.yaml@. Would not affect hackbox running much.
44
+
data/Gemfile ADDED
@@ -0,0 +1,12 @@
1
+ source :rubygems
2
+ gem 'swineherd', '>=0.0.4'
3
+ gem 'configliere', '0.4.6'
4
+ gem 'rake', '0.8.7'
5
+
6
+ group :development do
7
+ gem "shoulda", ">= 0"
8
+ gem "bundler", "~> 1.0.0"
9
+ gem "jeweler", "~> 1.5.2"
10
+ gem "rcov", ">= 0"
11
+ end
12
+
@@ -0,0 +1,34 @@
1
+ GEM
2
+ remote: http://rubygems.org/
3
+ specs:
4
+ configliere (0.4.6)
5
+ erubis (2.7.0)
6
+ git (1.2.5)
7
+ gorillib (0.1.1)
8
+ jeweler (1.5.2)
9
+ bundler (~> 1.0.0)
10
+ git (>= 1.2.5)
11
+ rake
12
+ rake (0.8.7)
13
+ rcov (0.9.9)
14
+ right_aws (2.1.0)
15
+ right_http_connection (>= 1.2.5)
16
+ right_http_connection (1.3.0)
17
+ shoulda (2.11.3)
18
+ swineherd (0.0.4)
19
+ configliere
20
+ erubis
21
+ gorillib
22
+ right_aws
23
+
24
+ PLATFORMS
25
+ ruby
26
+
27
+ DEPENDENCIES
28
+ bundler (~> 1.0.0)
29
+ configliere (= 0.4.6)
30
+ jeweler (~> 1.5.2)
31
+ rake (= 0.8.7)
32
+ rcov
33
+ shoulda
34
+ swineherd (>= 0.0.4)
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Infochimps
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,203 @@
1
+ h1. Hackboxen
2
+
3
+ The Hackboxen library is designed to encapsulate data collecting and processing tasks into simple and easy to implement packages.
4
+
5
+ Any singular hackbox has the following two parts:
6
+
7
+ * An engine, which contains configuration information and data processing code.
8
+ * An output directory, which will contain the fully processed data along with a descriptive schema. This directory may be either local or remote (e.g. S3/HDFS)
9
+
10
+ A hackbox **dataset** is defined by a @namespace@ and a @protocol@. The @namespace@ must be dot(.) separated and both the @namespace@ and @protocol@ may contain only lowercase letters, numbers and underscores.
11
+
12
+ h2. Hackbox Engine
13
+
14
+ A hackbox engine contains:
15
+
16
+ * @Rakefile@: **(required)** Used to read and combine all the sources of config metadata and execute @main@.
17
+ * @Gemfile@: **(optional)** A list of gems necessary for thsi hackbox to run. Processed automatically by "Bundler":https://github.com/carlhuda/bundler.
18
+ * @config/@: **(required)** A subdirectory containing:
19
+ ** @config.yaml@ **(required)** A dataset specific default configuration YAML file.
20
+ ** @protocol.icss.yaml@ **(optional)** An "Icss":http://github.com/infochimps/icss schema file describing the output data and publishing targets.
21
+ * @engine/@: **(required)** A subdirectory containing:
22
+ ** @main@: **(required)** An executable data processing file. This may be written in any language.
23
+ ** **(optional)** Any other executable and support files. There is no restriction on language and complexity.
24
+
25
+ The hackbox engine lives in the @coderoot@ directory specified by your configuration settings. An example hackbox engine directory structure:
26
+
27
+ <pre><code>coderoot
28
+ └── language
29
+ └── corpora
30
+ └── word_freq
31
+ └── bnc
32
+ ├── config
33
+ │ ├── config.yaml
34
+ │ └── bnc.icss.yaml
35
+ ├── engine
36
+ │ ├── main
37
+ │ └── bnc_endpoint.rb
38
+ └── Rakefile
39
+ </code></pre>
40
+
41
+ h2. Hackbox Output Directory
42
+
43
+ The hackbox output directory is where all of the data that a hackbox acquires, reads, or creates lives. The location of the data directory is determind by the @dataroot@ variable specified in your configuration settings. An example hackbox output directory structure:
44
+
45
+ <pre><code>dataroot
46
+ └── language
47
+ └── corpora
48
+ └── word_freq
49
+ └── bnc
50
+ ├── fixd
51
+ │ ├── code
52
+ │ │ └── bnc_endpoint.rb
53
+ │ ├── data
54
+ │ │ └── bnc_fixd_data.tsv
55
+ │ └── env
56
+ │ └── working_environment.json
57
+ ├── log
58
+ │ └── bnc_run_0.log
59
+ ├── rawd
60
+ │ └── bnc_data_in_process
61
+ ├── ripd
62
+ │ └── bnc_download.zip
63
+ └── tmp
64
+ </code></pre>
65
+
66
+ * @log/@: **(optional)** All logging from a hackbox run goes here.
67
+ * @tmp/@: **(optional)** If needed, any truly ephemeral output of the workflow should go here.
68
+ * @ripd/@: **(required)** This will contain virginal downloaded source data adhering to the directory structure from which it was pulled.
69
+ * @rawd/@: **(optional)** This will contain all intermediate data processing outputs.
70
+ * @fixd/@: **(required)** See the output interface described below.
71
+
72
+ Engine and output directories are generally created dynamically and are not meant to be archival.
73
+
74
+ h3. Output Interface (fixd/)
75
+
76
+ @fixd/@ is the final output directory and contains the following:
77
+
78
+ * @env/@: **(required)** This directory contains a file describing the environment in which the hackbox was run.
79
+ ** @working_environment.json@: **(required)** All runtime config metadata used to generate the schema and output data.
80
+ * @code/@: **(optional)** A directory containing the code assets described in the icss.
81
+ * @data/@: **(required)** A directory containing a single dataset or subdirectories named for each dataset. Each contains:
82
+ ** @protocol.icss.json@: **(required)** An "Icss":http://github.com/infochimps/icss schema file describing its respective dataset.
83
+ ** **(required)** One or more data files that collectively adhere to the schema of this dataset.
84
+
85
+ h2. Hackbox Configuration
86
+
87
+ Hackbox configuration may be one or more files in YAML format and, optionally, the command line. Configuration will be read in using "Configliere":https://github.com/mrflip/configliere in the following order:
88
+
89
+ * @/etc/hackbox/hackbox.yaml@: Machine-wide config.
90
+ * @~/.hackbox/hackbox.yaml@: Install specific config.
91
+ * @config/config.yaml@: Hackbox specific config.
92
+ * @rake task -- --args=@: Command line arguments.
93
+
94
+ Later sources on this list overwrite earlier sources. The combined configuration metadata is serialized out as JSON in the @fixd/env@ directory as @working_config.json@. This is done before any other code executes in order for a hackbox to be able to read in this file if necessary.
95
+
96
+ h1. Getting Started
97
+
98
+ Here are the general guidelines for creating your own hackbox.
99
+
100
+ h3. Hackboxen Dependencies
101
+
102
+ Clone the Hackboxen repo:
103
+
104
+ <pre><code>git clone git@github.com:infochimps/hackboxen.git
105
+ </code></pre>
106
+
107
+ Add Hackboxen to your $RUBYLIB:
108
+
109
+ <pre><code>export RUBYLIB=$RUBYLIB:/path/to/hackboxen/lib
110
+ </code></pre>
111
+
112
+ Install Hackboxen dependencies:
113
+
114
+ <pre><code>cd hackboxen
115
+ sudo bundle install
116
+ rake install # optionally: rake install -- --dataroot=/data/hb --coderoot=/code/hb
117
+ </code></pre>
118
+
119
+ This will install the following gems: "configliere":http://github.com/mrflip/configliere, "icss":http://github.com/infochimps/icss, "swineherd":http://github.com/ganglion/swineherd, and "rake":http://github.com/jimweirich/rake. This will also create a @.hackbox@ directory with a @hackbox.yaml@ file that contains default values for @coderoot@, @dataroot@, @s3_filesystem@, @os@, and @machine@. The @rake install@ command has optional arguments @--dataroot=@, @--coderoot=@.
120
+
121
+ A default @hackbox.yaml@ file:
122
+
123
+ <pre><code>---
124
+ coderoot: /code/hb/
125
+ dataroot: /data/hb/
126
+ s3_filesystem:
127
+ access_key:
128
+ secret_key:
129
+ mini_bucket:
130
+ requires:
131
+ machine: x86_64
132
+ os: darwin
133
+ </code></pre>
134
+
135
+ h3. Creating a Hackbox
136
+
137
+ Hackboxen comes with scaffold task that creates a template hackbox for you. Required arguments are @--namespace=@ and @--protocol=@. Optional arguments are @--targets=@, @--s3access=@, and @--s3secret=@.
138
+
139
+ <pre><code>hb-scaffold --namespace=foo.bar --protocol --targets=catalog,mysql
140
+ </code></pre>
141
+
142
+ This will create the following directories and files:
143
+
144
+ <pre><code>coderoot
145
+ └── foo
146
+ └── bar
147
+ └── baz
148
+ ├── config
149
+ │ ├── config.yaml
150
+ │ └── baz.icss.yaml
151
+ ├── engine
152
+ │ ├── main
153
+ │ └── baz_endpoint.rb
154
+ └── Rakefile
155
+ </code></pre>
156
+
157
+ h3. Running a hackbox
158
+
159
+ Externally, the execution of a hackbox appears as:
160
+
161
+ * A @Rakefile@ is run with @rake@ from the shell with one of the following targets:
162
+ ** @get_data@: Performs only the ingest step. The input data (in @ripd@/@rawd@) and any required metadata should exist after this step.
163
+ ** @default@: Performs the processing step, @:get_data@, and executes the @main@ file.
164
+
165
+ Execution Results:
166
+
167
+ * If there is no failure, @rake@ can be silent.
168
+ * If there is a failure, @rake@ ends with a thrown exception
169
+ * After a successful execution, the complete output interface (@fixd@) must exist, with no additional interaction outside of @rake@.
170
+
171
+ The rough steps of hackbox internal execution are:
172
+
173
+ * The configuration sources (command line and files) are read and combined.
174
+ * The output directory structure (@fixd@) is created.
175
+ * The hackbox engine is run and the "troop ready" ouput datasets are created in @fixd@.
176
+
177
+ * Note: Hackbox execution should be idempotent (when it is sensible and efficient), leveraging this behavior from @rake@.*
178
+
179
+ h3. Hackboxen Best Practices
180
+
181
+ One should try to avoid redundant computation. In particular, idempotency of output creation should be observed. Sometimes incrementally updated information makes this hard, but should be done if not too painful.
182
+
183
+ Files read and written by the hackbox should use the @Swineherd::FileSystem@ abstraction. See "swineherd":http://github.com/infochimps/swineherd.
184
+
185
+ Implementation of the @Gorillib::Receiver@ pattern is recommended. See "gorillib":http://github.com/infochimps/gorillib.
186
+
187
+ Any and all output datasets must include an appropriately descriptive schema. See "icss":http://github.com/infochimps/icss.
188
+
189
+ == Contributing to hackboxen
190
+
191
+ * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
192
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
193
+ * Fork the project
194
+ * Start a feature/bugfix branch
195
+ * Commit and push until you are happy with your contribution
196
+ * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
197
+ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
198
+
199
+ == Copyright
200
+
201
+ Copyright (c) 2011 Infochimps. See LICENSE.txt for
202
+ further details.
203
+
@@ -0,0 +1,49 @@
1
+ require 'rubygems'
2
+ require 'bundler'
3
+ begin
4
+ Bundler.setup(:default, :development)
5
+ rescue Bundler::BundlerError => e
6
+ $stderr.puts e.message
7
+ $stderr.puts "Run `bundle install` to install missing gems"
8
+ exit e.status_code
9
+ end
10
+ require 'rake'
11
+
12
+ require 'jeweler'
13
+ Jeweler::Tasks.new do |gem|
14
+ gem.name = "hackboxen"
15
+ gem.homepage = "http://github.com/infochimps/hackboxen"
16
+ gem.executables = ["hb-install", "hb-scaffold", "hb-runner"]
17
+ gem.license = "MIT"
18
+ gem.summary = "A simple framework to assist in standardizing the data-munging input/output process."
19
+ gem.description = "A simple framework to assist in standardizing the data-munging input/output process."
20
+ gem.email = "travis@infochimps.com"
21
+ gem.authors = ["kornypoet", "Ganglion", "bollacker"]
22
+ end
23
+ Jeweler::RubygemsDotOrgTasks.new
24
+
25
+ require 'rake/testtask'
26
+ Rake::TestTask.new(:test) do |test|
27
+ test.libs << 'lib' << 'test'
28
+ test.pattern = 'test/**/test_*.rb'
29
+ test.verbose = true
30
+ end
31
+
32
+ require 'rcov/rcovtask'
33
+ Rcov::RcovTask.new do |test|
34
+ test.libs << 'test'
35
+ test.pattern = 'test/**/test_*.rb'
36
+ test.verbose = true
37
+ end
38
+
39
+ task :default => :test
40
+
41
+ require 'rake/rdoctask'
42
+ Rake::RDocTask.new do |rdoc|
43
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
44
+
45
+ rdoc.rdoc_dir = 'rdoc'
46
+ rdoc.title = "hackboxen #{version}"
47
+ rdoc.rdoc_files.include('README*')
48
+ rdoc.rdoc_files.include('lib/**/*.rb')
49
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.1.0
@@ -0,0 +1,101 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'json'
5
+ require 'yaml'
6
+
7
+ class MetaBox
8
+
9
+ attr_accessor :cache
10
+
11
+ def initialize
12
+ @cache = {}
13
+ @unreadable = []
14
+ end
15
+
16
+ def lookup
17
+ readable = {}
18
+ paths_to_cfg.each do |path|
19
+ config = read_config path
20
+ if config
21
+ name = config['namespace'] + '.' + config['protocol']
22
+ readable[name] = path
23
+ else
24
+ @unreadable << path
25
+ end
26
+ end
27
+ readable
28
+ end
29
+
30
+ def paths_to_cfg
31
+ Dir["../**/config.yaml"]
32
+ end
33
+
34
+ def read_config cfg_path
35
+ begin
36
+ return config = YAML.load(File.read cfg_path)
37
+ rescue
38
+ return nil
39
+ end
40
+ end
41
+
42
+ def add_to_cache *args
43
+ args.flatten.each do |name|
44
+ @cache[name] = read_config(lookup[name]) if lookup[name]
45
+ end
46
+ list_cache
47
+ end
48
+
49
+ def clear_cache
50
+ @cache = {}
51
+ list_cache
52
+ end
53
+
54
+ def list_readable
55
+ lookup.keys
56
+ end
57
+
58
+ def list_cache
59
+ @cache.keys
60
+ end
61
+
62
+ def describe name
63
+ cfg = read_config(lookup[name]) if lookup[name]
64
+ puts JSON.pretty_generate cfg
65
+ name
66
+ end
67
+
68
+ def describe_cache *args
69
+ if args.empty?
70
+ @cache.each { |key, val| puts JSON.pretty_generate val }
71
+ list_cache
72
+ else
73
+ args.each { |val| puts JSON.pretty_generate @cache[val] }
74
+ end
75
+ end
76
+
77
+ def each_insert key, val
78
+ @cache.each { |name, cfg| cfg[key] = val }
79
+ describe_cache
80
+ end
81
+
82
+ def search query
83
+ results = []
84
+ lookup.each do |name, path|
85
+ cfg = read_config path
86
+ results << name if cfg[query]
87
+ end
88
+ results
89
+ end
90
+
91
+ def write_cache
92
+ @cache.each do |name, cfg|
93
+ File.open(lookup[name], 'w') do |file|
94
+ file.puts cfg.to_yaml
95
+ end
96
+ end
97
+ list_cache
98
+ end
99
+
100
+ end
101
+